Skip to content

SuperHeavyBallet/Coptic-ASCII-UNICODE-Translator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

COPTIC TRANSLATOR

ASCII > NFC UNICODE

License

This project is licensed under the MIT License – see the LICENSE file for details.

In a nutshell, this is an internal tool built to take older format ASCII coptic symbols and translate them to the more modern NFC Unicode format.

For example: .swtM ero.f xM.pek.`.maaje becomes: ⲤⲰⲦⲘ ⲈⲢⲞ Ϥ ϨⲘ ⲠⲈⲔ̅ ⲘⲀⲀϪⲈ (Lit Eng: Listen to Him with your mind).

Example

Input:  .swtM ero.f xM.pek.`.maaje
Output: ⲤⲰⲦⲘ ⲈⲢⲞ Ϥ ϨⲘ ⲠⲈⲔ̅ ⲘⲀⲀϪⲈ

Input:  PN` KC` XC`
Output: Ⲡ̅Ⲓ̅ Ⲕ̅Ⲥ̅ Ⲭ̅Ⲥ̅

But Why?

As mentioned, this is a purely internal tool. It may be of use to others who have a need for this specific purpose, so I wanted to expose it in that off-chance.

For me, it allowed me to work on my project of transcribing and translating the apocryphal Gospel of Thomas into English for interpretation. Initially it was purely for that, but in the process of research, I found that the existing archives are very old and poorly maintained or made usable. This runs the risk - in my opinion - that this and many other ancient texts, not deemed correct to include in the main canon of religious text are likely to be quietly forgotten and lost to the world outside of dusty archives in private collections. I am a very curious individual, and so the loss of anything even remotely interesting struck me as no bueno!

So, How?

So, as you can see on the browser UI, an input text field is given, with a translate button and an output field.

If you press Translate without any valid text in the input field, the output will complain.

Instead, you must paste a string of ASCII format characters. All ASCII works, and even English or others will be translated, since they all draw from the same symbol alphabet - but those might not make much sense...

.swtM ero.f xM.pek.`.maaje is the given example of ASCII symbols. In the correct environment, this will actually show as the correct Coptic symbols you aim for. But, it is difficult to copy paste and use in other places.

After hitting translate, the JavaScript will compare the presented characters and look for edge cases (it is not a simple A = 1, B = 2 chart) and then output the result in the output field.

From here, you can highlight, copy, paste and reuse the new Unicode symbols wherever you desire (If it accepts Unicode...)

There is also a button that will copy the contents of the output text to your clipboard, without having to highlight and right click.

Finally, there is a 'Clear' button to remove all text contents after a confirmation pop up.

Limitations

  • q → Ⳓ is corpus-specific; typical mapping is q → Ⲑ.

  • ti → Ϯ heuristic (article) is context-based; not universal inside words.

  • Place-name contractions are attested but not universal.

  • Rendering depends on font; recommend Noto Sans Coptic (and note combining mark behavior).

Design Choices

  • Two-pass digraphs: inside sacred segments first, then globally to catch stragglers.

  • Overline strategy: uppercase implies overline in segment; tail guard ensures the final letter is overlined without duplication.

  • Backtick & dot conventions: legacy ASCII quirks normalized early for rule stability.

Performance & correctness notes

  • Current approach is O(n) with a few replace passes. Fine for text-length inputs; no UI jank expected.

  • Word-boundary use (\b) only applies while still in ASCII; once mapped to Coptic, rules don’t rely on \b.

  • Security: no HTML injection; uses .textContent.

Font/Rendering

  • Recommend Noto Sans Coptic; note that combining U+0305 may render thicker if duplicated.

The Algorithm

I will walk through the process after placing suitable text and pressing submit.

On the click event, the variable:

let currentInputText = inputTextArea.value;

is assigned with the value of the Input Text Area.

Then, we check if the Input Text Area is composed of valid text by using

String.prototype.trim();

to remove whitespaces, and check that the contents is actual characters. If it is, then we create a new variable:

let convertedText...;

Which is assigned the value of:

copticConverter(currentInputText).normalize('NFC');

Which places the currentInputText into that function as the parameter, and normalizes the result to NFC format

Using:

String.prototype.normalize('NFC')

If the Input Text Area value is NOT valid, ie empty space, the output returns a String message to express this.

Now let's look at the actual function:

copticConverter(input)

First, a new variable is declared:

let t = input;

With the value of the input parameter.

QUICK FIX FOR SPECIFIC, FREQUENT CASES

First, we have an 'ugly' solution:

t = t.replace(/\bcwmas\b/g, "ⲐⲰⲘⲀⲤ"); // Thomas

Which quickly checks t via regex for the known quantity of 'Thomas' (bcwmas) which may be quite frequent, and simply replaces those characters with the correct Unicode (ⲐⲰⲘⲀⲤ) rather than continuing through the further checks.

NOMINA SACRA

Next, we check against the common Coptic NOMINA SACRA (Sacred Names, usually abbreviations of character names) and replace any found symbols:

NOMINA.forEach(([pat, repl]) => { t = t.replace(pat, repl); });

A collection of common Nomina Sacra in Coptic:

| Full Coptic Word    | Contracted Form | English Equivalent |
| ------------------- | --------------- | ------------------ |
| ⲓⲏⲥⲟⲩⲥ (Iēsous)     | ⲒⲤ̅ / ⲒⲨ̅       | Jesus              |
| ⲭⲣⲓⲥⲧⲟⲥ (Christos)  | ⲬⲤ̅             | Christ             |
| ⲛⲟⲩⲧⲉ (Noute)       | ⲚⲤ̅             | God                |
| ⲕⲩⲣⲓⲟⲥ (Kyrios)     | ⲔⲤ̅             | Lord               |
| ⲡⲛⲉⲩⲙⲁ (Pneuma)     | ⲠⲒ̅ / ⲠⲒⲢ̅      | Spirit             |
| ⲥⲱⲧⲏⲣ (Sōtēr)       | ⲤⲨ̅             | Savior             |
| ⲥⲧⲁⲩⲣⲟⲥ (Stauros)   | ⲤⲢ̅             | Cross              |
| ⲓⲥⲣⲁⲏⲗ (Israēl)     | Ⲓⲗ̅             | Israel             |
| ⲅⲁⲗⲓⲗⲁⲓⲁ (Galilaia) | ⲅⲗ̅             | Galilee            |
| ⲅⲟⲗⲅⲟⲑⲁ (Golgotha)  | ⲅⲗ̅             | Golgotha           |
| ⲥⲩⲣⲓⲁ (Syria)       | ⲥⲣ̅             | Syria              |
| ⲃⲏⲑⲗⲉⲉⲙ (Bethleem)  | ⲃⲗ̅             | Bethlehem          |

The NOMINA SACRA array of key–value pairs (Updated as new names emerge)

const NOMINA = [
  // Core names
  [/\bIS\b/g, "Ⲓ" + OVERLINE + "Ⲥ" + OVERLINE],   // Jesus
  [/\bIC\b/g, "Ⲓ" + OVERLINE + "Ⲥ" + OVERLINE],   // Jesus (variant)
  [/\bXC\b/g, "Ⲭ" + OVERLINE + "Ⲥ" + OVERLINE],   // Christ
  [/\bNC\b/g, "Ⲛ" + OVERLINE + "Ⲥ" + OVERLINE],   // God
  [/\bKC\b/g, "Ⲕ" + OVERLINE + "Ⲥ" + OVERLINE],   // Lord
  [/\bPN\b/g, "Ⲡ" + OVERLINE + "Ⲓ" + OVERLINE],   // Spirit (Pneuma)
  [/\bSY\b/g, "Ⲥ" + OVERLINE + "Ⲩ" + OVERLINE],   // Savior
  [/\bCR\b/g, "Ⲥ" + OVERLINE + "Ⲣ" + OVERLINE],   // Cross (Stauros)

  // Place names (less common, but attested)
  [/\bIL\b/g, "Ⲓ" + OVERLINE + "ⲗ" + OVERLINE],   // Israel
  [/\bGL\b/g, "ⲅ" + OVERLINE + "ⲗ" + OVERLINE],   // Galilee
  [/\bGLT\b/g, "ⲅ" + OVERLINE + "ⲗ" + OVERLINE],  // Golgotha (variant contraction)
  [/\bSR\b/g, "ⲥ" + OVERLINE + "ⲣ" + OVERLINE],   // Syria
  [/\bBL\b/g, "ⲃ" + OVERLINE + "ⲗ" + OVERLINE]    // Bethlehem
];

Backtick Normalizing

Next, checks are made for: .` which needs to be normalized to: ` so that later rules run reliably

Normalize: .` ` Code:

t = t.replace(/\.`/g, "`");

Overline Capitals

Next, we check within t for any general overline dotted capitals

t = overlineDottedCapitals(t);

In this

overlineDottedCapitals()

function, we take an argument of t and check through it for any capitals that require an Overline symbol above to denote special status, eg NOMINA SACRA Abbreviations.

function overlineDottedCapitals(t) {
    return t.replace(/\b([A-Z])\./g, (_, cap) => {
      const mapped = SINGLE[cap.toLowerCase()] || cap;
      return mapped + OVERLINE;
    });
  }

This function checks through t as a sequence of characters, for any that match the Regex

/\b([A-Z])\./g

\b → word boundary

([A-Z]) → capture a single uppercase letter

. → immediately followed by a literal period .

So it matches things like I. or X. at the start of a word.

Callback: (_, cap) => {  }

The first argument _ is the whole match (ignored).

cap is the single uppercase letter matched (e.g., "I").

Mapping:

SINGLE[cap.toLowerCase()] || cap;

Looks up the lowercase equivalent in a dictionary called SINGLE.

If there’s no mapping, it just keeps the original uppercase.

Return:

Concatenates the mapped character with OVERLINE (which is \u0305, the combining overline).

So "I." becomes " Ⲓ̅ ", for instance.

In plain English:

This function finds capital letters followed by a dot (a shorthand scribal way of writing nomina sacra) and replaces them with the mapped Coptic letter, marked with an overline.

SINGLE DICTIONARY

const SINGLE = {
    "a":"Ⲁ","b":"Ⲃ","g":"Ⲅ","d":"Ⲇ","e":"Ⲉ","z":"Ⲍ",
    "h":"Ⲏ","q":"Ⳓ","i":"Ⲓ","k":"Ⲕ","l":"Ⲗ","m":"Ⲙ",
    "n":"Ⲛ","o":"Ⲟ","p":"Ⲡ","r":"Ⲣ","s":"Ⲥ","t":"Ⲧ",
    "u":"Ⲩ","w":"Ⲱ","f":"Ϥ","y":"Ⲩ","c":"Ⲥ","v":"Ⲃ"
  };

NOTE: "q" by most cases maps to "Ⲑ", but in this old text, "Ⳓ" is required

Sacred Segments

Next we walk through this replacement process

t = t.replace(/([A-Za-z$]+)`/g, (_, seg) => {

        DIGRAPHS.forEach(([pat, repl]) => { seg = seg.replace(pat, repl); });
      
        let out = "";

        for (const ch of seg) {

          const sp = specials(ch);
          if (sp) { out += sp; continue; }

          const mapped = SINGLE[ch.toLowerCase()] || ch;
      
          if (/[A-Z]/.test(ch)) out += mapped + OVERLINE;
          else out += mapped;
        }

        return /\p{L}$/u.test(out) && !out.endsWith(OVERLINE) ? out + OVERLINE : out;

      });

END OF SEGMENT

First, we check via regex for:

/([A-Za-z$]+)`/g

Which captures a run of ASCII letters (and $) right before a backtick. Examples matched: IS in IS ` , KyriocinKyrioc in KyriocinKyrioc `. The backtick itself isn’t captured; it just marks “end of sacred segment”.

SEGMENT TREATMENT

Each of these captured segments is then treated to this process:

DIGRAPHS

(_, seg) => {
  DIGRAPHS.forEach(([pat, repl]) => { seg = seg.replace(pat, repl); });
const DIGRAPHS = [
    [/ch/gi, "Ⲑ"],   
    [/th/gi, "Ⲑ"],
    [/ph/gi, "Ⲫ"],
    [/kh/gi, "Ⲭ"],
    [/ps/gi, "Ⲯ"],
    [/sh/gi, "Ϣ"],
    [/ti/gi, "Ϯ"],
    [/ou/gi, "ⲞⲨ"]
  ];

Digraphs first: We normalize things like OU, TH, etc., before single-letter mapping. This prevents splitting digraphs into separate letters prematurely.

SPECIALS

function specials(ch) {
  if (ch === "$") return "Ⳃ";
  if (ch === "j") return "Ϫ";
  if (ch === "x") return "Ϩ";
  if (ch === ".") return " ";  // dots = space

  return null;
}

Note - " . " , Period as an input ASCII symbols maps to a blank space, as that is the convention in the ACII-Coptic source alphabet. Decimal places will not work as input.

  let out = "";
  for (const ch of seg) {
    const sp = specials(ch);
    if (sp) { out += sp; continue; }

specials(ch) Acts as a fast escape hatch for basic punctuation/symbols we want to inject verbatim.

SINGLE MAP

const mapped = SINGLE[ch.toLowerCase()] || ch;

Simple Latin→Coptic letter map check for the character

UPPERCASE OVERLINE RULE

    if (/[A-Z]/.test(ch)) out += mapped + OVERLINE;
    else out += mapped;
  }

Uppercase rule: If the input letter is uppercase, you add OVERLINE immediately after the mapped letter.

FINAL LETTER OVERLINE

  return /\p{L}$/u.test(out) && !out.endsWith(OVERLINE) ? out + OVERLINE : out;
}

Final overline: Regardless of case, we always add an OVERLINE to the last output character in the segment, unless it already features an overline.

Iota With Diaresis

    t = t.replace(/!/g, "Ⲓ\u0308");

Next, we replace any character that matches ! in ASCII as "Ⲓ\u0308" The input "Ⲓ\u0308" represents a character combination of the Coptic letter iota (ⲓ) and the combining diaeresis (U+0308). Looking like this: Ⲓ̈

Digraphs

Diraphs are checked again for characters that did not fall into previous patterns but may still require conversion.

const DIGRAPHS = [
    [/ch/gi, "Ⲭ"],   
    [/th/gi, "Ⲑ"],
    [/ph/gi, "Ⲫ"],
    [/kh/gi, "Ⲭ"],
    [/ps/gi, "Ⲯ"],
    [/sh/gi, "Ϣ"],
    [/\bti\b/gi, "Ϯ"],                // exact word 'ti'
    [/\bti(?=\s|[·.,;:!?'")\]])/gi, "Ϯ"], // article 'ti' before space/punct
    [/ou/gi, "ⲞⲨ"]
  ];

NOTE - " ti " is given context variations, depending on its position

    DIGRAPHS.forEach(([pat, repl]) => { t = t.replace(pat, repl); });

FINAL CHECK

First, declare an empty string variable for out

    let out = "";

Specials

Then, for each character in t, check if it falls under any special cases that were missed to this point:

function specials(ch) {
  if (ch === "$") return "Ⳃ"; 
  if (ch === "j") return "Ϫ";
  if (ch === "x") return "Ϩ";
  if (ch === ".") return " ";  // dots = space

  return null;
}

Note - " . " , Period as an input ASCII symbols maps to a blank space, as that is the convention in the ACII-Coptic source alphabet. Decimal places will not work as input.

    for (let ch of t) {
      const sp = specials(ch);
      if (sp) { out += sp; continue; }
      const map = SINGLE[ch.toLowerCase()];
      out += map ? map : ch;
    }

ENDING

And that should ideally present you with a fully converted set of characters properly formatted in Unicode NFC.

About

Taking old format Coptic ASCII text symbols and converting them to modern UNICODE NFC text.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published