An undeciphered script in the age of AI : a corpus-constrained computational analysis of Linear A
Ένα άγνωστο σύστημα γραφής στην εποχή της ΤΝ : υπολογιστική ανάλυση της Γραμμικής Α με περιορισμένα δεδομένα

Master Thesis
Author
Briakos, Nikolaos
Μπριάκος, Νικόλαος
Date
2026-05Advisor
Venetis, IoannisΒενέτης, Ιωάννης
View/ Open
Keywords
AI ; Undeciphered scripts ; Computational linguistics ; Linear A ; Τεχνητή νοημοσύνη ; Γραμμική ΑAbstract
This thesis investigates an unresolved question in computational linguistics: can artificial intelligence contribute to the study and eventual decipherment of Linear A, the undeciphered Bronze Age script of Minoan Crete (c. 1800–1450 BCE)? Using a corpus of 419 tablets from the lineara.xyz database (GORILA-derived, N = 2,481 sign tokens), the study approaches this question through three complementary analyses.
What AI can already detect. The corpus exhibits Zipfian sign distributions, systematic positional biases, register divergence (JSD = 0.0944 bits, permutation p = 0.018), cross-site variation in word length (Cohen’s d = 0.692, Phaistos vs. Khania), and 23 scribal fingerprints characterised by distinctive bigram preferences. Of particular interest, an unsupervised formula-detection algorithm recovered 5 of the 9 human-identified elements of the known libation formula from a 31-tablet stone-vessel corpus spanning 14 archaeological sites, without prior knowledge of the formula structure. These findings indicate that AI methods can recover structurally meaningful regularities relevant to epigraphic analysis.
What AI can learn from the current corpus. A small Transformer model (4-layer masked language model, ~2M parameters) trained on GPU achieved 90.2% peak validation accuracy and correctly reconstructed the most frequent Linear A sign collocation (KU-RO, p = 0.59 in both directions). The results suggest that the model captures important aspects of the distributional structure of the Linear A sign inventory. A multimodal analysis combining PIL-derived visual image features with distributional sign embeddings identified no meaningful geographic clustering; the apparent separation observed in visual PC1/PC2 space appears primarily attributable to photographic exposure conditions (visual PC1 r = −0.990 with image brightness, R² = 0.98).
What AI cannot yet achieve. A Linear B calibration experiment using a synthetic corpus constructed from published frequency distributions showed that frequency rank-matching accuracy at N = 2,481 tokens reaches only 13% top-1 accuracy. Bayesian phonetic inference produced near-uniform posterior distributions (maximum deviation 2.1× above the uniform baseline), indicating that the model cannot yet transform learned statistical structure into reliable phonetic value assignments without a substantially larger corpus of known correspondences. The calibration experiments place the practical threshold for useful phonetic inference at approximately N = 10,000 sign tokens — roughly 7,500 beyond the currently available corpus.
The results consistently indicate that the principal limiting factor is corpus size rather than script complexity or model architecture. AI methods can already model important aspects of the distributional structure of Linear A sign sequences, but they cannot yet infer reliable phonetic correspondences. The gap, however, is now quantifiable and may narrow as additional inscriptions become available.


