Lerna VIIa: Classical and Late vocabulary

Here, I'll try making some sense of how the vocabularies of Greek have shifted between the corpora.

This is where we got to.
LemmataExcluding Proper Names
TLG + PHI #7(viii-XVI, +tech +christ +inscr/pap)214,381172,646
TLG(viii–XVI, +tech +christ -inscr/pap)201,823162,009
LSJ Corpus(viii-VI, +tech -christ +inscr/pap)159,636124,215
Mostly Pagan(viii–IV, -tech -christ -inscr/pap)99,48576,067
Strictly Ancient(viii–iv, +tech -christ +inscr/pap)66,39054,898

The corpora have varying mixes of including "technical" texts, Christian texts, and inscriptions and papyri. In case it wasn't obvious, "Christian texts" means texts about the Christian religion, which have a distinct editorial and linguistic tradition deviant from the Classics. We're not banning authors for their creed, but for what corpora their texts fit into. The Mostly Pagan corpus chooses to end with Synesius rather than Nonnus, but both of them started as pagans and ended as Christian bishops.

I'm going to try and work on how the vocabularies differ in time between the corpora. Two postings ago, we cut down on post-classical and suspect-looking analyses, by restricting out word form counts to forms of good standing and pedigree. This permitted us to describe a more homogeneous corpus. We can put the same restrictions on our lemma counts.
LemmataExcluding Proper Names
TLG + PHI #7204,393167,640
TLG (viii–XVI)192,342157,302
LSJ Corpus (viii-VI)151,962120,018
Mostly Pagan (viii–IV)97,90675,845
Strictly Ancient (viii–iv)65,84254,743

Forms of Good Standing (no numbers, hypothetical, hypercorrect, unattested tenses, uncertain inflection, anomalous inflection, transliterated Latin)

Nary a dent on the Strictly Classical corpus or even the Mostly Pagan corpus, which contain literature. But we've got rid of words made up by grammarians as etymologies; e.g. ἀόλλησις "thronging" as an etymology of ἀλλᾶς "sausage". (The jokes just write themselves, don't they.) We also got rid of Latin terminology, which didn't always make it to the dictionaries when it was undigested Latin; e.g. νερεδιτάς hereditas "inheritance". What with that and Milesian numbers, we can take 10,000 lemmata out for the overall corpus.
hereditas ends up as /nereðitas/? Yes, Byzantine lawmen liked transliterating Latin /h/ as <n>; I'm not clear on why.

Now comes the hatchet. How many lemmata can be called Ancient grammatically, even if they only show up in later texts? That sounds nonsensical, right? Surely if a lemma is first attested post-Classically, it counts as post-Classical. Well, it is, but I'll crank the handle anyway.
  • I'm leaving out proper names now, because the lemmatiser only occasionally has assigned them period.
  • OTOH, lemmata do by default get called Late in the lemmatiser's database if they are unique to Lampe or Trapp, and specifically Demotic if they're unique to Kriaras. Finding I'd missed a class of lemmata for tagging was what made me start revisiting all the counts.
  • This also means that by default, if the lemma is in LSJ, it's counted as classical.
  • LSJ stops nominally at VI AD (and on occasion with scholia, a lot later); so it decidedly includes Koine—but lemmata have not been consistently periodised in the lemmatiser as Koine as distinct from Classical. So the counts of lemmata tagged as classical are inflated enough not to be useful.
  • I'm cranking the handle anyway.
  • Beyond that, though, any verb derived from a Classical verb (by prefixing a preposition) still counts as linguistically Classical, because that process was fully productive from the beginning. So ἀντιδιαλοιδορέομαι "to be mocked thoroughly in response" is attested only in Trapp; but all of ἀντί, διά and λοιδορέω are Classical, and the combination was licensed in antiquity, so the compound of all three is counted as Classical.
  • The same goes for lemmata formed through derivational morphology—unless the word does show up in a later dictionary. So ἀβελτίωτος "unimproved" could have been formed at any time of Greek, from ἀ-, βελτιόω, and -τος. But because it is explicitly attested in Trapp, it is counted as a new Byzantine word.
  • The discrepancy between how I handle prefixing (always counts as Ancient) and suffixing (counts as a new word in later dictionaries) is, as you may have guessed, an artifice of how the lemmatiser has been implemented.
  • OTOH, some later text has crept into the nominally Ancient corpus—notably in Testimonia (later descriptions of authors in literature), and more so in "technical" texts, which were often written in Koine. In fact, LSJ has plenty of Koine in it—and as we'll see, a lot more Koine in technical and daily-life texts than in literary texts, something which should surprise precisely nooone.
  • And yet... I'm still cranking the handle

So if I crank the handle, and exclude analyses that the lemmatiser thinks, for better or worse, are post-Classical, what do I get?
TLG + PHI #7132,098
TLG (viii–XVI)122,579
LSJ Corpus (viii-VI)110,417
Mostly Pagan (viii–IV)73,260
Strictly Ancient (viii–iv)54,176

Forms of Good Standing and Classical Pedigree

Let's try and make sense of this. The two corpora we'll compare are the LSJ corpus, which goes up to VI AD and excludes Christian writings; and the complete TLG + PHI #7 corpus. As always, excluding proper names:
All lemmataClassical lemmata only
TLG + PHI #7167,640132,098Difference: 33,000 Middle + 2,000 Modern lemmata
LSJ Corpus120,018110,417Difference: 10,000 Middle lemmata

  • There are 47,000 lemmata that turn up only after VI AD, after the LSJ corpus.
  • There are another 10,000 lemmata in the LSJ corpus (8%) that are marked as late ("Middle"). Given that the LSJ corpus does include Koine texts, that whether a lemma got marked as Koine or not is a little haphazard, and that the technical Koine texts in the LSJ are linguistically messy, that's not that surprising.
  • By contrast, the Mostly Pagan corpus, which skips the papyri and technical texts, has just 2,500 middle lemmata (3%). Literary texts avoid linguistically innovative lemmata. Technical texts account for another 2,600 middle lemmata; the remaiing 5,000 are from papyri.
  • Of the 47,000, almost half—22,000—are linguistically still Classical. Some of these are late lemmata that just happen to have made it to LSJ. (A lot of those are legal Latinisms.) Some of those are derived lemmata.

Restating: a quarter of all lemmata in our corpus turned up only after VI AD; but half of those new lemmata don't look new to the lemmatiser at all: they look classical. Because of productivity of lemmata vs. accidents of tagging, somewhat less than a quarter of all lemmata in our corpus appear to be post-Classical to the lemmatiser.

Let's look at these new lemmata more closely, by looking at the most frequent lemmata in each category. The distinction between "linguistically Classical" and "linguistically Middle" does not turn out to matter much, because it's an accident of what has been included or excluded from LSJ. OTOH the distinction with "linguistically Modern" (i.e. Early Modern Greek) is quite revealing. Be warned too that the frequency of lemmata is all about the types of text included in the corpus.

And because a little bit of vernacular has leaked into Photius' lexicon, which was after all compiled pretty late, I'm excluding the lexica from the LSJ counts again. This pushes 15,000 lemmata back into the mediaeval period—140 of them linguistically Modern, 1500 of them linguistically Middle—and the rest of them linguistically Ancient. They're dictionaries; of course they have lots of one-off words.

So what are the most frequent lemmata new to mediaeval Greek?

Linguistically Ancient (34,641)
  • ἐκπόρευσις (85) "proceeding forth"
  • στιχολογία (735) "recitation"
  • περιβόλιον (561) "garden"
  • ἐναντιοφανής (533) "apparently contradictory"
  • πανσέβαστος (453) "most august"
  • λατινικός (388) "Latin"
  • πακτεύω (375) "make a pact"
  • κορμίον (365) "trunk of body"
  • χρυσορρήμων (365) "golden-speaking"
  • παραταγή (365) "order for payment"

Linguistically Middle (25,111)
  • θεοτοκίον (1425) "hymn to Virgin Mary"
  • μετόχιον (1370) "monastic property"
  • κονδικτίκιος (928) "relating to repossession of property"
  • κανείς (671) "noone"
  • ὀκτώηχος (863) "hymnal with all eight modes"
  • πρωτοσπαθάριος (696) "chief of imperial bodyguard"
  • ἱεραρχία (638) "hierarchy"
  • τώρα (622) "now"
  • γυρίζω (582) "I return"
  • μισέρ (556) "monsieur"

Linguistically Modern (2,526)
  • ἔτζι (749) "so"
  • ἠμπορέω (424) "I can"
  • ἀμή (290) "but"
  • τέτοιος (261) "such"
  • κάθε (202) "each"
  • ἀντάμα (187) "together"
  • ἀπαυτοῦ (125) "thence"
  • κάποιος (114) "someone"
  • βουλέω (113) "I want"
  • ὁλόρθος (111) "upright"

(A couple of texts nominally in LSJ's time period still contain νά, which I treat as diagnostic of Modern Greek; but again, these lists are only meant to be indicative.)

The ancient-looking new words deal with theology, logic, law, or the public sphere: disciplines which kept innovating their own specialist vocabulary. The middle-looking words largely deal with the church: the theotokion count is inflated relative to its counterparts because the word is used to signal a section for a lot of hymns in the corpus. There are a couple of novel grammatical words in middle Greek ("noone, now"). But for Modern Greek all the most frequent words are grammatical. And what that foreshadows is that Modern Greek has a distinct grammatical system than Ancient Greek, while Middle Greek is a lot closer to the Ancient grammatical system. That's no surprise, given that much of our Middle Greek corpus is Atticist to begin with.

Finally, for what it's worth, this is how many lemmata the lemmatiser thinks are linguistically Attic:
TLG + PHI #7127,169
TLG (viii–XVI)118,697
LSJ Corpus (viii-VI)105,960
Mostly Pagan (viii–IV)70,726
Strictly Ancient (viii–iv)51,666

Forms of Good Standing and Attic Pedigree

That's not worth that much, it must be said, since Attic is taken as the default dialect in the lemmatiser. Though dialect word forms eliminates a substantial number from the corpus, the lemma count itself is not affected much: an Attic-compatible word form usually turns up someplace.

