This post is boring—a bunch of numbers—but necessary for the record. Because the counts are dependent on how the lemmatiser recognises words, and the lemmatiser is not static (and neither is the corpus), these counts are not definitive; but they are more correct than the last reports. The major bug fix (as far as I can tell!) was that I'd forgotten to factor out case and accentuation for not only the raw word forms, but also their normalised counterparts; so all the normalised word counts were off by some 10%. But because the main conclusions were comparing vocabulary sizes relative to each other, they still hold.
That's why testing is a good thing, right?
I'm added a new corpus into the mix: the LSJ Corpus is meant to approximate the coverage of LSJ. It excludes any Christian-related writing, apart from the Scriptures themselves. Otherwise (and that's a big Otherwise), it includes all pagan authors up to VI AD, including technical authors. It also includes the ancient inscriptions and papyri from PHI #7. The LSJ corpus additionally includes the lexica of Hesychius, Photius, and the Etymologicum Magnum, which were written later, but (some of the time) reach back earlier. It still leaves out the scholia on Classical literature, which explain Ancient texts with Byzantine words.
It also leaves out two Demotic texts which have ended up in collections of Ancient authors in the TLG, one under Pseudo-Hippocrates, one under the Hippiatrica. I'm taking those out of the Mostly Pagan and Strictly Ancient corpora too. The fact that a clearly XVI AD text has been lumped in with a v BC corpus should give you pause: use the author dates on the TLG with caution—they apply to the authors, but not to all the spurious works included under the author's name.
Counts of unique strings in the corpora
|Word Instances||Word Forms|
|TLG + PHI #7||(viii-XVI, +tech +christ +inscr/pap)||102,005,245 101,684,658||1,861,358 1,815,540|
|TLG (viii–XVI)||(viii–XVI, +tech +christ -inscr/pap)||95,475,128||1,567,892|
|LSJ Corpus (viii–VI)||(viii-VI, +tech -christ +inscr/pap)||34,746,312||1,147,454|
|Mostly Pagan (viii–IV)||(viii–IV, -tech -christ -inscr/pap)||16,312,159||605,335|
|Strictly Ancient (viii–iv)||(viii–iv, +tech -christ +inscr/pap)||5,464,913 5,463,292||334,428 334,187|
This is where the differences start. By correcting incomplete word indications, hyphenation, and rejected scribal forms in PHI #7, I've lost 400,000 word instances, and 46,000 distinct word forms.
You can also see that going from Mostly Pagan to the LSJ corpus almost doubles the count of distinct word forms. That's adding in two more centuries of pagan literature, technical writing, inscriptions, papyri, and late lexica. The PHI #7 texts account for around a third of that increase; the rest comes from the technical writing and lexica. The lexica include a large number of one-off words, and a lot of loose Byzantine spelling. Technical writing includes even more loose Byzantine spelling, because these texts are not closely bound to Atticist literary norms.
But it also includes a lot of idiosyncratic vocabulary—medical, astrological, engineering, mathematical, not to mention all the random place names in Ptolemy and the other geographical texts. Technical writing also encompasses grammatical and philological commentary—which often means grammarians just making up tenses and cases to explain words. So there is a lot of distinctive vocabulary in technical writing; but there is also a lot of inflated vocabulary.
Stripping case and forms without diacritics
I've fixed the calculations to take out more forms with partial diacritics—so I'm now making sure that all of ανδρι, ἀνδρι and ανδρί are folded under ἀνδρί. So less forms from here in are considered truly distinct:
|TLG + PHI #7||1,649,083 1,545,491|
|TLG (viii–XVI)||1,376,016 1,355,062|
|LSJ Corpus (viii–VI)||1,001,079|
|Mostly Pagan (viii–IV)||562,744 555,843|
|Strictly Ancient (viii–iv)||314,887 312,255|
Restricting to recognised forms
Though I've added a thousand-odd proper names and some Arcadian and Cretan grammar to the lemmatiser, it still struggles with the PHI #7 corpus, as you'd expect: it's now understanding 62% of all word forms instead of 59%. There's 73,000 capitalised word forms in PHI #7, and 21,000 uncapitalised, that the lemmatiser has no idea about. For the TLG corpus, the equivalent is currently 42,000 capitalised word forms, and 43,000 uncapitalised that are going unrecognised—and the TLG has seven times more word forms more than PHI #7.
So there are a *lot* of vocabulary, particularly proper names, that are unique to the PHI #7 corpus, and that the lemmatiser does not yet understand. In fact, I already know there should be 16,000 distinct proper names in the papyri alone, as I mentioned last post. But once again, if I am using the lemmatiser to make morphological judgements about distinct word forms, I can't count words that the lemmatiser doesn't understand. So I have to pretend those words don't exist, for any remaining counts to mean anything.
OTOH, it's been a month, and recognition of the TLG corpus has gone up (partly because of this series of posts). The word counts are not static.
|TLG + PHI #7||1,435,391 1,391,855|
|TLG (viii–XVI)||1,282,298 1,272,773|
|LSJ Corpus (viii–VI)||905,044|
|Mostly Pagan (viii–IV)||557,574 551,651|
|Strictly Ancient (viii–iv)||313,354 311,428|
Normalisation of forms (crasis, apostrophe, respellings)
Yeah, more bugs here. I've been case-folding word forms up to to this point; I, uh, think I forgot to case-fold the normalised word forms as well. Which ends up making quite a difference.
|TLG + PHI #7||1,352,303 1,152,682|
|TLG (viii–XVI)||1,232,209 1,101,191|
|LSJ Corpus (viii–VI)||736,932|
|Mostly Pagan (viii–IV)||539,469 481,424|
|Strictly Ancient (viii–iv)||301,005 275,703|
Eliminating nu movableHere too I changed the way I was considering a form to have nu movable—I relied on the morphological analysis rather than doing a blanket transformation. So less forms now get conflated.
|TLG + PHI #7||1,307,842 1,125,784|
|TLG (viii–XVI)||1,189,688 1,074,767|
|LSJ Corpus (viii–VI)||720,855|
|Mostly Pagan (viii–IV)||519,498 470,096|
|Strictly Ancient (viii–iv)||289,812 270,115|
Eliminating non-words (abbreviations, Greek numerals, or geometric lines)
The more aggressive folding of diacritics I've put in means there aren't many of these left at all.
|TLG + PHI #7||1,300,717 1,125,699|
|TLG (viii–XVI)||1,183,120 1,074,683|
|LSJ Corpus (viii–VI)||720,800|
|Mostly Pagan (viii–IV)||518,321 470,096|
|Strictly Ancient (viii–iv)||289,275 270,093|
So, sheepishly, I find that I overestimated unique word forms by say 120,000, and the errors in how I was handling PHI #7 made me overestimate by another 50,000. When I was comparing Three Thousand Years Of Greek to Slovenian and Telugu, my average word forms per thousand word instances in the TLG was 12.6; it is now 11.3. Telugu still has 30.8, so it still wins...
Forms of Good Standing (without: hypothetical, hypercorrect, uncertain inflection, anomalous inflection, transliterated Latin)
|TLG + PHI #7||1,267,434 1,101,948|
|TLG (viii–XVI)||1,158,529 1,053,549|
|LSJ Corpus (viii–VI)||708,669|
|Mostly Pagan (viii–IV)||515,275 468,698|
|Strictly Ancient (viii–iv)||288,305 269,448|
Forms of Good Standing and Pedigree (linguistically Classical)
|TLG + PHI #7||1,135,915 980,867|
|TLG (viii–XVI)||1,041,520 938,084|
|LSJ Corpus (viii–VI)||676,114|
|Mostly Pagan (viii–IV)||505,302 458,756|
|Strictly Ancient (viii–iv)||285,856 266,891|
Forms of Good Standing and Cecropian Pedigree (linguistically Attic)
|TLG + PHI #7||1,020,232 889,759|
|TLG (viii–XVI)||952,993 857,008|
|LSJ Corpus (viii–VI)||604,107|
|Mostly Pagan (viii–IV)||458,933 415,869|
|Strictly Ancient (viii–iv)||248,914 232,008|