2009-07-07

Lerna VIc: A correction of word form counts

This post fixes counts given in Lerna Va and Lerna Vb, with corrected counts from the PHI #7 disc—and a couple of weeks' work on the archaic dialects and proper names of the PHI #7 corpus. I've also fixed several errors in how I was counting forms as unique. The end result is that the previous counts were inflated all up by 15%.

This post is boring—a bunch of numbers—but necessary for the record. Because the counts are dependent on how the lemmatiser recognises words, and the lemmatiser is not static (and neither is the corpus), these counts are not definitive; but they are more correct than the last reports. The major bug fix (as far as I can tell!) was that I'd forgotten to factor out case and accentuation for not only the raw word forms, but also their normalised counterparts; so all the normalised word counts were off by some 10%. But because the main conclusions were comparing vocabulary sizes relative to each other, they still hold.

That's why testing is a good thing, right?

I'm added a new corpus into the mix: the LSJ Corpus is meant to approximate the coverage of LSJ. It excludes any Christian-related writing, apart from the Scriptures themselves. Otherwise (and that's a big Otherwise), it includes all pagan authors up to VI AD, including technical authors. It also includes the ancient inscriptions and papyri from PHI #7. The LSJ corpus additionally includes the lexica of Hesychius, Photius, and the Etymologicum Magnum, which were written later, but (some of the time) reach back earlier. It still leaves out the scholia on Classical literature, which explain Ancient texts with Byzantine words.

It also leaves out two Demotic texts which have ended up in collections of Ancient authors in the TLG, one under Pseudo-Hippocrates, one under the Hippiatrica. I'm taking those out of the Mostly Pagan and Strictly Ancient corpora too. The fact that a clearly XVI AD text has been lumped in with a v BC corpus should give you pause: use the author dates on the TLG with caution—they apply to the authors, but not to all the spurious works included under the author's name.

Lerna Va


Counts of unique strings in the corpora

Word InstancesWord Forms
TLG + PHI #7(viii-XVI, +tech +christ +inscr/pap)102,005,245 101,684,6581,861,358 1,815,540
TLG (viii–XVI)(viii–XVI, +tech +christ -inscr/pap)95,475,1281,567,892
LSJ Corpus (viii–VI)(viii-VI, +tech -christ +inscr/pap)34,746,3121,147,454
Mostly Pagan (viii–IV)(viii–IV, -tech -christ -inscr/pap)16,312,159 605,335
Strictly Ancient (viii–iv)(viii–iv, +tech -christ +inscr/pap)5,464,913 5,463,292334,428 334,187

This is where the differences start. By correcting incomplete word indications, hyphenation, and rejected scribal forms in PHI #7, I've lost 400,000 word instances, and 46,000 distinct word forms.

You can also see that going from Mostly Pagan to the LSJ corpus almost doubles the count of distinct word forms. That's adding in two more centuries of pagan literature, technical writing, inscriptions, papyri, and late lexica. The PHI #7 texts account for around a third of that increase; the rest comes from the technical writing and lexica. The lexica include a large number of one-off words, and a lot of loose Byzantine spelling. Technical writing includes even more loose Byzantine spelling, because these texts are not closely bound to Atticist literary norms.

But it also includes a lot of idiosyncratic vocabulary—medical, astrological, engineering, mathematical, not to mention all the random place names in Ptolemy and the other geographical texts. Technical writing also encompasses grammatical and philological commentary—which often means grammarians just making up tenses and cases to explain words. So there is a lot of distinctive vocabulary in technical writing; but there is also a lot of inflated vocabulary.
Stripping case and forms without diacritics

I've fixed the calculations to take out more forms with partial diacritics—so I'm now making sure that all of ανδρι, ἀνδρι and ανδρί are folded under ἀνδρί. So less forms from here in are considered truly distinct:
Word Forms
TLG + PHI #71,649,083 1,545,491
TLG (viii–XVI)1,376,016 1,355,062
LSJ Corpus (viii–VI)1,001,079
Mostly Pagan (viii–IV)562,744 555,843
Strictly Ancient (viii–iv)314,887 312,255

Restricting to recognised forms

Though I've added a thousand-odd proper names and some Arcadian and Cretan grammar to the lemmatiser, it still struggles with the PHI #7 corpus, as you'd expect: it's now understanding 62% of all word forms instead of 59%. There's 73,000 capitalised word forms in PHI #7, and 21,000 uncapitalised, that the lemmatiser has no idea about. For the TLG corpus, the equivalent is currently 42,000 capitalised word forms, and 43,000 uncapitalised that are going unrecognised—and the TLG has seven times more word forms more than PHI #7.

So there are a *lot* of vocabulary, particularly proper names, that are unique to the PHI #7 corpus, and that the lemmatiser does not yet understand. In fact, I already know there should be 16,000 distinct proper names in the papyri alone, as I mentioned last post. But once again, if I am using the lemmatiser to make morphological judgements about distinct word forms, I can't count words that the lemmatiser doesn't understand. So I have to pretend those words don't exist, for any remaining counts to mean anything.

OTOH, it's been a month, and recognition of the TLG corpus has gone up (partly because of this series of posts). The word counts are not static.
Word Forms
TLG + PHI #71,435,391 1,391,855
TLG (viii–XVI)1,282,298 1,272,773
LSJ Corpus (viii–VI)905,044
Mostly Pagan (viii–IV)557,574 551,651
Strictly Ancient (viii–iv)313,354 311,428

Normalisation of forms (crasis, apostrophe, respellings)

Yeah, more bugs here. I've been case-folding word forms up to to this point; I, uh, think I forgot to case-fold the normalised word forms as well. Which ends up making quite a difference.
Word Forms
TLG + PHI #71,352,303 1,152,682
TLG (viii–XVI)1,232,209 1,101,191
LSJ Corpus (viii–VI)736,932
Mostly Pagan (viii–IV)539,469 481,424
Strictly Ancient (viii–iv)301,005 275,703

Eliminating nu movable
Here too I changed the way I was considering a form to have nu movable—I relied on the morphological analysis rather than doing a blanket transformation. So less forms now get conflated.
Word Forms
TLG + PHI #71,307,842 1,125,784
TLG (viii–XVI)1,189,688 1,074,767
LSJ Corpus (viii–VI)720,855
Mostly Pagan (viii–IV)519,498 470,096
Strictly Ancient (viii–iv)289,812 270,115

Eliminating non-words (abbreviations, Greek numerals, or geometric lines)

The more aggressive folding of diacritics I've put in means there aren't many of these left at all.
Word Forms
TLG + PHI #71,300,717 1,125,699
TLG (viii–XVI)1,183,120 1,074,683
LSJ Corpus (viii–VI)720,800
Mostly Pagan (viii–IV)518,321 470,096
Strictly Ancient (viii–iv)289,275 270,093


So, sheepishly, I find that I overestimated unique word forms by say 120,000, and the errors in how I was handling PHI #7 made me overestimate by another 50,000. When I was comparing Three Thousand Years Of Greek to Slovenian and Telugu, my average word forms per thousand word instances in the TLG was 12.6; it is now 11.3. Telugu still has 30.8, so it still wins...

Lerna Vb


Forms of Good Standing (without: hypothetical, hypercorrect, uncertain inflection, anomalous inflection, transliterated Latin)

Word Forms
TLG + PHI #71,267,434 1,101,948
TLG (viii–XVI)1,158,529 1,053,549
LSJ Corpus (viii–VI)708,669
Mostly Pagan (viii–IV)515,275 468,698
Strictly Ancient (viii–iv)288,305 269,448

Forms of Good Standing and Pedigree (linguistically Classical)

Word Forms
TLG + PHI #71,135,915 980,867
TLG (viii–XVI)1,041,520 938,084
LSJ Corpus (viii–VI)676,114
Mostly Pagan (viii–IV)505,302 458,756
Strictly Ancient (viii–iv)285,856 266,891

Forms of Good Standing and Cecropian Pedigree (linguistically Attic)

Word Forms
TLG + PHI #71,020,232 889,759
TLG (viii–XVI)952,993 857,008
LSJ Corpus (viii–VI)604,107
Mostly Pagan (viii–IV)458,933 415,869
Strictly Ancient (viii–iv)248,914 232,008

0 comments:

Post a Comment