2009-07-10

Lerna VId: A correction of lemma counts

Last post had its share of egg on my face, showing systematic overcounts of word forms in the corpora. This post is another healthy serving of omelette, correcting the lemma counts given in Lerna VIa. The overall story is:
  • There are less distinct word forms in the PHI #7 corpus than I thought
  • There are less scribal alternate forms left in PHI #7: if an editor thought they knew better than the scribe, the scribe's form is left out of consideration
  • There is less dialectal and orthographic wiggle-room allowed to PHI #7
  • So as a result of all this, the count of lemmata distinctive to PHI #7 has crashed: ignoring proper names, 3,800 lemmata that the lemmatiser thought it saw in PHI #7 are no longer there.
  • The count has still crashed, even though I've added a fair few lemmata to deal with PHI #7—the most frequent names, the overlaps with Trapp's dictionary, a few stragglers from DGE—as well as some dialectal grammar and some more respelling rules. I've picked up around 800 non-names and 1200 proper names; so I'm down by 1800 lemmata from before, rather than 3800.
  • I could have kept going to add more names than that, but it's been two weeks already, for gorsakes.
  • OTOH, because I've added extra names in particular, recognition of the TLG has slightly improved. So there are a few more lemmata for just the TLG-based corpora. (*Very* few.)
  • I also did some debugging of orthographic variation in lemmata, which resulted in some conflation of variants.
  • So if you ignore proper names, the TLG lemma count... actually ended up losing a few lemmata. (Again, *very* few: a couple of hundred lemmata each way.)


So.

LemmataExcluding Greek NumeralsExcluding Proper Names
TLG + PHI #7216,234 214,381211,794 209,952175,791 172,646
TLG (viii–XVI)201,680 201,823197,448 197,591162,219 162,009
LSJ (viii-VI)159,636156,720124,215
Mostly Pagan (viii–IV)99,426 99,48598,593 98,65276,145 76,067
Strictly Ancient (viii–iv)66,437 66,39066,078 66,03155,003 54,898


I also had a tally including also-rans analyses:

Lemmata
TLG + PHI #7220,560 218,727
TLG (viii–XVI)206,161 206,470
LSJ (viii-VI)166,387
Mostly Pagan (viii–IV)107,257 107,512
Strictly Ancient (viii–iv)73,427 73,532


In all of this, I've not been paying the PHI #7 corpus that much attention, though I did make a point of slipping it into the LSJ corpus. (The LSJ coverage of inscriptions and papyri are in fact why I called up PHI #7 in the first place.) I knew there would be extra lemmata there, and this lemma count is the PHI #7 disc's chance to shine. PHI #7 has added 6.5% more word instances to the TLG's, but 16% more word forms, and 6% more lemmata! That's phenomenal!

... What on Earth am I talking about? Remember Zipf's Law: the cumulative number of word forms that turn up is inversely proportional to the instance count for each word form. It's a Long Tail. If you add 6% more word instances, by the time you're already at 95 million instances, you should be getting... well, I can't do the maths, but you should be getting at most hundreds of new lemmata, not (as the table above shows) 12,000, of which only a couple of thousand are proper names. The 10,000 more lemmata of ordinary vocabulary shows you that the inscriptions and papyri—the Greek of daily life and of far flung dialects—has a very different vocabulary from the Greek of literature.

Of course, that you get 16% more word forms in PHI #7 means there's a lot of different inflections in the corpus that lie outside the TLG's ambit, because of all the non-literary dialects represented in the inscriptions. It also means a lot of misspellings that didn't belong in the TLG, as well.

In VIa, I went into an extended riff extrapolating how many more lemmata of Greek could turn up. Let me attempt that again, this time with more detail on proper names—but *not* including proper names in the final estimate.

The reason proper names don't belong in a final tally is worth restating, because not enough people are laughing at the notion. When we want to know how many words of English there are (which we shouldn't, but I've already been through that), we don't add the New York State White Pages to the Oxford English Dictionary, and we don't start screen-scraping geonames.org. We recognise that proper names are a different kind of thing from normal words (although the boundaries are fuzzy); and we also recognise that it's problematic to say a name belongs to one language and not another.

Does Κόρινθος count as a Greek name, even though it has the prehellenic telltale -νθ-? Well sure it does. Does Ομπάμα count as a Greek name? Or Σαίξπηρ for Shakespeare? Surely not. But what about the older declinable transliteration Σακεσπήριος? Doesn't that at least look Greek? What about Αὐρήλιος? But then again, what about Ἰσαάκ? Is Αμπντουλάχ not a Greek name? But does it become a Greek name when it was hellenised, as the Byzantines did, as Ἀβδελλᾶς? And is counting these names as part of the vocabulary of Greek a meaningful thing to do?

Well, better not to count proper names in the final tally at all; but let me add the counts I do know of, just in case someone is curious.
  • Right now, the TLG lemmatiser knows about almost 42,000 proper names. That includes most names of the Strictly Classical canon; a fair few names from later literature (including lots of Byzantine surnames), the names in Smith's Dictionary of Greek and Roman geography , and the thousand-odd names I was shovelling in over the past fortnight, to deal with the inscriptions and papyri.
  • Pape-Benseler went into its second edition in 1863, which increased it by a third. It covers geographical, personal, and mythological names in Ancient literature, and has some coverage of later stages. It has good coverage of such inscriptions as were known at the time, and is starting to notice papyri—though remember, this is thirty years before the discovery of Oxyrhynchus. And the dictionary is reasonably good about conflating variants.

    Benseler does not say how many names he has in total, but he does say that Alpha under his revision went from 3820 names to 6120. Extrapolating based on LSJ, that should mean 38,000 names overall. There are clearly lemmata in Pape-Benseler that aren't in the TLG lemmatiser: I add 500 names because of dealing with PHI #7, and that was only dealing with names occurring 10 times or more in PHI #7. How much more am I missing? No idea. But I'd be surprised if it was more than 10,000.
  • In the following, I need a sense of how many of these names are personal, and how many are geographical. The Heidelberg word lists for papyri are a bit more reluctant to conflate variants than I prefer, but at least they list personal and geographical names separately: 8838 personal, 2637 geographical. Good enough for me, I'll say personal numbers :: place names are 4:1.
  • 1863 is a long time ago in epigraphy, and the Lexicon of Greek Proper Names has been running for the past three decades to record the torrent of names found on inscriptions. It avoids mythological names (which are covered well enough in literature and Pape-Benseler), and it also does not do geographical names. It's ongoing, but its online search knows of 35,000 distinct names of people (whereas Pape-Benseler has 38,000 names of people, places, and gods). Now, the TLG lemmatiser recognises 17,600 distinct names, personal and geographical, in the ancient inscriptions on the PHI #7 disc. Guessing that 14,000 of those are personal names (4:1 ratio), that means it's missing at least 21,000 personal names.
  • The Leuven projects recognise 16,000 personal names in the papyri (with 7,000 extra variants), using the Duke Documentary Papyri corpus. The TLG lemmatiser recognises 9,600 distinct names, personal and geographical, in the same corpus on PHI #7. Guessing that 7,700 of those names are personal, it's missing at least another 8,000 names.
  • Some of the Leuven names will overlap with LGPN; but the Egyptian names won't. Let's say that all up, we're owed at least another 27,000 personal names. And using that 4:1 ratio again, another 5,000 place names. Heidelberg counts 9,000 personal names to Leuven's 16,000, and Heidelberg counts 2,600 geographical names; extrapolating up, that's consistent with 5,000.
  • That's not even scratching the surface of Byzantine and Modern names (let alone Σαίξπηρ or Ομπάμα, or the Thessalonica and Environs phone book). But so far, we can guess 42+27+5=74,000 names.
  • Flipping things around, there are 72,000 unrecognised capitalised words in PHI #7. That does not mean 72,000 missing names: lots of these will be misspellings of known names that the lemmatiser isn't dealing with, or different inflections of the same name. And those names are in the scope of LGPN and Leuven. I'd say the personal names are already accounted for in the 27,000 (say) personal names of the two initiatives.
  • There are a further 42,000 unrecognised capitalised words in TLG. Most of these won't be in LGPN and Leuven—though some will be in Pape-Benseler. Most of these by far are from post-Classical texts, and they include ancient gazeteers. (Ptolemy's Geography alone accounts for close to 3,000 unrecognised names.) How many of these are legitimate novel proper names? Again, no idea, but by this stage we're getting into one-offs, because all proper name word forms occurring more than 7 times in the TLG have been added to the database. I'll guess 30,000. There'll be some overlap with Leuven and LGPN, but not a lot, because many of these names are Byzantine.
  • As mentioned, the TLG is maybe 70%, maybe 75% complete for Byzantine literature, and only starting to go into Early Modern literature. It does have a lot of Byzantine surnames through church deeds (which account for 5,000 unrecognised capitalised words); so it'll have a reasonable cross-section. I haven't gone through the Byzantine proposopographies though (285-641, 642-1265, Palaeologan), to work out how many surnames they've unearthed in sum.
  • And I have not spent quality time with the Attica or Thessalonica or Nicosia phonebooks.
  • So at least 70,000 proper names to go, adding up to something like 110,000 proper names, and that count only goes up to the Fall of Constantinople.

Anyone who wants to start boasting of the 110,000 proper names of Two And A Half Thousand Years of Greek needs to be smacked upside the head with all three volumes of the Dictionary of American Family Names, and have the printout of all 8,000,000 places on geonames.org dropped on their foot. Because all of those count as proper names of One Year of English, by the same criterion.

(The Blogger Writing These Lines enjoyed contributing to the Dictionary of American Proper Names, even before he realised its value as a tool of percussive persuasion.)

So. Banishing proper names, we're left with 173,000 lemmata, as guesstimated. How much is left to go again? As it turns out, I'm doing the same guesstimates as before—but they make more sense without including proper names:
  • I keep my guesstimate of 20,000 lemmata more from Trapp (including texts not yet added to the TLG and volumes not yet published), and 10,000 lemmata more from Kriaras (ditto). That's 203,000.
  • There are words in LSJ that are not represented in this corpus. The biggest gap is the mediaeval Latin-Greek glossaries, with 1,000 missing lemmata; but there are several other oddities. The latest I've encountered, under ἐλεφαντουργική "of or pertaining to ivory-working": the 1161 AD commentary to the astrologer Paul of Alexandria, writing in 378—and last published in 1588. (The irony here is, the same adjective turns up in the rather more mainstream Heliodorus, a century beforehand.) But again, once the PHI #7 texts are in, and with the changes in text editions between the original LSJ and the TLG—not to mention the rejected scribal forms—I don't think there's more than 3,000 lemmata to add. That takes us to 206,000.
  • I'm inclined to revise my extrapolation for DGE downwards. Volume I updated may have 3500 lemmata not in LSJ, but it's competing not only with Bauer, Lampe, and Trapp, but also with the LSJ Supplement—which on its own adds 10,000 lemmata to LSJ, and which also has made a point of covering more inscriptions and papyri. I haven't taken the time to do any counting with DGE. It's a long plane trip tomorrow to Montreal—so maybe I will.

    But there's no way Volume I has 3,500 lemmata not also in LSJ/Bauer/Lampe/Trapp/LSJSupp. DGE looks like taking 20 volumes if and when it finishes. (I wasn't planning on living until 2100 AD to find out.) If there's just 500 novel lemmata in Volume I, that means 10,000 novel lemmata all up; if 1000, then 20,000, as I proposed last time. I'm feeling jaundiced, but I'll still give them 20,000. That takes us to 226,000 lemmata, up to the fall of Candia.


Ούφ. On those figures, English still wins, :-) though not by much. The level of precision I've given is of course illusory, and in a following post I will tackle what is a more sensible question: how much vocabulary do you need to recognise n% of a text. But these counts should at least be indicative.

0 comments:

Post a Comment