Lerna VIb: A derailing of lemma counts

You may have noticed an extended radio silence for the last couple of weeks in the series counting lemmata. The people at the Magnificent Nikos Sarantakos' blog, where the good fight against Lerna is fought, know why: I found some problems in the way I was counting lemmata in the inscriptions and papyrus corpus (PHI #7), which I've been nowhere as familiar with as the TLG corpus. As a result, I'm down 2,000-odd lemmata from where I thought I was. Because I spent lots of posts on how contingent and provisional any count of lemmata is, that should not be that big a deal: a ±1% in the lemma count is within the bounds of what can happen when you fix first-cut errors.

Still, it's embarrassed me enough, now that people are starting to quote the Lerna VIa count of 211,794 (including Nikos Sarantakos, fighting the good fight), that I tried to get to the bottom of it. In the process, I've worked to treat the PHI #7 corpus less cursorly than I had done. Cleaning up problems in the PHI #7 markup, and clueing the lemmatiser in on some of the peculiarities of the dialects in the corpus, mean that the counts would give a more accurate picture of what was going on with those texts. The problem is, the longer I spent fixing my handling of PHI #7, the more the lemma count fell—*even as I was busy adding lemmata from elsewhere* (DGE, Pape-Benseler, Foraboschi). Erk. The counts are more accurate (with a catch I'll talk about), but they're not what they were.

I'm going to air some of the dirty laundry here, to cement the point yet again that any count of lemmata is going to be unstable. After that, next post is going to revise the counts that need revising. Then, the promised posts that got derailed: how many of these lemmata count as Ancient; relating lemma counts to recognition percentages (which is the only way lemma counts are meaningful); and distinguishing word variants from lemmata.

The first issue was when I wanted to count how many lemmata should be considered Ancient. I realised I had not been counting a couple of thousand lemmata from Lampe's and Trapp's dictionary (I-VIII AD and "IX-XII" AD) as post-classical. That did not particularly affect the accuracy of recognition for the TLG (as I confirmed by rerunning the program), but it was distorting the numbers: there are less "word forms of good pedigree" than I said there are. So you'll get new numbers for that.

The second catch was when I found a bug in how I was extracting word forms from the PHI #7 corpus, which meant that several hyphens were being ignored—so a hyphenated word would be extracted as two separate words. Once I fixed that bug, I also noticed that some of the markers that a word was fragmentary weren't being picked up. For instance, I knew that notation like ...]atisatio[ indicated bits of a word were missing from the papyrus or inscription; I didn't know that PHI #7 was also using dashes, like – – ]atisatio[ – –. Fixing these problems results in less complete word instances extracted—but of course, more correct word instances extracted. Even if some lemmata that looked like being there were no longer recognised, there should be more correct long words turning up. So that should not cause any drastic drops in the size of the vocabulary.

The next three problem fixes seem to be what's caused issues. Papyri are spelled phonetically, by the norms of Koine Greek, so the lemmatiser allows for some spelling variation: ι for ει, for instance, or ω for ο. Inscriptions and legal deeds from Late Byzantium need to allow for a lot more spelling variation, because of the many Ancient phonemes that had ended up pronounced identically: so ι could now be a misspelling of any of η ει οι υ υι.

Archaic inscriptions, on the other hand, may have a narrower range of respellings than papyri (depends on how early), but they also have different spellings of their own, because they use different versions of the Greek alphabet: ω and ου were Ionic innovations in the alphabet, for example, and what conventional Ancient orthography spells as ω and ου, most inscriptions before iv BC spell as just ο. So unlike papyri or church deeds, a system dealing with inscriptions has to allow ο to stand for ω or ου.

The lemmatisation run over PHI #7 that I'd reported was allowing all possible respellings from all periods indiscriminately. So an XIV AD document was being allowed the same latitude in spelling as a vii BC document.

Yeah, you can see how that might be a problem. I fixed this by allowing different respelling rules for the three parts of the corpus: the ancient inscriptions, the papyri, and the Christian inscriptions (which run all the way to Ottoman times). There'll still be some wrong respellings, because each part corpus spans a long period. But it'll be a lot better than allowing XIV AD iotacism in a vii BC text. Of course, restricting respellings means that lemmata that were being over-recognised in texts now aren't. That's fair enough.

I also tried to restrict the lemmata that were allowed for each part of the corpus, to prevent absurdities. Modern Greek words couldn't be allowed for Ancient texts of course, but they do show up in the late Christian inscriptions. The ancient inscriptions do keep going well into Roman times, so I couldn't ban Koine lemmata from there; but I did try to keep recognition plausible, by blocking from the papyri and ancient inscriptions any words unique to Trapp's dictionary.

That's underestimating both Trapp and the papyri. The papyri keep going until Greek yielded to Arabic in Egypt—a generation or so after the Islamic conquest, so VIII AD. Trapp, OTOH, badges itself as IX-XII AD—but it also sets out to fill in gaps left by other dictionaries, so it can be the only place where late papyri get covered. So some lemmata that should have been allowed for the papyri were being blocked. But having checked, only 150-odd legitimate lemmata were affected (and are now back in). So that wasn't the major disruption.

The other problem, as far I can tell, was that PHI #7 allowed in its markup both the word or phrasing the editor thinks the text is saying, and (in special brackets) the odd wording the scribe actually wrote; e.g. lemmatisation {4lmmeatsiantion}4. If an editor has decided to correct lmmeatsiantion as lemmatisation, I decided, I shouldn't be trying to analyse both. The editor's fix should count as the word instance for recognition: the "misspelling" (as the editor has judged it) shouldn't be considered an independent word. It looks like, in the process, some words LSJ says existed no longer turn up, because LSJ didn't trust the editor as much as I do. But all texts from a papyrus or inscription get filtered by the editor publishing it, and making sense of it—just like all the literary texts in the TLG. So that's the consistent thing to do.

All up, skipping proper names, 3,500 odd lemmata are no longer turning up as recognised. OTOH, 700 lemmata are now newly turning up that weren't before. Those numbers are still subject to change; but most of the 3,500 lemmata that disappeared should have disappeared. The scribal originals like lmmeatsiantion's arguably shouldn't have disappeared, and I may end up revisiting them down the road. But I've already spent two weeks trying to deal with the vanished 3,500, and I shouldn't be holding postings up much longer.

To compensate for the missing 3,500, I went through the PHI #7 corpus, and looked more closely at what kinds of words weren't being recognised—making sure that words occurring frequently in the corpus were accounted for. That involved some tweaking in the allowable spelling variations, and some filling in of the more obscure dialects' grammar.
I had no idea what the Arcadian first declension genitive was like—or how it's spread. Arcadian τρίταυ /trítau/ "of the third" corresponds to Homeric masculine τρίταο /trítao/ (Attic τρίτου /trítoː/), but it's also spread to the feminine, displacing Proto-Greek and Doric τρίτας /trítaːs/ (Attic τρίτης /trítɛːs/). Arcadian τρίταυ reminds me of the Esperanto -aŭ ending; I wonder if I'm the first person to have had that mental short-circuit.

Beyond that, if the dictionaries that the TLG lemmatiser already knew about didn't account for frequent word forms, I checked it in DGE. After all, part of DGE's reason for existence was to broaden the coverage of LSJ into new finds in inscriptions and papyri. For lowercase words (i.e. excluding proper names), I went through all word forms occurring more than twice in the corpus; DGE is up to εκ-, and I did end up adding new lemmata from DGE, unique to this corpus.

The count of lemmata I added to the vocabulary from DGE... was 12. This surprised me, especially because even between α and αλ—for which DGE went back and redid Vol. I—there were a few word forms still unaccounted for on PHI #7. Going down to word forms occurring just twice or once will account for a lot more than 12 lemmata from DGE; but it won't account for thousands. The remaining gaps even after DGE is something I'll be looking at again: I'm curious to work out what's going on. Of course, PHI #7 is nowhere near a complete corpus even for 1995 when it was published—let alone now, with the continuous stream of inscriptions and papyri being transcribed and published. Only the Athenian curse tablets from Audollent's 1904 collection, for example, are in. (So when I looked at how καταχθόνιος and χθόνιος are used in the tablets for a paper, I had to do eyeballing as well as keyboard searching.)

I also wanted to improve the recognition of proper names particular to PHI #7, where the lemmatiser is really struggling: It now recognises 46% of all capitalised words, vs. 89% of all lowercase words. As I keep saying, proper names shouldn't count at all, but a couple of thousand instances of Πεθέως drawing a blank from the lemmatiser was a bit much for me. Moreover, if the lemmatiser isn't told about a proper name, it will end up making wrong guesses about what the lemma actually is. There are several inscriptions-only names that I was able to find in Pape–Benseler; but the big store of unrecognised names are in the papyri. And there's a simple reason why so many names from papyri drew a blank from the Greek lemmatiser: they're not Greek names, but Egyptian.

Of course, adding 500 or 1000 Egyptian names to improve Greek word recognition sounds suspect, right? But no more suspect than adding Hebrew names to improve recognition of words in the Septuagint, or Roman names to improve recognition of Cassius Dio. That, after all, is why proper names don't count when you count lemmata.

I'm using Foraboschi as my Egyptian phone book; it's the update to Preisigke's Namenbuch, which Foraboschi updates—and which seems to be AWOL at the moment in transit from Monash University to the University of Melbourne. My bloody fault for not waiting to drive over to Monash on the weekend—it's just 10 minutes up the road from my place.

People whose day job it is to look at names in papyri (several projects based at Leuven) have already been counting the proper names in the Duke Database of Documentary Papyri, which is what PHI #7 uses for the papyri. So they're doing the electronic counterpart to the dead tree phone book I'm sampling. The Leuven projects have come up with 26,000 name variants in the corpus, in 16,500 lemmata—and the majority of them are Egyptian, and unknown to other corpora of Greek (although a few of them make it to Athanasius of Alexandria or the Desert Fathers, who after all were also Egyptians). I'm not proposing to sit down and add 16,500 lemmata to the lemmatiser database: this is not my day job. I'm aiming at adding around 1,000, as triage prioritising the most frequent names; that'll account for uppercase word forms turning up 10 or more times in the corpus.

So, I'm going to tell you I know of 1,000 Egyptian names in PHI #7, when the Leuven papyrologists know there are 16,500? Why yes. Just like I'm telling you I know of 35,000 proper names in the TLG, when there are 42,000 uppercase words in the TLG unaccounted for. I don't know how many names there are, and that's not what the current lemma count is about. But I do know how many names account for n% of the corpus, for a suitably large number of n%. The name count is not open-ended, but it is pretty large—larger than for common nouns.

In fact, the good folk at Heidelberg Uni Centre for Research on Antiquity have produced lists of lemmata in papyri. They've got around 22,000 lemmata, half of them names. So Leuven knows more names than Heidelberg knows—presumably because they're using a smaller corpus. I know less than either. And the more papyri turn up, the more names and nouns and verbs will turn up. Lemmata are open-ended.

But while Heidelberg's 11,000 names can turn into 16,500, it won't turn into a million. And while 175,000 lemmata without names can turn into 173,000 when I fix PHI #7—and maybe 220,000 once both the dictionaries and the corpus is complete up to 1453 or 1669—it's not going to turn into five million. Even if you count all the variants of dialect and spelling and phonology in lemmata, as I'll attempt in the final installment—which is how Leuven get from 16,500 names to 26,000, and the OED gets from 230,000 lemmata to 610,00 variants: even then, you're not getting to a million. (My current back-of-the-envelope calculation without names is around 350,000.)

OK, I've got some Egyptian names to go before I revise the published counts.


