Ἡλληνιστεύκοντος: July 2009

2009-07-21

Lerna VIIc: Variants

The various counts of lemmata that I've been putting out for the last while have made little mention of the difficulty in deciding whether two forms belong to variants of the same lemma, or distinct lemmata. The judgement call is difficult enough within a homogeneous language, with slight variations in derivational morphology. It's even worse with a large linguistic span like we've been dealing with, with lots of dialectal variation, phonological change across time, and spelling mutability.

So confronted with two similar nominative singulars in the vocabulary, or two 1st person present indicatives, you need to decide whether you'll count them as the same lexeme or not. And how liberal you are in your counting will decide how many lemmata you count, and how many you dismiss as variants.

To illustrate with English: you won't count color and colour as distinct lemmata, nor publicise and publicize. You won't count recieve as a distinct lemma from receive: misspellings happen, and you still need to count those misspellings as something. Though imbed is not a misspelling of embed, but a different derivation, you'd still want to conflate them too as the one headword.

OTOH, some people draw a distinction between racist and racialist. You may not, but if enough people do, you have to treat them as distinct. (The fact that the OED has decided they're now the same thing does not mean the entire language community has.) That's a judgement call in itself: people are uncomfortable with morphological variation as much as with any other kind, and people swore to the OED that there's a meaning distinction between gray and grey too. So there's no right answers. But there are arbitrary decisions.

Dictionaries normally make those arbitrary decisions for you, because they cross-reference variants to main headwords, and you can rely on their judgement. But dictionaries' judgement can be as arbitrary as any other's: the distinctions can be fine—particularly with variant suffixes, as we'll see. And some dictionaries conflate variants more effectively than others. Kriaras has a lot of phonetic variability, and hides a lot more spelling variability, because of its normalised modern spelling. But it does a reasonable job of indicating which forms are just variants.

By contrast, LSJ's more discursive entries include derived lemmata and similar lemmata, as well as simple variants, in the one entry. So it's harder for automated processing of dictionaries, such as the TLG lemmatiser does, to trust that two words cited in the one entry are in fact the same lexeme. Because the automated processing errs on the side of caution, it will distinguish variants as lemmata more than it should, which inflates the lemma count. The TLG lemmatiser currently seens two or more lemmata in some 12,000 LSJ entries; eyeballing, I'd say maybe a fifth of those could arguably be conflated. There are also a number of lemmata in later dictionaries (Trapp and Kriaras most notably) which could also arguably be conflated with lemmata in LSJ—but haven't been yet, because I haven't been through all 70,000-odd of their lemmata manually. (I'm not bold enough to guess how many.)

So there is a margin of overcounting lemmata; OTOH, there are many more instances where one could argue lemmata are undercounted, because the lemmatiser conflates variants. *I* won't argue that, because I've been responsible for a lot of the conflating. But this has been driven by a particular take on the vocabulary of Greek: if you're searching for words through a search engine, you're likelier to care about meaning than inflection, and you'll want the search to retrieve any word instance that looks enough like your word to match. You won't want to skip matches just because the spelling is slightly off. So for that purpose, less lemmata, meaning more search hits per lemma, is a good thing.

This has meant that, where there was doubt about whether a form is a distinct lemma or not, e.g. in LSJ cross-references, or variation between LSJ and Trapp, I've usually conflated those forms that I've been through manually. That's not everyone's purpose. If you're trying to inflate the count of distinct words of Greek, it's definitely not your purpose. Even if that isn't what you're trying to do, there will be disagreements on how much to conflate.

Part of those disagreements are tied to the dead tree: paper dictionaries have been understandably more reluctant to conflate variants that are alphabetically distant from each other. But there have been times that the choice to conflate hasn't been clear; and times when I've decided not to conflate. (The TLG search engine does display cross-referencing to the user, so that decision is not fatal.)

So what sorts of things may or may not be conflated as the same lemma in Greek? This laundry list will cover at least some of it:

Dialectal phonological differences: For some of the ancient dialects some of the time, these differences are so predictable, they're not even mentioned in LSJ entries. The big example is the different treatment of Proto-Greek */aː/ in Doric and Aeolic (stays α), in Ionic (goes to η), and in Attic (η except after ε, ι, ρ). The second example is Aeolic being accented as far back in the word as humanly possible, something which in the rest of Greek regularly happens for just verbs and compounds. There's more, and together they all mean that you shouldn't count Attic βοηθέω, Ionic βωθέω, and Doric βοαθοέω as different verbs for "help". Nor for that matter Aeolic βαθόημι: the different inflection is normal for Aeolic, and α is a plausible fate for /oaː/. (Just like the Ionic ω is.)
Classical spelling variations: If the odd inscription spells βοηθέω as βοιηθέω, that still isn't reason enough to call it a different lemma. Some spelling variation is endemic to the Classical language, because the pronunciation of those words was in flux—as occurs at any stage of any language. Usually, dictionaries will sweep this spelling variation under the carpet too, in generic cross-references like "κρεω-: see κρεο-". Once LSJ has worked out that λιπ- was the original pronunciation of compounds meaning "lacking in", every compound that can be spelled with λειπ- is listed under λιπ- .
... With the inconvenient exception of λειπογνώμων "without an inspector (or tariff, or distinguishing mark)", for which no ancient authority gives a λιπ- spelling, because the pronunciation was already changing. LSJ does dismiss the mediaeval spellings of the word as λιπογνώμων; but really, who could blame the mediaevals?
Late spelling variations: λιπογνώμων is very, *very* far from the only instance when mediaeval or even Hellenistic writers could not cope with the increasingly historical spelling of Greek, and spelled words more creatively than LSJ allows. LSJ is an historical dictionary, so its business is to go back to the original phonology of the word where possible, and dismiss everything outside its ambit. But if the manuscripts and papyri spell the word differently, that does not mean it's a different word. I've spent several entertaining months helping the lemmatiser cope with λλ as an alternate spelling of λ, or ι as an alternate spelling of ει, or σζ as an... interesting spelling of σ.
Hypotheticals: In making sense of the derivation of words, grammarians would often suggest what they thought the real underlying form of a word was. For example, in discussing ἀζηχής "continuous", they would suggest the form is underlyingly ἀδιηχής "unseparated" or ἀδιεχθής "unhostile". I have on occasion conflated these hypothetical derivations with the word they're explaining, when there was not anything else to be done with them.
Hypercorrections: With written Greek increasingly remote from the spoken language, Byzantine writers did increasingly oddball things to make their words sound high-falutin'. Theodore Metochites' Homer-Through-The-Looking-Glass is one egregious instance, the result of too much Classical learning ("if Homer said νοῦσος for νόσος, then I'll say σουφία for σοφία"). Theodore Studites' is another, and the result of not enough Classical learning: I'll never quite get over him working out that the imperfect of περισσεύω is περι-έσσευε. Throughout the period, there's a persistent tendency to stress words on the "wrong" syllable. I assume the authorities don't discuss it because it was beneath their notice; and I assume the Byzantines did it Because They Could.
This means that Byzantines have made up a bunch of variants for lemmata which did not exist, quite artificially. However artificial they are, they turn up in the corpus, so they need to be accounted for; but they shouldn't be counted as new words. They're old words in clown outfits.
Language change: But Classical words don't only turn up in the corpus in either chlamys or clown outfits. They also turn up in denim. Hm, I don't know if that analogy is going to work...
Variation in words is also going to result from natural language change. (In fact, differentiating natural and artificial language change is harder than it looks.) As we saw in earlier episodes, the *anr- stem of Proto-Greek, "man", turns up as ἀνήρ, ἀνδρός in Classical Greek, as ἄνδρας, ἄνδρα in Byzantine Greek, and as ἄντρας, ἄντρα in Modern Greek. The third declension of the original has been done away with, and the Ancient /ndr/ cluster is now spelled differently, leaving the eye-pronunciation /nðr/ to learned forms.
For the purposes that the TLG lemmatiser is put to—searching words in a diachronic corpus—that's not enough reason to call them different lemmata. ἄντρας is the natural development of ἀνήρ, it means the same thing, so they count as the same thing. But morphologically ἄντρας has already moved on from ἀνήρ, and counting them as the same is a diachronic artifice. It's only because we're trying to cover three thousand years in the one corpus, that we're conflating words three thousand years apart.
Variations in compounding: Putting stem A and stem B together forms you a new compound AB. That compound should for the most part count as the same word, regardless of slight differences in how the compound is put together. So ὑδατοφόρος "water-bearing" does not show up in the dictionaries, but ὑδροφόρος "water-bearing" does; they should count as variants of the same lemma, since hydro and hydato are allomorphs of the same noun, /húdɔːr/ gen. /húdatos/ "water". This, already, is a conflation lexicographers will not be equally eager to embrace.
Similarly, Classical Greek has δαφνηφόρος "carrying laurels" (as a ceremonial act), using -η- as a combining vowel; Late Greek no longer used -η- as a combining vowel, so the word turns up there as the more regular δαφνοφόρος. I'm disinclined to call these distinct lemmata. Others may not be.
Variations in inflection: Here things really do start getting murky: if the inflection of two forms is slightly different, though their stem is the same, should they count as the same lemma? I have allowed them to some times, especially if there is a distinction between an earlier and a later, more transparent or commonplace form. So when Theodore Studites, with his shaky command of classical morphology, uses εὐέλπης, εὐέλπες instead of εὔελπις, εὔελπι, or when he forms the aorist passive participle of εὐκρινέω as εὐκριθέν, implying a citation form εὐκρίνω, I smile benignly, and allow that he's gotten the Classical form wrong—not that he's come up with a brand new lemma. When the variants are contemporary, I'm more reluctant to make that conflation. So I have let ἀθλεύω and ἀθλέω remain distinct verbs. That's not to say I've never done such a conflation—especially when LSJ has said "= sq."; but I've had less of a motivation to.

So I have come up with a count of lemmata that conflates some variants and deflates others. In conflating variants, I am coming up with a lower count of lemmata than others might. Which means there is a question mark over the count of 173,000 lemmata I've claimed.

Well, good. Like I keep saying, there's a question mark over any count of words of any sort. But given that the lemmata contain multitudes, it's worth uncovering how many of those multitudes there are. So I'm going to count up the variants.

The first warning about these counts is that this is still not how you're going to beat English in counts of words. The OED counts some 250,000 lemmata (and its coverage of Old and Middle English is *not* exhaustive); with variants, that gets up to 615,000. There will be a lot more than 173,000 variants to count in the Greek corpus; but there won't be 615,000.

The second is to explain what I'm counting.

I allow any variation in the lexical rot, recorded in the lexical database, to count as a distinct variant. So ἀνήρ, ἄνδρας, and ἄντρας count as three variants, and Doric ἀγέννατος and Attic ἀγέννητος as two.
I do *not* count dialectal or diachronic change in the same inflection paradigm (including tense stems) as different. So I don't count both Ionic χώρ-η and Attic χώρ-α, or Doric δωρ-ίσδω and Attic δωρ-ίζω, as different.
I don't count dialectal variants in preverbs, because they are not part of the root; so ξυμμαχέω is not counted separately from συμμαχέω.
I don't count uppercase and lowercase variants of the same lemma. (But I do count them when they are distinct lemmata.)
I don't count active and passive variants of the same root verb as distinct.
And I don't count the adhoc respellings that the lemmatiser can do on the fly, to recognised deviations particularly in diplomatic editions.

So. I had 214,381 lemmata in the corpus. Without proper names and Milesian numbers, that came down to 172,646. How many variants does that translate to?

Variants: 362,947
Without numbers: 352,895
Without names and numbers: 286,652.

(I'd guessed around 350,000 variants a couple of postings ago. That's pretty good. It would be even better, if my guess hadn't excluded proper names...)

This amounts to 1.7 variants per lemma. I'll admit to some surprise that the OED ratio of variants to lemmata is more like 2.5: Greek historical spelling should allow for comparable confusion. My suspicion is it does, and I'm discounting adhoc misspellings which the OED doesn't.

Names are slightly more variable than normal words: 66,243 variants for 37,306 names, which is a ratio of 1.8:1. Foreign names in particular get mangled in several ways, including creative hellenisations: that's why there are 75 different variants of "Muhammad", and 43 variants of "Lombard". (Those examples aren't fair, since "Muhammad" includes the Turkish "Mehmed", and "Lombard" also includes the earlier "Longibard"—so again, I'm conflating variants more than some might.)

That count can be whittled down further of course.

If we discount all variants noted as hypothetical—which were made up by grammarians, and were not used in the actual language, we come down to 274,650.
If we ignore variation in accentuation, which is mostly a Byzantine hypercorrection, we're down to 265,233.
If we ignore the uncertainty between ει and ι, which bedevilled the Koine, we're down to 260,362.
Ignore double consonants: 253,891.
Ignore the distinction between η and α, as a brute-force levelling of Doric and Attic, and we're down to 247,294.
Ignore the distinction between smooth and rough breathing (which occasionally tripped scribes up): 245,672.

Even with these common causes of variation excluded, that's still some 73,000 added variants (41%) that I have not counted as distinct lemmata. That means that one could argue some of them should be counted as separate—although I have trouble seeing how a consistent criterion could be devised, especially over such a large timespan.

So the lemma count may be an underestimate, because of different judgements on what counts as distinct; but at its most inflated, the lemma count will no more than double. In reality, I think the debatable instances are closer to 20% than to 70%. So no, not even this way are we getting to 5,000,000 lemmata.
...Read more

2009-07-15

Lerna VIIb: Lemma counts and proportion of text recognised

We can keep dredging lemmata up to move towards a target of 300,000. But of course for a living language, as Modern Greek now is and as Ancient Greek once was, there is no ceiling in lemmata: people can always make up new words, and do. And because dictionaries will never exhaust what words people come up with, even if they work off a limited corpus, the constructive thing to do is not to say how many lemmata are in a language.

The constructive thing rather is to say, if I know n lemmata in a language, how many word instances in a corpus will I understand? If I know n lemmata, how much of the text I'm confronted with can I make sense of? If a vocabulary of 500 words lets you understand just 70% of all the text you'll see, you're in some trouble. If a vocabulary of 50,000 lets you understand 99.7% of a text, on the other hand, that's one word instance out of three hundred that you'll draw a blank on. Assuming 500 words a page, that's around three unknown words every couple of pages. That's still a lot: if you're having to run to the dictionary once a page, you've got catching up to do in the Word-A-Day club. One word every ten pages—say 99.98% of all word instances: that's probably more reasonable.

That gives you a statistic of how many word instances are recognised; but when you're listing the words you don't know yet, you tend to list unfamiliar word forms, not word instances. So if you come away from your reading of The Superior Person's Little Book of Words with a list of words you need to look up in the dictionary, you won't count contrafibularity three times and floccipaucinihilation seven times. You only have to look up contrafibularity once to understand it, so you'll list it as a single unknown word form.

The proportion of recognised word forms is going to be much lower than the proportion of recognised word instances. The word instances will give you credit for knowing words like and and the and of: hey presto, with those three words, you already understand 20% of all printed words of English! The words you won't know will tend to be one-offs, occurring just once or twice in a text: it's rare words, which people don't come across a lot in texts, that they won't have needed to learn. But with word forms, and and the and of don't count as 20% of all printed words of English: they only count as three word forms. The unfamiliar one-offs will make a much larger dent in the size of your vocabulary, than in the proportion of a page you can grok.

Again, the point of this is to say, not that there are n words in a language, which is deeply problematic in ways I've gone into great length on. It's to say that, if you know n lemmata, you will understand n₁% of the vocabulary of a corpus (its word forms), and n₂% of all the text in a corpus (its word instances). The value of n can go up or down, and the proportions of words you understand can go up or down with it. This means two things which are more useful to keep in mind than any grand How Many Words statement.

First, it's not about how many words there are ever, so much as how many words are *useful* to know. If there are fifty words which were made up for Joe Blow's autobiography, and Jow Blow's autobiography has never been published or indeed sighted outside his kitchen, then those fifty words will not form part of your corpus, so they need not count. Or, if there are three hundred words of phrenology which noone has used for the past century, and they only got used once in a blue moon, then even if those three hundred words do show up in your corpus, they will be marginal enough to cut out most of the time. Tying vocabulary size to recognition allows you to limit the lexicon to what you will actually use, and how frequently you will use it.

The second realisation is the admission this makes, that the size of a vocabulary is asymptotic. People can keep making up words, or using words in increasingly niche and esoteric contexts. If n words let you understand 99% of the vocabulary, then you may well be able to come up with 5n words to recognise 99.9% of the vocabularly, and 20n to recognise 99.99% of the vocabulary, and even 100n to recognise 99.999% of the vocabulary. But by the time you're up to 99.99% recognised words, you can reasonably ask whether it's worth spending an extra five years building up your vocabulary, just to deal with the remaining 0.01%.

The answer is no. Dictionaries do not wait forever before they decide they're done: they have a large corpus, and dip into it fairly eclectically, but they do miss stuff (not just "sausage" in that Blackadder episode on Johnson's Dictionary). And that's OK, if the word is obscure enough for the dictionary's purposes. Dictionaries employ some subjectivity in leaving out words until they think they're worth taking seriously; but the way a corpus is put together usually filters the obscurities out for you already. If you're relying on printed text to prove a word is worth describing, you're leaving out all the made up and nonce words and speech errors that were never written down. Of course, that's an elitist way of viewing language, and print is nowhere near the barrier it used to be. But it does cut your corpus down to something manageable.

If you're working on a Classical language, the cruelty of Time (and Bastard Crusader scum), the indifference of scribes, and the snootiness of schoolmasters do plenty of filtering for you as well. That's why the PHI #7 corpus, which was not subject to the same filters as the literary corpus, has so much distinct vocabulary.

(For an example of why the bones of the Bastard Fourth-Crusader scum should boil in pitch in eternity, see the fate of the text of Ctesias and the other manuscripts unfortunate enough to be in Constantinople in 1204.)

So what sort of recognitions do the figures I've been quoting represent? I'm using the five corpora as before, and I'm also differentiating between all word forms, and just lowercase word forms—because proper name recognition lags behind recognition of common words in general. Lowercase word forms is a somewhat crude metric for leaving out proper names, and there are a few TLG editions which follow the e.e.cummings stylings of their manuscripts, leaving names lowercase. But it's all about the indicative figures, always.

	% Instances Recognised	Recognised Instances Ratio	% Lowercase Instances Recognised	Recognised Lowercase Instances Ratio
TLG + PHI #7	99.66%	1:294	99.86	1:740
TLG	99.84%	1:624	99.915	1:1170
LSJ	99.37%	1:158	99.83%	1:585
Mostly Pagan	99.964%	1:2759	99.979%	1:4750
Strictly Classical	99.967%	1:2993	99.975%	1:4019

	% Forms Recognised	Recognised Forms Ratio	% Lowercase Forms Recognised	Recognised Lowercase Forms Ratio
TLG + PHI #7	89.56%	1:9.6	94.33%	1:17.6
TLG	93.91%	1:16.4	95.59%	1:22.7
LSJ	89.51%	1:9.5	95.77%	1:23.6
Mostly Pagan	99.16%	1:118	99.42%	1:172
Strictly Classical	99.56%	1:226	99.62%	1:263

Let's go through this slowly.

The lemmatiser understands the Strictly Classical corpus—literary Greek up to iv BC—quite well. It only fails to pick up 1 in every 226 distinct word forms, which mean you have go through on average 2993 word instances—say six pages of text—before you hit a word it does not understand. But you can ignore capitalised words, because they're typically proper names, and we don't expect to have those in our vocabulary anyway. You can make sense of "Alcidamophron slaughtered the servant of Tlesipator" more readily than you can "Jack fnocilurphed the smorchnepot of Jill". If we do ignore capitalised words, the lemmatiser fails to understand just 1 in 263 word forms, and over eight pages of text on average before it finds a problem word. As machine understanding of morphology goes, that's not bad at all.

So the 55,000 lemmata that the lemmatiser knows of for the Strictly Classical corpus get you through eight pages of Greek on average as smooth sailing. And that is the real meaning of "55,000" lemmata here. Of course, that's an eight page average across a corpus that is still not terribly homogeneous; and some bits of the corpus are going to be understood a lot better than others. The lemmatiser understands all 199,000 word instances in Homer, for instance: 400 pages by our reckoning, not just 8. On the other hand, the Strictly Classical corpus also includes Aeschylus, whose transmission has been corrupted frequently, and where the lemmatiser falls over 63 word instances of 74,000—once every couple of pages.

With the Mostly Pagan corpus, which sticks to literary texts up to IV AD, the lemmatiser understands the corpus almost as well: 76,000 lemmata give you all but 1 in 172 word forms, and in fact because the later texts are slightly more homogeneous linguistically, almost 10 pages on average of text before there is a problem word. So 76,000 lemmata for Mostly Pagan is about as meaningful a claim as 55,000 lemmata for Strictly Classical: it lets you understand almost the same proportion of text in the corpus. There's bound to be more lemmata than that in the corpus, that the dictionaries have not officially recorded; but it's not going to be overwhelmingly more. I'd guessed maybe 500 lemmata underestimated for the Strictly Classical corpus, with 1,500 unrecognised word forms. The Mostly Pagan corpus has 5,000 unrecognised word forms, so I'll guess maybe 2,000 underestimated lemmata.

The LSJ corpus is much less well understood, partly because it includes technical writing, but mostly because it includes the more unruly texts from the inscriptions and papyri, with their distinct vocabulary and grammars, and confusing spellings. We claimed 124,000 lemmata here, but that only gets you one word form unrecognised per 23; including potential proper names, it's as bad as one word unrecognised in ten. And you'll be stumbling over one word per a page and a bit. Our unrecognised word forms are now up to 35,000 lowercase forms. That does not necessarily mean 10,000 more lemmata unaccounted for, given the problems in spelling and grammar; so I'm reluctant to guess how many more lemmata you need to get to the same level of recognition as with the Strictly Classical corpus. But there are clearly more lemmata to go.

You can see the trouble the papyri and inscriptions bring more clearly in the last two counts, which include and exclude them. Without them, the TLG corpus has one lowercase word form in 23 unrecognised, and a little over a word per two pages unrecognised. That's not that bad for the claimed 162,000 lemmata, given the bewildering diversity of texts in the corpus. Let the inscriptions and papyri back in, and you now miss a word form for every 18, and a word every one and a half pages. And that's for increasing the size of the corpus by just a twentieth.

So the lemma counts are more and less reliable for different periods of Greek: we can tell how much text they allow you to recognise in different corpora, and we can allow that there are cut-offs for how many lemmata it is useful to know in a corpus. The lemma count is still not open-ended, so long as the corpus is finite. (That's the thing about langue instead of parole: the corpus size of *potential* text, using language as a theoretical system, is infinite.) And the word form coverage of the lemmatiser will keep improving, as an ongoing project; as I'd already mentioned before, TLG word form recognition has gone up from 90% to 94% in the past two years. But the lemma count does peter off.

So let me give one last batch of numbers to illustrate the relativity of lemma counts: how much less of a corpus do we understand, if we cut down on the number of lemmata. I'll do that using the word instances per lemma count for the TLG. Because there is a fair bit of ambiguity in Greek morphology, many word forms are ambiguous between two lemmata (and a few between more than two); so there is some double counting of instances to be had. As a result, the 202,000 lemmata recognised in the TLG corpus—proper names and not—account for 112 million word instances, though the corpus really contains only 95 million.

So if we take 112 million as our baseline, how many instances are accounted for by admitting less lemmata?

Lemmata	Word Instances
100	61,166,253	54.44%
500	78,932,451	70.25%
1,000	86,575,286	77.06%
2,000	94,016,671	83.68%
5,000	102,370,243	91.11%
10,000	106,926,324	95.17%
20,000	109,884,248	97.80%
50,000	111,727,544	99.44%
100,000	112,191,095	99.85%
120,000	112,251,181	99.91%
150,000	112,302,895	99.95%
180,000	112,332,895	99.98%
190,000	112,342,895	99.98%
202,000	112,354,703	100%

There's your Zipf's Law in action. The table neatly parallels what Wikipedia says for vocabulary size, quoting a 1989 paper presumably on English: "We need to understand about 95% of a text in order to gain close to full understanding and it looks like one needs to know more than 10,000 words for that."

The difference between 100,000 and 200,000 lemmata accounts for just 163,000 word instances out of 112 million, around one word in 700. The difference between 180,000 and 200,000 accounts for less than a word every ten pages. So there's a very very long tail of increasingly rare words: the last 60,000 lemmata each occur just once in the 95 million word corpus, and the last 25,000 lemmata before that occur just twice. There's a *lot* of these one-offs, which is why all together they account for an unknown word every four pages. And we need dictionaries for words we don't come across every day, not words we do.

Still, they are one-offs (hapaxes). They're not useless—they were clearly useful to whoever used them that one time in the 2,500 year span of the corpus. But noone needs all 60,000 of them at once. And by the time you're down to lemmata that happen just once or twice in a roomfull of books (a small room admittedly), you can appreciate why real human beings walk around with close to 20,000 lemmata in their skulls, and not 200,000. For the rest, we have guessing from context (and related words); and we have dictionaries. And once Classical Greek became a bookish language, the Byzantines used dictionaries too.
...Read more

2009-07-12

Lerna VIIa: Classical and Late vocabulary

Here, I'll try making some sense of how the vocabularies of Greek have shifted between the corpora.

This is where we got to.

		Lemmata	Excluding Proper Names
TLG + PHI #7	(viii-XVI, +tech +christ +inscr/pap)	214,381	172,646
TLG	(viii–XVI, +tech +christ -inscr/pap)	201,823	162,009
LSJ Corpus	(viii-VI, +tech -christ +inscr/pap)	159,636	124,215
Mostly Pagan	(viii–IV, -tech -christ -inscr/pap)	99,485	76,067
Strictly Ancient	(viii–iv, +tech -christ +inscr/pap)	66,390	54,898

The corpora have varying mixes of including "technical" texts, Christian texts, and inscriptions and papyri. In case it wasn't obvious, "Christian texts" means texts about the Christian religion, which have a distinct editorial and linguistic tradition deviant from the Classics. We're not banning authors for their creed, but for what corpora their texts fit into. The Mostly Pagan corpus chooses to end with Synesius rather than Nonnus, but both of them started as pagans and ended as Christian bishops.

I'm going to try and work on how the vocabularies differ in time between the corpora. Two postings ago, we cut down on post-classical and suspect-looking analyses, by restricting out word form counts to forms of good standing and pedigree. This permitted us to describe a more homogeneous corpus. We can put the same restrictions on our lemma counts.

	Lemmata	Excluding Proper Names
TLG + PHI #7	204,393	167,640
TLG (viii–XVI)	192,342	157,302
LSJ Corpus (viii-VI)	151,962	120,018
Mostly Pagan (viii–IV)	97,906	75,845
Strictly Ancient (viii–iv)	65,842	54,743

Forms of Good Standing (no numbers, hypothetical, hypercorrect, unattested tenses, uncertain inflection, anomalous inflection, transliterated Latin)

Nary a dent on the Strictly Classical corpus or even the Mostly Pagan corpus, which contain literature. But we've got rid of words made up by grammarians as etymologies; e.g. ἀόλλησις "thronging" as an etymology of ἀλλᾶς "sausage". (The jokes just write themselves, don't they.) We also got rid of Latin terminology, which didn't always make it to the dictionaries when it was undigested Latin; e.g. νερεδιτάς hereditas "inheritance". What with that and Milesian numbers, we can take 10,000 lemmata out for the overall corpus.

hereditas ends up as /nereðitas/? Yes, Byzantine lawmen liked transliterating Latin /h/ as <n>; I'm not clear on why.

Now comes the hatchet. How many lemmata can be called Ancient grammatically, even if they only show up in later texts? That sounds nonsensical, right? Surely if a lemma is first attested post-Classically, it counts as post-Classical. Well, it is, but I'll crank the handle anyway.

I'm leaving out proper names now, because the lemmatiser only occasionally has assigned them period.
OTOH, lemmata do by default get called Late in the lemmatiser's database if they are unique to Lampe or Trapp, and specifically Demotic if they're unique to Kriaras. Finding I'd missed a class of lemmata for tagging was what made me start revisiting all the counts.
This also means that by default, if the lemma is in LSJ, it's counted as classical.
LSJ stops nominally at VI AD (and on occasion with scholia, a lot later); so it decidedly includes Koine—but lemmata have not been consistently periodised in the lemmatiser as Koine as distinct from Classical. So the counts of lemmata tagged as classical are inflated enough not to be useful.
I'm cranking the handle anyway.
Beyond that, though, any verb derived from a Classical verb (by prefixing a preposition) still counts as linguistically Classical, because that process was fully productive from the beginning. So ἀντιδιαλοιδορέομαι "to be mocked thoroughly in response" is attested only in Trapp; but all of ἀντί, διά and λοιδορέω are Classical, and the combination was licensed in antiquity, so the compound of all three is counted as Classical.
The same goes for lemmata formed through derivational morphology—unless the word does show up in a later dictionary. So ἀβελτίωτος "unimproved" could have been formed at any time of Greek, from ἀ-, βελτιόω, and -τος. But because it is explicitly attested in Trapp, it is counted as a new Byzantine word.
The discrepancy between how I handle prefixing (always counts as Ancient) and suffixing (counts as a new word in later dictionaries) is, as you may have guessed, an artifice of how the lemmatiser has been implemented.
OTOH, some later text has crept into the nominally Ancient corpus—notably in Testimonia (later descriptions of authors in literature), and more so in "technical" texts, which were often written in Koine. In fact, LSJ has plenty of Koine in it—and as we'll see, a lot more Koine in technical and daily-life texts than in literary texts, something which should surprise precisely nooone.
And yet... I'm still cranking the handle

So if I crank the handle, and exclude analyses that the lemmatiser thinks, for better or worse, are post-Classical, what do I get?

	Lemmata
TLG + PHI #7	132,098
TLG (viii–XVI)	122,579
LSJ Corpus (viii-VI)	110,417
Mostly Pagan (viii–IV)	73,260
Strictly Ancient (viii–iv)	54,176

Forms of Good Standing and Classical Pedigree

Let's try and make sense of this. The two corpora we'll compare are the LSJ corpus, which goes up to VI AD and excludes Christian writings; and the complete TLG + PHI #7 corpus. As always, excluding proper names:

	All lemmata	Classical lemmata only
TLG + PHI #7	167,640	132,098	Difference: 33,000 Middle + 2,000 Modern lemmata
LSJ Corpus	120,018	110,417	Difference: 10,000 Middle lemmata

There are 47,000 lemmata that turn up only after VI AD, after the LSJ corpus.
There are another 10,000 lemmata in the LSJ corpus (8%) that are marked as late ("Middle"). Given that the LSJ corpus does include Koine texts, that whether a lemma got marked as Koine or not is a little haphazard, and that the technical Koine texts in the LSJ are linguistically messy, that's not that surprising.
By contrast, the Mostly Pagan corpus, which skips the papyri and technical texts, has just 2,500 middle lemmata (3%). Literary texts avoid linguistically innovative lemmata. Technical texts account for another 2,600 middle lemmata; the remaiing 5,000 are from papyri.
Of the 47,000, almost half—22,000—are linguistically still Classical. Some of these are late lemmata that just happen to have made it to LSJ. (A lot of those are legal Latinisms.) Some of those are derived lemmata.

Restating: a quarter of all lemmata in our corpus turned up only after VI AD; but half of those new lemmata don't look new to the lemmatiser at all: they look classical. Because of productivity of lemmata vs. accidents of tagging, somewhat less than a quarter of all lemmata in our corpus appear to be post-Classical to the lemmatiser.

Let's look at these new lemmata more closely, by looking at the most frequent lemmata in each category. The distinction between "linguistically Classical" and "linguistically Middle" does not turn out to matter much, because it's an accident of what has been included or excluded from LSJ. OTOH the distinction with "linguistically Modern" (i.e. Early Modern Greek) is quite revealing. Be warned too that the frequency of lemmata is all about the types of text included in the corpus.

And because a little bit of vernacular has leaked into Photius' lexicon, which was after all compiled pretty late, I'm excluding the lexica from the LSJ counts again. This pushes 15,000 lemmata back into the mediaeval period—140 of them linguistically Modern, 1500 of them linguistically Middle—and the rest of them linguistically Ancient. They're dictionaries; of course they have lots of one-off words.

So what are the most frequent lemmata new to mediaeval Greek?

Linguistically Ancient (34,641)

ἐκπόρευσις (85) "proceeding forth"
στιχολογία (735) "recitation"
περιβόλιον (561) "garden"
ἐναντιοφανής (533) "apparently contradictory"
πανσέβαστος (453) "most august"
λατινικός (388) "Latin"
πακτεύω (375) "make a pact"
κορμίον (365) "trunk of body"
χρυσορρήμων (365) "golden-speaking"
παραταγή (365) "order for payment"

Linguistically Middle (25,111)

θεοτοκίον (1425) "hymn to Virgin Mary"
μετόχιον (1370) "monastic property"
κονδικτίκιος (928) "relating to repossession of property"
κανείς (671) "noone"
ὀκτώηχος (863) "hymnal with all eight modes"
πρωτοσπαθάριος (696) "chief of imperial bodyguard"
ἱεραρχία (638) "hierarchy"
τώρα (622) "now"
γυρίζω (582) "I return"
μισέρ (556) "monsieur"

Linguistically Modern (2,526)

ἔτζι (749) "so"
ἠμπορέω (424) "I can"
ἀμή (290) "but"
τέτοιος (261) "such"
κάθε (202) "each"
ἀντάμα (187) "together"
ἀπαυτοῦ (125) "thence"
κάποιος (114) "someone"
βουλέω (113) "I want"
ὁλόρθος (111) "upright"

(A couple of texts nominally in LSJ's time period still contain νά, which I treat as diagnostic of Modern Greek; but again, these lists are only meant to be indicative.)

The ancient-looking new words deal with theology, logic, law, or the public sphere: disciplines which kept innovating their own specialist vocabulary. The middle-looking words largely deal with the church: the theotokion count is inflated relative to its counterparts because the word is used to signal a section for a lot of hymns in the corpus. There are a couple of novel grammatical words in middle Greek ("noone, now"). But for Modern Greek all the most frequent words are grammatical. And what that foreshadows is that Modern Greek has a distinct grammatical system than Ancient Greek, while Middle Greek is a lot closer to the Ancient grammatical system. That's no surprise, given that much of our Middle Greek corpus is Atticist to begin with.

Finally, for what it's worth, this is how many lemmata the lemmatiser thinks are linguistically Attic:

	Lemmata
TLG + PHI #7	127,169
TLG (viii–XVI)	118,697
LSJ Corpus (viii-VI)	105,960
Mostly Pagan (viii–IV)	70,726
Strictly Ancient (viii–iv)	51,666

Forms of Good Standing and Attic Pedigree

That's not worth that much, it must be said, since Attic is taken as the default dialect in the lemmatiser. Though dialect word forms eliminates a substantial number from the corpus, the lemma count itself is not affected much: an Attic-compatible word form usually turns up someplace.
...Read more

2009-07-10

Lerna VId: A correction of lemma counts

Last post had its share of egg on my face, showing systematic overcounts of word forms in the corpora. This post is another healthy serving of omelette, correcting the lemma counts given in Lerna VIa. The overall story is:

There are less distinct word forms in the PHI #7 corpus than I thought
There are less scribal alternate forms left in PHI #7: if an editor thought they knew better than the scribe, the scribe's form is left out of consideration
There is less dialectal and orthographic wiggle-room allowed to PHI #7
So as a result of all this, the count of lemmata distinctive to PHI #7 has crashed: ignoring proper names, 3,800 lemmata that the lemmatiser thought it saw in PHI #7 are no longer there.
The count has still crashed, even though I've added a fair few lemmata to deal with PHI #7—the most frequent names, the overlaps with Trapp's dictionary, a few stragglers from DGE—as well as some dialectal grammar and some more respelling rules. I've picked up around 800 non-names and 1200 proper names; so I'm down by 1800 lemmata from before, rather than 3800.
I could have kept going to add more names than that, but it's been two weeks already, for gorsakes.
OTOH, because I've added extra names in particular, recognition of the TLG has slightly improved. So there are a few more lemmata for just the TLG-based corpora. (*Very* few.)
I also did some debugging of orthographic variation in lemmata, which resulted in some conflation of variants.
So if you ignore proper names, the TLG lemma count... actually ended up losing a few lemmata. (Again, *very* few: a couple of hundred lemmata each way.)

So.

	Lemmata	Excluding Greek Numerals	Excluding Proper Names
TLG + PHI #7	216,234 214,381	211,794 209,952	175,791 172,646
TLG (viii–XVI)	201,680 201,823	197,448 197,591	162,219 162,009
LSJ (viii-VI)	159,636	156,720	124,215
Mostly Pagan (viii–IV)	99,426 99,485	98,593 98,652	76,145 76,067
Strictly Ancient (viii–iv)	66,437 66,390	66,078 66,031	55,003 54,898

I also had a tally including also-rans analyses:

	Lemmata
TLG + PHI #7	220,560 218,727
TLG (viii–XVI)	206,161 206,470
LSJ (viii-VI)	166,387
Mostly Pagan (viii–IV)	107,257 107,512
Strictly Ancient (viii–iv)	73,427 73,532

In all of this, I've not been paying the PHI #7 corpus that much attention, though I did make a point of slipping it into the LSJ corpus. (The LSJ coverage of inscriptions and papyri are in fact why I called up PHI #7 in the first place.) I knew there would be extra lemmata there, and this lemma count is the PHI #7 disc's chance to shine. PHI #7 has added 6.5% more word instances to the TLG's, but 16% more word forms, and 6% more lemmata! That's phenomenal!

... What on Earth am I talking about? Remember Zipf's Law: the cumulative number of word forms that turn up is inversely proportional to the instance count for each word form. It's a Long Tail. If you add 6% more word instances, by the time you're already at 95 million instances, you should be getting... well, I can't do the maths, but you should be getting at most hundreds of new lemmata, not (as the table above shows) 12,000, of which only a couple of thousand are proper names. The 10,000 more lemmata of ordinary vocabulary shows you that the inscriptions and papyri—the Greek of daily life and of far flung dialects—has a very different vocabulary from the Greek of literature.

Of course, that you get 16% more word forms in PHI #7 means there's a lot of different inflections in the corpus that lie outside the TLG's ambit, because of all the non-literary dialects represented in the inscriptions. It also means a lot of misspellings that didn't belong in the TLG, as well.

In VIa, I went into an extended riff extrapolating how many more lemmata of Greek could turn up. Let me attempt that again, this time with more detail on proper names—but *not* including proper names in the final estimate.

The reason proper names don't belong in a final tally is worth restating, because not enough people are laughing at the notion. When we want to know how many words of English there are (which we shouldn't, but I've already been through that), we don't add the New York State White Pages to the Oxford English Dictionary, and we don't start screen-scraping geonames.org. We recognise that proper names are a different kind of thing from normal words (although the boundaries are fuzzy); and we also recognise that it's problematic to say a name belongs to one language and not another.

Does Κόρινθος count as a Greek name, even though it has the prehellenic telltale -νθ-? Well sure it does. Does Ομπάμα count as a Greek name? Or Σαίξπηρ for Shakespeare? Surely not. But what about the older declinable transliteration Σακεσπήριος? Doesn't that at least look Greek? What about Αὐρήλιος? But then again, what about Ἰσαάκ? Is Αμπντουλάχ not a Greek name? But does it become a Greek name when it was hellenised, as the Byzantines did, as Ἀβδελλᾶς? And is counting these names as part of the vocabulary of Greek a meaningful thing to do?

Well, better not to count proper names in the final tally at all; but let me add the counts I do know of, just in case someone is curious.

Right now, the TLG lemmatiser knows about almost 42,000 proper names. That includes most names of the Strictly Classical canon; a fair few names from later literature (including lots of Byzantine surnames), the names in Smith's Dictionary of Greek and Roman geography , and the thousand-odd names I was shovelling in over the past fortnight, to deal with the inscriptions and papyri.
Pape-Benseler went into its second edition in 1863, which increased it by a third. It covers geographical, personal, and mythological names in Ancient literature, and has some coverage of later stages. It has good coverage of such inscriptions as were known at the time, and is starting to notice papyri—though remember, this is thirty years before the discovery of Oxyrhynchus. And the dictionary is reasonably good about conflating variants.

Benseler does not say how many names he has in total, but he does say that Alpha under his revision went from 3820 names to 6120. Extrapolating based on LSJ, that should mean 38,000 names overall. There are clearly lemmata in Pape-Benseler that aren't in the TLG lemmatiser: I add 500 names because of dealing with PHI #7, and that was only dealing with names occurring 10 times or more in PHI #7. How much more am I missing? No idea. But I'd be surprised if it was more than 10,000.
In the following, I need a sense of how many of these names are personal, and how many are geographical. The Heidelberg word lists for papyri are a bit more reluctant to conflate variants than I prefer, but at least they list personal and geographical names separately: 8838 personal, 2637 geographical. Good enough for me, I'll say personal numbers :: place names are 4:1.
1863 is a long time ago in epigraphy, and the Lexicon of Greek Proper Names has been running for the past three decades to record the torrent of names found on inscriptions. It avoids mythological names (which are covered well enough in literature and Pape-Benseler), and it also does not do geographical names. It's ongoing, but its online search knows of 35,000 distinct names of people (whereas Pape-Benseler has 38,000 names of people, places, and gods). Now, the TLG lemmatiser recognises 17,600 distinct names, personal and geographical, in the ancient inscriptions on the PHI #7 disc. Guessing that 14,000 of those are personal names (4:1 ratio), that means it's missing at least 21,000 personal names.
The Leuven projects recognise 16,000 personal names in the papyri (with 7,000 extra variants), using the Duke Documentary Papyri corpus. The TLG lemmatiser recognises 9,600 distinct names, personal and geographical, in the same corpus on PHI #7. Guessing that 7,700 of those names are personal, it's missing at least another 8,000 names.
Some of the Leuven names will overlap with LGPN; but the Egyptian names won't. Let's say that all up, we're owed at least another 27,000 personal names. And using that 4:1 ratio again, another 5,000 place names. Heidelberg counts 9,000 personal names to Leuven's 16,000, and Heidelberg counts 2,600 geographical names; extrapolating up, that's consistent with 5,000.
That's not even scratching the surface of Byzantine and Modern names (let alone Σαίξπηρ or Ομπάμα, or the Thessalonica and Environs phone book). But so far, we can guess 42+27+5=74,000 names.
Flipping things around, there are 72,000 unrecognised capitalised words in PHI #7. That does not mean 72,000 missing names: lots of these will be misspellings of known names that the lemmatiser isn't dealing with, or different inflections of the same name. And those names are in the scope of LGPN and Leuven. I'd say the personal names are already accounted for in the 27,000 (say) personal names of the two initiatives.
There are a further 42,000 unrecognised capitalised words in TLG. Most of these won't be in LGPN and Leuven—though some will be in Pape-Benseler. Most of these by far are from post-Classical texts, and they include ancient gazeteers. (Ptolemy's Geography alone accounts for close to 3,000 unrecognised names.) How many of these are legitimate novel proper names? Again, no idea, but by this stage we're getting into one-offs, because all proper name word forms occurring more than 7 times in the TLG have been added to the database. I'll guess 30,000. There'll be some overlap with Leuven and LGPN, but not a lot, because many of these names are Byzantine.
As mentioned, the TLG is maybe 70%, maybe 75% complete for Byzantine literature, and only starting to go into Early Modern literature. It does have a lot of Byzantine surnames through church deeds (which account for 5,000 unrecognised capitalised words); so it'll have a reasonable cross-section. I haven't gone through the Byzantine proposopographies though (285-641, 642-1265, Palaeologan), to work out how many surnames they've unearthed in sum.
And I have not spent quality time with the Attica or Thessalonica or Nicosia phonebooks.
So at least 70,000 proper names to go, adding up to something like 110,000 proper names, and that count only goes up to the Fall of Constantinople.

Anyone who wants to start boasting of the 110,000 proper names of Two And A Half Thousand Years of Greek needs to be smacked upside the head with all three volumes of the Dictionary of American Family Names, and have the printout of all 8,000,000 places on geonames.org dropped on their foot. Because all of those count as proper names of One Year of English, by the same criterion.

(The Blogger Writing These Lines enjoyed contributing to the Dictionary of American Proper Names, even before he realised its value as a tool of percussive persuasion.)

So. Banishing proper names, we're left with 173,000 lemmata, as guesstimated. How much is left to go again? As it turns out, I'm doing the same guesstimates as before—but they make more sense without including proper names:

I keep my guesstimate of 20,000 lemmata more from Trapp (including texts not yet added to the TLG and volumes not yet published), and 10,000 lemmata more from Kriaras (ditto). That's 203,000.
There are words in LSJ that are not represented in this corpus. The biggest gap is the mediaeval Latin-Greek glossaries, with 1,000 missing lemmata; but there are several other oddities. The latest I've encountered, under ἐλεφαντουργική "of or pertaining to ivory-working": the 1161 AD commentary to the astrologer Paul of Alexandria, writing in 378—and last published in 1588. (The irony here is, the same adjective turns up in the rather more mainstream Heliodorus, a century beforehand.) But again, once the PHI #7 texts are in, and with the changes in text editions between the original LSJ and the TLG—not to mention the rejected scribal forms—I don't think there's more than 3,000 lemmata to add. That takes us to 206,000.
I'm inclined to revise my extrapolation for DGE downwards. Volume I updated may have 3500 lemmata not in LSJ, but it's competing not only with Bauer, Lampe, and Trapp, but also with the LSJ Supplement—which on its own adds 10,000 lemmata to LSJ, and which also has made a point of covering more inscriptions and papyri. I haven't taken the time to do any counting with DGE. It's a long plane trip tomorrow to Montreal—so maybe I will.

But there's no way Volume I has 3,500 lemmata not also in LSJ/Bauer/Lampe/Trapp/LSJSupp. DGE looks like taking 20 volumes if and when it finishes. (I wasn't planning on living until 2100 AD to find out.) If there's just 500 novel lemmata in Volume I, that means 10,000 novel lemmata all up; if 1000, then 20,000, as I proposed last time. I'm feeling jaundiced, but I'll still give them 20,000. That takes us to 226,000 lemmata, up to the fall of Candia.

Ούφ. On those figures, English still wins, :-) though not by much. The level of precision I've given is of course illusory, and in a following post I will tackle what is a more sensible question: how much vocabulary do you need to recognise n% of a text. But these counts should at least be indicative. ...Read more

2009-07-07

Lerna VIc: A correction of word form counts

This post fixes counts given in Lerna Va and Lerna Vb, with corrected counts from the PHI #7 disc—and a couple of weeks' work on the archaic dialects and proper names of the PHI #7 corpus. I've also fixed several errors in how I was counting forms as unique. The end result is that the previous counts were inflated all up by 15%.

This post is boring—a bunch of numbers—but necessary for the record. Because the counts are dependent on how the lemmatiser recognises words, and the lemmatiser is not static (and neither is the corpus), these counts are not definitive; but they are more correct than the last reports. The major bug fix (as far as I can tell!) was that I'd forgotten to factor out case and accentuation for not only the raw word forms, but also their normalised counterparts; so all the normalised word counts were off by some 10%. But because the main conclusions were comparing vocabulary sizes relative to each other, they still hold.

That's why testing is a good thing, right?

I'm added a new corpus into the mix: the LSJ Corpus is meant to approximate the coverage of LSJ. It excludes any Christian-related writing, apart from the Scriptures themselves. Otherwise (and that's a big Otherwise), it includes all pagan authors up to VI AD, including technical authors. It also includes the ancient inscriptions and papyri from PHI #7. The LSJ corpus additionally includes the lexica of Hesychius, Photius, and the Etymologicum Magnum, which were written later, but (some of the time) reach back earlier. It still leaves out the scholia on Classical literature, which explain Ancient texts with Byzantine words.

It also leaves out two Demotic texts which have ended up in collections of Ancient authors in the TLG, one under Pseudo-Hippocrates, one under the Hippiatrica. I'm taking those out of the Mostly Pagan and Strictly Ancient corpora too. The fact that a clearly XVI AD text has been lumped in with a v BC corpus should give you pause: use the author dates on the TLG with caution—they apply to the authors, but not to all the spurious works included under the author's name.

Lerna Va

Counts of unique strings in the corpora

		Word Instances	Word Forms
TLG + PHI #7	(viii-XVI, +tech +christ +inscr/pap)	102,005,245 101,684,658	1,861,358 1,815,540
TLG (viii–XVI)	(viii–XVI, +tech +christ -inscr/pap)	95,475,128	1,567,892
LSJ Corpus (viii–VI)	(viii-VI, +tech -christ +inscr/pap)	34,746,312	1,147,454
Mostly Pagan (viii–IV)	(viii–IV, -tech -christ -inscr/pap)	16,312,159	605,335
Strictly Ancient (viii–iv)	(viii–iv, +tech -christ +inscr/pap)	5,464,913 5,463,292	334,428 334,187

This is where the differences start. By correcting incomplete word indications, hyphenation, and rejected scribal forms in PHI #7, I've lost 400,000 word instances, and 46,000 distinct word forms.

You can also see that going from Mostly Pagan to the LSJ corpus almost doubles the count of distinct word forms. That's adding in two more centuries of pagan literature, technical writing, inscriptions, papyri, and late lexica. The PHI #7 texts account for around a third of that increase; the rest comes from the technical writing and lexica. The lexica include a large number of one-off words, and a lot of loose Byzantine spelling. Technical writing includes even more loose Byzantine spelling, because these texts are not closely bound to Atticist literary norms.

But it also includes a lot of idiosyncratic vocabulary—medical, astrological, engineering, mathematical, not to mention all the random place names in Ptolemy and the other geographical texts. Technical writing also encompasses grammatical and philological commentary—which often means grammarians just making up tenses and cases to explain words. So there is a lot of distinctive vocabulary in technical writing; but there is also a lot of inflated vocabulary.

Stripping case and forms without diacritics

I've fixed the calculations to take out more forms with partial diacritics—so I'm now making sure that all of ανδρι, ἀνδρι and ανδρί are folded under ἀνδρί. So less forms from here in are considered truly distinct:

	Word Forms
TLG + PHI #7	1,649,083 1,545,491
TLG (viii–XVI)	1,376,016 1,355,062
LSJ Corpus (viii–VI)	1,001,079
Mostly Pagan (viii–IV)	562,744 555,843
Strictly Ancient (viii–iv)	314,887 312,255

Restricting to recognised forms

Though I've added a thousand-odd proper names and some Arcadian and Cretan grammar to the lemmatiser, it still struggles with the PHI #7 corpus, as you'd expect: it's now understanding 62% of all word forms instead of 59%. There's 73,000 capitalised word forms in PHI #7, and 21,000 uncapitalised, that the lemmatiser has no idea about. For the TLG corpus, the equivalent is currently 42,000 capitalised word forms, and 43,000 uncapitalised that are going unrecognised—and the TLG has seven times more word forms more than PHI #7.

So there are a *lot* of vocabulary, particularly proper names, that are unique to the PHI #7 corpus, and that the lemmatiser does not yet understand. In fact, I already know there should be 16,000 distinct proper names in the papyri alone, as I mentioned last post. But once again, if I am using the lemmatiser to make morphological judgements about distinct word forms, I can't count words that the lemmatiser doesn't understand. So I have to pretend those words don't exist, for any remaining counts to mean anything.

OTOH, it's been a month, and recognition of the TLG corpus has gone up (partly because of this series of posts). The word counts are not static.

	Word Forms
TLG + PHI #7	1,435,391 1,391,855
TLG (viii–XVI)	1,282,298 1,272,773
LSJ Corpus (viii–VI)	905,044
Mostly Pagan (viii–IV)	557,574 551,651
Strictly Ancient (viii–iv)	313,354 311,428

Normalisation of forms (crasis, apostrophe, respellings)

Yeah, more bugs here. I've been case-folding word forms up to to this point; I, uh, think I forgot to case-fold the normalised word forms as well. Which ends up making quite a difference.

	Word Forms
TLG + PHI #7	1,352,303 1,152,682
TLG (viii–XVI)	1,232,209 1,101,191
LSJ Corpus (viii–VI)	736,932
Mostly Pagan (viii–IV)	539,469 481,424
Strictly Ancient (viii–iv)	301,005 275,703

Eliminating nu movable

Here too I changed the way I was considering a form to have nu movable—I relied on the morphological analysis rather than doing a blanket transformation. So less forms now get conflated.

	Word Forms
TLG + PHI #7	1,307,842 1,125,784
TLG (viii–XVI)	1,189,688 1,074,767
LSJ Corpus (viii–VI)	720,855
Mostly Pagan (viii–IV)	519,498 470,096
Strictly Ancient (viii–iv)	289,812 270,115

Eliminating non-words (abbreviations, Greek numerals, or geometric lines)

The more aggressive folding of diacritics I've put in means there aren't many of these left at all.

	Word Forms
TLG + PHI #7	1,300,717 1,125,699
TLG (viii–XVI)	1,183,120 1,074,683
LSJ Corpus (viii–VI)	720,800
Mostly Pagan (viii–IV)	518,321 470,096
Strictly Ancient (viii–iv)	289,275 270,093

So, sheepishly, I find that I overestimated unique word forms by say 120,000, and the errors in how I was handling PHI #7 made me overestimate by another 50,000. When I was comparing Three Thousand Years Of Greek to Slovenian and Telugu, my average word forms per thousand word instances in the TLG was 12.6; it is now 11.3. Telugu still has 30.8, so it still wins...

Lerna Vb

Forms of Good Standing (without: hypothetical, hypercorrect, uncertain inflection, anomalous inflection, transliterated Latin)

	Word Forms
TLG + PHI #7	1,267,434 1,101,948
TLG (viii–XVI)	1,158,529 1,053,549
LSJ Corpus (viii–VI)	708,669
Mostly Pagan (viii–IV)	515,275 468,698
Strictly Ancient (viii–iv)	288,305 269,448

Forms of Good Standing and Pedigree (linguistically Classical)

	Word Forms
TLG + PHI #7	1,135,915 980,867
TLG (viii–XVI)	1,041,520 938,084
LSJ Corpus (viii–VI)	676,114
Mostly Pagan (viii–IV)	505,302 458,756
Strictly Ancient (viii–iv)	285,856 266,891

Forms of Good Standing and Cecropian Pedigree (linguistically Attic)

	Word Forms
TLG + PHI #7	1,020,232 889,759
TLG (viii–XVI)	952,993 857,008
LSJ Corpus (viii–VI)	604,107
Mostly Pagan (viii–IV)	458,933 415,869
Strictly Ancient (viii–iv)	248,914 232,008

2009-07-03

Lerna VIb: A derailing of lemma counts

You may have noticed an extended radio silence for the last couple of weeks in the series counting lemmata. The people at the Magnificent Nikos Sarantakos' blog, where the good fight against Lerna is fought, know why: I found some problems in the way I was counting lemmata in the inscriptions and papyrus corpus (PHI #7), which I've been nowhere as familiar with as the TLG corpus. As a result, I'm down 2,000-odd lemmata from where I thought I was. Because I spent lots of posts on how contingent and provisional any count of lemmata is, that should not be that big a deal: a ±1% in the lemma count is within the bounds of what can happen when you fix first-cut errors.

Still, it's embarrassed me enough, now that people are starting to quote the Lerna VIa count of 211,794 (including Nikos Sarantakos, fighting the good fight), that I tried to get to the bottom of it. In the process, I've worked to treat the PHI #7 corpus less cursorly than I had done. Cleaning up problems in the PHI #7 markup, and clueing the lemmatiser in on some of the peculiarities of the dialects in the corpus, mean that the counts would give a more accurate picture of what was going on with those texts. The problem is, the longer I spent fixing my handling of PHI #7, the more the lemma count fell—*even as I was busy adding lemmata from elsewhere* (DGE, Pape-Benseler, Foraboschi). Erk. The counts are more accurate (with a catch I'll talk about), but they're not what they were.

I'm going to air some of the dirty laundry here, to cement the point yet again that any count of lemmata is going to be unstable. After that, next post is going to revise the counts that need revising. Then, the promised posts that got derailed: how many of these lemmata count as Ancient; relating lemma counts to recognition percentages (which is the only way lemma counts are meaningful); and distinguishing word variants from lemmata.

The first issue was when I wanted to count how many lemmata should be considered Ancient. I realised I had not been counting a couple of thousand lemmata from Lampe's and Trapp's dictionary (I-VIII AD and "IX-XII" AD) as post-classical. That did not particularly affect the accuracy of recognition for the TLG (as I confirmed by rerunning the program), but it was distorting the numbers: there are less "word forms of good pedigree" than I said there are. So you'll get new numbers for that.

The second catch was when I found a bug in how I was extracting word forms from the PHI #7 corpus, which meant that several hyphens were being ignored—so a hyphenated word would be extracted as two separate words. Once I fixed that bug, I also noticed that some of the markers that a word was fragmentary weren't being picked up. For instance, I knew that notation like ...]atisatio[ indicated bits of a word were missing from the papyrus or inscription; I didn't know that PHI #7 was also using dashes, like – – ]atisatio[ – –. Fixing these problems results in less complete word instances extracted—but of course, more correct word instances extracted. Even if some lemmata that looked like being there were no longer recognised, there should be more correct long words turning up. So that should not cause any drastic drops in the size of the vocabulary.

The next three problem fixes seem to be what's caused issues. Papyri are spelled phonetically, by the norms of Koine Greek, so the lemmatiser allows for some spelling variation: ι for ει, for instance, or ω for ο. Inscriptions and legal deeds from Late Byzantium need to allow for a lot more spelling variation, because of the many Ancient phonemes that had ended up pronounced identically: so ι could now be a misspelling of any of η ει οι υ υι.

Archaic inscriptions, on the other hand, may have a narrower range of respellings than papyri (depends on how early), but they also have different spellings of their own, because they use different versions of the Greek alphabet: ω and ου were Ionic innovations in the alphabet, for example, and what conventional Ancient orthography spells as ω and ου, most inscriptions before iv BC spell as just ο. So unlike papyri or church deeds, a system dealing with inscriptions has to allow ο to stand for ω or ου.

The lemmatisation run over PHI #7 that I'd reported was allowing all possible respellings from all periods indiscriminately. So an XIV AD document was being allowed the same latitude in spelling as a vii BC document.

Yeah, you can see how that might be a problem. I fixed this by allowing different respelling rules for the three parts of the corpus: the ancient inscriptions, the papyri, and the Christian inscriptions (which run all the way to Ottoman times). There'll still be some wrong respellings, because each part corpus spans a long period. But it'll be a lot better than allowing XIV AD iotacism in a vii BC text. Of course, restricting respellings means that lemmata that were being over-recognised in texts now aren't. That's fair enough.

I also tried to restrict the lemmata that were allowed for each part of the corpus, to prevent absurdities. Modern Greek words couldn't be allowed for Ancient texts of course, but they do show up in the late Christian inscriptions. The ancient inscriptions do keep going well into Roman times, so I couldn't ban Koine lemmata from there; but I did try to keep recognition plausible, by blocking from the papyri and ancient inscriptions any words unique to Trapp's dictionary.

That's underestimating both Trapp and the papyri. The papyri keep going until Greek yielded to Arabic in Egypt—a generation or so after the Islamic conquest, so VIII AD. Trapp, OTOH, badges itself as IX-XII AD—but it also sets out to fill in gaps left by other dictionaries, so it can be the only place where late papyri get covered. So some lemmata that should have been allowed for the papyri were being blocked. But having checked, only 150-odd legitimate lemmata were affected (and are now back in). So that wasn't the major disruption.

The other problem, as far I can tell, was that PHI #7 allowed in its markup both the word or phrasing the editor thinks the text is saying, and (in special brackets) the odd wording the scribe actually wrote; e.g. lemmatisation {4lmmeatsiantion}4. If an editor has decided to correct lmmeatsiantion as lemmatisation, I decided, I shouldn't be trying to analyse both. The editor's fix should count as the word instance for recognition: the "misspelling" (as the editor has judged it) shouldn't be considered an independent word. It looks like, in the process, some words LSJ says existed no longer turn up, because LSJ didn't trust the editor as much as I do. But all texts from a papyrus or inscription get filtered by the editor publishing it, and making sense of it—just like all the literary texts in the TLG. So that's the consistent thing to do.

All up, skipping proper names, 3,500 odd lemmata are no longer turning up as recognised. OTOH, 700 lemmata are now newly turning up that weren't before. Those numbers are still subject to change; but most of the 3,500 lemmata that disappeared should have disappeared. The scribal originals like lmmeatsiantion's arguably shouldn't have disappeared, and I may end up revisiting them down the road. But I've already spent two weeks trying to deal with the vanished 3,500, and I shouldn't be holding postings up much longer.

To compensate for the missing 3,500, I went through the PHI #7 corpus, and looked more closely at what kinds of words weren't being recognised—making sure that words occurring frequently in the corpus were accounted for. That involved some tweaking in the allowable spelling variations, and some filling in of the more obscure dialects' grammar.

I had no idea what the Arcadian first declension genitive was like—or how it's spread. Arcadian τρίταυ /trítau/ "of the third" corresponds to Homeric masculine τρίταο /trítao/ (Attic τρίτου /trítoː/), but it's also spread to the feminine, displacing Proto-Greek and Doric τρίτας /trítaːs/ (Attic τρίτης /trítɛːs/). Arcadian τρίταυ reminds me of the Esperanto -aŭ ending; I wonder if I'm the first person to have had that mental short-circuit.

Beyond that, if the dictionaries that the TLG lemmatiser already knew about didn't account for frequent word forms, I checked it in DGE. After all, part of DGE's reason for existence was to broaden the coverage of LSJ into new finds in inscriptions and papyri. For lowercase words (i.e. excluding proper names), I went through all word forms occurring more than twice in the corpus; DGE is up to εκ-, and I did end up adding new lemmata from DGE, unique to this corpus.

The count of lemmata I added to the vocabulary from DGE... was 12. This surprised me, especially because even between α and αλ—for which DGE went back and redid Vol. I—there were a few word forms still unaccounted for on PHI #7. Going down to word forms occurring just twice or once will account for a lot more than 12 lemmata from DGE; but it won't account for thousands. The remaining gaps even after DGE is something I'll be looking at again: I'm curious to work out what's going on. Of course, PHI #7 is nowhere near a complete corpus even for 1995 when it was published—let alone now, with the continuous stream of inscriptions and papyri being transcribed and published. Only the Athenian curse tablets from Audollent's 1904 collection, for example, are in. (So when I looked at how καταχθόνιος and χθόνιος are used in the tablets for a paper, I had to do eyeballing as well as keyboard searching.)

I also wanted to improve the recognition of proper names particular to PHI #7, where the lemmatiser is really struggling: It now recognises 46% of all capitalised words, vs. 89% of all lowercase words. As I keep saying, proper names shouldn't count at all, but a couple of thousand instances of Πεθέως drawing a blank from the lemmatiser was a bit much for me. Moreover, if the lemmatiser isn't told about a proper name, it will end up making wrong guesses about what the lemma actually is. There are several inscriptions-only names that I was able to find in Pape–Benseler; but the big store of unrecognised names are in the papyri. And there's a simple reason why so many names from papyri drew a blank from the Greek lemmatiser: they're not Greek names, but Egyptian.

Of course, adding 500 or 1000 Egyptian names to improve Greek word recognition sounds suspect, right? But no more suspect than adding Hebrew names to improve recognition of words in the Septuagint, or Roman names to improve recognition of Cassius Dio. That, after all, is why proper names don't count when you count lemmata.

I'm using Foraboschi as my Egyptian phone book; it's the update to Preisigke's Namenbuch, which Foraboschi updates—and which seems to be AWOL at the moment in transit from Monash University to the University of Melbourne. My bloody fault for not waiting to drive over to Monash on the weekend—it's just 10 minutes up the road from my place.

People whose day job it is to look at names in papyri (several projects based at Leuven) have already been counting the proper names in the Duke Database of Documentary Papyri, which is what PHI #7 uses for the papyri. So they're doing the electronic counterpart to the dead tree phone book I'm sampling. The Leuven projects have come up with 26,000 name variants in the corpus, in 16,500 lemmata—and the majority of them are Egyptian, and unknown to other corpora of Greek (although a few of them make it to Athanasius of Alexandria or the Desert Fathers, who after all were also Egyptians). I'm not proposing to sit down and add 16,500 lemmata to the lemmatiser database: this is not my day job. I'm aiming at adding around 1,000, as triage prioritising the most frequent names; that'll account for uppercase word forms turning up 10 or more times in the corpus.

So, I'm going to tell you I know of 1,000 Egyptian names in PHI #7, when the Leuven papyrologists know there are 16,500? Why yes. Just like I'm telling you I know of 35,000 proper names in the TLG, when there are 42,000 uppercase words in the TLG unaccounted for. I don't know how many names there are, and that's not what the current lemma count is about. But I do know how many names account for n% of the corpus, for a suitably large number of n%. The name count is not open-ended, but it is pretty large—larger than for common nouns.

In fact, the good folk at Heidelberg Uni Centre for Research on Antiquity have produced lists of lemmata in papyri. They've got around 22,000 lemmata, half of them names. So Leuven knows more names than Heidelberg knows—presumably because they're using a smaller corpus. I know less than either. And the more papyri turn up, the more names and nouns and verbs will turn up. Lemmata are open-ended.

But while Heidelberg's 11,000 names can turn into 16,500, it won't turn into a million. And while 175,000 lemmata without names can turn into 173,000 when I fix PHI #7—and maybe 220,000 once both the dictionaries and the corpus is complete up to 1453 or 1669—it's not going to turn into five million. Even if you count all the variants of dialect and spelling and phonology in lemmata, as I'll attempt in the final installment—which is how Leuven get from 16,500 names to 26,000, and the OED gets from 230,000 lemmata to 610,00 variants: even then, you're not getting to a million. (My current back-of-the-envelope calculation without names is around 350,000.)

OK, I've got some Egyptian names to go before I revise the published counts. ...Read more

Ἡλληνιστεύκοντος

Pages