So confronted with two similar nominative singulars in the vocabulary, or two 1st person present indicatives, you need to decide whether you'll count them as the same lexeme or not. And how liberal you are in your counting will decide how many lemmata you count, and how many you dismiss as variants.
To illustrate with English: you won't count color and colour as distinct lemmata, nor publicise and publicize. You won't count recieve as a distinct lemma from receive: misspellings happen, and you still need to count those misspellings as something. Though imbed is not a misspelling of embed, but a different derivation, you'd still want to conflate them too as the one headword.
OTOH, some people draw a distinction between racist and racialist. You may not, but if enough people do, you have to treat them as distinct. (The fact that the OED has decided they're now the same thing does not mean the entire language community has.) That's a judgement call in itself: people are uncomfortable with morphological variation as much as with any other kind, and people swore to the OED that there's a meaning distinction between gray and grey too. So there's no right answers. But there are arbitrary decisions.
Dictionaries normally make those arbitrary decisions for you, because they cross-reference variants to main headwords, and you can rely on their judgement. But dictionaries' judgement can be as arbitrary as any other's: the distinctions can be fine—particularly with variant suffixes, as we'll see. And some dictionaries conflate variants more effectively than others. Kriaras has a lot of phonetic variability, and hides a lot more spelling variability, because of its normalised modern spelling. But it does a reasonable job of indicating which forms are just variants.
By contrast, LSJ's more discursive entries include derived lemmata and similar lemmata, as well as simple variants, in the one entry. So it's harder for automated processing of dictionaries, such as the TLG lemmatiser does, to trust that two words cited in the one entry are in fact the same lexeme. Because the automated processing errs on the side of caution, it will distinguish variants as lemmata more than it should, which inflates the lemma count. The TLG lemmatiser currently seens two or more lemmata in some 12,000 LSJ entries; eyeballing, I'd say maybe a fifth of those could arguably be conflated. There are also a number of lemmata in later dictionaries (Trapp and Kriaras most notably) which could also arguably be conflated with lemmata in LSJ—but haven't been yet, because I haven't been through all 70,000-odd of their lemmata manually. (I'm not bold enough to guess how many.)
So there is a margin of overcounting lemmata; OTOH, there are many more instances where one could argue lemmata are undercounted, because the lemmatiser conflates variants. *I* won't argue that, because I've been responsible for a lot of the conflating. But this has been driven by a particular take on the vocabulary of Greek: if you're searching for words through a search engine, you're likelier to care about meaning than inflection, and you'll want the search to retrieve any word instance that looks enough like your word to match. You won't want to skip matches just because the spelling is slightly off. So for that purpose, less lemmata, meaning more search hits per lemma, is a good thing.
This has meant that, where there was doubt about whether a form is a distinct lemma or not, e.g. in LSJ cross-references, or variation between LSJ and Trapp, I've usually conflated those forms that I've been through manually. That's not everyone's purpose. If you're trying to inflate the count of distinct words of Greek, it's definitely not your purpose. Even if that isn't what you're trying to do, there will be disagreements on how much to conflate.
Part of those disagreements are tied to the dead tree: paper dictionaries have been understandably more reluctant to conflate variants that are alphabetically distant from each other. But there have been times that the choice to conflate hasn't been clear; and times when I've decided not to conflate. (The TLG search engine does display cross-referencing to the user, so that decision is not fatal.)
So what sorts of things may or may not be conflated as the same lemma in Greek? This laundry list will cover at least some of it:
- Dialectal phonological differences
- For some of the ancient dialects some of the time, these differences are so predictable, they're not even mentioned in LSJ entries. The big example is the different treatment of Proto-Greek */aː/ in Doric and Aeolic (stays α), in Ionic (goes to η), and in Attic (η except after ε, ι, ρ). The second example is Aeolic being accented as far back in the word as humanly possible, something which in the rest of Greek regularly happens for just verbs and compounds. There's more, and together they all mean that you shouldn't count Attic βοηθέω, Ionic βωθέω, and Doric βοαθοέω as different verbs for "help". Nor for that matter Aeolic βαθόημι: the different inflection is normal for Aeolic, and α is a plausible fate for /oaː/. (Just like the Ionic ω is.)
- Classical spelling variations
- If the odd inscription spells βοηθέω as βοιηθέω, that still isn't reason enough to call it a different lemma. Some spelling variation is endemic to the Classical language, because the pronunciation of those words was in flux—as occurs at any stage of any language. Usually, dictionaries will sweep this spelling variation under the carpet too, in generic cross-references like "κρεω-: see κρεο-". Once LSJ has worked out that λιπ- was the original pronunciation of compounds meaning "lacking in", every compound that can be spelled with λειπ- is listed under λιπ- .
... With the inconvenient exception of λειπογνώμων "without an inspector (or tariff, or distinguishing mark)", for which no ancient authority gives a λιπ- spelling, because the pronunciation was already changing. LSJ does dismiss the mediaeval spellings of the word as λιπογνώμων; but really, who could blame the mediaevals?
- Late spelling variations
- λιπογνώμων is very, *very* far from the only instance when mediaeval or even Hellenistic writers could not cope with the increasingly historical spelling of Greek, and spelled words more creatively than LSJ allows. LSJ is an historical dictionary, so its business is to go back to the original phonology of the word where possible, and dismiss everything outside its ambit. But if the manuscripts and papyri spell the word differently, that does not mean it's a different word. I've spent several entertaining months helping the lemmatiser cope with λλ as an alternate spelling of λ, or ι as an alternate spelling of ει, or σζ as an... interesting spelling of σ.
- In making sense of the derivation of words, grammarians would often suggest what they thought the real underlying form of a word was. For example, in discussing ἀζηχής "continuous", they would suggest the form is underlyingly ἀδιηχής "unseparated" or ἀδιεχθής "unhostile". I have on occasion conflated these hypothetical derivations with the word they're explaining, when there was not anything else to be done with them.
- With written Greek increasingly remote from the spoken language, Byzantine writers did increasingly oddball things to make their words sound high-falutin'. Theodore Metochites' Homer-Through-The-Looking-Glass is one egregious instance, the result of too much Classical learning ("if Homer said νοῦσος for νόσος, then I'll say σουφία for σοφία"). Theodore Studites' is another, and the result of not enough Classical learning: I'll never quite get over him working out that the imperfect of περισσεύω is περι-έσσευε. Throughout the period, there's a persistent tendency to stress words on the "wrong" syllable. I assume the authorities don't discuss it because it was beneath their notice; and I assume the Byzantines did it Because They Could.
This means that Byzantines have made up a bunch of variants for lemmata which did not exist, quite artificially. However artificial they are, they turn up in the corpus, so they need to be accounted for; but they shouldn't be counted as new words. They're old words in clown outfits.
- Language change
- But Classical words don't only turn up in the corpus in either chlamys or clown outfits. They also turn up in denim. Hm, I don't know if that analogy is going to work...
Variation in words is also going to result from natural language change. (In fact, differentiating natural and artificial language change is harder than it looks.) As we saw in earlier episodes, the *anr- stem of Proto-Greek, "man", turns up as ἀνήρ, ἀνδρός in Classical Greek, as ἄνδρας, ἄνδρα in Byzantine Greek, and as ἄντρας, ἄντρα in Modern Greek. The third declension of the original has been done away with, and the Ancient /ndr/ cluster is now spelled differently, leaving the eye-pronunciation /nðr/ to learned forms.
For the purposes that the TLG lemmatiser is put to—searching words in a diachronic corpus—that's not enough reason to call them different lemmata. ἄντρας is the natural development of ἀνήρ, it means the same thing, so they count as the same thing. But morphologically ἄντρας has already moved on from ἀνήρ, and counting them as the same is a diachronic artifice. It's only because we're trying to cover three thousand years in the one corpus, that we're conflating words three thousand years apart.
- Variations in compounding
- Putting stem A and stem B together forms you a new compound AB. That compound should for the most part count as the same word, regardless of slight differences in how the compound is put together. So ὑδατοφόρος "water-bearing" does not show up in the dictionaries, but ὑδροφόρος "water-bearing" does; they should count as variants of the same lemma, since hydro and hydato are allomorphs of the same noun, /húdɔːr/ gen. /húdatos/ "water". This, already, is a conflation lexicographers will not be equally eager to embrace.
Similarly, Classical Greek has δαφνηφόρος "carrying laurels" (as a ceremonial act), using -η- as a combining vowel; Late Greek no longer used -η- as a combining vowel, so the word turns up there as the more regular δαφνοφόρος. I'm disinclined to call these distinct lemmata. Others may not be.
- Variations in inflection
- Here things really do start getting murky: if the inflection of two forms is slightly different, though their stem is the same, should they count as the same lemma? I have allowed them to some times, especially if there is a distinction between an earlier and a later, more transparent or commonplace form. So when Theodore Studites, with his shaky command of classical morphology, uses εὐέλπης, εὐέλπες instead of εὔελπις, εὔελπι, or when he forms the aorist passive participle of εὐκρινέω as εὐκριθέν, implying a citation form εὐκρίνω, I smile benignly, and allow that he's gotten the Classical form wrong—not that he's come up with a brand new lemma. When the variants are contemporary, I'm more reluctant to make that conflation. So I have let ἀθλεύω and ἀθλέω remain distinct verbs. That's not to say I've never done such a conflation—especially when LSJ has said "= sq."; but I've had less of a motivation to.
So I have come up with a count of lemmata that conflates some variants and deflates others. In conflating variants, I am coming up with a lower count of lemmata than others might. Which means there is a question mark over the count of 173,000 lemmata I've claimed.
Well, good. Like I keep saying, there's a question mark over any count of words of any sort. But given that the lemmata contain multitudes, it's worth uncovering how many of those multitudes there are. So I'm going to count up the variants.
The first warning about these counts is that this is still not how you're going to beat English in counts of words. The OED counts some 250,000 lemmata (and its coverage of Old and Middle English is *not* exhaustive); with variants, that gets up to 615,000. There will be a lot more than 173,000 variants to count in the Greek corpus; but there won't be 615,000.
The second is to explain what I'm counting.
- I allow any variation in the lexical rot, recorded in the lexical database, to count as a distinct variant. So ἀνήρ, ἄνδρας, and ἄντρας count as three variants, and Doric ἀγέννατος and Attic ἀγέννητος as two.
- I do *not* count dialectal or diachronic change in the same inflection paradigm (including tense stems) as different. So I don't count both Ionic χώρ-η and Attic χώρ-α, or Doric δωρ-ίσδω and Attic δωρ-ίζω, as different.
- I don't count dialectal variants in preverbs, because they are not part of the root; so ξυμμαχέω is not counted separately from συμμαχέω.
- I don't count uppercase and lowercase variants of the same lemma. (But I do count them when they are distinct lemmata.)
- I don't count active and passive variants of the same root verb as distinct.
- And I don't count the adhoc respellings that the lemmatiser can do on the fly, to recognised deviations particularly in diplomatic editions.
So. I had 214,381 lemmata in the corpus. Without proper names and Milesian numbers, that came down to 172,646. How many variants does that translate to?
- Variants: 362,947
- Without numbers: 352,895
- Without names and numbers: 286,652.
(I'd guessed around 350,000 variants a couple of postings ago. That's pretty good. It would be even better, if my guess hadn't excluded proper names...)
This amounts to 1.7 variants per lemma. I'll admit to some surprise that the OED ratio of variants to lemmata is more like 2.5: Greek historical spelling should allow for comparable confusion. My suspicion is it does, and I'm discounting adhoc misspellings which the OED doesn't.
Names are slightly more variable than normal words: 66,243 variants for 37,306 names, which is a ratio of 1.8:1. Foreign names in particular get mangled in several ways, including creative hellenisations: that's why there are 75 different variants of "Muhammad", and 43 variants of "Lombard". (Those examples aren't fair, since "Muhammad" includes the Turkish "Mehmed", and "Lombard" also includes the earlier "Longibard"—so again, I'm conflating variants more than some might.)
That count can be whittled down further of course.
- If we discount all variants noted as hypothetical—which were made up by grammarians, and were not used in the actual language, we come down to 274,650.
- If we ignore variation in accentuation, which is mostly a Byzantine hypercorrection, we're down to 265,233.
- If we ignore the uncertainty between ει and ι, which bedevilled the Koine, we're down to 260,362.
- Ignore double consonants: 253,891.
- Ignore the distinction between η and α, as a brute-force levelling of Doric and Attic, and we're down to 247,294.
- Ignore the distinction between smooth and rough breathing (which occasionally tripped scribes up): 245,672.
Even with these common causes of variation excluded, that's still some 73,000 added variants (41%) that I have not counted as distinct lemmata. That means that one could argue some of them should be counted as separate—although I have trouble seeing how a consistent criterion could be devised, especially over such a large timespan.
So the lemma count may be an underestimate, because of different judgements on what counts as distinct; but at its most inflated, the lemma count will no more than double. In reality, I think the debatable instances are closer to 20% than to 70%. So no, not even this way are we getting to 5,000,000 lemmata.