In the last post, we did some pruning of the word form count of our corpora, and came up with some numbers. We also noted that, once you pruned away the 137 forms of ἀνήρ, you're still left with 42 forms of ἀνήρ.

(Did I say 43? I miscounted. Dangerous thing to admit, with all these numbers flying about. But you should be taking those numbers with a grain of salt anyway. As I'm going to keep saying.)

42 is a lot more than the 11 forms ἀνήρ should have, based solely on the Attic dialect. Here, we're going to look at where the remaining 31 forms came from, and what that tells us about the morphological heterogeneity of the TLG corpus. We're also going to keep pruning at those numbers we came up with last time, and see if we can arriving at something like a count of Good Reliable Attic word forms.

The Attic forms of ἀνήρ are shown today in glorious Galatia SIL:

The classicists among you will have picked up that a bunch of the remaining forms are Epic or "poetic". Another 12 of them:

The tricky proto-Greek stem, *anr-, shows up in Epic with the variant stem a(ː)nér-:
Datἀνέριἀνέρεσι, ἀνέρεσσι, ἀνέρσι

The multiple choices are typical of Epic: Epic is a conventional, mixed dialect, and it was handy for Epic to have multiple choices, to fit the metre that the dialect was used in. Hence the variation in the dative plural between -si(n), -esi(n), and -essi(n).

The lack of ἀνδρ- forms in the table, btw, doesn't mean Epic literature avoided the ἀνδρ- stem. Homer used both. It just means that we've already checked off the ἀνδρ- forms for Attic. But because Epic inflections can also appear on the ἀνδρ- stem, the Epic count also includes a fourth dative plural, ἄνδρεσσι, which we did not count under Attic.

That leaves 19. We can pick off four forms of ἀνήρ as Modern Greek:

Of course, treating Ancient ἀνήρ, ἀνδρός as the same lemma as Modern άντρας, άντρα is a bit of a leap, and it shows the problem with having a single vocabulary try to span three thousand years: there is a continuum from ἀνέρος to ἀνδρός to ἄνδρα to άντρα, but the endpoints are far apart. Still, spanning even a century in a corpus raises problems, because language is a moving object. And on the flip side, much of Greek literature—including the Epics themselves—are attics full of relics. Much like any literary language, really, just over a longer timespan. So we'll treat these as the same lemma (because the TLG has the one search engine for everything); but we'll note that this is a difficult judgement to make in general—and that it has limited synchronic reality.

A further 8 forms look Epic (both ἀνερ- and ἄνδρ- stems), but are accented further back than they would be in Epic Greek. That should make them Aeolic:

We have very, very few literary texts actually in Aeolic. Five of these eight forms do actually turn up in what literary Aeolic we have: ἄνηρ (Alcaeus, Julia Balbilla), ἄνδρος (Sappho), ἄνερος (Theocritus), ἄνδρων (Theocritus, Alcaeus), ἄνδρεσι (Alcaeus).

Of the rest, ἄνερα shows up in fragments of Euripides and Numenius, and ἄνδρασι in fragments of Diocles and Phylarchus. Scribal errors? Maybe; at any rate, there's nothing Aeolic about any of those authors.

The oddest of the eight is ἄνδρι. The form shows up in Jacoby's collection of the Fragments of Greek Historians. This collection gathers up the bits of ancient historians who were not preserved in intact books, and it gathers them from wherever it can; lots of fragments come from citations in later authors. Jacoby has ἄνδρι in a passage by the historian Ion of Chios, as cited in Athenaeus. That means the passage in question turns up twice in the corpus: once in Jacoby's edition of Ion, and once in Kaibel's edition of Athenaeus. (That kind of duplication happens quite a lot in the TLG, though it involves small bits of text, so it does not inflate the word count all that much.)

The thing is, Kaibel's edition of Athenaeus has the word as the normal ἁνδρί. Is this a typo in Jacoby? Is this an earlier version of the text of Athenaeus? Is this an emendation to Athenaeus by Jacoby, because he knows something about Ion that I don't? I don't know, and I'm not burning right now to find out. The point is that this kind of variability does happen in the corpus, and it does increase its morphological diversity more than it should.

So of the eight Aeolic forms, three don't occur in Aeolic texts, and just look like misaccentuations. But this kind of misaccentuation turns out to be routine in Byzantine Greek: in fact, it accounts for most instances in the corpus of the first five "Aeolic" forms. This misaccentuation is too frequent a feature of Byzantine Greek to be an accident or scribal whim. It is a kind of systematic hypercorrection: "I'll misaccent this word because it will sound more récherché." So it's not like Didymus the Blind or St Athanasius are aping Alcaeus specifically; they're just randomising where the accent goes, as part of their game of Greek.

We know that they aren't aping Alcaeus, because the Byzantine don't only put the accent where the Aeolians would have put it; they also put it where noone would have put it. So Byzantine misaccentuation also accounts for four forms of ἀνήρ stressed on the final syllable:

This leaves us with three last forms of ἀνήρ.

To get from Ancient ἀνήρ to modern άντρας, you need to switch from the third declension to the first declension, because the third declension was thrown out in Modern Greek as too hard. (There's some Lazarus—or Zombie—third declensions in the contemporary language, but outside of -ις, -εως plurals, people are uncomfortable with them.) This means that the Ancient nominative ἀνήρ became the Byzantine nominative ἄνδρας. That nominative does turn up in the corpus, but it's spelled identically to the Attic accusative plural ἄνδρας, so we've already crossed it off our list. It also means, though, that there is an accusative singular ἄνδραν in the corpus, which soon became Modern άντρα. So that's where that form has come from.

The second form is ἄνδραις. This is a dative plural, turning up in one church hymn, of that Byzantine first declension variant ἄνδρας. It is also an old-fashioned spelling of the Demotic accusative plural (which would now be spelled άντρες), for reasons of morphological analogy denialism that I'm not going to get into here.

The last form is ἀνρός, and it's not any form of Greek at all. It's proto-Greek: it's Herodian, reconstructing (correctly) what the original genitive of ἀνήρ must have been:
τὸ δὲ ἀνδρός κατὰ συγκοπὴν γενόμενον ἐκ τοῦ ἀνέρος ἐξ ἀνάγκης ἐπλεόνασε τὸ δ. οὐκ ἠδύνατο γὰρ εἶναι ἀνρός χωρὶς τοῦ δ, ἐπεὶ τὸ ν πρὸ τοῦ ρ οὔτε ἐν συλλήψει δύναται εἶναι οὔτε ἐν διαστάσει.
When andrós was formed by syncope [deleting a phoneme] from anéros (anéros > *anrós > andrós), /d/ was a necessary redundancy. For /n/ cannot directly precede /r/, either within a single word or between words. (Herodian De Prosodia Catholica p. 406 Lentz)

In fact, we'd say the proto-Greek was *anrós to begin with; but given the poor track record of Greek etymologists in general, Herodian gets cut plenty of slack from me.

We've just accounted for the 42 "legitimate" forms of ἀνήρ, and we can see some problems with the range of forms we've found:
  • 11 are in one dialect.
  • 12 are in a different dialect—albeit the literary dialect that almost all Classical literature draws on.
  • 5 are in a third, marginal dialect
  • 3 look like they're in the third, marginal dialect, but are really just Byzantines making accents up. As are most instances of the previous 5.
  • 3 are also Byzantines making accents up, in the opposite direction.
  • 4 are in Modern Greek—and you can argue about the extent to which it is the same language at all.
  • 2 are Almost-Modern Greek
  • 1 is a hypothetical reconstruction of proto-Greek by a Roman-era grammarian (and the Byzantines that copied him).

All of these forms are Greek, in one way or another. But counting all of proto-Greek *ἀνρός, Modern άντρα, Poetic Aeolic ἄνερος and Attic ἀνδρός as genitives of "man" should make you nervous. These are not all part of the same linguistic system. We can concede Epic mixed with Attic, because everyone who wrote literature had Homer in mind; literary languages are not pure and uniform langues. (Spoken languages aren't either—although a dialect with four different datives rightly makes people suspicious.) But listing ἀνρός and άντρα together... that's weighing down the scales.

We whittled down the word form count in the previous post to something more reasonable—something that wasn't lurching at every change in casing or apostrophe. But there are still oddball forms in the corpus, and it would be useful to filter out some of the more problematic word forms, to get a more accurate sense of what is going on in the language—to try to restrict the word form count to forms that might plausibly have been spoken by someone once. The lemmatiser can make some judgements about which forms are more oddball than others. It won't be infallible—after all, it thought ἄνδρι was Aeolic. But it's better than nothing, and it's what I've got at hand.

We've seen above that some forms of ἀνήρ are perversely accented, and one form is a grammarian's reconstruction. We can come up with a word form count which eliminates those four forms of ἀνήρ (though it will preserve the accidental Aeolic of the Byzantines). I'm going to filter out the following categories of word forms marked by the current lemmatiser:
  • Hypotethical forms (like *ἀνρός)
  • Hypercorrect forms (like ἀνδράς, or any number of other Byzantine hybrids and not-quite-genuine Doricisms)
  • Uncertain inflection categories (the lemmatiser has insufficient information on how a stem should be inflected)
  • The tense stem used to account for the verb form is not in the lemmatiser lexicon (so this could still be guesswork)
  • The inflection is anomalous (typically, it's the "wrong" class of inflection by conventional norms—which covers lots of confused Byzantine optatives)
  • The form is a transliteration of Latin (occurs in Legal texts)

This should give us a count of Forms of Good Standing. There's more grammatical eccentricities than that, but those are the most egregious.
Word FormsReduced
TLG + PHI #71,300,7171,267,434
TLG (viii–XVI)1,183,1201,158,529
Mostly Pagan (viii–IV)518,321515,275
Strictly Ancient (viii–iv)289,275288,305

Not a massive cut, but a necessary one. Again, the Strictly Ancient corpus is better behaved overall, so there are less anomalies there that need culling.

The next cut will be more cruel. Stems and inflections are marked for dialect and period in the lemmatiser; again, not infallibly, but indicatively enough. There's still a lot of Late forms in the corpus, including Graecobarbara. There's also lots of forms from non-literary dialects, that weren't lucky enough to have an Alcman or a Sappho.

We can filter out the Boeotian and Cretan and Locrian, and the Koine and Byzantine and Demotic, to give just forms compatible with literary Ancient Greek dialects. That's an artificial barrier, sure, but no less artificial than including them all in the same corpus to begin with; and there are plenty of people, ancient and modern, who would look approvingly on this "no riff-raff" policy. It will make the corpus somewhat more morphologically consistent: we'll at least be talking about five centuries' worth of morphology, not twenty-five.

Limiting word forms to Forms of Good Standing And Pedigree does still include lots of word forms that were only devised in the fourteenth or fifteenth century, because the ancient corpus did not exhaust all the possibilities of the ancient language(s). Moreover, proper names haven't been quarantined off from antiquity the way vocabulary proper has. So the count will be much more approximate and fuzzy than it seems. (That holds for all the counts here, of course; come back tomorrow, and the lemmatiser will give different counts.) Still, applying a No Riff-Raff constraint on the corpus—excluding post-Classical and dialectally marginal forms, and keeping just linguistically Classical forms, as the lemmatiser currently understands them—gives us:
Word FormsReduced
TLG + PHI #71,267,4341,135,915
TLG (viii–XVI)1,158,5291,041,520
Mostly Pagan (viii–IV)515,275505,302
Strictly Ancient (viii–iv)288,305285,856

The final cut is the cruellest of all, and it's so cruel only a linguist would do it. Forms of Good Standing And Cecropian Pedigree; in other words, Naught but Attic. No Aeolic, no Doric, no Ionic, and—here's the killer—no Epic. That cannibalises any classical literary work there is, and it's an over-idealised notion of what was spoken in Athens: there would have been some Aristophanean peasants that spoke like that, but no educated Athenian would have. And of course in the other direction, Byzantines kept coming up with Attic-compatible words too. Still, this cruellest of all cuts will give us a word form count that describes just one dialect at a time. Subject to how much the lemmatiser knows about Greek dialect, and again, the lemmatiser is not infallible, and will never be complete.

But again, as long as you understand these numbers will be just indicative, and just illustrative, and are worth what you paid for: these are the Attic-only word form counts for the corpora:
Word FormsReduced
TLG + PHI #71,135,9151,020,232
TLG (viii–XVI)1,041,520952,993
Mostly Pagan (viii–IV)505,302458,933
Strictly Ancient (viii–iv)285,856248,914

Finally got the Strictly Ancient count to budge. :-)

So we started with 1.3 million forms attested of Greek in our corpora; limiting them to forms compatible with Attic Greek takes that down to 1 million. For the Strictly Ancient corpus (when Attic was still a living dialect), that's 250,000, down from 290,000.

Now, what did all that prove?
  • For two millennia after Attic was no longer a living language, it remained a language of literature. There are some 890,000 post-classical wordforms, but over two thirds of them are compatible with Attic, and only a tenth of them are linguistically Late. Now, in part, that's simply because Greek did not turn into Lithuanian: there are plenty of words in Modern Greek that are compatible with Attic too—so long as you're relying on historical orthography. But if the literary language reflected the vernacular more accurately, the count would be a lot less than two thirds. So writers kept using Attic morphology productively.
  • The word form count for the corpora includes a lot of problematic forms, particularly the later we get (and the more artificial the literary language becomes). These problematic forms are part of the heritage of literary Greek; but it is misleading to include them in evidence of the productivity of Greek morphology: many of them are fantasy morphology. That said, these problematic forms are not frequent as forms (2% of TLG forms), and are even less frequent as instances (0.7% of all the words in the TLG corpus).
  • Nonetheless, cutting the morphological variety of Greek down from three millenia to a couple of centuries of Attic does make a difference: 85% of all forms in the Strictly Ancient corpus are Attic, and 80% in the full TLG corpus.
  • ... although in cultural, literary, and even sociolinguistic terms, limiting the morphology to Attic Only is an artificial thing to do. But that's weighing the scales for you: there's a reason why it happens.

So should we cite Classical Greek as having just 1 million, or 250,000 word forms, instead of 1.3 million or 1.8 million? Nah, we should not be counting word forms in a corpus at all, and limiting ourselves to accidents of attestation. But we should also be aware that any corpus like this is going to have forms that are more at home or less at home. And that it all depends.

One final note. The lemmatiser, as I keep saying, is changeable and fallible: the numbers I've been giving—and which I will give in later posts, once I start counting lemmata—are transitory, indicative, and unreliable: they only tell you how far one piece of software has gotten with one lexicon and one corpus. Because the TLG lemmatiser has to do a lot more than lemmatisers normally do—coping with six dialects and three thousand years—it runs into a lot more ambiguity than is usual; and it tries to deal with that ambiguity by ranking analyses of word forms as more or less plausible. If you're using the TLG lemmatised search, you can view the word forms which the lemmatiser thinks *might* belong to the lemma you're searching for, but probably don't.

So if you search for ἀνήρ as a lemma, you'll get the 42 forms we've been talking about. In fact, you'll get 103 forms, because of all the variations in accent and crasis and apostrophes we've mentioned before—though the list is case-folded. But you can also access, by clicking Show lower confidence forms, word forms that the lemmatiser thinks might but probably aren't instances of ἀνήρ. As of this writing, that list includes:
  • κἄνδρος (1) (More probable lemma: Ἄνδρος)
  • ἄνδρου (56) (More probable lemmata: Ἄνδρος ἀνδρόω)
  • ἀνδρου (1) (More probable lemmata: Ἄνδρος ἀνδρόω)
  • ανδρου (1) (More probable lemmata: Ἄνδρος ἀνδρόω)
  • αντρον (1) (More probable lemma: ἄντρον)
  • ἄντρου (126) (More probable lemma: ἄντρον)
  • αντρου (1) (More probable lemma: ἄντρον)
  • ἄντρ’ (4) (More probable lemma: ἄντρον)
  • ἄνδρους (1) (More probable lemma: ἀνδρόω)

For the most part, the lemmatiser is correct in dismissing these analyses: ἄντρ’ is not a Demotic analysis, but a Euripidean mention of "cave", and ἄνδρου (Ἄνδρου) refers only to the island of Andros.

But the lemmatiser is fallible, and it has slipped up with ἄνδρους. (Yes, I'm fixing the analysis now.) The lemmatiser had the alternatives of treating this as a Byzantine attempt at "thou wert manning" (with no augment, so the Byzantines would be play-acting at Homer); or a Demotic accusative of "men", in the completely wrong declension. The lemmatiser decided the Demotic wrong declension was even more absurd than the Byzantine play-acting. As it happens, it's wrong, this is a Demotic wrong declension after all (in the notoriously patchy vernacular Historia Imperatorum). So next time the lemmatised search engine will be updated at the TLG, there'll be 43 simple forms of ἀνήρ after all.

Once more with feeling: don't take the numbers too seriously. (As if the preceding posts didn't argue that at nauseam already.) Just use them to get an order of magnitude sense of what's going on with Greek.


