2009-06-18

Lerna VIa: For Zeus' Sake, How Many Words?

[Counts in this post have been corrected in Lerna VId]

At long last, after nine posts of teasing, will I finally give the punters a count of lemmata of Greek?

Why yes. Yes I will. And then for a change, I will also set to work inflating it, to extrapolate from the current corpus and lexicon I have access to, to how much larger it could conceivably get.

Ready for it? The count of lemmata, known to date to the TLG lemmatiser, and recognised in the four corpora we've set up as they stand to date, is...
LemmataExcluding Greek NumeralsExcluding Proper Names
TLG + PHI #7216,234211,794175,791
TLG (viii–XVI)201,680197,448162,219
Mostly Pagan (viii–IV)99,42698,59376,145
Strictly Ancient (viii–iv)66,43766,07855,003
1: Lemma Counts

Did I have to quibble even here? Why, of course I did. The lemmatiser makes sense of Milesian numerals like αϠοα = 1971 and χξϛ = 666, but including them in the vocabulary of Greek as lemmata is a bit much. And dictionaries do not include proper names. So if you're going to compare the headword count in LSJ with the headword count in OED, you won't be including proper names in your count. Proper names are not exactly open-ended in count, but they do work differently from core vocabulary, they come from a lot more sources and cultures, and knowing lots of people doesn't really prove your vocabulary is more expressive.

So if we're comparing the TLG corpus to the OED's, we'll say 162,000. OTOH, adding proper names is most of my fun with the TLG lemmatiser: Athenian courtesans one minute, Albanian chieftains the next. In terms of the lemmatised search engine, they are search targets like any other. So if we're not comparing the TLG corpus to any dictionaries, we'll say 202,000 lemmata.

Not So Fast!

Only if you'd asked me two years ago, I'd have told you 231,000. And that was when I was recognising 90% of all word forms—as opposed to now, when I'm recognising close to 94%. Does that mean the Greek language has lost 25,000 lemmata in the intervening two years, even as the lemmatiser now recognises 60,000 more word forms? No. The lemmatiser has just gotten more discerning about when it claims a new lemma has shown up.

There is a lot of ambiguity in dealing with three thousand years and six dialects of Greek, and incomplete dictionaries. The lemmatiser has been allowed to make up its own lemmata (more on this below); it does this to cover gaps in dictionaries, whether they've gotten to pi or omega. But in the past two years, the lemmatiser has been constrained to make up a new lemma only if it doesn't have a "legitimate" alternative, already recorded in a dictionary.

The lemmatiser has also gotten better at conflating variant stems under the one lemma. That's a huge issue, which will have to wait for a couple of posts: the number of stems I count as distinct lemmata is in several ways different to the number LSJ counts as being distinct lemmata. There are 216,000 lemmata in the overall corpus, following the lexical database's definition of when two stems belong to different lemmata. But its definition will be different to someone else's definition. As we'll see, it's not always clear from the dictionaries when they consider two lemmata distinct, or whether they should when they do.

All this should make you distrustful of any lemma counts I give you. As well it should. Counting lemmata is an artefact of how you set about counting lemmata; different criteria, and different methods of analysis, will give different results. As will different sizes of corpora. So in this and the next couple of posts I'm going to stretch the lemma count, then shrink it, then stretch it again. (And then I'll bring this thread to an end; there's Mariupolitan dialect waiting to be blogged about.)

Fuzzy Boundaries

To begin stretching, I'm going to allow for the uncertainty of lemmatisation. The TLG lemmatiser, confronted with much too much ambiguity, ranks potential analyses of word forms as belonging to different lemmata. The counts I've just given are for the "winning" lemmata for each word form. If I include the "also-rans" in the analyses, then I'll also be counting lemmata which never give the preferred analysis for any word form—but which the lemmatiser keeps in reserve, in case one of them turns out to be correct after all. You will get search results if you look up those lemmata in the TLG. You will also get lots of warning, saying "but this form is probably X instead, and that form is probably Y instead."

If I include these also-rans in the counts, the counts go up to:
Lemmata
TLG + PHI #7220,560
TLG (viii–XVI)206,161
Mostly Pagan (viii–IV)107,257
Strictly Ancient (viii–iv)73,427

2. Lemma counts, including also-ran analyses

This gives a curious result. There are lemmata which the Strictly Ancient corpus rejects as implausible for all its forms; but somewhere in the subsequent mediaeval morass, a word form turns up for which the rejected lemma makes the most sense after all. Of course, a lot of those lemmata rejected for Strictly Ancient Greek will turn out to be Byzantine after all. So it makes sense they'd become more acceptable, once bona fide Byzantine texts are included.

More dictionaries

There are two constraints on what a lemmatiser recognises: how many words it knows about, and how many words you're asking it to recognise. Increase the corpus—as we did by excluding and including Byzantine texts—and it will find more lemmata. Give it more dictionaries—as the TLG did by adding Lampe, Trapp, and Kriaras—and it will also recognise more lemmata. So these counts could be bigger with more texts (and more texts are being added), and more dictionaries.

You can increase your dictionary size by allowing the lemmatiser to do what human beings do: make words up. You can make words up from whole cloth, as a random but plausible sequence of sounds. But that's fairly rare in human language. What is much more usual is making up words based on existing words, using rules present in the language (derivational morphology). The lemmatiser does know something about Greek derivational morphology: as a result, the TLG counts include some 15,000 lemmata that are not in its dictionaries, but are derived from lemmata that are. Two thirds of these are from prefixing prepositions to verbs, which is quite productive in Greek. One third is from derivational morphology forming new stems through suffixes, and those proposed analyses are more tentative.

But derivational morphology will only catch some words. Otherwise, if a lemma is unrecorded in the dictionaries that the lemmatiser has access to, then that lemma won't show up in the list of lemmata recognised: you have to tell the lemmatiser that πιθανοθηρία exists, for it to make sense of πιθανοθηρίας. If you give the lemmatiser access to more dictionaries, it will recognise more words.

Much of the TLG is Byzantine, and more of the TLG is going to be Early Modern; so the fact that the dictionaries of both stages of the language are currently stuck at pi means there are future volumes of those dictionaries that the lemmatiser hasn't been given access to yet (because they don't exist). That means a lot of lemmata in the corpus going unrecognised, and being missed in these counts.

My back of an envelope can beat your back of an envelope

How many? Here start the back of the envelope calculations: take them with several satchels of salt. Trapp has finished six volumes out of a projected eight, and adding Trapp's lemmata to the lemmatiser has accounted for 25,000 lemmata newly recognised in the TLG. The remaining two volumes (to be completed by 2013) should add another 8,000 lemmata to the TLG. So 202,000 will go to 210,000—give or take a couple of thousand. Remember those 15,000 lemmata the lemmatiser is recognising, even though they're in no dictionaries? Some of them will turn up in the forthcoming volumes of Trapp.

The same holds for Kriaras. With the 2.5 million words of Early Modern Greek in the TLG, adding Kriaras' dictionary to the lemmatiser accounts for 2650 lemmata. Since Kriaras is up to 15 volumes of a projected 19, that would mean we're owed another 550 lemmata. Except, the TLG is not going to stay at 2.5 million words of Early Modern Greek: it's expanding deliberately into the Early Modern corpus, and there'll be a lot more lemmata added to the TLG as it does so. The 15 volumes of Kriaras have added 9900 lemmata to the lexical database, so with with the 2650 already seen in the corpus, we can expect another 9900 lemmata once the entire Early Modern corpus is in and Kriaras is completed. That takes us to 220,000 lemmata.

In fact, because we haven't run out of Byzantine learnèd texts to data enter into the TLG corpus either, 8,000 more lemmata from Trapp is an underestimate. There are 35,000 headwords in Trapp that the TLG does not already recognise from its other dictionaries; but the current corpus only accounts for 25,000 of them. There's some spelling variation and derivational morphology clouding the results, but all up, and assuming headwords and lemmata are the same (which they're not), we should expect not 8,000 more lemmata once all the texts are in (and all the spelling variation accomodated for), but 20,000. That takes us to 232,000 lemmata.
"A headword is not a lemma"?! But the definition of "lemma" *is* "headword"! I'm being a little idiosyncratic in my usage: the source dictionaries each have their own headwords, but I'm calling "lemma" the canonical lexeme that the lemmatiser uses, to bind those headwords and variants together—in case it merges two different headwords from its sources into the one lexeme. In other words, I distinguish dictionaries' print "headwords" from the lemmatiser database's "lemmata".
Even with Strictly Ancient Literary Greek, there are word forms that the dictionaries are missing, because editions have changed, or new fragments have turned up, or lexicographers (who often copied each other) had a blind spot. The DGE do use the TLG to compile their list of words, so there's a good chance they'll catch the gaps in the TLG corpus at least. But the corpora are fluid, for reasons we've already discussed. So πιθανοθηρία "hunting for possibilities" turns up as a variant reading in Plato's Sophist; because it's a variant reading, lexicographers have not been panicked to include it. The dictionaries don't (yet) register anything that deals with ὀκταχοίνικον "eight choinixes heavy" in Aristophanes, or πεντηκοντακισμυρίους "five hundred thousand" in Polybius. (That's the other blind spot of lexicographers: numbers are boring.)

The count of missing lemmata won't be massive for antiquity: there are currently just 1,600 word forms unrecognised in the Strictly Ancient corpus, and from inspection, skipping proper names and geometric lines, there'll be a lot less than 500 lemmata's worth there. Still, 500 is 500; and it's thousands, not hundreds, of skipped lemmata moving forwards from Strictly Ancient to Mostly Pagan. Moreover, there are a couple of late ancient texts that, once added, will give us a lot of lemmata: we're owed 1000 lemmata from the old Latin-Greek dictionaries, a couple of hundred from the Hexapla.

PHI #7 has even more treasures to unlock—although it'll have to be someone else's job to do the unlocking. Remember that Vol I of the DGE in the second edition has 3500 lemmata not in LSJ, from α to ἀλλά; extrapolating, that would mean around 80,000 lemmata all up not in LSJ, if and when DGE finishes. How many of those lemmata are already going to be in Bauer, Lampe, and Trapp? My guess is, a fair few; DGE is nowhere near as Pagan-centric as LSJ.

But there are also lemmata specific to the inscriptions and papyri, and not recorded elsewhere. The lemmatiser failed to deal with 40% of the 300,000 word forms unique to PHI #7; and that corpus is not the latest and greatest in papyrology and epigraphy. With 6 word forms per lemma in the Strictly Ancient corpus (which has about the same number of forms), that could mean 20,000 more lemmata unaccounted for. I'm sure that maths is completely wrong, but let's pretend it's not: that would mean that, if Kriaras, Trapp, and DGE were completed and added to the TLG lemmatiser, the overall lemma count for PHI #7 and TLG would go from 216,000 to 268,000. Give or take several thousand and more and more satchels of salt.

Keep going? The Triantaphyllidis dictionary of Modern Greek (the contemporary language) has around 47,000 headwords. There'll be substantial overlap with LSJ, let alone with Kriaras. Maybe 10,000 more? Maybe 15,000? And there's the dialects of Modern Greek to count too. It's starting to look like all the lemma of any form of Greek ever spoken anywhere will be around 300,000.

And is that count meaningful? Of course not. Modern Greek is borrowing words from English all the time. (Some of them, it even spells with Greek characters.) And if I did a lemma count for every Italic language spoken over the same time span in the general area of the Apennine Peninsula—from Iron Age South Picene through to modern Bulgnais, via Latin, Standard Italian, and a lot more than six literary dialects in between—then I'd probably get more than 300,000. And of course, any lemma count spanning more than a century is not linguistically kosher anyway. But hey, people ask.

(Bulgnais? The dialect of Bologna. Sounds like Provençal with a head cold. See description in Bulgnais.)

Back to LSJ

You may have noticed that I was only half-heartedly using counts of dictionary headwords in all of this, even though dictionary writers have a bigger corpus than I do, and a more authoritative sense of what should count as a headword. Mapping headwords to lemmata is more complicated than you might think, especially if you're spanning millenia.

For instance, are ἀνήρ and άντρας the same lemma? We've decided they are. But the headwords look completely different, because they are two thousand years apart. And this doesn't just happen going from Ancient to Modern Greek; it happens from Ancient to Byzantine Greek, and even between Ancient dialects. So there will be less lemmata than headwords. On the flip side, some variants in the printed dictionaries are being treated as separate lemmata, because there's no consistent indication when they're variants and when they're more loosely related forms.

For the record, I'll note that the 1940 LSJ has 122,000 headwords by my reckoning, and 6,000 of them are cross-references; that leaves 116,000 headwords. The LSJ supplement of 1996 has an additional 10,000 headwords, which makes it 126,000. These headwords span the Mostly Pagan corpus (75,000 lemmata that aren't names or numbers), plus the inscriptions and papyri (at least 10,000 more lemmata), plus a bunch of technical vocabulary that I'd skipped in excluding Galen and his fellows. If I put Galen and his fellow technical writers back in, but I still leave out the Church Fathers (like LSJ does), and I also add in the papyri and inscriptions from PHI #7 (but not the "Christian Empire" inscriptions), then I get a corpus pretty much like LSJ's corpus. And my lemma count for that corpus, leaving out numbers and names, is 98,333.

98,333 isn't 126,000; but then, lemmata aren't the same as headwords, some LSJ lemmata have disappeared with new editions, we're missing a few late ancient texts, the lemmatiser is really struggling with understanding inscriptions. In addition, the lexica of Hesychius and Photius alone, which are outside the Mostly Pagan corpus (but document older stages of Greek) account for close to 9000 headwords in LSJ, and well over 1000 for the LSJ Supplement. With them taken out, 98,333 is certainly in the same neighbourhood as 116,000.

I'll have more to say about headword-to-lemma mapping in a couple of posts, anyway.

For Zeus' Sake, How many lemmata of Greek?

So. With the evidence available to me right now, with the current status of the TLG corpus and lemmatiser, and my satchels of salt, and OK, OK, enough already...
  • How Many Ever? Depending on how long a time window you put, and whether names count or not, anything from 55,000 to 300,000.
  • How many lemmata of Ancient Greek? Depends again, but if names don't count, and we allow Synesius and not St Athanasius—i.e. LSJ's Pagan-centric definition of Ancient Greek—then 98,000 is a reasonable first guess. Though the way DGE is going, and with new material showing up in the sands of Egypt, add maybe 20,000 lemmata to that.
  • How many lemmata of Literary Ancient Greek? Depends for a third time, but if time didn't stop with Aristotle but Synesius (and names still don't count), 75,000. If time did stop with Aristotle, 55,000.
  • If names count, Ancient Greek goes up to 124,000. Literary Ancient Greek goes up to 99,000. Homer-to-Aristotle Greek goes up to 66,000.

No, that's not a straight answer. You want a straight answer, don't do lexicography. And in the next couple of posts, I'll complicate the answer more again.
...Read more

2009-06-15

Lerna Vb: Forms of Good Pedigree

[Counts in this post have been corrected in Lerna VIc]

In the last post, we did some pruning of the word form count of our corpora, and came up with some numbers. We also noted that, once you pruned away the 137 forms of ἀνήρ, you're still left with 42 forms of ἀνήρ.

(Did I say 43? I miscounted. Dangerous thing to admit, with all these numbers flying about. But you should be taking those numbers with a grain of salt anyway. As I'm going to keep saying.)

42 is a lot more than the 11 forms ἀνήρ should have, based solely on the Attic dialect. Here, we're going to look at where the remaining 31 forms came from, and what that tells us about the morphological heterogeneity of the TLG corpus. We're also going to keep pruning at those numbers we came up with last time, and see if we can arriving at something like a count of Good Reliable Attic word forms.

The Attic forms of ἀνήρ are shown today in glorious Galatia SIL:



The classicists among you will have picked up that a bunch of the remaining forms are Epic or "poetic". Another 12 of them:



The tricky proto-Greek stem, *anr-, shows up in Epic with the variant stem a(ː)nér-:
SgDuPl
Nomἀνέρεἀνέρες
Genἀνέροςἀνέρων
Datἀνέριἀνέρεσι, ἀνέρεσσι, ἀνέρσι
Accἀνέραἀνέρας
Vocἆνερ

The multiple choices are typical of Epic: Epic is a conventional, mixed dialect, and it was handy for Epic to have multiple choices, to fit the metre that the dialect was used in. Hence the variation in the dative plural between -si(n), -esi(n), and -essi(n).

The lack of ἀνδρ- forms in the table, btw, doesn't mean Epic literature avoided the ἀνδρ- stem. Homer used both. It just means that we've already checked off the ἀνδρ- forms for Attic. But because Epic inflections can also appear on the ἀνδρ- stem, the Epic count also includes a fourth dative plural, ἄνδρεσσι, which we did not count under Attic.

That leaves 19. We can pick off four forms of ἀνήρ as Modern Greek:



Of course, treating Ancient ἀνήρ, ἀνδρός as the same lemma as Modern άντρας, άντρα is a bit of a leap, and it shows the problem with having a single vocabulary try to span three thousand years: there is a continuum from ἀνέρος to ἀνδρός to ἄνδρα to άντρα, but the endpoints are far apart. Still, spanning even a century in a corpus raises problems, because language is a moving object. And on the flip side, much of Greek literature—including the Epics themselves—are attics full of relics. Much like any literary language, really, just over a longer timespan. So we'll treat these as the same lemma (because the TLG has the one search engine for everything); but we'll note that this is a difficult judgement to make in general—and that it has limited synchronic reality.

A further 8 forms look Epic (both ἀνερ- and ἄνδρ- stems), but are accented further back than they would be in Epic Greek. That should make them Aeolic:



We have very, very few literary texts actually in Aeolic. Five of these eight forms do actually turn up in what literary Aeolic we have: ἄνηρ (Alcaeus, Julia Balbilla), ἄνδρος (Sappho), ἄνερος (Theocritus), ἄνδρων (Theocritus, Alcaeus), ἄνδρεσι (Alcaeus).

Of the rest, ἄνερα shows up in fragments of Euripides and Numenius, and ἄνδρασι in fragments of Diocles and Phylarchus. Scribal errors? Maybe; at any rate, there's nothing Aeolic about any of those authors.

The oddest of the eight is ἄνδρι. The form shows up in Jacoby's collection of the Fragments of Greek Historians. This collection gathers up the bits of ancient historians who were not preserved in intact books, and it gathers them from wherever it can; lots of fragments come from citations in later authors. Jacoby has ἄνδρι in a passage by the historian Ion of Chios, as cited in Athenaeus. That means the passage in question turns up twice in the corpus: once in Jacoby's edition of Ion, and once in Kaibel's edition of Athenaeus. (That kind of duplication happens quite a lot in the TLG, though it involves small bits of text, so it does not inflate the word count all that much.)

The thing is, Kaibel's edition of Athenaeus has the word as the normal ἁνδρί. Is this a typo in Jacoby? Is this an earlier version of the text of Athenaeus? Is this an emendation to Athenaeus by Jacoby, because he knows something about Ion that I don't? I don't know, and I'm not burning right now to find out. The point is that this kind of variability does happen in the corpus, and it does increase its morphological diversity more than it should.

So of the eight Aeolic forms, three don't occur in Aeolic texts, and just look like misaccentuations. But this kind of misaccentuation turns out to be routine in Byzantine Greek: in fact, it accounts for most instances in the corpus of the first five "Aeolic" forms. This misaccentuation is too frequent a feature of Byzantine Greek to be an accident or scribal whim. It is a kind of systematic hypercorrection: "I'll misaccent this word because it will sound more récherché." So it's not like Didymus the Blind or St Athanasius are aping Alcaeus specifically; they're just randomising where the accent goes, as part of their game of Greek.

We know that they aren't aping Alcaeus, because the Byzantine don't only put the accent where the Aeolians would have put it; they also put it where noone would have put it. So Byzantine misaccentuation also accounts for four forms of ἀνήρ stressed on the final syllable:



This leaves us with three last forms of ἀνήρ.



To get from Ancient ἀνήρ to modern άντρας, you need to switch from the third declension to the first declension, because the third declension was thrown out in Modern Greek as too hard. (There's some Lazarus—or Zombie—third declensions in the contemporary language, but outside of -ις, -εως plurals, people are uncomfortable with them.) This means that the Ancient nominative ἀνήρ became the Byzantine nominative ἄνδρας. That nominative does turn up in the corpus, but it's spelled identically to the Attic accusative plural ἄνδρας, so we've already crossed it off our list. It also means, though, that there is an accusative singular ἄνδραν in the corpus, which soon became Modern άντρα. So that's where that form has come from.

The second form is ἄνδραις. This is a dative plural, turning up in one church hymn, of that Byzantine first declension variant ἄνδρας. It is also an old-fashioned spelling of the Demotic accusative plural (which would now be spelled άντρες), for reasons of morphological analogy denialism that I'm not going to get into here.

The last form is ἀνρός, and it's not any form of Greek at all. It's proto-Greek: it's Herodian, reconstructing (correctly) what the original genitive of ἀνήρ must have been:
τὸ δὲ ἀνδρός κατὰ συγκοπὴν γενόμενον ἐκ τοῦ ἀνέρος ἐξ ἀνάγκης ἐπλεόνασε τὸ δ. οὐκ ἠδύνατο γὰρ εἶναι ἀνρός χωρὶς τοῦ δ, ἐπεὶ τὸ ν πρὸ τοῦ ρ οὔτε ἐν συλλήψει δύναται εἶναι οὔτε ἐν διαστάσει.
When andrós was formed by syncope [deleting a phoneme] from anéros (anéros > *anrós > andrós), /d/ was a necessary redundancy. For /n/ cannot directly precede /r/, either within a single word or between words. (Herodian De Prosodia Catholica p. 406 Lentz)

In fact, we'd say the proto-Greek was *anrós to begin with; but given the poor track record of Greek etymologists in general, Herodian gets cut plenty of slack from me.

We've just accounted for the 42 "legitimate" forms of ἀνήρ, and we can see some problems with the range of forms we've found:
  • 11 are in one dialect.
  • 12 are in a different dialect—albeit the literary dialect that almost all Classical literature draws on.
  • 5 are in a third, marginal dialect
  • 3 look like they're in the third, marginal dialect, but are really just Byzantines making accents up. As are most instances of the previous 5.
  • 3 are also Byzantines making accents up, in the opposite direction.
  • 4 are in Modern Greek—and you can argue about the extent to which it is the same language at all.
  • 2 are Almost-Modern Greek
  • 1 is a hypothetical reconstruction of proto-Greek by a Roman-era grammarian (and the Byzantines that copied him).

All of these forms are Greek, in one way or another. But counting all of proto-Greek *ἀνρός, Modern άντρα, Poetic Aeolic ἄνερος and Attic ἀνδρός as genitives of "man" should make you nervous. These are not all part of the same linguistic system. We can concede Epic mixed with Attic, because everyone who wrote literature had Homer in mind; literary languages are not pure and uniform langues. (Spoken languages aren't either—although a dialect with four different datives rightly makes people suspicious.) But listing ἀνρός and άντρα together... that's weighing down the scales.

We whittled down the word form count in the previous post to something more reasonable—something that wasn't lurching at every change in casing or apostrophe. But there are still oddball forms in the corpus, and it would be useful to filter out some of the more problematic word forms, to get a more accurate sense of what is going on in the language—to try to restrict the word form count to forms that might plausibly have been spoken by someone once. The lemmatiser can make some judgements about which forms are more oddball than others. It won't be infallible—after all, it thought ἄνδρι was Aeolic. But it's better than nothing, and it's what I've got at hand.

We've seen above that some forms of ἀνήρ are perversely accented, and one form is a grammarian's reconstruction. We can come up with a word form count which eliminates those four forms of ἀνήρ (though it will preserve the accidental Aeolic of the Byzantines). I'm going to filter out the following categories of word forms marked by the current lemmatiser:
  • Hypotethical forms (like *ἀνρός)
  • Hypercorrect forms (like ἀνδράς, or any number of other Byzantine hybrids and not-quite-genuine Doricisms)
  • Uncertain inflection categories (the lemmatiser has insufficient information on how a stem should be inflected)
  • The tense stem used to account for the verb form is not in the lemmatiser lexicon (so this could still be guesswork)
  • The inflection is anomalous (typically, it's the "wrong" class of inflection by conventional norms—which covers lots of confused Byzantine optatives)
  • The form is a transliteration of Latin (occurs in Legal texts)

This should give us a count of Forms of Good Standing. There's more grammatical eccentricities than that, but those are the most egregious.
Word FormsReduced
TLG + PHI #71,300,7171,267,434
TLG (viii–XVI)1,183,1201,158,529
Mostly Pagan (viii–IV)518,321515,275
Strictly Ancient (viii–iv)289,275288,305

Not a massive cut, but a necessary one. Again, the Strictly Ancient corpus is better behaved overall, so there are less anomalies there that need culling.

The next cut will be more cruel. Stems and inflections are marked for dialect and period in the lemmatiser; again, not infallibly, but indicatively enough. There's still a lot of Late forms in the corpus, including Graecobarbara. There's also lots of forms from non-literary dialects, that weren't lucky enough to have an Alcman or a Sappho.

We can filter out the Boeotian and Cretan and Locrian, and the Koine and Byzantine and Demotic, to give just forms compatible with literary Ancient Greek dialects. That's an artificial barrier, sure, but no less artificial than including them all in the same corpus to begin with; and there are plenty of people, ancient and modern, who would look approvingly on this "no riff-raff" policy. It will make the corpus somewhat more morphologically consistent: we'll at least be talking about five centuries' worth of morphology, not twenty-five.

Limiting word forms to Forms of Good Standing And Pedigree does still include lots of word forms that were only devised in the fourteenth or fifteenth century, because the ancient corpus did not exhaust all the possibilities of the ancient language(s). Moreover, proper names haven't been quarantined off from antiquity the way vocabulary proper has. So the count will be much more approximate and fuzzy than it seems. (That holds for all the counts here, of course; come back tomorrow, and the lemmatiser will give different counts.) Still, applying a No Riff-Raff constraint on the corpus—excluding post-Classical and dialectally marginal forms, and keeping just linguistically Classical forms, as the lemmatiser currently understands them—gives us:
Word FormsReduced
TLG + PHI #71,267,4341,135,915
TLG (viii–XVI)1,158,5291,041,520
Mostly Pagan (viii–IV)515,275505,302
Strictly Ancient (viii–iv)288,305285,856

The final cut is the cruellest of all, and it's so cruel only a linguist would do it. Forms of Good Standing And Cecropian Pedigree; in other words, Naught but Attic. No Aeolic, no Doric, no Ionic, and—here's the killer—no Epic. That cannibalises any classical literary work there is, and it's an over-idealised notion of what was spoken in Athens: there would have been some Aristophanean peasants that spoke like that, but no educated Athenian would have. And of course in the other direction, Byzantines kept coming up with Attic-compatible words too. Still, this cruellest of all cuts will give us a word form count that describes just one dialect at a time. Subject to how much the lemmatiser knows about Greek dialect, and again, the lemmatiser is not infallible, and will never be complete.

But again, as long as you understand these numbers will be just indicative, and just illustrative, and are worth what you paid for: these are the Attic-only word form counts for the corpora:
Word FormsReduced
TLG + PHI #71,135,9151,020,232
TLG (viii–XVI)1,041,520952,993
Mostly Pagan (viii–IV)505,302458,933
Strictly Ancient (viii–iv)285,856248,914

Finally got the Strictly Ancient count to budge. :-)

So we started with 1.3 million forms attested of Greek in our corpora; limiting them to forms compatible with Attic Greek takes that down to 1 million. For the Strictly Ancient corpus (when Attic was still a living dialect), that's 250,000, down from 290,000.

Now, what did all that prove?
  • For two millennia after Attic was no longer a living language, it remained a language of literature. There are some 890,000 post-classical wordforms, but over two thirds of them are compatible with Attic, and only a tenth of them are linguistically Late. Now, in part, that's simply because Greek did not turn into Lithuanian: there are plenty of words in Modern Greek that are compatible with Attic too—so long as you're relying on historical orthography. But if the literary language reflected the vernacular more accurately, the count would be a lot less than two thirds. So writers kept using Attic morphology productively.
  • The word form count for the corpora includes a lot of problematic forms, particularly the later we get (and the more artificial the literary language becomes). These problematic forms are part of the heritage of literary Greek; but it is misleading to include them in evidence of the productivity of Greek morphology: many of them are fantasy morphology. That said, these problematic forms are not frequent as forms (2% of TLG forms), and are even less frequent as instances (0.7% of all the words in the TLG corpus).
  • Nonetheless, cutting the morphological variety of Greek down from three millenia to a couple of centuries of Attic does make a difference: 85% of all forms in the Strictly Ancient corpus are Attic, and 80% in the full TLG corpus.
  • ... although in cultural, literary, and even sociolinguistic terms, limiting the morphology to Attic Only is an artificial thing to do. But that's weighing the scales for you: there's a reason why it happens.

So should we cite Classical Greek as having just 1 million, or 250,000 word forms, instead of 1.3 million or 1.8 million? Nah, we should not be counting word forms in a corpus at all, and limiting ourselves to accidents of attestation. But we should also be aware that any corpus like this is going to have forms that are more at home or less at home. And that it all depends.

One final note. The lemmatiser, as I keep saying, is changeable and fallible: the numbers I've been giving—and which I will give in later posts, once I start counting lemmata—are transitory, indicative, and unreliable: they only tell you how far one piece of software has gotten with one lexicon and one corpus. Because the TLG lemmatiser has to do a lot more than lemmatisers normally do—coping with six dialects and three thousand years—it runs into a lot more ambiguity than is usual; and it tries to deal with that ambiguity by ranking analyses of word forms as more or less plausible. If you're using the TLG lemmatised search, you can view the word forms which the lemmatiser thinks *might* belong to the lemma you're searching for, but probably don't.

So if you search for ἀνήρ as a lemma, you'll get the 42 forms we've been talking about. In fact, you'll get 103 forms, because of all the variations in accent and crasis and apostrophes we've mentioned before—though the list is case-folded. But you can also access, by clicking Show lower confidence forms, word forms that the lemmatiser thinks might but probably aren't instances of ἀνήρ. As of this writing, that list includes:
  • κἄνδρος (1) (More probable lemma: Ἄνδρος)
  • ἄνδρου (56) (More probable lemmata: Ἄνδρος ἀνδρόω)
  • ἀνδρου (1) (More probable lemmata: Ἄνδρος ἀνδρόω)
  • ανδρου (1) (More probable lemmata: Ἄνδρος ἀνδρόω)
  • αντρον (1) (More probable lemma: ἄντρον)
  • ἄντρου (126) (More probable lemma: ἄντρον)
  • αντρου (1) (More probable lemma: ἄντρον)
  • ἄντρ’ (4) (More probable lemma: ἄντρον)
  • ἄνδρους (1) (More probable lemma: ἀνδρόω)

For the most part, the lemmatiser is correct in dismissing these analyses: ἄντρ’ is not a Demotic analysis, but a Euripidean mention of "cave", and ἄνδρου (Ἄνδρου) refers only to the island of Andros.

But the lemmatiser is fallible, and it has slipped up with ἄνδρους. (Yes, I'm fixing the analysis now.) The lemmatiser had the alternatives of treating this as a Byzantine attempt at "thou wert manning" (with no augment, so the Byzantines would be play-acting at Homer); or a Demotic accusative of "men", in the completely wrong declension. The lemmatiser decided the Demotic wrong declension was even more absurd than the Byzantine play-acting. As it happens, it's wrong, this is a Demotic wrong declension after all (in the notoriously patchy vernacular Historia Imperatorum). So next time the lemmatised search engine will be updated at the TLG, there'll be 43 simple forms of ἀνήρ after all.

Once more with feeling: don't take the numbers too seriously. (As if the preceding posts didn't argue that at nauseam already.) Just use them to get an order of magnitude sense of what's going on with Greek.
...Read more

2009-06-12

Lerna Va: Word Form Counts, pruning

[Counts in this post have been corrected in Lerna VIc]

So surely, after all the disclaimers in previous posts, I will now tell you how many words there are in Greek?

Oh no. Not at all. Not even close.

Before I alight at the burning question of how many lemmata of Greek (and when), I'm going to spend a good deal of time on how many word forms of Greek. I've bandied a count already on these pages, and I'm going to reduce that count, slice by slice, until it represents something more reasonable. Not completely reasonable, but more reasonable.

Recall that we established four concentric corpora. When we extract unique strings from each corpus, we can (and do) do some normalisation of those strings. We delete non-textual material: Jona[tha]n and Jŏnă|thān are both counted as Jŏnă|thān, because those brackets and diacritics don't change the meaning of the word. For Greek, we also do some basic normalisation of accent: the grave is positional variant of the acute, and words with two accents are phonological variants of words with one accent. So ἄνθρωπός is counted as the same word form as ἄνθρωπος, and καλὸς is counted as the same word form as καλός. We also reattach hyphenated words (which for some texts is trickier than it should be), and we ignore words which are only fragmentary (as routinely happens in inscriptions and papyri).

With that normalisation done, we get the following counts of unique strings in the corpora, for the TLG corpus as of this date.
Word InstancesWord Forms
TLG + PHI #7102,005,2451,861,358
TLG (viii–XVI)95,475,1281,567,892
Mostly Pagan (viii–IV)16,312,159605,335
Strictly Ancient (viii–iv)5,464,913334,428

We can already notice a few things:
  1. The PHI#7 papyri and inscriptions have 7% more word instances, but 19% more word forms; so there's lots of novel strings in the papyri and inscriptions. Because there's lots of new lemmata? Sure. But also because there's lots of mispellings. That's right, a misspelling counts as a unique string; so we'll have some sifting ahead of us.
  2. More word instances is not directly proportional to more word forms: most word forms are very common, and novel word forms follow a law of diminishing returns. Going from Strictly Ancient to Entire TLG multiplies your word count by 18, but it only multiplies your word forms by 4.5. Because 18 times more text means 18 times more occurrences of and, and 18 times more occurrences of the, and only at the bottom of the sieve do you find lots of novel words.
  3. Even factoring all that in, later texts did come up with lots of novel word forms. How many, we'll see later.

So, what does 1.6 (or 1.8, or 0.6) million unique strings mean? As we'll see, not as much as you might think. Let's take the lemma ἀνήρ, "man". By this criterion, the TLG has no less than 137 distinct word forms corresponding to ἀνήρ. Pretty impressive, when it should just have 11 forms in any given dialect. This is what it should look like in Attic:
SgDuPl
Nomἀνήρἄνδρεἄνδρες
Genἀνδρόςἀνδροῖνἀνδρῶν
Datἀνδρί[Like Gen.Du]ἀνδράσι
Accἄνδρα[Like Nom.Du]ἄνδρας
Vocἄνερ[Like Nom.Du][Like Nom.Pl]

So how did we get from 11 forms to 137? For one, yes, we have multiple dialects in there. But that's by no means the main reason; in fact, we're not even going to get to *that* issue, messy as it is, until the next post in the series. Take a look at the 137, this time resplendent in the Greek Font Society Didot typeface:

See the problem? The "unique strings" are case sensitive. Now, there is a reason why I did that: Greek has capitonyms—words that have different definitions if they are capitalised or not; so Ὅμηρος is "Homer", but ὅμηρος is "hostage", and Ἱππίας is "Hippias" while ἱππίας is the adjective "of the equestrian [fem]". The distinction needs to be made for lemmatisation, but it is not extremely frequent; and for words that aren't capitonyms, it leads to drastic inflation of word form counts. If we do away with casing in our strings, we get something closer to the spoken (and early written) linguistic reality of Greek. Yes these word forms become more ambiguous, but we're not left trying to claim that ΑΝΔΡΑΣ, ἄνδρας and Ἄνδρας are different words.

Our 137 forms then go down to 105, and our overall counts tumble as follows:
Word FormsReduced
TLG + PHI #71,861,3581,698,134
TLG (viii–XVI)1,567,8921,408,908
Mostly Pagan (viii–IV)605,335572,537
Strictly Ancient (viii–iv)334,428319,512


That's not enough though: notice that we've eliminated ΑΝΔΡΕΣ, because there's also a lowercase ανδρες, but we've kept ΑΝΔΡΙ, because there is no lowercase ανδρι. But of course, ανδρι is just ἀνδρί shorn of its accents, for whatever reason, and shouldn't be counted separately. If any word form is missing its stress or breathing, we should ignore it if the same word form occurs with a stress or breathing. That will mangle a couple of enclitics, but we'll undo that damage in a couple of counts, and at any rate it will affect only a dozen or so word forms.

So, conflating ΑΝΔΡΙ and ανδρι to ἀνδρί, and requiring word forms to have breathings and accents, our 105 forms go down to 86, and and our overall counts to:
Word FormsReduced
TLG + PHI #71,698,1341,649,083
TLG (viii–XVI)1,408,9081,376,016
Mostly Pagan (viii–IV)572,537562,744
Strictly Ancient (viii–iv)319,512314,887


To go any further in interpreting word forms, we have to associate them to particular morphological analyses and lemmata. That means we should restrict our counts to word forms that the lemmatiser recognises, because we can't say much reasonable about the word forms that it doesn't. Right now, with casing intact, the TLG lemmatiser recognises close to 94% of the word forms in the TLG corpus, and 60% of the word forms in PHI #7. That's sacrificing something (6% and 40% of the word forms respectively), but we can't talk about word forms that we don't understand; and a lot of those words won't be words anyway—there's incantations and geometrical lines and all sorts of stuff in there.

(Of course, if you talk to me tomorrow, I'll be throwing out less word forms, because the lemmatiser is constantly being made cleverer.)

Eliminating unrecognised word forms and folding case, as we've been doing, gives us:
Word FormsReduced
TLG + PHI #71,649,0831,435,391
TLG (viii–XVI)1,376,0161,282,298
Mostly Pagan (viii–IV)562,744557,574
Strictly Ancient (viii–iv)314,887313,354

Let's pause here. So far, we've normalised case and (somewhat) accentuation, and we've constrained our word forms to those the lemmatiser understands. Our overall count has gone from 1.86 to 1.44 million. Our Strictly Ancient count has gone from 334 to 313 thousand—that corpus is overall much better behaved, so there's less there to clean up. Notice that getting rid of unrecognised word forms makes a huge dent in PHI #7 (the lemmatiser doesn't like phonetic spellings), but barely a scratch on the Ancient corpora (because Ancient Greek is well documented.)

Now, the lemmatiser does cleaning up of its own when it recognises words.
  • When it sees an apostrophe, it analyses it by filling in the missing vowel: ’νδρες = ἄνδρες.
  • When it is confronted with words unrecognisable on their own, it comes up with alternate spellings which can make sense of the word as spelled—that's how it gets anywhere with phonetically spelled church deeds or papyri. So it understands the monstrous diplomatic spelling δειακαίλἐυονται as διακελεύονται.
    What's a monstrous spelling like that doing in the TLG to begin with? Diplomatically published church deeds. That's why editors normalise. In fact, for all the chaos in the spelling of PHI #7, a lot of the words do have a bracketed normalisation next to them on the CD, and I've used those normalisations rather than the original readings in the counts.
  • If the accentuation has an acute in the fourth last syllable, or something else absurd like that, it analyses the word as if it were accented more sensibly. So it knows ἤλλοιτριωσθησαν is meant to be ἠλλοιτριώσθησαν, and ἦλπισαν is ἤλπισαν.
  • Iota adscripts are respelled as iota subscripts. So the lemmatiser treats ἦιδε the same as ᾖδε.
  • And if a word has undergone crasis, merging two words phonetically, the lemmatiser pries them apart again: κἀνδρῶν is broken up into καὶ ἀνδρῶν, and counted as an instance of ἀνδρῶν.

So the lemmatiser does some normalisation of words: it dismisses what are to it obvious misspellings, and it fills in phonologically missing bits of words. This does not get rid of all potential "misspellings": a lot of them have been added manually to the lemmatiser as variants in the texts. But these normalisations do need to be taken into account when counting word forms. ανιρ is just a phonetic spelling of ἀνήρ, not a novel word form. ἄνδρ’ is not a distinct word form from ἄνδρα, nor is ’νδρες distinct from ἄνδρες, or κἀνδρῶν distinct from ἀνδρῶν.

With the normalisation the lemmatiser can do on its own, the 86 forms of ἀνήρ go down to 50—getting rid of all crases and apostrophes; and the word counts go to:
Word FormsReduced
TLG + PHI #71,435,3911,352,303
TLG (viii–XVI)1,282,2981,232,209
Mostly Pagan (viii–IV)557,574539,469
Strictly Ancient (viii–iv)313,354301,005


Greek phonology has always featured the nu movable, an /n/ which can occur optionally at the end of some inflections, depending on what phoneme follows it. So "is" is ἐστι before a consonant, and ἐστιν before a vowel—leading to the Classic example of why the Ancients should have spaced their words, ἐστι νοῦς "it's a mind", ἐστιν οὖς "it's an ear" (esti nôːs, estin ôːs).

In other words, this /n/ is a liaison phoneme, and its presence or absence does not make the word distinct. So pairs differing only by a nu movable should not be differentiated as novel word forms (and the lemmatiser knows which /n/s are movable). That takes the 50 forms of ἀνήρ down to 43, and the word counts to:
Word FormsReduced
TLG + PHI #71,352,3031,307,842
TLG (viii–XVI)1,232,2091,189,688
Mostly Pagan (viii–IV)539,469519,498
Strictly Ancient (viii–iv)301,005289,812



The lemmatiser also recognises some strings of Greek that it knows are not words, but abbreviations (Αν is used to abbreviate ἀνήρ at least once), Greek numerals, or geometric lines. (The corpus does include Archimedes and Euclid, after all.) Excluding such non-words takes us to:
Word FormsReduced
TLG + PHI #71,307,8421,300,717
TLG (viii–XVI)1,189,6881,183,120
Mostly Pagan (viii–IV)519,498518,321
Strictly Ancient (viii–iv)289,812289,275



We could keep going, but we won't, because going further is going to be a lot more onerous. There are lots of "wrong" spellings in the Byzantine era:
  • uncertainty about whether to circumflex or acute stems (which count as different word forms here): κῦμα κύμα
  • uncertainty about whether to have double or single consonants (which is what I've been dealing with for the past couple of months): ἁγνόρυτος ἁγνόρρυτος
  • accents on a wrong but legal syllable of a word (which as far as I can tell, Byzantines did Just For Fun): ἄβυσσος ἀβύσσος.
At a guess, that kind of spelling variation may account for 2% of the word forms of the TLG. But this has already gone on plenty, and the point's been made: it's true that there are almost 1.6 million distinct strings as far as the TLG Word Index is concerned, but chop off a quarter of that to get closer to a realistic word form count. And if you limit yourself to just an Ancient Greek corpus, the 1.2 million becomes 500 or 300 thousand word forms.

Is that a lot? Well, noone said that Greek wasn't a highly inflected language. We've already seen at length why being a highly inflected language doesn't automatically give your culture extra IQ points—it's what you say, not how many suffixes you use to say it. Still, at a rough guess, this means between 3 and 6 word forms per lemma on average in the Greek corpus: common verbs will have hundreds of word forms corresponding to them, while the Long Tail of lemmata will have only one or two forms represented in a corpus. That's not bad, but it's not exceptional even among inflected languages—let alone agglutinative.

Let's compare Slovenian, which is certainly up there among modern inflected Indo-European languages. Rotovnik et al. used a newspaper corpus comparable to what we're talking about here, and a dictionary of 60,000 lemmata. Now, the thing about lemmata we will see in future episodes is, you never stop counting them: lemma counts are open-ended. All you can do is say, if I know this many lemmata, I can recognise this percentage of word forms in a corpus. So:
Ancient + Byzantine Greek (TLG)Contemporary Slovenian
Word instances95 million105 million
Word forms1.2 million660,000
Lemma countsay 205,000?60,000
Unrecognised word forms6.2%8.7%
Avg. word forms per 1000 word instances12.66.3
Avg. word forms per lemma610

No, this is not a race, and we're not going to call Slovenian better or worse than Atticist Greek. Nor am I going to go into the sophistication of Rotovnik et al.'s word recognition model, which uses sub-words to improve recognition—and goes from 8.7% unrecognised down to 1.2%. I'm already doing some less sophisticated tricks to get as far down as 6.2%, because the TLG corpus is much messier than the corpus of Večer articles. No, the point is that a language like contemporary Slovenian, without three thousand years' and six dialects' worth of weighing down the scales, gives you the same order of magnitude of morphological diversity as do the Three Thousand Years of Greek.

And of course, Three Thousand Years of Greek may have double the word forms of five years of Večer; but once you go to an agglutinative language, Greek's out of the running, because agglutinative languages pack a lot more into their words. I can't get a lemma count from Kamadev Bhanuprasad's study on speech recognition in Telugu; but his newspaper corpus has 20 million word instances, and 615,000 different word forms: 30.8 word forms per thousand word instances, to the TLG's 12.6. Which tells us what we already knew: Greek is not the most morphologically productive language on the planet.

We've cut the word form count down for the Greek corpus to something more realistic; but "Realistic" is a problematic thing to say anyway, because we've still got to explain how 11 forms of ἀνήρ got to 43. That's the story about the dialectal and diachronic diversity of the corpus, and it will have to wait for the next instalment.
...Read more

2009-06-10

Lerna IV: Corpora

So having spent four posts on why we should not count words of Greek, I will count words of Greek. The counts are only meaningful relative to a corpus, so here I detail what's in the corpus I'll be using, PHI #7 + TLG—and how I will end up treating it as four concentric corpora. There is also some information on the distribution and coverage of the TLG, which may be of interest even if you're not interested in counting words.

The corpus I'm using consists of a group of texts I've come to know well, the TLG; and a group of texts I know less well, the PHI #7 disc. The Thesaurus Linguae Graecae is a digital library of Ancient and Byzantine texts, which has been steadily moving forwards in time: it's increasing by around 3 million words a year. Counting words of data-entered text, including markup, hyphenated fragments, and symbols, it currently has close to 105 million word instances; if we restrict the count to just words of Greek, it has 95 million words. (That's your first indication that counting words is complicated.)

The TLG doesn't have all the text there is, but it does have a lot, and it's filling in texts as it goes. I'll reuse the grid I used for dictionary coverage of Greek to show how:

This is a crude representation of the current coverage of the TLG:

The TLG does not cover ancient non-literary texts, which are attested in papyri and inscriptions. That matters for counting words, because a lot of lemmata are only attested in non-literary sources. Non-literary sources range over details of daily life (especially in the papyri), and dialects absent from the literary canon (in the inscriptions). Non-literary texts are where texts keep showing up from antiquity, and where both the LSJ Supplement and the Diccionario Griego-Español get many of their new lemmata from.

This area is not covered by the TLG, with only a couple of exceptions (Epistulae Privatae); to address that, I'm also including the PHI #7 disc from the Packard Humanities Institute in the corpus.

The PHI #7 disc, which has 6.5 million words of Greek, was issued in 1995, and it includes three collections: a corpus of ancient inscriptions [now online], compiled by Cornell and Ohio State Universities (3.1 million); the Duke Databank of Documentary papyri (3.1 million) [also online]; and Inscriptions of the Christian Empire, compiled by John Mansfield of Cornell (0.3 million). New inscriptions and papyri keep showing up all the time, so PHI #7 does not cover everything we know we have; but it is representative enough.

The TLG does admit non-literary texts for the mediaeval period. The monastic acts in particular are diplomatic editions (i.e. preserving the original spelling of these monastery legal documents in all their inventive confusion). Their misspellings cause our counts of word forms all manner of trouble, as we'll see. But the TLG has not been working on mediaeval inscriptions either, so PHI #7's Christian inscriptions fill in a gap as well. The Christian inscription corpus is small, but it covers a lot of ground: the proto-Bulgarian inscriptions are here, and the texts go late enough include Χακῆ as a rendering of Hajji.

For literary texts, the TLG is pretty much complete for antiquity, strictly defined. There are some gaps for the early Christian era, which are currently being filled in: the TLG is still missing some apocrypha, liturgical texts, and the Hexapla, including the Hebrew Scripture translations of Aquila and Symmachus. In terms of raw lemma count, the Latin–Greek glossaries are not in yet; when they are added to the TLG, they will account for something like a thousand LSJ lemmata currently absent from the corpus.
That an ancient dictionary should have so many one-off lemmata in it is no surprise: dictionaries contain words that people didn't know, and which are unlikely to turn up anywhere else. Which is why Hesychius is so important to comparative linguistics—and such a pain to do anything sensible with in lemmatisation. The Latin–Greek glossaries are a mixed bag: they contain words never heard of before or again (e.g. τηκεδονικός, -ή, -όν tabificabile); but they also have the first instances of common modern words (e.g. τζάπιον, τό bidens, ligo, raster—i.e. Modern τσάπα "hoe").

For the mediaeval period proper, TLG work on expanding the corpus is ongoing. We have had a guess a couple of years ago that we had 70% of the texts covered by Trapp's dictionary: that would translate to some 20 more million words. The TLG has now started to include Early Modern texts as well: it has a while to go yet, but it already has 2.5 million words of the vernacular. Of course, this is (a) only a small proportion of Early Modern Greek texts, and (b) nothing about the contemporary Modern Greek language. So this corpus doesn't tell you much about anything past 1600 at the moment.

We saw in the post on the dictionary coverage of Greek that various periods do better or worse in how well they are covered by grammars and dictionaries—and how "clean" their texts are. (Way too much Migne still for editions of the Church Fathers, for one.) That's reflected in the lemmatiser I've been working on for the TLG: it deals with Ancient Greek proper exceedingly well (99.4% recognition up to Aristotle), but more patchily with Mediaeval Greek (94.6% for viii-xvi AD learnèd, as of May). These can be illustrated with degrees of certainty of recognition by the TLG lemmatiser, something I'll talk more about later. (And note that these figures change month by month, as the lemmatiser is improved.)


Lemma recognition is at a loss with the PHI #7 texts (around 65%). In large part, that's because of the more chaotic spelling used in those texts. In at least some part, that's because I've spent 6 years tweaking the lemmatiser to TLG texts, and only a couple of hours tweaking it to PHI. I'm missing a whole lot of inscription- and papyrus-specific lemmata from DGE (where the growth spurt is), and there's a whole lot of Egyptian proper names the lemmatiser hasn't heard of, so PHI is going to be underrepresented in any lemma counts I try to work out.

Moreover, we saw in more recent posts that the longer the time span of a corpus, the more incoherent any counts are. Given that later Greek is less well documented, less well edited, and less Classical than earlier Greek, I'm going to split my corpus in four, and give counts for each, moving progressively closer to the Ancient core. So:
  1. Counts for TLG + PHI #7.
  2. Counts for just TLG, which I've had more command over than the PHI #7 corpus (and which concentrates us on literary texts for antiquity)
  3. A Mostly Pagan Mostly Ancient corpus.

  4. A strictly Ancient corpus.

The strictly Ancient corpus stops with Aristotle, fourth century BC. That covers the classical canon, which everyone since has admired and emulated; but it's not all the texts counted as ancient in the broad sense: it leaves out Polybius, Plutarch, Lucian—and the Judaeo-Christian scriptures. Antiquity conventonially ends with Nonnus, in the sixth century AD. But having an ancient corpus go up to the sixth century will include too much "unruly" texts: texts in poor editions, or texts where the classical norms aren't as consistently observed.

To clean up the corpus somewhat, and present a middle ground between the full TLG and Homer-through–Aristotle, I'm positing a Mostly Pagan Mostly Ancient subcorpus. This goes up to the fourth century AD (so Synesius is in, Nonnus is out), and it includes the Jewish and Christian scriptures. But it excludes any other Christian writings, and technical writing: medical, legal, alchemy, astrology, lexicography, grammar, scholiastic, philology, geography, mathematics, mechanics [engineering], and magical. That's pretty brutal, but both the technical and the Patristic corpora are linguistically distinct from the literature of Lucian and Synesius, and are the kinds of text that Classicists, for better or worse, have paid less attention to. So this Mostly Pagan Mostly Ancient corpus is a literary corpus, comparable to the strict Homer-to-Aristotle grouping, but with a less straitened timespan.

Limiting the time span like that cuts down the 95 million word corpus significantly, because of how unequally texts are attested from different periods. The strictly Ancient corpus is just 5 million words large; and there are some striking disparities in how texts are represented in the TLG by century:

Obligatory provisos about the century breakdown: it's by author not work, so a small number of later texts get included in earlier centuries. The most egregious instance is in the Hippocratic corpus, which includes among its Ionic a text so modern, it uses Italian words for "virtue" and "colour" (βερτοῦ, κλόρε). The "Varia" are mostly scholia, which cover any time from Roman times to the late Middle Ages. But the proportions are indicative enough.

The inconsistencies will be clearer in bar chart form:

Most of the spikes in the graph can be explained. The iv AD spike are the major church fathers, and texts attributed to them—which make John Chrysostom the most prolific author in the corpus. The ii AD spike is in large measure because of the disproportionate representation of medical authors (and the Second Sophistic), and texts attributed to *them*—which make Galen the second most prolific author in the corpus. The dip in vii–viii AD is presumably the Byzantine Dark Ages (yes, yes, I know the term is problematic). The dips in other centuries, especially ii BC and iii AD, I don't really have an explanation for.

The disproportionate spikes go away if we take Christian and technical texts out of the equation, and restrict ourselves to literature (à la the Mostly Pagan Mostly Ancient subcorpus, which adds up to 19 million words).

The Golden Age of Classical Literature does not look so underwhelming, for one:

There's still some spikes that may or may not come as a surprise. vi AD is bolstered by the voluminous Neoplatonists; even without the medicos, the Second Sophistic was prolific; the Comnenan Renaissance and Palaeologan Renaissance, xi AD and xiii AD, are now visible. And once the Byzantine legal texts are taken out of the picture, the Dark Ages look Darker: they weren't as dark a time for lawyers...
...Read more

2009-06-06

Lerna IIId: Why we do not count lemmata

Now, the whole point of any word counting venture, such as Lerna attempts and gets galumphingly wrong, is not the corpus size, which is contingent and always less than infinity; nor is it the number of word forms, which tells you about morphological happenstance but not about vocabularies. When people talk about words, they mean dictionary words.

This veers off into Eskimo Words For Snow territory, so it's even more fraught for a linguist to talk about. Especially because, even more than for word forms, there is a lot of arbitrariness to be had about how you count lemmata. Enough arbitrariness to make the whole venture deeply problematic. It's especially problematic if, like the artificially inflated corpus of the TLG (or the OED, or indeed any dictionary), the corpus spans more than the vocabulary contained in one skull, and ranges over more than one region, and more than one decade. That brings together all the words you might need to know if you ever come across them, in a literate culture that preserves words in print for centuries. It does not bring together all the words you ever will have in your skull: it's not modelling the vocabulary that any speaker will ever command. Dictionaries are documenting an inflation inherent in any written language; it is particularly pronounced for Greek, for reasons already seen.

Now, it's reasonable to assume that if your language gets used by more people, to talk about more stuff, in a culture where more stuff is around, and in contact with lots of other languages and their speakers' stuff, then that language will have more words. The Greek of the Roman Empire was like that. The English of the Globalisation Empire is much more like that. So if the guesstimates are that contemporary English has twice the dictionary words as contemporary Spanish, that's plausible.

The Greek of the Classical Age invented much of how the West understands the world. But it was not exploding with words. The Spartans weren't the only Greeks to be Laconic: Classical Greek was frugal with its words—enough for its philosophy to look basic (or unsophisticated), compared to the German experience. As we'll see, the vocabulary explosion happened much later. Look at how Plato writes about philosophy, how a speech in Euripides works—how insistently Aristophanes snipes at Socrates' and Euripides' new-fangled words, and how unremarkable those new-fangled words turn out to be. "Verse" στίχος, Frogs 1239, was such a new-fangled word, for goodness' sake, as Andreas Willi writes: The Languages of Aristophanes, p. 58; yet it's merely reusing the word for "line".

Plausibility was never the point of the Lernaean text, nor is it perturbed by any actual familiarity with Classical Greek. But even with three millenia of vocabulary buildup pitted against 500 years' worth of Modern English, the world is working out in such a way that Greek is not going to beat English in the "my lexicon is bigger than your lexicon" games. The information overload explosion is being engineered in English, and involves English coinings. Where the vocabularies are growing, other languages are struggling to keep up, and most don't bother: IT done outside of English is now all about the codeswitching. Lernaeanists hear the codeswitching and see the scriptswitching all around them, yet still they assert in their Letter to the Editor that English having more words than Greek must be some kind of joke. ("Και προβάλλεται ως τέτοια η Αγγλική, που μόνο σαν ανέκδοτο μπορεί να θεωρηθεί.") That... must be some kind of joke itself.

But as the about.com answerer hastened to add, English having double the words of Spanish doesn't mean Spanish doesn't have nuances which English can't readily express. Or that any other language doesn't. There are still notions particular to any given culture, which that culture's vehicle language will have words for, and another culture's language won't have had a reason to come up with a word for. That's true of farm implements vs. modem protocols, and it's true of all the subtle constructs that each language's poets embrace zealously, and that the Meaning of Tingo book series did such a superficial job on. (At least the guy has a blog, so there's some avenue for the readership to fine tune things.)
It always struck me as amusing, btw, that most such "untranslatable" Modern Greek words... are Turkish or Venetian. Although of course, whatever meaning they've since picked up is quite distinct from when they first entered the language. It's a long way from merak "hypochondria" to μεράκι "outburst of creativity" [EDIT: better, "sustained creative effort"]. The sequence, from what I surmise, is: hypochondria > lovesick > yearning > fastidious about one's work > taking pride in one's work. By a similar pathway with a last-minute detour, meraklı "hypochondriac" > μερακλής "bon vivant, connoisseur"... Come to think of it, those French words are untranslatable too, aren't they.

There's likely more animal husbandry terms in Masai than Pitjinjarra, and more terms for kinds of cheese in Italian than Laotian, and more terms for intellectual property arrangements in English than in Sorbian. That's the anodyne version of the Eskimo Words For Snow business, and not particularly surprising. Again, it doesn't mean brains are wired differently. You can translate μεράκι with some work. Unlike Spanish (and Old English), German does not have a verb to distinguish essential being from contingent being (ser/estar, bēon/wesan). That didn't put the brakes on German philosophy (!) , and it didn't prevent them making a nouns for persistent existence, Dasein. Not having as many words as the language up the road is not such a deal-breaker in the end.

But as to this urge to have more words than English, in a game that can't be won and makes no sense anyway... it's malicious to, but I'm compelled to recount the 1980 Richard Feynman in Greece episode (Link 1, Link 2):
They were very upset when I said the development of the greatest importance to mathematics in Europe was the discovery by Tartaglia that you can solve a cubic equation: although it is of little use in itself, the discovery must have been psychologically wonderful. It therefore helped in the Renaissance, which was freeing man from the intimidation of the ancients. What the Greeks are learning in school is to be intimidated into thinking they have fallen so far below their ancestors.

Tartaglia's work was done more than 1000 years after the Greeks and showed to the Greeks that a modern man could do something no ancient Greeks could do
(Richard Feynman, What Do You Care What Other People Think?)

Lerna is a hoax, and Lerna is an annoyance, and Lerna is an embarrassment; but it will not die, because more than anything else, Lerna is a symptom. It's a symptom of what Feynman found. And the way to singe the head of the Hydra is to get over that nagging sense of not measuring up to the Hellenes. Generations have failed to make headway there; but Lerna's not making the job any easier, by attributing to a literature already bestriding the world a vocabulary 1000 times larger than life.
...Read more

Lerna IIIc: Why the Greek scales are rigged

Even if you allow for the fact that Greek is flexional and has lots of inflections, a literary corpus of Greek is going to have a lot more morphological variety than most other literary languages. That doesn't tell you something about the superiority of the Greek language. But it does tell you a bit about Greek culture. And it does mean that, if the word form and lemma counts of Greek come out better than expected, the comparison is not exactly fair.

The first catch is that the literary corpus spans three thousand years, as many a Greek ideologue likes to remind you: a trick only Chinese has gotten away with. Does that prove it's the same language? That's a loaded question, of course: if you believe the Moderns are the same people as the Ancients, you'll call both Greek, as everyone does now; and if you don't, you'll distinguish Hellenic from Romeic, as everyone did three centuries ago. (That's unless you were calling Romeic Graecobarbaric, which was also all the rage in some circles.) More to the point, if you believe the Moderns are the same people as the Ancients, your language will reflect that belief. A lot of that in the contemporary Standard is engineered: it results from the conscious efforts of Puristic, to bring older forms of the language back. Some of it is older conservative forces, notably the language of the church.

Greek is no Icelandic: the written literary tradition has had much more of an effect on the spoken language up North, and Iceland is a much smaller place. Greek may be on the conservative side morphologically, compared to say English; but the morphology has still changed quite a bit. Which means, if you count Homeric morphology and contemporary morphology in the same word form count, you're going to get a lot more word forms than if you were doing one millenium at a time. And most counts of what a language's word forms are take just a decade or so at a time, because most counts are synchronic: they're snapshots of a language, not the whole Theseus' Boat ten-part series. A synchronic count of Greek is going to show you a lot less variation, because people don't normally have conversational command of three millenia's worth of speech.

Normally, noone does: that's not the language people have in their skulls, which is what most linguists deal with. Of course, you could compile a corpus of three millenia's worth of language spoken in Rome; and you'd get Classical Latin, Vulgar Latin, several stages of Romanesco, and Standard Italian in the one list. With lots and lots of morphological variation. There's a reason why you wouldn't call that one language's worth of morphology over three millenia in Rome, but three: so the different morphology shouldn't be on the same listing. There's a reason why you may choose to call it one language's worth of morphology over three millenia in Athens (as long as you leave out the Arvanitika of Pllaka). The reasons for that aren't entirely linguistic. They aren't entirely non-linguistic, and the development of Greek has been affected by the underlying thinking. But these are all gradients and slippery slopes; and Greek is at one extreme of the slope. It proves Greek covers a long period; it doesn't prove Greek-speakers have their brains wired differently.

There's not just the three millenia upping the word count. All languages have regional variation, with different grammar and lexicon, up until they get spoken in just one place—or the mass media convince you that they are. People normally speak one dialect at a time, just like they speak one century at a time; so having a corpus span 3000 km of language doesn't tell you more about what language is contained in a single skull than does having a corpus span 3000 years of language. So including 3000 km bumps up your word form count more than is strictly speaking fair.

The thing about Greek is, the literary culture made the same language span not just thousands of years, but thousands of kilometres. Literary Greek is pretty distinctive in having no less than six literary dialects: Epic, which is Old Ionic with other bits, New Ionic, Doric, Aeolic, Attic, and Koine. They're conventionalised in the literary texts, and are not always linguistically reliable; but someone with a literary grounding by Hellenistic times was expected to be conversant in the lot of them; and the literary corpus does need to reflect them all. The literary corpus is not reflecting what was in any one Greek's skull as their native speech, so comparing its morphological diversity to what other language corpora tell you is artificial. But once a language is literary, artifice happens: there's more King James and Shakespeare in contemporary English than there should be, too, and more pepperings of American in Australian English than would have made sense a century ago. And at least some Byzantine scholars did have some command of much of this inflated repertoire of Greek morphology, as artificial as it got.

All this though is reason why counting word forms in Greek is misleading. I'm still going to attempt it, because it raises some further interesting questions, and we're going to see the 1.5 million word forms I quoted whittled down a fair bit. (Having to control for spelling variation, for starters.) Last stop in the ritual abjuring of grocery calculations, lemmata.
...Read more

Lerna IIIb: Why we do not count word forms

Greek is a flexional language: it's not English. A single noun can have 11 different inflections. A single adjective can have 23 inflections. A single verb? I'll throw in the second aorist as well as the first, though I really shouldn't—verbs mostly had just one aorist at a time. I'll be generous, we'll call it 740 forms.

Many a student has gazed in wonder at the subtlety and copiousness of the Greek verb table. I'm sure about as many have been annoyed at the rote memorisation; but the reason the verb table gets admiring remarks is, the 740 forms are not random: they follow a mesh of patterns, which you can reconstruct in proto-Greek back to something pretty neatly agglutinative. On the other hand, a few centuries of phonological shuffling reconfigured the 740 forms enough to be interesting.

Of course, very few verbs are attested in a corpus with all 740 forms. Few verbs have both first and second aorists, to start with. And any corpus is going to display only a subset of what is possible in a language, and what language speakers will recognise as valid verbs. To our knowledge, πεπαίκοιτον "you two would have played" is not attested anywhere in Greek. But it's a regular perfect dual optative, and the perfect indicative πέπαικα is well attested enough: it's as valid a verb form of Greek as any other, whether anyone ever wrote it—indeed, whether anyone ever spoke it, or not. So though the TLG happens to have 219 forms of παίζω, all 534 possible forms of παίζω should count. (No second aorist.)

But once we admit all possible forms, and aren't constrained by what's in a corpus, we're comparing langues, not paroles. And there are languages with more morphology than Ancient Greek. Finnish has fifteen cases. Sanskrit has comfortably over a thousand verb forms. Agglutinative languages, which don't moosh affixes together into idiosyncratic combinations, can go a lot further than that. Turkish? OVER TWO MILLION verb forms.

So is Ancient Greek the only language with an interesting verb table? No, Sanskrit beats it. Is it the only language with lots of morphology? No, Turkish beats it, and Lakhota beats it, and Telugu beats it.

And does that prove Greek inferior to Sanskrit, or Turkish, or Lakhota? No, because that's no valid criterion for judging language's merit. And the reason Greek should bail out of this comparison is not that its 740 lose out to Turkish's TWO MILLION, but that this particular flavour of grocers' calculation doesn't prove much of anything. Just as the 98 inflections of Modern Greek, or the four inflections of Modern English, don't prove its inferiority to the 740 of Ancient Greek.

And really, why would they? There's poetry in Chinese, and poetry in Lakhota; there's oratory in Latin, and oratory in Arabic. Is a culture lesser for lack of a dative? Μὴ γένοιτο. Is it impoverished through absence of an optative? Ας σοβαρετούμε λίγο. In fact, just as Hellenomaniacs ponder whether you really are impoverished for lack of a dative, English-speakers in different venues—though no more scholarly—ponder whether you are impoverished for having one. Both can't be right, and really, do we want to say that either is right? That's a dodgy calculus to embark on.

Now, I have to swap hats from linguistician to literato for a minute, because the Hellenomaniacs do ponder the loss of the dative for a reason. "I prefer the synthetic nature of Ancient Greek to the analytical nature of Modern Greek", one of the Sarantakos bloggers posted. With a linguistician's hat on, that's sentimental claptrap. But language is about a lot of things, including sentimental claptrap. It's a vehicle for peoples' idelogies, and it gets affected by those ideologies.

The dative ain't coming back in Greek—been through that. But Puristic has had major impact on the Modern Standard, even if it didn't augment its inflection count. And just because Puristic Greek failed to revive the dative doesn't mean standard languages can't choose to switch their typology, through deliberate acts of engineering. Estonian even changed its word order because of one language reformer. Are there linguistic reasons to do so? Not really, the languages were trundling along fine without the engineering. I mean, language typologies left on their own do change: they've got more analytical for the European languages we're familiar with, but less analytical for Chinese. So it can happen. But it doesn't have to, and natural ebb and flow of language structures is not why the language engineering happens. It's "sentimental claptrap" that does it. If a language community is convinced to do something about its morphology, that has linguistic consequences, so it's not something alien to linguistics.

That aside, you can have aesthetic judgements about how a language works. If you got Classical Greek under humane conditions in your schooling, you can look at a phrase in Lucian and say, "that's elegant". The datives and the optatives are part of that elegance. I've even thought "that's elegant" about George Chatzidakis' Puristic Greek. At the sentence level; because like most 19th century linguists, he was incapable of structuring an argument, and Chatzidakis' elegant sentences add up to fifty pages of "and another thing" meanderings that can only be broached via a subject index.

Aesthetics matters; but aesthetics is informed by many a factor, not all of them linguistic. Modern Greek speakers have been attracted to the dative, which they don't have and wish they did, like the Ancients; and they've been repelled by it, after being badgered that they should have it. It's emotive either way because the Ancients are part of the equation. The Turkish ablative is as elegant, in purely linguistic terms, as the Latin one; but those who have longed for the Greek dative aren't on record admiring Turkish sentence structure.

There's no shame in aesthetic judgement being culturally informed. That's the nature of aesthetics. But that tells you that the aesthetics are not mathematically provable, certainly not through a word form count. People used to want to nudge English in the direction of Latin, back when they too were burdened with its heritage. They're chill about it now. Which means the few English-speakers who read Latin can appreciate it without the gnawing feeling they should emulate it—whether the emulation makes sense in English or not. (See infinitive, split.)

I don't say this to dismiss the learning of Ancient Greek in Greece. I'm not even saying there aren't things Modern Greek style can emulate from Ancient Greek: it has done, just as Gibbon owed a debt to Cicero. But none of that is inherent in the dative case. And at the end of the day, πεπαίκοιτον is more compact than "you two would have played", but is it "better"? Objectively? Without considering which civilisation the word was at home to? And if it is, is Upper Sorbian byštej zahrałoj "you two would have played" any less "better"? How? How about Nenets manzarajidinz' "you two would have worked"?

Right. More grocery calculations coming up.
...Read more

Lerna IIIa: Why we do not count word instances

This blogpost in the ongoing thread on the Lernaean Text and counting words in Greek (see Lerna II, Lerna I) may be misdirected to the readership of this blog. It goes through basic notions in linguistics that some of you will be familiar enough with to be annoyed at. And given how the Lernaean text has been propagated, this post should be in Greek. Then again, Nikos Sarantakos has been posting brilliantly about it in Greek for a decade—not that this has singed off any of the heads of the Hydra, because it really is a Hydra. From Lerna.

Still, I have to ritually cast out the implications of "my language has more words than your language, nyuh", before I start counting the words of Greek on the record. And obvious statements are worth writing down too. Especially despite them not being obvious down at the fevered swamps of Lerna. Besides, I have a mission statement for the blog: "making Greek more googleable" (through English).

I arrived at this mission statement in corresponding with my one-time student Matt Treyvaud, who has long been making Japanese more googleable at No-Sword. Read ye his blog, for it is hale: thou hast exceeded thy sensei! :-)

The previous post already goes into reasons why the size of the corpus you're using doesn't mean all that much. The millions of times /malaka/ gets said daily (let alone the hundreds of millions of times /fʌkɪŋ/ gets said) does not outweigh the smaller number of words surviving from Greek antiquity. The fact that five times more people speak French than Dutch does not make French a five times better language. The fact that the Mahabharata is ten times longer than the Iliad does not make it an inherently better poem. Nor an inherently worse poem.

We could go on with this. Let's. And let's go to the real point of promoting the size of the TLG corpus in the Lernaean text: the fact that this is not a count of any old words, but of the words of Classical Greek literature.

Size as a metric for literary quality. Doesn't sound convincing, does it? Of course, that's not why the figure of 90 million got inserted: it got inserted because the writers really had no idea what the difference is between a word instance count and a lemma count. Among all the other things they had no idea about. But let's spend some paragraphs on this strawman anyway.

Reading the erudite though self-important Esperanto literary journal Literatura Foiro, I came across a quote from Italo Calvino that obviously a widely spoken language would produce greater literature than a small language like Bulgarian. I don't know anything about Italo Calvino, and after that quote, I didn't care to. I did recently ask a friend who did know something about Italo Calvino, and it makes sense, given his conscious cosmopolitanism, that Italian would be better at producing the kind of literature he valued than would Bulgarian. The response to that of course is, English would be a lot better still, if contemporary cosmopolitanism is your primary aesthetic criterion. It's not the only aesthetic criterion in existence, and it's not like Calvino wrote in English anyway; so I wouldn't use population counts or readership size or extent of bilingualism to invalidate as inferior the literature of Bulgarian. Or Italian. Or Greek. Or Esperanto.

Besides, what's "greatness" about? Esperantists were hankering for an Epic of their own, and were overjoyed when William Auld gave them The Infant Race (La Infana Raso) in 1956—although, it being 1956, the cantos it sounded like were Pound's not Alighieri's, and it eulogised the perpetuation of the species, not the rage of the son of Thetis. The poem is good; Auld did good short poems too, though they're not why he kept being nominated for the Nobel Prize. But the jewel of Esperanto Modernism was Victor Sadler's Self-Criticism (Memkritiko) in 1968; and it was a jewel because it thought Small, not Big. Greek readers whose eyes are glazing over about now might want to compare how much pound per verse you get out of Palamas and Kazantzakis, versus Cavafy and Karyotakis. The fact that they wrote Big is not an argument against Palamas and Kazantzakis, any more than it is against the Mahabharata. But it decidedly isn't an argument against Cavafy and Karyotakis, either.

Even within the TLG's ambit, the size of the Byzantine corpus is bigger than the Ancient corpus. A lot bigger. All up, it'd be at least 10 times bigger, depending on how you count. Now, that does not mean the Byzantine corpus is worthless: judgement has been severe on Byzantine literature, and the artificiality of the learnèd language did not help, but a thousand years of writing did not produce nothing. Still, students don't enrol in Ancient Greek classes to read Theodore Prodromus or John Chrysostom. It would be cool if they did, but they don't. They enrol to read Plato and Homer, or Mark and Paul. If they do read Prodromus and Chrysostom, it's after they've read Plato and Mark. And there's a lot less of Plato, or Mark, than of John Chrysostom.

Which brings us back to the actual Classical Greek corpus. As we'll see, the TLG is not just Classical but Byzantine Greek, and the actual Classical Greek corpus that has survived is not 90 million words: not even close. What did survive, survived precisely because of how great its impact was, and the impact was out of proportion to its word count. Nor is the actual Classical Greek corpus particularly prolix: at its best, it used words carefully and frugally. It wasn't in a race to come up with lots of words: Aristophanes has a little fun now and again, but he doesn't go to town like Constantine of Rhodes did

The literary corpus from Homer to Aristotle is 5 million words, not 90. Do we really want to say that makes it 18 times less important?

OK, we now put such grocers' calculations aside. And we move to other grocers' calculation in following posts. (Lerna IIIa is the first of four.)

To take us out, some Constantine of Rhodes, which I cited in the Entertaining Tale of Quadrupeds, pp. 91-92, to illustrate... well, the race to come up with lots of words. Leo Choerosphactes, you must have really gotten beneath this guy's skin:
λαρυγγοφλασκοξεστοχανδοεκπότα!
κασαλβοπορωομαχλοπροικτεπεμβάτα!
ὀλεθροβιβλιογαλσογραμματοφθόρε!
σολοικοβαττοβαρβαροσκυτογράφε!
καὶ ψευδομυθοσαυροπλασματοπλόκε!
ἑλληνοθρησκοχριστοβλασφημοτρόπε!
καὶ παντοτολμοψευδομηχανορρόφε!
καὶ τρωκτοφερνοπροικοχρηματοφθόρε!
ἀρρητοποιονυκτεροσκοτεργάτα!
καὶ νεκροτυμβοκλεπτολωποεκδύτα!

You flask-in-gullet–pint–mouth-gaping–gulper!
You harlot-whore–lewd-beggar–shirt-lifter!
Disastrous-book–false-letter–ruiner!
Solecist-babbling–barbarous-hide–writer!
Fake-fairytale–and–cracked-creation–monger!
You pagan-creed–and–Christ-blaspheming-type!
All-daring–and–mendacious–mal-intriguer!
You bride-gift-gnawing–dowry-money–waster!
Acts-unspeakable–nightly-darkness–worker!
You gravesite-corpses-robbing–clothes-despoiler!
...Read more