2009-06-12

Lerna Va: Word Form Counts, pruning

[Counts in this post have been corrected in Lerna VIc]

So surely, after all the disclaimers in previous posts, I will now tell you how many words there are in Greek?

Oh no. Not at all. Not even close.

Before I alight at the burning question of how many lemmata of Greek (and when), I'm going to spend a good deal of time on how many word forms of Greek. I've bandied a count already on these pages, and I'm going to reduce that count, slice by slice, until it represents something more reasonable. Not completely reasonable, but more reasonable.

Recall that we established four concentric corpora. When we extract unique strings from each corpus, we can (and do) do some normalisation of those strings. We delete non-textual material: Jona[tha]n and Jŏnă|thān are both counted as Jŏnă|thān, because those brackets and diacritics don't change the meaning of the word. For Greek, we also do some basic normalisation of accent: the grave is positional variant of the acute, and words with two accents are phonological variants of words with one accent. So ἄνθρωπός is counted as the same word form as ἄνθρωπος, and καλὸς is counted as the same word form as καλός. We also reattach hyphenated words (which for some texts is trickier than it should be), and we ignore words which are only fragmentary (as routinely happens in inscriptions and papyri).

With that normalisation done, we get the following counts of unique strings in the corpora, for the TLG corpus as of this date.
Word InstancesWord Forms
TLG + PHI #7102,005,2451,861,358
TLG (viii–XVI)95,475,1281,567,892
Mostly Pagan (viii–IV)16,312,159605,335
Strictly Ancient (viii–iv)5,464,913334,428

We can already notice a few things:
  1. The PHI#7 papyri and inscriptions have 7% more word instances, but 19% more word forms; so there's lots of novel strings in the papyri and inscriptions. Because there's lots of new lemmata? Sure. But also because there's lots of mispellings. That's right, a misspelling counts as a unique string; so we'll have some sifting ahead of us.
  2. More word instances is not directly proportional to more word forms: most word forms are very common, and novel word forms follow a law of diminishing returns. Going from Strictly Ancient to Entire TLG multiplies your word count by 18, but it only multiplies your word forms by 4.5. Because 18 times more text means 18 times more occurrences of and, and 18 times more occurrences of the, and only at the bottom of the sieve do you find lots of novel words.
  3. Even factoring all that in, later texts did come up with lots of novel word forms. How many, we'll see later.

So, what does 1.6 (or 1.8, or 0.6) million unique strings mean? As we'll see, not as much as you might think. Let's take the lemma ἀνήρ, "man". By this criterion, the TLG has no less than 137 distinct word forms corresponding to ἀνήρ. Pretty impressive, when it should just have 11 forms in any given dialect. This is what it should look like in Attic:
SgDuPl
Nomἀνήρἄνδρεἄνδρες
Genἀνδρόςἀνδροῖνἀνδρῶν
Datἀνδρί[Like Gen.Du]ἀνδράσι
Accἄνδρα[Like Nom.Du]ἄνδρας
Vocἄνερ[Like Nom.Du][Like Nom.Pl]

So how did we get from 11 forms to 137? For one, yes, we have multiple dialects in there. But that's by no means the main reason; in fact, we're not even going to get to *that* issue, messy as it is, until the next post in the series. Take a look at the 137, this time resplendent in the Greek Font Society Didot typeface:

See the problem? The "unique strings" are case sensitive. Now, there is a reason why I did that: Greek has capitonyms—words that have different definitions if they are capitalised or not; so Ὅμηρος is "Homer", but ὅμηρος is "hostage", and Ἱππίας is "Hippias" while ἱππίας is the adjective "of the equestrian [fem]". The distinction needs to be made for lemmatisation, but it is not extremely frequent; and for words that aren't capitonyms, it leads to drastic inflation of word form counts. If we do away with casing in our strings, we get something closer to the spoken (and early written) linguistic reality of Greek. Yes these word forms become more ambiguous, but we're not left trying to claim that ΑΝΔΡΑΣ, ἄνδρας and Ἄνδρας are different words.

Our 137 forms then go down to 105, and our overall counts tumble as follows:
Word FormsReduced
TLG + PHI #71,861,3581,698,134
TLG (viii–XVI)1,567,8921,408,908
Mostly Pagan (viii–IV)605,335572,537
Strictly Ancient (viii–iv)334,428319,512


That's not enough though: notice that we've eliminated ΑΝΔΡΕΣ, because there's also a lowercase ανδρες, but we've kept ΑΝΔΡΙ, because there is no lowercase ανδρι. But of course, ανδρι is just ἀνδρί shorn of its accents, for whatever reason, and shouldn't be counted separately. If any word form is missing its stress or breathing, we should ignore it if the same word form occurs with a stress or breathing. That will mangle a couple of enclitics, but we'll undo that damage in a couple of counts, and at any rate it will affect only a dozen or so word forms.

So, conflating ΑΝΔΡΙ and ανδρι to ἀνδρί, and requiring word forms to have breathings and accents, our 105 forms go down to 86, and and our overall counts to:
Word FormsReduced
TLG + PHI #71,698,1341,649,083
TLG (viii–XVI)1,408,9081,376,016
Mostly Pagan (viii–IV)572,537562,744
Strictly Ancient (viii–iv)319,512314,887


To go any further in interpreting word forms, we have to associate them to particular morphological analyses and lemmata. That means we should restrict our counts to word forms that the lemmatiser recognises, because we can't say much reasonable about the word forms that it doesn't. Right now, with casing intact, the TLG lemmatiser recognises close to 94% of the word forms in the TLG corpus, and 60% of the word forms in PHI #7. That's sacrificing something (6% and 40% of the word forms respectively), but we can't talk about word forms that we don't understand; and a lot of those words won't be words anyway—there's incantations and geometrical lines and all sorts of stuff in there.

(Of course, if you talk to me tomorrow, I'll be throwing out less word forms, because the lemmatiser is constantly being made cleverer.)

Eliminating unrecognised word forms and folding case, as we've been doing, gives us:
Word FormsReduced
TLG + PHI #71,649,0831,435,391
TLG (viii–XVI)1,376,0161,282,298
Mostly Pagan (viii–IV)562,744557,574
Strictly Ancient (viii–iv)314,887313,354

Let's pause here. So far, we've normalised case and (somewhat) accentuation, and we've constrained our word forms to those the lemmatiser understands. Our overall count has gone from 1.86 to 1.44 million. Our Strictly Ancient count has gone from 334 to 313 thousand—that corpus is overall much better behaved, so there's less there to clean up. Notice that getting rid of unrecognised word forms makes a huge dent in PHI #7 (the lemmatiser doesn't like phonetic spellings), but barely a scratch on the Ancient corpora (because Ancient Greek is well documented.)

Now, the lemmatiser does cleaning up of its own when it recognises words.
  • When it sees an apostrophe, it analyses it by filling in the missing vowel: ’νδρες = ἄνδρες.
  • When it is confronted with words unrecognisable on their own, it comes up with alternate spellings which can make sense of the word as spelled—that's how it gets anywhere with phonetically spelled church deeds or papyri. So it understands the monstrous diplomatic spelling δειακαίλἐυονται as διακελεύονται.
    What's a monstrous spelling like that doing in the TLG to begin with? Diplomatically published church deeds. That's why editors normalise. In fact, for all the chaos in the spelling of PHI #7, a lot of the words do have a bracketed normalisation next to them on the CD, and I've used those normalisations rather than the original readings in the counts.
  • If the accentuation has an acute in the fourth last syllable, or something else absurd like that, it analyses the word as if it were accented more sensibly. So it knows ἤλλοιτριωσθησαν is meant to be ἠλλοιτριώσθησαν, and ἦλπισαν is ἤλπισαν.
  • Iota adscripts are respelled as iota subscripts. So the lemmatiser treats ἦιδε the same as ᾖδε.
  • And if a word has undergone crasis, merging two words phonetically, the lemmatiser pries them apart again: κἀνδρῶν is broken up into καὶ ἀνδρῶν, and counted as an instance of ἀνδρῶν.

So the lemmatiser does some normalisation of words: it dismisses what are to it obvious misspellings, and it fills in phonologically missing bits of words. This does not get rid of all potential "misspellings": a lot of them have been added manually to the lemmatiser as variants in the texts. But these normalisations do need to be taken into account when counting word forms. ανιρ is just a phonetic spelling of ἀνήρ, not a novel word form. ἄνδρ’ is not a distinct word form from ἄνδρα, nor is ’νδρες distinct from ἄνδρες, or κἀνδρῶν distinct from ἀνδρῶν.

With the normalisation the lemmatiser can do on its own, the 86 forms of ἀνήρ go down to 50—getting rid of all crases and apostrophes; and the word counts go to:
Word FormsReduced
TLG + PHI #71,435,3911,352,303
TLG (viii–XVI)1,282,2981,232,209
Mostly Pagan (viii–IV)557,574539,469
Strictly Ancient (viii–iv)313,354301,005


Greek phonology has always featured the nu movable, an /n/ which can occur optionally at the end of some inflections, depending on what phoneme follows it. So "is" is ἐστι before a consonant, and ἐστιν before a vowel—leading to the Classic example of why the Ancients should have spaced their words, ἐστι νοῦς "it's a mind", ἐστιν οὖς "it's an ear" (esti nôːs, estin ôːs).

In other words, this /n/ is a liaison phoneme, and its presence or absence does not make the word distinct. So pairs differing only by a nu movable should not be differentiated as novel word forms (and the lemmatiser knows which /n/s are movable). That takes the 50 forms of ἀνήρ down to 43, and the word counts to:
Word FormsReduced
TLG + PHI #71,352,3031,307,842
TLG (viii–XVI)1,232,2091,189,688
Mostly Pagan (viii–IV)539,469519,498
Strictly Ancient (viii–iv)301,005289,812



The lemmatiser also recognises some strings of Greek that it knows are not words, but abbreviations (Αν is used to abbreviate ἀνήρ at least once), Greek numerals, or geometric lines. (The corpus does include Archimedes and Euclid, after all.) Excluding such non-words takes us to:
Word FormsReduced
TLG + PHI #71,307,8421,300,717
TLG (viii–XVI)1,189,6881,183,120
Mostly Pagan (viii–IV)519,498518,321
Strictly Ancient (viii–iv)289,812289,275



We could keep going, but we won't, because going further is going to be a lot more onerous. There are lots of "wrong" spellings in the Byzantine era:
  • uncertainty about whether to circumflex or acute stems (which count as different word forms here): κῦμα κύμα
  • uncertainty about whether to have double or single consonants (which is what I've been dealing with for the past couple of months): ἁγνόρυτος ἁγνόρρυτος
  • accents on a wrong but legal syllable of a word (which as far as I can tell, Byzantines did Just For Fun): ἄβυσσος ἀβύσσος.
At a guess, that kind of spelling variation may account for 2% of the word forms of the TLG. But this has already gone on plenty, and the point's been made: it's true that there are almost 1.6 million distinct strings as far as the TLG Word Index is concerned, but chop off a quarter of that to get closer to a realistic word form count. And if you limit yourself to just an Ancient Greek corpus, the 1.2 million becomes 500 or 300 thousand word forms.

Is that a lot? Well, noone said that Greek wasn't a highly inflected language. We've already seen at length why being a highly inflected language doesn't automatically give your culture extra IQ points—it's what you say, not how many suffixes you use to say it. Still, at a rough guess, this means between 3 and 6 word forms per lemma on average in the Greek corpus: common verbs will have hundreds of word forms corresponding to them, while the Long Tail of lemmata will have only one or two forms represented in a corpus. That's not bad, but it's not exceptional even among inflected languages—let alone agglutinative.

Let's compare Slovenian, which is certainly up there among modern inflected Indo-European languages. Rotovnik et al. used a newspaper corpus comparable to what we're talking about here, and a dictionary of 60,000 lemmata. Now, the thing about lemmata we will see in future episodes is, you never stop counting them: lemma counts are open-ended. All you can do is say, if I know this many lemmata, I can recognise this percentage of word forms in a corpus. So:
Ancient + Byzantine Greek (TLG)Contemporary Slovenian
Word instances95 million105 million
Word forms1.2 million660,000
Lemma countsay 205,000?60,000
Unrecognised word forms6.2%8.7%
Avg. word forms per 1000 word instances12.66.3
Avg. word forms per lemma610

No, this is not a race, and we're not going to call Slovenian better or worse than Atticist Greek. Nor am I going to go into the sophistication of Rotovnik et al.'s word recognition model, which uses sub-words to improve recognition—and goes from 8.7% unrecognised down to 1.2%. I'm already doing some less sophisticated tricks to get as far down as 6.2%, because the TLG corpus is much messier than the corpus of Večer articles. No, the point is that a language like contemporary Slovenian, without three thousand years' and six dialects' worth of weighing down the scales, gives you the same order of magnitude of morphological diversity as do the Three Thousand Years of Greek.

And of course, Three Thousand Years of Greek may have double the word forms of five years of Večer; but once you go to an agglutinative language, Greek's out of the running, because agglutinative languages pack a lot more into their words. I can't get a lemma count from Kamadev Bhanuprasad's study on speech recognition in Telugu; but his newspaper corpus has 20 million word instances, and 615,000 different word forms: 30.8 word forms per thousand word instances, to the TLG's 12.6. Which tells us what we already knew: Greek is not the most morphologically productive language on the planet.

We've cut the word form count down for the Greek corpus to something more realistic; but "Realistic" is a problematic thing to say anyway, because we've still got to explain how 11 forms of ἀνήρ got to 43. That's the story about the dialectal and diachronic diversity of the corpus, and it will have to wait for the next instalment.

5 comments:

  1. Oooh, cool, inflection as a measure of cultural complexity: Klingon beats Greek any time.

    ReplyDelete
  2. John, you've gone and done it now :-)

    The Klingon Hamlet (which some hellenistikeuthentes may not know I had a hand in): 21575 tokens, 8220 types: 381 word forms per thousand word instances.

    The Crude Terran Forgery Hamlet: 31938 tokens, 5353 types: 168 word forms per thousand word instances.

    Argal, Klingon is twice as good as English (er, Federation Standard). DoS tobnISlu'ta'bogh [Q.E.D.]

    (The disparity between 168 and the figures given above is because of the size of the corpus, and the long tail of the Zipfian distribution.)

    ReplyDelete
  3. It's not my fault if you don't know how to use the Warrior's Tongue properly! It's all about how may inflectional forms there could be, like old Quine and his Latin nolitessituriesco 'I am not beginning to want to flutter hard'. But, as he says, it hardly seems worthwhile.

    ReplyDelete
  4. Quine source. My grouchy inclination is to say that, like other polymaths (Dougie Hofstadter, I'm looking at you), Quine doesn't quite know all the subtleties of the field he's gatecrashing: there is a difference between inflectional and derivational morphology. As can rightly be counterargued, the semantic difference between inflection and derivation is murky (it's one of the problems underlying counting words in dictionaries), and is more a family resemblance than anything else.

    Still, there's something legitimate about tessituriesco and illegitimate about nolitessituriesco—and nolo and nescio do not represent productive derivational morphology, the way the Latin inceptive or even desiderative might have.

    As for the potentialities of the Klingon langue, ah, I know, and have talked about that too. (Lerna IIIb, I believe.) 31 person prefixes in Klingon verbs, 9 classes of verb suffix, *calculates*... gets us to 2 million possible forms. Slice off 5% because -taH can only occur with -vIS, but you're up to Turkish numbers there. What with them both being agglutinative, and all...

    ReplyDelete
  5. While I appreciate the learned argumentation, I must point out that the only proper refutation in this matter is done with broadswords.

    ReplyDelete

The Other Place (opɯcɯluklɑr)

Powered by Blogger Widgets