2009-07-15

Lerna VIIb: Lemma counts and proportion of text recognised

We can keep dredging lemmata up to move towards a target of 300,000. But of course for a living language, as Modern Greek now is and as Ancient Greek once was, there is no ceiling in lemmata: people can always make up new words, and do. And because dictionaries will never exhaust what words people come up with, even if they work off a limited corpus, the constructive thing to do is not to say how many lemmata are in a language.

The constructive thing rather is to say, if I know n lemmata in a language, how many word instances in a corpus will I understand? If I know n lemmata, how much of the text I'm confronted with can I make sense of? If a vocabulary of 500 words lets you understand just 70% of all the text you'll see, you're in some trouble. If a vocabulary of 50,000 lets you understand 99.7% of a text, on the other hand, that's one word instance out of three hundred that you'll draw a blank on. Assuming 500 words a page, that's around three unknown words every couple of pages. That's still a lot: if you're having to run to the dictionary once a page, you've got catching up to do in the Word-A-Day club. One word every ten pages—say 99.98% of all word instances: that's probably more reasonable.

That gives you a statistic of how many word instances are recognised; but when you're listing the words you don't know yet, you tend to list unfamiliar word forms, not word instances. So if you come away from your reading of The Superior Person's Little Book of Words with a list of words you need to look up in the dictionary, you won't count contrafibularity three times and floccipaucinihilation seven times. You only have to look up contrafibularity once to understand it, so you'll list it as a single unknown word form.

The proportion of recognised word forms is going to be much lower than the proportion of recognised word instances. The word instances will give you credit for knowing words like and and the and of: hey presto, with those three words, you already understand 20% of all printed words of English! The words you won't know will tend to be one-offs, occurring just once or twice in a text: it's rare words, which people don't come across a lot in texts, that they won't have needed to learn. But with word forms, and and the and of don't count as 20% of all printed words of English: they only count as three word forms. The unfamiliar one-offs will make a much larger dent in the size of your vocabulary, than in the proportion of a page you can grok.

Again, the point of this is to say, not that there are n words in a language, which is deeply problematic in ways I've gone into great length on. It's to say that, if you know n lemmata, you will understand n1% of the vocabulary of a corpus (its word forms), and n2% of all the text in a corpus (its word instances). The value of n can go up or down, and the proportions of words you understand can go up or down with it. This means two things which are more useful to keep in mind than any grand How Many Words statement.

First, it's not about how many words there are ever, so much as how many words are *useful* to know. If there are fifty words which were made up for Joe Blow's autobiography, and Jow Blow's autobiography has never been published or indeed sighted outside his kitchen, then those fifty words will not form part of your corpus, so they need not count. Or, if there are three hundred words of phrenology which noone has used for the past century, and they only got used once in a blue moon, then even if those three hundred words do show up in your corpus, they will be marginal enough to cut out most of the time. Tying vocabulary size to recognition allows you to limit the lexicon to what you will actually use, and how frequently you will use it.

The second realisation is the admission this makes, that the size of a vocabulary is asymptotic. People can keep making up words, or using words in increasingly niche and esoteric contexts. If n words let you understand 99% of the vocabulary, then you may well be able to come up with 5n words to recognise 99.9% of the vocabularly, and 20n to recognise 99.99% of the vocabulary, and even 100n to recognise 99.999% of the vocabulary. But by the time you're up to 99.99% recognised words, you can reasonably ask whether it's worth spending an extra five years building up your vocabulary, just to deal with the remaining 0.01%.

The answer is no. Dictionaries do not wait forever before they decide they're done: they have a large corpus, and dip into it fairly eclectically, but they do miss stuff (not just "sausage" in that Blackadder episode on Johnson's Dictionary). And that's OK, if the word is obscure enough for the dictionary's purposes. Dictionaries employ some subjectivity in leaving out words until they think they're worth taking seriously; but the way a corpus is put together usually filters the obscurities out for you already. If you're relying on printed text to prove a word is worth describing, you're leaving out all the made up and nonce words and speech errors that were never written down. Of course, that's an elitist way of viewing language, and print is nowhere near the barrier it used to be. But it does cut your corpus down to something manageable.

If you're working on a Classical language, the cruelty of Time (and Bastard Crusader scum), the indifference of scribes, and the snootiness of schoolmasters do plenty of filtering for you as well. That's why the PHI #7 corpus, which was not subject to the same filters as the literary corpus, has so much distinct vocabulary.

(For an example of why the bones of the Bastard Fourth-Crusader scum should boil in pitch in eternity, see the fate of the text of Ctesias and the other manuscripts unfortunate enough to be in Constantinople in 1204.)

So what sort of recognitions do the figures I've been quoting represent? I'm using the five corpora as before, and I'm also differentiating between all word forms, and just lowercase word forms—because proper name recognition lags behind recognition of common words in general. Lowercase word forms is a somewhat crude metric for leaving out proper names, and there are a few TLG editions which follow the e.e.cummings stylings of their manuscripts, leaving names lowercase. But it's all about the indicative figures, always.
% Instances RecognisedRecognised Instances Ratio% Lowercase Instances RecognisedRecognised Lowercase Instances Ratio
TLG + PHI #799.66%1:29499.861:740
TLG99.84%1:62499.9151:1170
LSJ99.37%1:15899.83%1:585
Mostly Pagan99.964%1:275999.979%1:4750
Strictly Classical99.967%1:299399.975%1:4019

% Forms RecognisedRecognised Forms Ratio% Lowercase Forms RecognisedRecognised Lowercase Forms Ratio
TLG + PHI #789.56%1:9.694.33%1:17.6
TLG93.91%1:16.495.59%1:22.7
LSJ89.51%1:9.595.77%1:23.6
Mostly Pagan99.16%1:11899.42%1:172
Strictly Classical99.56%1:22699.62%1:263

Let's go through this slowly.

The lemmatiser understands the Strictly Classical corpus—literary Greek up to iv BC—quite well. It only fails to pick up 1 in every 226 distinct word forms, which mean you have go through on average 2993 word instances—say six pages of text—before you hit a word it does not understand. But you can ignore capitalised words, because they're typically proper names, and we don't expect to have those in our vocabulary anyway. You can make sense of "Alcidamophron slaughtered the servant of Tlesipator" more readily than you can "Jack fnocilurphed the smorchnepot of Jill". If we do ignore capitalised words, the lemmatiser fails to understand just 1 in 263 word forms, and over eight pages of text on average before it finds a problem word. As machine understanding of morphology goes, that's not bad at all.

So the 55,000 lemmata that the lemmatiser knows of for the Strictly Classical corpus get you through eight pages of Greek on average as smooth sailing. And that is the real meaning of "55,000" lemmata here. Of course, that's an eight page average across a corpus that is still not terribly homogeneous; and some bits of the corpus are going to be understood a lot better than others. The lemmatiser understands all 199,000 word instances in Homer, for instance: 400 pages by our reckoning, not just 8. On the other hand, the Strictly Classical corpus also includes Aeschylus, whose transmission has been corrupted frequently, and where the lemmatiser falls over 63 word instances of 74,000—once every couple of pages.

With the Mostly Pagan corpus, which sticks to literary texts up to IV AD, the lemmatiser understands the corpus almost as well: 76,000 lemmata give you all but 1 in 172 word forms, and in fact because the later texts are slightly more homogeneous linguistically, almost 10 pages on average of text before there is a problem word. So 76,000 lemmata for Mostly Pagan is about as meaningful a claim as 55,000 lemmata for Strictly Classical: it lets you understand almost the same proportion of text in the corpus. There's bound to be more lemmata than that in the corpus, that the dictionaries have not officially recorded; but it's not going to be overwhelmingly more. I'd guessed maybe 500 lemmata underestimated for the Strictly Classical corpus, with 1,500 unrecognised word forms. The Mostly Pagan corpus has 5,000 unrecognised word forms, so I'll guess maybe 2,000 underestimated lemmata.

The LSJ corpus is much less well understood, partly because it includes technical writing, but mostly because it includes the more unruly texts from the inscriptions and papyri, with their distinct vocabulary and grammars, and confusing spellings. We claimed 124,000 lemmata here, but that only gets you one word form unrecognised per 23; including potential proper names, it's as bad as one word unrecognised in ten. And you'll be stumbling over one word per a page and a bit. Our unrecognised word forms are now up to 35,000 lowercase forms. That does not necessarily mean 10,000 more lemmata unaccounted for, given the problems in spelling and grammar; so I'm reluctant to guess how many more lemmata you need to get to the same level of recognition as with the Strictly Classical corpus. But there are clearly more lemmata to go.

You can see the trouble the papyri and inscriptions bring more clearly in the last two counts, which include and exclude them. Without them, the TLG corpus has one lowercase word form in 23 unrecognised, and a little over a word per two pages unrecognised. That's not that bad for the claimed 162,000 lemmata, given the bewildering diversity of texts in the corpus. Let the inscriptions and papyri back in, and you now miss a word form for every 18, and a word every one and a half pages. And that's for increasing the size of the corpus by just a twentieth.

So the lemma counts are more and less reliable for different periods of Greek: we can tell how much text they allow you to recognise in different corpora, and we can allow that there are cut-offs for how many lemmata it is useful to know in a corpus. The lemma count is still not open-ended, so long as the corpus is finite. (That's the thing about langue instead of parole: the corpus size of *potential* text, using language as a theoretical system, is infinite.) And the word form coverage of the lemmatiser will keep improving, as an ongoing project; as I'd already mentioned before, TLG word form recognition has gone up from 90% to 94% in the past two years. But the lemma count does peter off.

So let me give one last batch of numbers to illustrate the relativity of lemma counts: how much less of a corpus do we understand, if we cut down on the number of lemmata. I'll do that using the word instances per lemma count for the TLG. Because there is a fair bit of ambiguity in Greek morphology, many word forms are ambiguous between two lemmata (and a few between more than two); so there is some double counting of instances to be had. As a result, the 202,000 lemmata recognised in the TLG corpus—proper names and not—account for 112 million word instances, though the corpus really contains only 95 million.

So if we take 112 million as our baseline, how many instances are accounted for by admitting less lemmata?

LemmataWord Instances
100 61,166,25354.44%
500 78,932,45170.25%
1,000 86,575,28677.06%
2,000 94,016,671 83.68%
5,000 102,370,24391.11%
10,000 106,926,32495.17%
20,000 109,884,24897.80%
50,000 111,727,54499.44%
100,000112,191,09599.85%
120,000112,251,18199.91%
150,000112,302,89599.95%
180,000112,332,89599.98%
190,000112,342,89599.98%
202,000112,354,703100%


There's your Zipf's Law in action. The table neatly parallels what Wikipedia says for vocabulary size, quoting a 1989 paper presumably on English: "We need to understand about 95% of a text in order to gain close to full understanding and it looks like one needs to know more than 10,000 words for that."

The difference between 100,000 and 200,000 lemmata accounts for just 163,000 word instances out of 112 million, around one word in 700. The difference between 180,000 and 200,000 accounts for less than a word every ten pages. So there's a very very long tail of increasingly rare words: the last 60,000 lemmata each occur just once in the 95 million word corpus, and the last 25,000 lemmata before that occur just twice. There's a *lot* of these one-offs, which is why all together they account for an unknown word every four pages. And we need dictionaries for words we don't come across every day, not words we do.

Still, they are one-offs (hapaxes). They're not useless—they were clearly useful to whoever used them that one time in the 2,500 year span of the corpus. But noone needs all 60,000 of them at once. And by the time you're down to lemmata that happen just once or twice in a roomfull of books (a small room admittedly), you can appreciate why real human beings walk around with close to 20,000 lemmata in their skulls, and not 200,000. For the rest, we have guessing from context (and related words); and we have dictionaries. And once Classical Greek became a bookish language, the Byzantines used dictionaries too.

3 comments:

  1. Well, I don't know. I once decided to decipher a poster written in French, a language in which my natural and unfettered capacity for human communication is decidedly fettered. I was able to understand the whole thing, with some reference to the pictures, with the exception of one preposition: did it mean 'before' or 'after' the date which followed it? I posted to Conlang, and the answer was that it meant 'as of' the date in question, and so in effect 'before'.

    ReplyDelete
  2. Keep on with this great presentation... It's a treasure of well-presented knowledge even for a layman like me... With your every post I learn two or three (or more) new things, I get some new insight.
    Today, f.e., and among many others, I found it interesting that the percentages of word knowledge for the good understanding of a language are similar to those for a good quality industrial production (six-sigma theories and so on). Life's own quality controls?

    ReplyDelete
  3. An inevitable consequence of words, like defects, having normal distributions, I'd have thought. But you know more about it than I do; feel free to expand!

    ReplyDelete