2009-06-03

Lerna II: Definitions

I've started a series of posts on counting words in Greek (see: Lerna I). This is the kind of thing that revokes your linguistics cabal membership card, so I have to add that the posts are really about the journey to counting words, and the questions that come up along the way, rather than the destination. There'll be at least a couple of posts about why the destination doesn't make a lot of sense. I also don't hold out any hope that any heads of the Hydra this chops off will stay chopped off: too many people Want To Believe that there are 200 times more words in Greek than English; and if IT professionals can be duped by a text claiming that computers can only communicate with each other in Ancient Greek, no amount of numbers will convince them to the contrary.

(I owe a lot of these insights to the commenters at Team Fortier, aka the magnificent Nikos Sarantakos' blog's host and its commenters, and this thread is of course a shout-out to them.)

In this post, some definitions, to introduce the concepts we're counting. Many of you will already know all of this, but I'm writing for an undefined audience here.

Word counts


An individual instance of a word is a token—or an instance. The count of instances is what you get when you hit Word Count on Word, or wc -w on Unix. They're what you're aiming towards when you have a 1,000-word essay.

Now, counting word instances in a language doesn't prove much of anything. The larger the population of language-speakers, the more words spoken in a language. By that token, since there have been more speakers of English in the history of the world than there have been speakers of Greek, then there have been more word instances of English than of Greek, case closed, thank you for playing. Like human population, the count of word instances ever spoken has gone up exponentially (so far), and until now the number of words ever spoken has dwarfed the number of words ever written. A Google researcher guesstimated in 2007 that there were 100 trillion words (written words) in Google's cache; and that count isn't going down. Many a wag commented that half those words are PORN or XXX; but rest assured that most of those words are still in English.

You wanna talk to me about the 90 million words of Greek?

That's unfair, I hear you say? Because the words of English are barbarous and the words of Greek inspired, and half those 100 trillion words are PORN or XXX anyway? Well, with 15 million Greek speakers alive today, the number of times someone says μ#@#$# daily exceeds the word count of all of Ancient Greek literature. Come again? How dare I debase Our Ancient Ancestors (ΑΗΠ) to the gutter talk of our modern decrepitude? As if no philosophy is ever discussed in Modern Greek; and as if everyone who ever spoke Ancient Greek, Attic Doric or Pamphylian, was in Socratic dialogue mode 24/7, and never once abused their fellows. And if you want your word count of Ancient Greek free of scurrility, then get to work: time and censors have rid us of a lot of antick filth, but there's still some Hipponax and Aristophanes they've spared for you to expunge.

Trust me, word counts is not really the game you want to get into to prove the superiority of Greekdom. It's a game you can't win, and a game you shouldn't be trying to win anyway. There are more serious things than this that you can do with your enthusiasm for Hellenism.

Of course, those aren't the calculations we go into when we count words, and those kinds of counts are of little interest to anyone, with the possible exception of computer models of language change. When we do go about counting words (in the process of doing something else), there are two kinds of construct we deal with. What a language allows in potentiality, as a system; and what evidence of language use you actually have in your records, as a sample. The old Saussurean langue/parole opposition, in other words.

If you're dealing with language as a theoretical system, the number of words you can put together with it is infinite. And it's no less infinite whether you're dealing with Greek, English, or Umbu-Ungu. So word counts don't even come into it. If you're dealing with some concrete records of a language, on the other hand, you're using a corpus: a body of texts assembled for linguistic research. The corpus is not going to represent all possibilities of a language, which are after all infinite. But typically it will represent enough to tell you what is going on in a langue. Or a range of langues, depending on what's gone into the corpus, and how you define your language.

Corpora don't normally try or need to be exhaustive, but they do need to be large enough to tell you stuff, and small enough to be feasible—especially if you're putting not just words into them, but markup. The British National Corpus, for instance, has 100 million words of British spoken and written text, and it's annotated for morphology and syntax. The Corpus of Contemporary American English has 350 million words. With the internet, of course, what's small enough to be feasible has grown a little. (See Google, 100 Trillion.)

As you have no doubt gathered, I work for the Thesaurus Linguae Graecae, and the TLG has been building a corpus of literary Greek for close to 40 years. Like any corpus, the TLG has a word count, and that word count is now well over 90 million words. That's the count that has made its way to the Lernaean text, as had earlier counts in earlier viral strains of the text. But of course, the TLG is a corpus: it's the subset of all words in the Greek language that happened to have been
  1. written down
  2. survived to the present day
  3. considered of literary or linguistic interest
  4. published in a scholarly edition
  5. digitised for the University of California to date
  6. date before Modern times. (The cutoff is moving forwards from 1453, but won't soon go past 1669.)
There have been a lot more Greek words in history than that. Then again, like we said, there's been a lot more English words in history than that too. So on a global scale, the number 90 million does not prove superiority or perfection. If there is value in the texts of Classical Greek, that value does not come from quantity.

Wordforms


When dictionaries count words, word instances are not what they count. They don't count wordforms either, but let's start with wordforms anyway.

There's lots of words in text, but most words get used more than once, and some words get used a *lot* more than once. If you want to count words, the first constructive thing to do is to count how many of those are distinct words, wordforms. That's the list of words that begins to tell you what the language is actually coming up with as a system.

Having a million words of English in a corpus doesn't tell you much in itself, precisely because a small number of common, functional wordforms accounts for a lot of word instances. In those million words of English, you'll find 70,000 of those words are "the", another 36,000 are "of", and 135 distinct wordforms are enough to account for half your word count: there is an inverse proportionality of word frequency to word frequency ranking, as Zipf's Law finds. That's why tag clouds routinely take out stopwords: you don't need to be told that your text contains a lot of instances of "the" and "of".

You can see that principle at work by checking out this tag cloud, generated by Notis Toufexis for an excerpt of Ptochoprodromos. The biggest words in the Worlde tag cloud are καὶ, νὰ, τὰ, "and, to, the". They're the biggest by far, because they're so common in Early Modern Greek: but you have to squint past them to see the content words that actually reveal Ptochoprodromos' frame of mind: ἀνάθεμαν, τεχνίτην, κρασίν, στάμενον "damn, artisan, wine, coin".

No different in the TLG corpus of course; of the most frequent wordforms in the corpus, you don't hit a verb until #43 εἶναι "to be", a noun until #103 Θεοῦ "of God" (as we'll see, there's a *lot* of Christian texts in the TLG), and an unambiguous adjective until #158 ἄλλων "of others" and #162 πολλά "many". That's a lot of particles and pronouns and articles to go through before you get a content word.

We can see the repetitiveness of word forms in practice in Ancient Greek. Here's the first 1000 words of Thucydides, resplendent in 3 point Alexander font:



Lots of red words. Now, let's blue out all repetitions of word forms in the passage:



Of the 1000 words, 558 are unique. So 1000 words translates to 558 wordforms. As you can imagine, the more text you have, the more you're going to repeat words, so the 90-odd million word instances of the TLG corpus translate to 1.5 million distinct strings. (And as we'll see when we look at those counts more closely, not all 1.5 million wordforms are equal.)

1.5 million is smaller than 90 million, I'd like to point out.

Lemmata


We're not done though. Dictionaries of English don't list dog, dog's, dogs as separate words, nor take, takes, took, taken. Dictionaries of Greek don't list κύων, κυνός, κυνί as separate, nor λαμβάνω, λαμβάνει, ἔλαβον, εἰλημμένος. These are grammatical variants of the same word, and outside the grammatical function they serve in a sentence, they don't means something different. So they're considered the same word underlyingly: the same lexeme, or dictionary word, grouped in a dictionary under the same lemma, or headword.

That's not all the wordforms you conflate under a lemma. Some wordforms don't differ grammatically, but are merely phonological alternates: the distinction between εἱσί, εἰσι, and εἰσιν in Ancient Greek, or κυσί, κυσίν, κυσὶν depends only on the sounds of the following word. (There's not a lot of this in English, but the difference between a and an is the same principle.) This kind of variation, you could fold into the morphological variation (inflection) we've just seen; but they don't carry the same weight. Ancient Greek allows five cases and three numbers for its nouns, so "dog" can theoretically appear in 15 forms. (In reality, it's never more than 11 forms, even if there's 15 grammatical meanings.) But the phonological variation of κυσί, κυσίν, κυσὶν does not contribute to that count.

Some wordforms get conflated under the same lemma because they differ from each other only as spelling alternates. You'd be hard put to say conjurer and conjuror, let alone colour and color, are distinct lexemes. Spelling variation already happens in Ancient Greek—e.g. writing iota adscripts vs. subscripts, or the spelling variation between λειπ- and λιπ- in compounds. Once the language has changed enough that the spelling is no longer phonetic—and once editors aren't as concerned to normalise texts—you get a lot of spelling variation: what you might choose to call "misspellings", if that was a productive thing to do. (If you're respecting what the editors chose not to normalise, then it isn't: you deal with the words as you find them.)

We don't get misspellings in our text of Thucydides, but we do get lots of grammatical variants. The words left in red above, as unique wordforms, include Ἀθηναῖος "Athenian" and Ἀθηναίων "of Athenians", μέγας "great (masc.)" and μεγίστη "greatest (fem.)", ᾤκουν "they dwelled" and οἰκουμένη "dwelled in (fem.)". If we shade those grammatical variants in green, and leave red just for distinct lemmata, we get something a lot less red than we started with:



1000 words, 558 wordforms, 409 lemmata. And of course the more text you have, the more lemmata you're going to repeat. The 1.5 million odd wordforms of the TLG as a corpus boil down to the neighbourhood of 200,000 lemmata.

It's a neighbourhood, and counting lemmata is a very problematic thing to do. Moreover, that's a count specific to a corpus: the bigger your corpus, obviously, the more lemmata it's going to contain, especially as the TLG adds more Early Modern Greek texts—although the Law of Diminishing Returns does apply. And of course, if you're thinking about langue and not parole—what is possible in a language rather than what you happen to have recorded—then any counts are pointless: you can make up words in any language, and people do, all the time. Some languages let you make up words more readily than others do: German more than French, Greek more than Latin, Sanskrit more than Greek. That doesn't really prove much of course. In fact, none of these counts really proves that much. But that's a topic for a separate post.

Still, when dictionaries count words, they work off a corpus, and dictionary words—lemmata—is what they count. When the Lernaean text talks about 490,000 words of English, what they're counting are lemmata. Basic decency tells you to compare lemma counts and lemma counts, not word instance counts and lemma counts. Common sense tells you that the lemma count of Greek is going to be in the same order of magnitude as English, not 200 times greater: and common sense, away from the fevered swamps of Lerna, is correct.

Nonetheless, counting lemmata is problematic. You'll see 490,000 cited for English; you'll also see 172,000, 615,000, and a million. I said around 200,000 lemmata for Greek, but I said 227,000 last year, in some contexts I should be saying 116,000, and in some much shakier contexts discussed below, I could even speak of 1.5 million (but won't). I'll go into the reasons why, and what aspects of Greek they reflect, in future posts. But I'll give some background on why counting lemmata is difficult.

Derived lemmata


In many languages, a new word can be made up based on an old word. So given employ in English, you can derive other words—other lemmata: employer, employee, employment, employable, employability. Some of the rules for forming new words are live in the language: given ginormous, you can form ginormousness, and given abdominoplasty, you can form abdominoplastic. Some of those rules are frozen in time, and can't be used to make up new words now; truth comes from true, but you can't get coolth from cool. (OK, you can, but if the style guide calls it "tiresomely jocular", that tells you something's wrong with trying to reintroduce -th as a suffix.)

Greek can derive lexemes from other lexemes, just like English can. In fact, because a lot of legitimate derived lemmata are not included in published dictionaries, the TLG lemmatiser allows words to be recognised based on these derivation rules. Any verb can, in theory, generate a couple of dozen nouns and adjectives; so if you crank those rules all the way to eleven, you could say that Greek in theory has over 1.5 million lexemes. Not even a hundredth of those potential lexemes are of any use in Greek text, though: they really are lexemes only in theory, and the theory only occasionally translates into practice. I'd like to think noone is going to start quoting me as proving Greek has 1.5 million lexemes after all. I'd *like* to think...

Anyway: concerned to save space, and to lump related forms together, many dictionaries list derived lemmata like this in the same entry as the original form. See how the dictionary.com dictionaries treat cleverness, for example. LSJ does this a lot: the entry for Ἀντίγονος "Antigonus" also includes Ἀντιγόνειος and Ἀντιγονικός "Antigonian", Ἀντιγονίς "an Antigonus cup", and Ἀντιγονίζω "to side with Antigonus". That's four or five lexemes, even if they're all related. But that means that the count of headwords in a dictionary is not the same as the count of lexemes.

Variants


So far, the variation between wordforms under a lemma has been only grammatical or spelling; and as long as you're looking at a single, uniform version of a language, that's mostly enough to make sense of it. But a lemma can also encompass variation that, strictly speaking, involves different-sounding words still meaning the same thing. And the more broad the definition of the language is, the more such variation you're going to have to deal with.

So some wordforms get conflated because they're pronounced similarly rather than identically. In English, take embed and imbed. They sound and look different, but mean the same, and dictionaries will not bother to define them separately: at most, the entry for imbed will be something along the lines of "see embed". There is a lot more of this in Greek, whether because the stems look different, or the endings look different. There's no essential difference between δωδεκάμηνος, δυοδεκάμηνος, and δυοκαιδεκάμηνος for "twelve-monthly"; so it's a judgement call whether to consider them as the same lemma or different. ἰσόσταθμος and ἰσοσταθμής mean the same thing, "equal in weight". Should they count as the same lemma? That's not obvious either, because the declension of the two is distinct—but LSJ don't see a reason to give them separate definitions.

In strictly morphological terms, ἰσόσταθμος and ἰσοσταθμής, and δυοδεκάμηνος and δυοκαιδεκάμηνος are distinct lexemes. But dictionaries tend to conflate those lexemes, especially as they're driven by semantics. So more often than not, ἰσόσταθμος and ἰσοσταθμής are going to turn up in the same dictionary entry, which in effect treats them as the same. That doesn't mean that everything has to treat them as the same—as we saw with derived lemmata. But it is a consideration.

Beyond that, Ancient Greek covers a range of dialects, and several centuries of language change, and that inflates the variation under the same lemma. So combing wool gets to be all of κνάπτω, γνάπτω, and κνάμπτω: κνάπτω seems to be the original Attic form, γνάπτω Ionic (and later Attic). δυσευνάτωρ is just the Doric for Attic δυσευνήτωρ; if you're going to include both in your lexicon, you're not going to treat them as distinct lemmata. The longer the timespan of your lexicon, the more merging in you'll be doing of wordforms that vary across time—until the change is so great that it no longer makes sense to. Modern Greek μπορώ "I can" derives from Ancient εὐπορέω "I prosper", but noone looking for instances of εὐπορέω in text is well served by getting hits for μπορώ.

On the other hand, Ancient ἀνήρ "man" regularly developed into Modern άντρας, and if your dictionary is going to span both Ancient and Modern Greek, it should treat them as variants of the same lemma.

That of course brings us to the delicate issue of whether a lexicon has any business listing both ἀνήρ and άντρας in the same tome. (It'd have to be ἄντρας of course if it did: you wouldn't allow different accentuation rules in the same tome.) You won't find any dictionaries listing Latin homo and Italian uomo as the same word, because they're not regarded as the same language. Greek is, but it's not untoward to say, that's not exclusively or even primary for linguistic reasons. (Those extralinguistic reasons are of course a lot of the motivation behind the Lernaean text to begin with.)

Outside Dimitrakos' dictionary, noone has attempted to have the one dictionary cover the whole of the τρισχιλιετοῦς*, and there's reasons for that too. The TLG lemma search engine does have to deal with the entire corpus, Ancient and Not, so it does do those conflations—not without problems. The counts I'll be giving in subsequent posts will reflect those conflations; but there is also more to say about why those conflations are problematic within linguistics.

[GLOSS: τρισχιλιετοῦς is the archaic genitive of τρισχιλιετής, "three-thousand year long", an epithet for the Greek language favoured by those who emphasise its continuity. It's also favoured by those on the other side of the debate, who regard Modern Greek as autonomous from the Ancient Grammar: the archaic genitive is uncomfortable enough in the Modern language (see again Sarantakos in Greek on how), that they believe the word's very grammar supports *their* side of the argument.]

2 comments:

  1. I'll just mention (because I'm not allowed to say anything concrete about it until the article is published) that there now exists a searchable corpus made up of Google-scanned English-language books that contains about 100 million words published in each year since 1800. More than 200 BNC-equivalents, plus lesser amounts of corpus (corpus becomes a mass noun at this scale) covering the years 1550-1799. Eat that, hellenomaniacs. (What we know about the history of collocations in Modern English is gonna change. A lot.)

    As I posted on Language Log a while back, the Google n-gram data strongly suggests that "Trademarks and brands are the property of their respective owners" is now the most common written sentence of English, with about 25 million tokens of it extant.

    ReplyDelete
  2. It is a great time in human history to be a corpus linguist. And thank you John for giving my office mates something to muse over for a few minutes with the n-gram. :-) Said post here.

    Hellenomanics: now go forth and do likewise. There is Greek OCR at some level in Google Books already...

    ReplyDelete

The Other Place (opɯcɯluklɑr)

Powered by Blogger Widgets