2010-02-01

Comparison, TLG BC and AD

In the previous post, I used Wordle to illustrate stop words in Greek (and, by the by, the exponential distribution of function words following Zipf's Law). After getting rid of a whole bunch of stop words, I ended up with a Wordle of the lemmata of the TLG:

But I stopped short of making sense of the Wordle, because the TLG contains both Ancient and Mediaeval texts, and they talk about different things. I promised Wordles of the texts in the TLG from BC and AD, which will give at least a rough sense of the difference.

So here they are:





Images created by the Wordle.net web application are licensed under a Creative Commons Attribution 3.0 United States License.


The Wordle images are hyperlinked to the Wordle applets hosted there, so you can play with the applets by eliminating words. The stopwords are as before, but I also got rid of πολύς "much", which was crowding the BC texts a bit much.

A few things jump out quickly: there's a lot more God AD, as you'd expect (θεός), slightly more talk of "people" than of "men" (ἄνθρωπος, ἀνήρ), less talk of the City and more talk of power (πόλις, δύναμις).

But I'm not really a visual person, so I'm going to use more quantitative ways of working out the changes in vocabulary.

To begin with, the two Wordles show the 150 most frequent lemmata for each period, not counting stop words. These are the differences between the two—words in the top 150 of one period, but not the other.
Ancients talked more about...and less about...
Ἕλλην, Ἀθηναῖος, Ζεύς, ἀμφότερος, διαφέρω, ἑκάτερος, ἐλάσσων, εἶμι, εὐ, ἤλιος, ἡγέομαι, ἱερός, κεῖμαι, κύκλος, ναῦς, νέος, νομίζω, ὀρθός, οἶκος, πλέως, πλεῖστος, πλῆθος, πόλεμος, πολέμιος, ποταμός, θάλασσα, θεά, σημεῖον, ταχύς, ὔστερος, χρῆμα, χώρα, ζῷον Χριστός, ἅγιος, ἁπλόος, ἄξιος, ἀδελφός, ἀλήθεια, βασιλεία, δέχομαι, δηλόω, δόξα, ἐκκλησία, ἐνέργεια, εἶδος, φωνή, κίνησις, κόσμος, νόος, οἰκεῖος, οὐρανός, οὐσία, πάθος, πίστις, πνεῦμα, πρόσωπον, θάνατος, θεῖος, σάρξ, τέλος, τρίτος, χάρις, ζητέω, ζωή
Greek, Athenian, Zeus, both, to differ, either, less, go, good, sun, to lead, dawn, holy, to lie, circle, ship, new, to think, right, house, full, most, crowd, war, enemy, river, sea, goddess, point, fast, last, need, land, animal Christ, holy, simple, worthy, brother, truth, kingdom, to accept, to declare, glory, church, activity, form, voice, movement, world, mind, own, heaven, substance, passion, faith, spirit, face, death, divine, flesh, end, third, grace, to ask, life

The effect of Christianity on vocabulary use is pretty obvious. A few other changes are worth noting:
  • Byzantines nominalised a lot more than Ancients did. That's at last some of the reason for ἀλήθεια "truth" (instead of the more Attic τὸ ἀληθές "the true"), and it may relate to other nominalisations like κίνησις "movement" and ἐνέργεια "activity". (βασιλεία "kingdom" has a Biblical pedigree—but that is also because the Bible was not written in Attic.)
  • Many of the differences are a matter of language change, rather than different ideology. For all that most Byzantines did not write in the vernacular, their language was usually more akin to Koine than to Attic. That explains the absence of εἶμι, εὐ, ναῦς, πλέως, ἐλάσσων, ἱερός, πολέμιος (replaced by στέλλω, καλός, πλοῖον, πλήρης, μικρότερος/ὀλιγότερος, ἄγιος, ἐχθρός) "send, good, ship, full, less, holy, enemy", and presumably also the avoidance of ἀμφότερος and ἑκάτερος "both, either".


I've left out from those lists words that show up in the top 150 only because they're ambiguous with other legitimate words. (Yes, I should have pruned the Wordles.)
  • BC: δίκαιον, δοκεύς, ἠώς, θέα: rights, beam, dawn, view
  • AD: ἅγιον, βασίλειος, ἴδιον, κενόω, πρόσωπος, ζωός: sanctuary, royal, particularity, make void, face, alive

There's one further comparison I'll attempt: the words whose frequency changed the most between the two periods. To track this, I'm going to use the 2000 most frequent lemmata for each period—including both normal words and stop words; that constraint means we're only looking at words that are likely to matter. I'll go through the lemmata in those lists whose ranking changed by the greatest amount (e.g. from #1537 to #10342).

Because it's a pretty heterogeneous list—and different kinds of words tells us different things, I'll split them up into categories. (And I will do some silent suppressing of ill-recognised ambiguous words.)

These are the biggest shifts in proper names:
Ancients talked more about...Rank Shift
ἜφοροςEphorus-8530
ΠοσειδώνιοςPosidonus-8397
ΠελοποννήσιοςPeloponnesian-6655
ΑἰτωλόςAetolian-5399
ἙκαταῖοςHecataeus-5157
ΘεόπομποςTheopomus-5046
ἈπολλόδωροςApollodorus-4948
ΦωκεύςPhocian-4786
ΤυρρηνικόςTyrrhenian-4587
ΧρύσιπποςChrysippus-4043

Two things are going on here. First, some ancient authorities—primarily historians, if I read the names right—were of interest to several ancient writers, but of less interest to the Byzantines. They tend to be the historians whose texts didn't survive, which is related to them being of less interest to the Byzantines. (I don't know offhand whether that's cause or effect.)

Second, Greece was very important to Ancient Greeks, and so were the various regions of Greece. To the Byzantines though, Greece was a backwater, and the old regions did not survive into the Byzantine system of themes. So there was no reason to talk about Aetolia or Phocia outside of Ancient History; and less reason to talk about the Peloponnese than you might think, even while the name survived. The same goes for Tyrrhenians: it wasn't Etruscans that the Byzantines were having to deal with in Italy, but Lombards.

Ancients talked less about...Rank Shift
ΚύριλλοςCyril+214,509
ΚωνσταντινούπολιςConstantinople+214,399
ΓρηγόριοςGregory+214,391
ἈθανάσιοςAthanasius+214,154
ΓεώργιοςGeorge+85,856
ΚωνσταντῖνοςConstantine+47,064
ΠέτροςPeter+40,947
ΧριστιανόςChristian+36,162
ΒασίλειοςBasil+28,217
ΧριστόςChrist+23,988

The only surprise is that Christians turn in BC texts at all; there's only 5 instances though, and the dating of texts in the corpus is porous (late citations can appear as testimonia of earlier authors).

These are the biggest shifts in common nominals:
Ancients talked more about...Rank Shift
εὔδοξοςreputable-8805
κύλινδροςcylinder-6569
ἀσύμμετροςasymmetrical-5939
δημοκρατίαdemocracy-5389
πυραμίςpyramid-4714
ναυμαχίαsea battle-4274
κῶνοςcone-4205
παραλληλόγραμμοςparallelogram-3837
παρεμβολήinterpolation; encampment-3668
ψήφισμαdecree passed by vote-3194

If the AD texts have more theology, they clearly have a lot less geometry, and a lot less to do with representational systems of government. The drop in εὔδοξος is surprising, given it's in Plato; I wonder if the change of -δοξ- in compound from "reputations" to "glory" made the adjective confusing for later writers.

Ancients talked less about...Rank Shift
ἀποστολικόςapostolic+214,282
θεοτόκοςGod-bearing (Theotokos)+85,966
βάπτισμαbaptism+85,945
θεότηςdivinity+59,016
μόδιοςbushel+58,616
μοναστήριονmonastery+57,696
σεβάσμιοςreverend+57,691
αἱρετικόςheretic+35,602
χάρισμα(spiritual) gift+35,588
πατριάρχηςpatriarch+27,001

No surprises again; the only non-religious term is μόδιος "bushel", both as a vessel and a measure.

These are the biggest shifts in verbs:
Ancients talked more about...Rank Shift
διαπορεύωpass across-5939
βλώσκωgo-5710
εἰσοράωlook upon-4134
ἄημιblow (wind)-4113
κλύωhear-3392
ἐπιζεύγνυμιjoin to-3039
ἀμφισβητέωdoubt-2436
ἐφάπτωhang on-2122
ἱκνέομαιcome-2088
μεταπέμπωsend for-1974

Many of the missing verbs are poetic and/or dialectal, and would not have a natural place in Byzantine prose; that includes βλώσκω, εἰσοράω, ἄημι, κλύω, ἱκνέομαι. The surprise here is the vanishing of doubt in the Middle Ages.

... Yes, yes, the jokes just write themselves, I know...

Ancients talked less about...Rank Shift
ἐνάγωpersuade+10,366
βαπτίζωbaptise+8529
ψάλλωchant+5021
φανερόωreveal+3988
καταδικάζωcondemn+3483
φωτίζωilluminate+3308
περισπάωtake a circumflex+2948
βαστάζωcarry+2911
ἀνέρχομαιgo up+2809
προλαμβάνωanticipate+2769

I admit to being less sure about some the shifts here, such as ἐνάγω and προλαμβάνω. The Christian influence is clear in βαπτίζω, ψάλλω, φανερόω and φωτίζω. Language change accounts for βαστάζω and ἀνέρχομαι replacing φέρω and ἄνειμι, and I assume καταδικάζω for "condemn" replaced what came to look like more generic verbs, in καθαιρέω or καταγιγνώσκω. And unlike the Ancients, the Byzantines had to learn about polytonic orthography; so what word took a circumflex and what word took an acute was a matter much ink was spilled about.

Finally, these are the biggest shifts in function words:
Ancients talked more about...Rank Shift
τοτέat times-5796
αὖτεagain-4707
δισχίλιοιtwo thousand-4676
αἴalas-3334
πεντακόσιοιfive hundred-2859
ἠέor-2844
νή[I swear] by [deity]-2663
διακόσιοιtwo hundred-2597
μάyea-2470
πωyet, at all-2466

There is some Epic dialect here, in αὖτε and ἠέ; some strictly Attic rather than Koine words in τοτέ, πω, and δισχίλιοι; and a rather different approach to exclamations, with the old oaths by the Gods dispensed with, and the ai!'s of tragedy avoided in theological discourse. (There are 2100 instances AD of φεῦ "alas"; maybe αἴ was too specific to tragedy? *shrug*) Not sure why the written-out 500 and 200 were less popular. Maybe the armies just got bigger, so historians talked in the thousands instead of 300...

Ancients talked less about...Rank Shift
ἀμήνamen+19,195
νάto (Modern Greek)+18,984
ἀλλαχοῦelsewhere+7689
δηλαδήthat is+7587
ἤγουνthat is+6367
ιζ΄XVII+4541
καθόinsofar as+4524
ιϛʹXVI+4001
ιηʹXVIII+3727
ιδʹXIV+3196

It's obvious why amen is there; it's also obvious why να, the Modern Greek equivalent of the ancient infinitive inflection, is there. ἀλλαχοῦ for "elsewhere" is attested in Sophocles and Xenophon, but it became prevalent much later, and LSJ reports that Moeris proscribed it as vernacular, in favour of ἄλλοθι. The other conjunctions are run-in phrases, which Byzantine texts in general are rather more sympathetic to treating as single words than are ancient texts: δῆλα δή "so [they are] obvious", ἤ γε οὖν "or indeed then", καθ’ ὅ "according to what".

Finally, the numerals aren't there because the Byzantines were more numerate than the Ancients. After all, the Byzantines had given up on geometry, from what the counts tell us. (And that's a silly enough thing to conclude that you should not take much of this too seriously.) No, the reason there's a whole lot of XVII's and XIV's in the AD corpus is that there are a lot more chapter headings in the theologians...

6 comments:

  1. You should really try a bit of disambiguation sometime (doesn't take a lot of text) -- that will do wonders to your viper presence:-) Lots of (τοὺς) ἔχεις are pretty easy to distinguish from (σὺ) ἔχεις.
    But, more to the point, if you really want to compare two corpora with Wordles, check out Martin Mueller's Wordles (enter him as a Wordle user to find them). Neat stuff that contrasts Iliad and Odyssey, and similar. So his Wordles are not raw counts but differentials (more specifically, Dunning's log likelihood ratio). For English stuff, check out his Jane_Austen_avoids (used less in Austen than in her contemporary novelists). By the way, I'm completely in agreement that Wordles are fairly useless as an analytical tool after the first glance. Ordering by frequency, or alphabetically, gets you where you want to be sooner.

    ReplyDelete
  2. Helma, hello, and good to hear from you again.

    Syntactic disambiguation will do some things, agreed, but I don't think it will go so far as "wonders" for the vipers. I played with syntactic disambiguation in 2004—although I should revisit it now that the lemmatiser is doing a better job: I used the collocation of parts of speech and inflectional categories for words that were unambiguous, and tried to apply those to words that were ambiguous.

    My findings at the time were, that kind of disambiguation would deal with a quarter to a third of all ambiguous word instances. Certainly a meaningful step, but it wouldn't get rid of "viper" completely. Moreover, though I went as far as a three word window either side of the ambiguous word, the only consistently reliable syntactic cue I found was preceding definite article (which cannot precede a verb). And that's because it combines a syntactic restriction with inflectional agreement.

    Otherwise, syntactic restrictions in Ancient Greek are thin on the ground: articles, prepositions, and that's about it. The word order is too free for such a restriction to emerge through pure statistics. Which means I know τῷ ἔχει is "to the viper" and not *"to the she has"; but I can't do as much with σὺ ἔχεις "thou hast". For starters, ἔχεις could be the accusative plural, and could occur in a pattern like σὺ ἔχεις ἐφόνευσας "thou murderedst vipers". For seconds, pronouns like σύ are rare in Greek in general, so they wouldn't account for a lot of instances.

    I don't want to be uncharitable to Wordle, because the "first glance" still counts for something. But yes, it's only an initial glance.

    Thank you for the cluestick on Dunning via Mueller. I'm going to blunt-instrument use the formulas as interpreted by Wordhoard, and see if the Wordles I get that way are more informative.

    ReplyDelete
  3. Hi Nick,
    I guess I like cluesticks as much as anyone :-)

    So let me leave another comment here. I think that you'll find that morphological disambiguation (less work than syntax) would not be 100% for things like ἔχις and δοκεύς (another stand-out, for obvious reasons) but quite successful nonetheless. In my corpus, after a really small amount of disambiguation, I get frequencies for ἔχω 1000 times higher than for ἔχις and few mishits, that are mostly there because the system in its first go-round had not seen a lot of postpositions yet (and because we didn't deal with caps and apostrophes well. oops!). We'll soon do a round of re-training with more data, so we'll see what happens then.
    But I understand that in what it makes available for the general user, TLG has decided to give users the option of looking for all δοκεῖ or ἔχεις instances, without taking a first stab at which is which. I guess this fits in with the general philosophy of the project and the obsessions of the average classicist -- the search for completeness above all (collecting all the trees in the Greek forest). Because I'm interested in frequency distribution *and* fast searching, I've taken the other decision: go with the probables, not the possibles, for the sake of searching, lemma frequency, collocation, etc.

    ReplyDelete
  4. The distinction is more a matter of degree: I've been more conservative about displacing ambiguous parses, for the reasons you indicate; but there is a system of ranking analyses in place—especially because the TLG lemmatiser, because of the nature of its corpus, overgenerates analyses.

    The disambiguation is rule-driven rather than stochastic (since words are analysed in isolation currently); its criteria are morphological and lexical. (So δοκεύς, a hapax pretty much, won't be competing with δοκέω any more: I've added a lexical rule to that effect.) But the morphological rules are of limited scope, or else conservatively applied.

    ReplyDelete
  5. Thanks for that comment. Especially amusing to me since my more statistically trained, non-classicist, co-conspirators try to keep me from implementing word-specific rules:-) I should trust the system to take care of it, after all. So far, they've pretty much been right, but I'm always tempted.

    ReplyDelete