Ἡλληνιστεύκοντος: Comparison, TLG BC and AD

2010-02-01

Comparison, TLG BC and AD

In the previous post, I used Wordle to illustrate stop words in Greek (and, by the by, the exponential distribution of function words following Zipf's Law). After getting rid of a whole bunch of stop words, I ended up with a Wordle of the lemmata of the TLG:

But I stopped short of making sense of the Wordle, because the TLG contains both Ancient and Mediaeval texts, and they talk about different things. I promised Wordles of the texts in the TLG from BC and AD, which will give at least a rough sense of the difference.

So here they are:

Images created by the Wordle.net web application are licensed under a Creative Commons Attribution 3.0 United States License.

The Wordle images are hyperlinked to the Wordle applets hosted there, so you can play with the applets by eliminating words. The stopwords are as before, but I also got rid of πολύς "much", which was crowding the BC texts a bit much.

A few things jump out quickly: there's a lot more God AD, as you'd expect (θεός), slightly more talk of "people" than of "men" (ἄνθρωπος, ἀνήρ), less talk of the City and more talk of power (πόλις, δύναμις).

But I'm not really a visual person, so I'm going to use more quantitative ways of working out the changes in vocabulary.

To begin with, the two Wordles show the 150 most frequent lemmata for each period, not counting stop words. These are the differences between the two—words in the top 150 of one period, but not the other.

Ancients talked more about...	and less about...
Ἕλλην, Ἀθηναῖος, Ζεύς, ἀμφότερος, διαφέρω, ἑκάτερος, ἐλάσσων, εἶμι, εὐ, ἤλιος, ἡγέομαι, ἱερός, κεῖμαι, κύκλος, ναῦς, νέος, νομίζω, ὀρθός, οἶκος, πλέως, πλεῖστος, πλῆθος, πόλεμος, πολέμιος, ποταμός, θάλασσα, θεά, σημεῖον, ταχύς, ὔστερος, χρῆμα, χώρα, ζῷον	Χριστός, ἅγιος, ἁπλόος, ἄξιος, ἀδελφός, ἀλήθεια, βασιλεία, δέχομαι, δηλόω, δόξα, ἐκκλησία, ἐνέργεια, εἶδος, φωνή, κίνησις, κόσμος, νόος, οἰκεῖος, οὐρανός, οὐσία, πάθος, πίστις, πνεῦμα, πρόσωπον, θάνατος, θεῖος, σάρξ, τέλος, τρίτος, χάρις, ζητέω, ζωή
Greek, Athenian, Zeus, both, to differ, either, less, go, good, sun, to lead, dawn, holy, to lie, circle, ship, new, to think, right, house, full, most, crowd, war, enemy, river, sea, goddess, point, fast, last, need, land, animal	Christ, holy, simple, worthy, brother, truth, kingdom, to accept, to declare, glory, church, activity, form, voice, movement, world, mind, own, heaven, substance, passion, faith, spirit, face, death, divine, flesh, end, third, grace, to ask, life

Ancients talked more about...

and less about...

Ἕλλην, Ἀθηναῖος, Ζεύς, ἀμφότερος, διαφέρω, ἑκάτερος, ἐλάσσων, εἶμι, εὐ, ἤλιος, ἡγέομαι, ἱερός, κεῖμαι, κύκλος, ναῦς, νέος, νομίζω, ὀρθός, οἶκος, πλέως, πλεῖστος, πλῆθος, πόλεμος, πολέμιος, ποταμός, θάλασσα, θεά, σημεῖον, ταχύς, ὔστερος, χρῆμα, χώρα, ζῷον

Χριστός, ἅγιος, ἁπλόος, ἄξιος, ἀδελφός, ἀλήθεια, βασιλεία, δέχομαι, δηλόω, δόξα, ἐκκλησία, ἐνέργεια, εἶδος, φωνή, κίνησις, κόσμος, νόος, οἰκεῖος, οὐρανός, οὐσία, πάθος, πίστις, πνεῦμα, πρόσωπον, θάνατος, θεῖος, σάρξ, τέλος, τρίτος, χάρις, ζητέω, ζωή

Greek, Athenian, Zeus, both, to differ, either, less, go, good, sun, to lead, dawn, holy, to lie, circle, ship, new, to think, right, house, full, most, crowd, war, enemy, river, sea, goddess, point, fast, last, need, land, animal

Christ, holy, simple, worthy, brother, truth, kingdom, to accept, to declare, glory, church, activity, form, voice, movement, world, mind, own, heaven, substance, passion, faith, spirit, face, death, divine, flesh, end, third, grace, to ask, life

The effect of Christianity on vocabulary use is pretty obvious. A few other changes are worth noting:

Byzantines nominalised a lot more than Ancients did. That's at last some of the reason for ἀλήθεια "truth" (instead of the more Attic τὸ ἀληθές "the true"), and it may relate to other nominalisations like κίνησις "movement" and ἐνέργεια "activity". (βασιλεία "kingdom" has a Biblical pedigree—but that is also because the Bible was not written in Attic.)
Many of the differences are a matter of language change, rather than different ideology. For all that most Byzantines did not write in the vernacular, their language was usually more akin to Koine than to Attic. That explains the absence of εἶμι, εὐ, ναῦς, πλέως, ἐλάσσων, ἱερός, πολέμιος (replaced by στέλλω, καλός, πλοῖον, πλήρης, μικρότερος/ὀλιγότερος, ἄγιος, ἐχθρός) "send, good, ship, full, less, holy, enemy", and presumably also the avoidance of ἀμφότερος and ἑκάτερος "both, either".

I've left out from those lists words that show up in the top 150 only because they're ambiguous with other legitimate words. (Yes, I should have pruned the Wordles.)

BC: δίκαιον, δοκεύς, ἠώς, θέα: rights, beam, dawn, view
AD: ἅγιον, βασίλειος, ἴδιον, κενόω, πρόσωπος, ζωός: sanctuary, royal, particularity, make void, face, alive

There's one further comparison I'll attempt: the words whose frequency changed the most between the two periods. To track this, I'm going to use the 2000 most frequent lemmata for each period—including both normal words and stop words; that constraint means we're only looking at words that are likely to matter. I'll go through the lemmata in those lists whose ranking changed by the greatest amount (e.g. from #1537 to #10342).

Because it's a pretty heterogeneous list—and different kinds of words tells us different things, I'll split them up into categories. (And I will do some silent suppressing of ill-recognised ambiguous words.)

These are the biggest shifts in proper names:

Ancients talked more about...		Rank Shift
Ἔφορος	Ephorus	-8530
Ποσειδώνιος	Posidonus	-8397
Πελοποννήσιος	Peloponnesian	-6655
Αἰτωλός	Aetolian	-5399
Ἑκαταῖος	Hecataeus	-5157
Θεόπομπος	Theopomus	-5046
Ἀπολλόδωρος	Apollodorus	-4948
Φωκεύς	Phocian	-4786
Τυρρηνικός	Tyrrhenian	-4587
Χρύσιππος	Chrysippus	-4043

Two things are going on here. First, some ancient authorities—primarily historians, if I read the names right—were of interest to several ancient writers, but of less interest to the Byzantines. They tend to be the historians whose texts didn't survive, which is related to them being of less interest to the Byzantines. (I don't know offhand whether that's cause or effect.)

Second, Greece was very important to Ancient Greeks, and so were the various regions of Greece. To the Byzantines though, Greece was a backwater, and the old regions did not survive into the Byzantine system of themes. So there was no reason to talk about Aetolia or Phocia outside of Ancient History; and less reason to talk about the Peloponnese than you might think, even while the name survived. The same goes for Tyrrhenians: it wasn't Etruscans that the Byzantines were having to deal with in Italy, but Lombards.

Ancients talked less about...		Rank Shift
Κύριλλος	Cyril	+214,509
Κωνσταντινούπολις	Constantinople	+214,399
Γρηγόριος	Gregory	+214,391
Ἀθανάσιος	Athanasius	+214,154
Γεώργιος	George	+85,856
Κωνσταντῖνος	Constantine	+47,064
Πέτρος	Peter	+40,947
Χριστιανός	Christian	+36,162
Βασίλειος	Basil	+28,217
Χριστός	Christ	+23,988

The only surprise is that Christians turn in BC texts at all; there's only 5 instances though, and the dating of texts in the corpus is porous (late citations can appear as testimonia of earlier authors).

These are the biggest shifts in common nominals:

Ancients talked more about...		Rank Shift
εὔδοξος	reputable	-8805
κύλινδρος	cylinder	-6569
ἀσύμμετρος	asymmetrical	-5939
δημοκρατία	democracy	-5389
πυραμίς	pyramid	-4714
ναυμαχία	sea battle	-4274
κῶνος	cone	-4205
παραλληλόγραμμος	parallelogram	-3837
παρεμβολή	interpolation; encampment	-3668
ψήφισμα	decree passed by vote	-3194

If the AD texts have more theology, they clearly have a lot less geometry, and a lot less to do with representational systems of government. The drop in εὔδοξος is surprising, given it's in Plato; I wonder if the change of -δοξ- in compound from "reputations" to "glory" made the adjective confusing for later writers.

Ancients talked less about...		Rank Shift
ἀποστολικός	apostolic	+214,282
θεοτόκος	God-bearing (Theotokos)	+85,966
βάπτισμα	baptism	+85,945
θεότης	divinity	+59,016
μόδιος	bushel	+58,616
μοναστήριον	monastery	+57,696
σεβάσμιος	reverend	+57,691
αἱρετικός	heretic	+35,602
χάρισμα	(spiritual) gift	+35,588
πατριάρχης	patriarch	+27,001

No surprises again; the only non-religious term is μόδιος "bushel", both as a vessel and a measure.

These are the biggest shifts in verbs:

Ancients talked more about...		Rank Shift
διαπορεύω	pass across	-5939
βλώσκω	go	-5710
εἰσοράω	look upon	-4134
ἄημι	blow (wind)	-4113
κλύω	hear	-3392
ἐπιζεύγνυμι	join to	-3039
ἀμφισβητέω	doubt	-2436
ἐφάπτω	hang on	-2122
ἱκνέομαι	come	-2088
μεταπέμπω	send for	-1974

Many of the missing verbs are poetic and/or dialectal, and would not have a natural place in Byzantine prose; that includes βλώσκω, εἰσοράω, ἄημι, κλύω, ἱκνέομαι. The surprise here is the vanishing of doubt in the Middle Ages.

... Yes, yes, the jokes just write themselves, I know...

Ancients talked less about...		Rank Shift
ἐνάγω	persuade	+10,366
βαπτίζω	baptise	+8529
ψάλλω	chant	+5021
φανερόω	reveal	+3988
καταδικάζω	condemn	+3483
φωτίζω	illuminate	+3308
περισπάω	take a circumflex	+2948
βαστάζω	carry	+2911
ἀνέρχομαι	go up	+2809
προλαμβάνω	anticipate	+2769

I admit to being less sure about some the shifts here, such as ἐνάγω and προλαμβάνω. The Christian influence is clear in βαπτίζω, ψάλλω, φανερόω and φωτίζω. Language change accounts for βαστάζω and ἀνέρχομαι replacing φέρω and ἄνειμι, and I assume καταδικάζω for "condemn" replaced what came to look like more generic verbs, in καθαιρέω or καταγιγνώσκω. And unlike the Ancients, the Byzantines had to learn about polytonic orthography; so what word took a circumflex and what word took an acute was a matter much ink was spilled about.

Finally, these are the biggest shifts in function words:

Ancients talked more about...		Rank Shift
τοτέ	at times	-5796
αὖτε	again	-4707
δισχίλιοι	two thousand	-4676
αἴ	alas	-3334
πεντακόσιοι	five hundred	-2859
ἠέ	or	-2844
νή	[I swear] by [deity]	-2663
διακόσιοι	two hundred	-2597
μά	yea	-2470
πω	yet, at all	-2466

There is some Epic dialect here, in αὖτε and ἠέ; some strictly Attic rather than Koine words in τοτέ, πω, and δισχίλιοι; and a rather different approach to exclamations, with the old oaths by the Gods dispensed with, and the ai!'s of tragedy avoided in theological discourse. (There are 2100 instances AD of φεῦ "alas"; maybe αἴ was too specific to tragedy? *shrug*) Not sure why the written-out 500 and 200 were less popular. Maybe the armies just got bigger, so historians talked in the thousands instead of 300...

Ancients talked less about...		Rank Shift
ἀμήν	amen	+19,195
νά	to (Modern Greek)	+18,984
ἀλλαχοῦ	elsewhere	+7689
δηλαδή	that is	+7587
ἤγουν	that is	+6367
ιζ΄	XVII	+4541
καθό	insofar as	+4524
ιϛʹ	XVI	+4001
ιηʹ	XVIII	+3727
ιδʹ	XIV	+3196

It's obvious why amen is there; it's also obvious why να, the Modern Greek equivalent of the ancient infinitive inflection, is there. ἀλλαχοῦ for "elsewhere" is attested in Sophocles and Xenophon, but it became prevalent much later, and LSJ reports that Moeris proscribed it as vernacular, in favour of ἄλλοθι. The other conjunctions are run-in phrases, which Byzantine texts in general are rather more sympathetic to treating as single words than are ancient texts: δῆλα δή "so [they are] obvious", ἤ γε οὖν "or indeed then", καθ’ ὅ "according to what".

Finally, the numerals aren't there because the Byzantines were more numerate than the Ancients. After all, the Byzantines had given up on geometry, from what the counts tell us. (And that's a silly enough thing to conclude that you should not take much of this too seriously.) No, the reason there's a whole lot of XVII's and XIV's in the AD corpus is that there are a lot more chapter headings in the theologians...

6 comments:

filologanoga / Neven JovanovićFebruary 03, 2010 1:14 AM
Wow! Stuff for thought.
ReplyDelete
Replies
HelmaFebruary 06, 2010 9:49 AM
You should really try a bit of disambiguation sometime (doesn't take a lot of text) -- that will do wonders to your viper presence:-) Lots of (τοὺς) ἔχεις are pretty easy to distinguish from (σὺ) ἔχεις.
But, more to the point, if you really want to compare two corpora with Wordles, check out Martin Mueller's Wordles (enter him as a Wordle user to find them). Neat stuff that contrasts Iliad and Odyssey, and similar. So his Wordles are not raw counts but differentials (more specifically, Dunning's log likelihood ratio). For English stuff, check out his Jane_Austen_avoids (used less in Austen than in her contemporary novelists). By the way, I'm completely in agreement that Wordles are fairly useless as an analytical tool after the first glance. Ordering by frequency, or alphabetically, gets you where you want to be sooner.
ReplyDelete
Replies
opoudjisFebruary 06, 2010 4:40 PM
Helma, hello, and good to hear from you again.

Syntactic disambiguation will do some things, agreed, but I don't think it will go so far as "wonders" for the vipers. I played with syntactic disambiguation in 2004—although I should revisit it now that the lemmatiser is doing a better job: I used the collocation of parts of speech and inflectional categories for words that were unambiguous, and tried to apply those to words that were ambiguous.

My findings at the time were, that kind of disambiguation would deal with a quarter to a third of all ambiguous word instances. Certainly a meaningful step, but it wouldn't get rid of "viper" completely. Moreover, though I went as far as a three word window either side of the ambiguous word, the only consistently reliable syntactic cue I found was preceding definite article (which cannot precede a verb). And that's because it combines a syntactic restriction with inflectional agreement.

Otherwise, syntactic restrictions in Ancient Greek are thin on the ground: articles, prepositions, and that's about it. The word order is too free for such a restriction to emerge through pure statistics. Which means I know τῷ ἔχει is "to the viper" and not *"to the she has"; but I can't do as much with σὺ ἔχεις "thou hast". For starters, ἔχεις could be the accusative plural, and could occur in a pattern like σὺ ἔχεις ἐφόνευσας "thou murderedst vipers". For seconds, pronouns like σύ are rare in Greek in general, so they wouldn't account for a lot of instances.

I don't want to be uncharitable to Wordle, because the "first glance" still counts for something. But yes, it's only an initial glance.

Thank you for the cluestick on Dunning via Mueller. I'm going to blunt-instrument use the formulas as interpreted by Wordhoard, and see if the Wordles I get that way are more informative.
ReplyDelete
Replies
HelmaFebruary 07, 2010 5:14 AM
Hi Nick,
I guess I like cluesticks as much as anyone :-)

So let me leave another comment here. I think that you'll find that morphological disambiguation (less work than syntax) would not be 100% for things like ἔχις and δοκεύς (another stand-out, for obvious reasons) but quite successful nonetheless. In my corpus, after a really small amount of disambiguation, I get frequencies for ἔχω 1000 times higher than for ἔχις and few mishits, that are mostly there because the system in its first go-round had not seen a lot of postpositions yet (and because we didn't deal with caps and apostrophes well. oops!). We'll soon do a round of re-training with more data, so we'll see what happens then.
But I understand that in what it makes available for the general user, TLG has decided to give users the option of looking for all δοκεῖ or ἔχεις instances, without taking a first stab at which is which. I guess this fits in with the general philosophy of the project and the obsessions of the average classicist -- the search for completeness above all (collecting all the trees in the Greek forest). Because I'm interested in frequency distribution *and* fast searching, I've taken the other decision: go with the probables, not the possibles, for the sake of searching, lemma frequency, collocation, etc.
ReplyDelete
Replies
opoudjisFebruary 09, 2010 2:32 AM
The distinction is more a matter of degree: I've been more conservative about displacing ambiguous parses, for the reasons you indicate; but there is a system of ranking analyses in place—especially because the TLG lemmatiser, because of the nature of its corpus, overgenerates analyses.

The disambiguation is rule-driven rather than stochastic (since words are analysed in isolation currently); its criteria are morphological and lexical. (So δοκεύς, a hapax pretty much, won't be competing with δοκέω any more: I've added a lexical rule to that effect.) But the morphological rules are of limited scope, or else conservatively applied.
ReplyDelete
Replies
HelmaFebruary 09, 2010 4:10 PM
Thanks for that comment. Especially amusing to me since my more statistically trained, non-classicist, co-conspirators try to keep me from implementing word-specific rules:-) I should trust the system to take care of it, after all. So far, they've pretty much been right, but I'm always tempted.
ReplyDelete
Replies

Add comment

Ἡλληνιστεύκοντος

Pages

2010-02-01

Comparison, TLG BC and AD

6 comments:

Search This Blog

Ὦδ’ ὁ δ’ ἡλληνιστεύκων

Blog Calendar (April 2009– )

Label Cloud

Scholia

Followers

ἡλληνιστικευθέντες (2009-06-08+)

Λάβαρα

Ἑλληνιστεύοντες