2010-02-06

Comparison, TLG BC and AD: log-likelihood

Helma Dik left a comment on my post on comparing TLG AD and BC through Wordle, suggesting I use Dunning's Log-Likelihood measure of differential word frequencies in corpora, as Wordled by Martin Mueller. That lets you work out what the real shifts in frequency are, rather than trying to eyeball them through the aggregate word counts.

Here for instance is his comparison of the Iliad to the Odyssey—which words are more frequent in the one, or the other:
Wordle: Odyssey_plusCorrected
Wordle: Iliad_plus
I looked up Ted Dunning's paper, failed to understand it :-( , and used instead the walkthrough of the computation on the user manual of the Wordhoard corpus software package.

And this is the more statistically sound Wordle comparison. Words more frequent BC are in red, words more frequent AD are in black. I'm leaving in stop words this time, and not cleaning up the ambiguity, because this says some interesting things about the changes in Greek grammar between Classical and Late Greek. Do click:
Wordle: TLG AD vs BC comparison, using Dunning's Log-Likelihood metric

Here's my impressionistic notes, that haven't already been covered in the previous post (where I was working through rankings):
  • Both corpora talk about θεός God, but the big jump, of course, is Χριστός Christ. The second biggest jump is in ἅγιος holy, displacing ἱερός. (Was ἱερός too pagan-sounding?)
  • But the biggest discrepancy between BC and AD Greek is the avoidance of δέ but, on the other hand, followed by avoidance of μέν on the one hand. That tells you that AD Greek used different sentence structures, such as a lot more ἀλλά but. Tucked away, there's also more καί and (i.e. more coordinating constructions) and a lot less τε and (a very archaic phrase-second construction).
  • There are a lot more ἤγουν and τουτέστι that is, and a lot less ἐάν if and ἄρα therefore; I'm tempted to think that says something about changing rhetoric in the genres popular in the respective periods—less logic, more exemplification. It's foolhardy, but not impossible.
  • There is a lot more τίς who? being reported, and that's an error in ambiguity, but it's an illuminating error. τοῦ in Attic (though not Late Greek) is ambiguous between "whose?", and the genitive definite article. And there are a lot more definite articles in Late Greek, as you can see by the black ὁ. (My friend Io Manolessou actually wrote her PhD on that shift; nice to see it visually confirmed.)
  • There's also more ἵνα in order to, which suggests Late Greek was already moving towards more subjunctive constructions rather than participles and infinitives, even before Early Modern Greek made the switch completely.
  • Clearly less ὦ Ο!—A very Classical way of addressing people.
  • Some of the odder looking words more prevalent in BC Greek are there because there are a lot more geometric texts in the BC corpus: Ἄβ is actually mistakenly picking up the line ΑΒ, and you can also see in smaller print ΑΒΓ, ΒΔ, ΓΔ, ΕΖ, ΞΖ.

Hm. Yes, that was somewhat more illuminating. Thanks, Helma!

3 comments:

  1. Nicely done Nick; thank you very much!
    On-demand log-likelihood wordles, who'd have thought! And thanks for leaving the stopwords in, which of course is what as a linguist I am most interested in. I'm interested in doing this with morphology distribution as well, but of course, do not have the full corpus that you have available.

    ReplyDelete
  2. PS: See how besides the figure descriptions AB, BC etc, you actually have an amazing presence of geometry: circle, area, circumference, slice (what is the English for this), ?straight (obviously some of this is ordinary 'correct'), line, diameter, angle..
    The preponderance of hgoun, dhlonoti, toutesti may be a reflection not so much of a change in rhetoric/language generally but the immense commentary tradition present in the corpus: all those volumes of works *about* the canon (just think of how big Eustathius is compared to what he comments on, and compared to, say, primary works of literature produced. Even all the sermons of course will want to comment on the only christian canonical work).
    So if you bracket these two categories as accidents of written culture, you can talk about the amazing persistence of lexical items through the history of Greek..(agw, poor thing, will show back up if you count upagw and phgainw..)

    ReplyDelete
  3. Helma: Morphology distribution, eh? The late corpus won't be as informative there, you'd have to limit the comparison to the Classical period; and the Classical corpus that matters is already widely available. But what kind of distributions did you have in mind? I'd rather not get into collocation, it'd be a lot of work.

    The geometry is a curious absence; I'd already noticed it when I did the comparison by ranking, in the preceding post. (For "slice", I think the geometry term is "section", isn't it?)

    You're quite right about the preponderance of exegesis being the reason behind ἤγουν and δηλονότι; "exemplification" was a misleading way of putting it.

    I'm thinking of doing a Wordle of Byzantine Learnèd vs. Early Modern vs Contemporary Greek, which will really show the changes in function words. The results are predictable, but it'll be interesting to see which nouns and verbs took over most.

    No fair on double counting ἄγω—although as it turns out, I *have* been double counting those root verbs, so the instances of ἵημι do include ἀφιήμι. (I reused a script that was doing that; it's a defensible preference, but it's arguably misleading here.) So ἄγω is overrepresented, and I don't think it will show up.

    And oh, ἄγω is utterly dead now, and its reduplicated aorist αγαγ- causes nothing but confusion for Modern Greek speakers. Tipoukeitos has a comment thread on how the aorist has become used in the present in Modern Greek for "kidnap": not απάγω, but απαγάγω. The reason is simple: the verb is being back-formed from "kidnapping", απαγωγή. (It would now sound comical to use the "correct" απάγω, though people have no less problems with other presents like εξάγω.)

    ReplyDelete

The Other Place (opɯcɯluklɑr)

Powered by Blogger Widgets