Monday, July 24, 2006
1 / (common sense) = animus
My post last week goofing on entropy wasn't entirely tongue-in-cheek. And it looks as if there may actually be something to it. I've updated the ETC software to experiment with the idea. The original programming selected bigrams using a simple weighted selection--the more often a bigram appeared in the model, the more likely it would be selected within an utterance. There were a couple of reasons for the approach. First, I'd hypothesized that poetry wants to be mostly normal text, with unexpected variations here and there. Second, low frequency bigrams seemed to trash the poetics.
This low-frequency distortion turned out not to be "normal." What I was seeing were the result of tagging mistakes and anomalies in the source text (British National Corpus)--mistakes such as mispunctuated compound sentences, where two words were really in two different sentences and therefore not a natural bigram or spelling errors or phonetically contrived attempts to capture an accent in a literary passage, and so on. A closer look at the model revealed that about 30% of the volume under the distribution curve consisted of bigram pairs with a frequency of one! I trashed all of those and the output was much less distorted. Now that the model is a better representation of language as it's "normally" used, I could more easily reason about things like Barthes' poetic zone of speech within the model's constraints.
So I changed the bigram selection code to favor infrequent bigram pairs. That's what's running now. We'll see if the poetry is any better.