Tuesday, October 17, 2006
Accident and essence
I learn a little something from each Etc iteration. Etc's grandfather was a python program working against a text repository built from Jane Austen's novels (all of them). It composed single-line poems from a string of slots that represented a string of parts of speech roughly approximating a sentence. It took about a half hour to compose a poem. It based its semantic selection on the distribution of adjacent bigrams. These poems often used the same word at different positions in a line, e.g.:" The nurse saw the nurse." Lesson learned: A word's context includes itself--trivial now, but a revelation then. And in-memory language models suck.
Etc1 (C++) took the experience and added to it a relational database--much faster. It used a phrase structure grammar with LHS=>RHS expansion rules. It used the Brown corpus to build a semantic association model (its "vocabulary"). To enforce some kind of semantic cohesion, it used a bag-of-words approach, selecting from its vocabulary 1000 words associated with each other in some way (synonyms, antonyms, context, etc). These became the subset vocabulary from which a poem could draw down words. Context was again based on adjacent bigrams. It could compose a 20-line poem (once a bag was created, which could be serialized for reuse) in about a minute. Some bags-of-words resulted in better poems than others. After a few hundred poems, Etc1 began to repeat itself--some bigrams only appeared once and so Etc1, once given a particular word, had no choice but to use the other (a conditional certainty?). Lessons learned: Context counts. The 1,000,000-word Brown corpus was too small to lead to semantic variation. And adjacent bigrams are too restrictive.
Etc2 (C#) used the British National Corpus (100,000,000) words and a much refined and controlled phrase structure grammar. Bigrams were defined as pairs of words appearing together in a sentence. Slower (because of the massive amount of data), but certainly more varied. Lots of annoying usage and mechanical errors. Lots of oddball word combinations. After a while Etc2 started to repeat itself as well--not so much semantically as structurally. Too many participials. Too many "gets to the...." Lessons learned: Tend to the surface--it really does count. And variation exists as much in formal novelty as in quirky semantics.
So....
The word distribution model as represented in Etc1 and 2 is a flawed concept. A model as monolith only ever speaks itself--and so can never be other than a briefly interesting thing--once the structure has been spoken, it has nothing else to say. Structure counts not only as an expressive medium, but as an expressing one.
Etc3's design (Java) does not access a language model or define a grammar. Rather it defines a set of rules by which language models can be instantiated (a meta-model, I suppose) and another set of rules for importing a grammar. The only thing it knows how to do is to realize a terminal. It is ignorant of the actual grammar it is using to find the terminals and it is oblivious to whatever language model it is using at any instant. In Etc3, grammar and vocabulary will be states, not drivers. They are nowhere coded and everywhere present.
How cool is that!