Tuesday, December 26, 2006
LTAG, you're it!
Consider this recent poem from the ETC Web site:
Busy
Tooth is a skirt.
Acquaintance and the crowd.
Worn, dines Jim.
The meddling intellect allows short-term fluctuation.
Wines in three good years.
Considerates in mathematics.
Wines hand.
Considerates.
I am no major threat.
This sample illustrates several problems in aesthetic text generation. In the first line, tooth lacks an leading determiner. Obviously, a better (and more "correct") line would be a tooth is a skirt. The problem is that ETC2's phrase structure grammar, being context-free, can't distinguish between nouns that take determiners and those that don't. Courage is a skirt works and A tooth is a skirt works, but A courage is a skirt and Courage is skirt don't.
The difference is that courage is an abstract noun, while tooth and skirt are concrete. ETC2 tries to normalize derterminers by looking at the frequency distributions of various determiners as they are associated with different nouns and does it at runtime. That allows for all kinds of errors. It won't work well if the parsing algorithm doesn't. It won't work well if the frequency is skewed due to some words' contexts being overly represented in the concordance. And since no parsing algorithm is flawless and since all bodies of text are misreprestentative of the language as a whole ("all words are rare"), there will always be errors.
And so Erica realizes, that at this stage of her development, she is no major threat. Making her a threat becomes possible through a lexicalized tree adjoining grammar (LTAG). I posted about basic TAGs last month. LTAG goes a step further and allows categories of words (right down to individual words) in the grammar, a trick sometimes referred to as mildly context sensitive. Take seem as an example. Ought to be easy, right? Not. It can't be used in an expansion such as VP->V-NP, because it's intransitive (copulative actually). So just define separate word categories for transitive and intransitive verbs. Still not. Be is the ultimate intransitive. So She is beautiful and She is a friend both work. She seems beautiful is OK. But She seems a friend is not (or only marginally--the sentence really wants to be She seems to be a friend. More interesting, She is bleeding works, but She seems bleeding doesn't. Seem wants real adjectives not participles and seems to indicate that is bleeding is really in a progressive tense, not containing an inflection of to be at all.
LTAG offers a way past all of this. Just give seem its own node in the grammar and tag it as requiring adjoining. Then define substitution trees for seem and feed the result into the adjoining algorithm. And SHAZZAM, it works! (It really does--we got it working last week.)
There are some complications in getting there for sure. We have to capture more attributes for most word categories. We have to weight tree selections, so that rarely used constructions don't become as legitimate as frequently used ones. And these tasks require more analytic intervention into the part-of-speech tagging process and more complex database schema. But keeping the semantic model small makes this manageable.
The really neat thing about an LTAG implementation, however, is that it means we can deploy the monster and let its anomalies surface through use. When we find one, all we have to do is define whatever lexicalized trees are needed and slam them into the grammar (stored in the DB). No new code. No recompilation. No redeployment. Sweet!