Tuesday, June 20, 2006

 

Why aesthetic language generation (ALG) is hard


A lot of the problems of natural langauge generation (NLG) have been solved. NLG systems are almost always concerned with specific application domains whose vocabularies are small and about which a relatively few kinds of messages are to be articulated. But aesthetic language generation wants to range over an entire language (in etc's case, English). So we need statistics. A small domain langauge can be completely anaylzed for its grammatical and idiomatic conventions. They can be identified and articulated in rules specific to that domain. But it's impossible to analysize all of English. So instead, we model it. Etc's model is pretty much the distribution of the frequencies of various kinds of bigrams.

But "all words are rare." For the model to have any shot at reasonable depicting the langauge, it needs to be able to determine that various forms of a given word have the same base. For example, the model should identify rise and rising as the same word. Easy, you say, just stem it. Right. Sounds like a solved problem.

It isn't. Information retrieval relies heavily on stemming, to compare a vector of search terms to some set of vectors in candidate documents. Rise in the search terms should line up with rising in candidates. But these stemming routines would stem both rise and rising to ris. Doesn't matter that ris is not a word. IR doesn't care. It's only interested in the simplest thing that matches.

But not in ALG. Riddle me this poet programmers--how would you code a general method to properly stem these participles?:

rising
betting
kissing
dying
panicking

Just one of scores of problems a poetry machine has to solve.

Comments:
Stemming and morphological decomposition are really two different things, although most linguists talk about them interchangeably, but from a computational linguistic point of view, stemmers are traditionally like those you mention, similar to the Porter stemming algorthm, which create "root" forms which are not necessarily the real base form of the word (or a real word at all, eg. stories->stori), but a sufficient unit to help in document retrieval (eg. stemming of search terms used in search engines).

There are relatively simple heuristic rules that can be applied to get at the base form of the words you list. These rules, in conjunction with a lenghty exception list, can help create morphological analyzers which can produce pretty good results. One difficulty is in the disambiguation part, how to identify the proper form of a particular word, for example, being->to_be (verb) vs. being (noun).

WordNet has a morphological component which is pretty simple and does an ok job at getting to the "root". A java version can be found
here.

I just found this link today, this software contains an english dictionary containing words and their base form with semantic and inflectional data


I've written a basic morphological package for nouns and verbs for the GTR Language Workbench. The code for nouns transforms between singular and plural forms, verb are more complex due to the various tenses etc. Currently it can only transform from base verb forms to present participle and past/past participle. More transitions are planned in the future. This code will be released in the next month or two.

I've also been collecting relevant language computing publications on the web relevant to aesthetic text generation. There are a few interesting pubs on morphology, you can find them here.
 
Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?