Monday, June 19, 2006

 

Does (the) language matter?


Last week I talked a little about starting a new etc version. This is etc's 4th iteration. I wrote the first version(as a course project) in Python and used Jane Austen's novels as the source text for the language model. Python lends itself quite well to NLP. It was really strong string handling and regular expression capabilities. And it was enormously powerful collections and indexing features. But its slow. Its interpreted and typeless (which sounds crazy but is actually quite useful), both of which contribute to degraded performance.

This first etc could only compose single-line poems. And it took forever to load (about 30 minutes) because I rebuilt the language model each time and maintained it in memory. For all its flaws, however, it established the basic architecture for all the etc's to come: A statistical language model attached to an analytic transformational grammar.

For Version 2, for my thesis, I used C++ and stored the language model (based on the Brown Corpus and WordNet) in SQLServer. C++ was my native language at the time and SQLServer came free to me (as part of the University's site license). C++ makes for wicked-fast executables. This version showed that it was possible to develop software that generated cohesive compositional structures. (Most of Erica's published work were edited versions of these poems.) But the work was very rough: No stemming (which also made the problem of word rarity even worse) and surface realization was primitive at best. And I learned that the performance problems were not in the code, but in the DB--etc is I/O bound, not CPU bound. But I was on the right track.

The current version's language model uses the British National Corpus and the grammar is in C#. C# because I wanted to learn it. The BNC solved problems with repetitiveness but at the price of even the worse performance and stylistic problems I outlined in my last post. C# is an excellent language for aesthetic text generation. Excellent string handling and a ton of useful collection and indexing libraries. (And Microsoft's IDE is the gold standard.) But in spite of that I'm writing this latest version in Java, with MySql as the DB. Lots of reasons for this.

My peers in electronic writing tend to shy away from MS and sharing code and thoughts embodied in code is a lot more difficult when programmers are speaking different languages. I want to get better at Java. But, if the truth be told, MS is starting to annoy me. (Full disclosure: When I was a consultant, I wanted things to be hard; otherwise people wouldn't need consultants. The harder the better. And I was quite grateful to MS for their buggy code--again, folks needed competent consultants to build workarounds and the hourly rate just kept on going up. That's how I got my motorcycle.) But now that I'm retired from consulting, I want things to be easy (or easier). And though the non-Microsofts of the world haven't yet figured out what MS has (that the GUI interface is just about the only thing most users care about), there are fewer bugs and much less nonsense.

Java is not quite as good a language as C# (expected since C# is a revised clone of Java and MS could correct weaknesses) and Sun's documentation is bad (Javadoc is a really flawed concept). But Sun's IDE is catching up to MS and Java is as much a philosophy as a language, so I'm quite encouraged. And JDBC is a dream come true.

Java is interesting in its connotation. Whereas C# phrasing connote lightness and agility, Java code (to me at least) connotes an odious, contemplative, and dark set of mysteries. Kind of cool.

What I aim to find out is if Java leads to better poetry. Wouldn't that make Scott McNealy wish he hadn't quit quite so soon!

Comments:
I have seen more python NLP projects around, but the more mature ones are either in c/c++ or Java. I work in Java, i like the language a lot, tools are good (Eclipse IDE is excellent), interesting features, mature and stable.

I don't think there is anything inherent in Java in particular that makes it more suitable for NLP tasks than python or C#.

Java has a nice collection of open source numerical computing, statistics, machine learning toolkits, high quality visualization/graphing libraries, enterprise frameworks (enabling scalability), and two UI toolkits to choose from (SWT/Swing) which make it a more appealing choice as a base for a new linguistic computing application.

Two of the more exciting open source NLP apps, (the GATE project, and Standfords NLP parser) are both in Java. Momentum for this language in particular just seems to be increasing.
 
Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?