Tuesday, November 25, 2003
Just to log a little bit about state-of-the-art technology here, for people outside of this world to understand how far we are from really getting there.
I was testing ANNIE today. It is a natural language processing (NLP) open-source software developed by the Natural Language Processing Research Group in the University of Sheffield. The demo you can run theoretically highlights the words of a given type. You can choose to highlight names, places, organizations, dates, address, money, and percent values in a given webpage.
I decided to run it on my blog, because it is kind of a diverse text that contains different instances of each of the first four elements. It did run pretty well with some exceptions. But some of these exceptions are pretty bad! If I had to do analysis of some text with this software I knew that I would have to read the whole text and make sure that it didn't make many very strange mistakes and left out some important elements. Some examples of mistakes is on the name Malcom Arnold. It only got the "Arnold" as the name. It skipped Pamie's name altogether. It thinks that "Director-" is a place, "Concert" is an organization (from "Concert Chorale") and "independent" too (this was very weird)!
Bottom line: we are still far away from a good NLP even though lots of people have been studying it for a long time already. Some pretty good people got in this field too, but I guess language is just too hard - why did our ancestors create such a complicated thing to use for communication?
posted by Michel |
7:12 AM