The way to get better understanding of text is through statistics rather than through hand-crafted grammars and lexicons.This reminds me a bit of what Peter said in some of his recent talks:
The statistical approach is cheaper, faster, more robust, easier to internationalize, and so far more effective.
Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm ... is performing better than the best algorithm on less training data.Learning from big data is what Google's infrastructure was built to do. It is particularly obvious in their results in machine translation, but also impacts everything they do, from search to ad targeting to personalization.
Worry about the data first before you worry about the algorithm.
Having more machines is a very important part because it allows us to turn around the experiments much faster than the other guys ... It's the -- gee, I have an idea, I think we should change this -- and we can get the answer in two hours which I think is a big advantage over someone else who takes two days.
The rest of the interview is worth reading. Unfortunately, it is appended to yet another hype piece on vaporware from PowerSet, but just ignore the first part of the article and get to the good stuff at the end.
Update: Matthew Hurst has some interesting thoughts after reading Peter's interview:
The huge redundancy in ... documents suggests approaches to serving the user that don't require the perfect analysis of every document.Update: I also liked the challenge Matthew Hurst described in this later post:
The basic [paradigm] of text mining ... the one document at a time pipeline ... is limiting. It fails to leverage redundancy ... [and assumes] that perfection is required at every step.
The key to cracking the problem open is the ability to measure, or estimate, the confidence in the results ... Given 10 different ways in which the same information is presented, one should simply pick the results which are associated with the most confident outcome - and possibly fix the other results in that light.
A more sophisticated search engine would be explicit about ambiguity (rather than let the user and documents figure this out for themselves) and would take information from many sources to resolve ambiguity, recognize ambiguity and synthesize results.