The number of grammatical English sentences is theoretically infinite ... However, in practice we humans care to make only a finite number of distinctions. For many tasks, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need.The article talks in more detail about work at Google and elsewhere on extracting relationships from massive crawls of text, tables, and the deep web.
We're left with ... interpreting the content, which is mainly that of learning as much as possible about the context of the content to correctly disambiguate it .... What we need are methods to infer relationships between ... entities in the world. These inferences may be incorrect at times, but if they're done well enough we can connect disparate data collections and thereby substantially enhance our interaction with Web data.
Unlabeled data ... is so much more plentiful than labeled data ... With very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there.
On a related note, Google announced some new features a couple days ago, improved query suggestions and snippets, that Googler Ori Allon apparently described as scanning pages "in real-time ... after a query is entered" and identifying "conceptually and contextually related sites/pages" using "an 'understanding' of content and context." Many news articles are referring to this as a step toward semantic search.
Please see also my April 2008 post, "GoogleBot starts on the deep web", which discusses related work by Alon Halevy on mining data in tables and the deep web.
Please see also my post on the WSDM 2008 keynote by Oren Etzioni on semantic interpretation. His work is mentioned a few times by Halevy et al.
[IEEE article found via the Google Research Blog]