From the paper:
When ... scaling entity tracking to ... the Web, resolving semantic ambiguity becomes of central importance, as many surface forms turn out to be ambiguous.Silviu cleverly uses the high-quality, semi-structured data available from Wikipedia for this task. In addition to pages describing different entities where contextual clues can be extracted (example), Wikipedia contains redirects for different surface forms of the same entity (example), list pages that categorize names (example), and disambiguation pages that show many of the different entities for a surface form (example).
For example, the surface form "Texas" is used to refer to more than twenty different named entities in Wikipedia.
In the context "former Texas quarterback James Street", Texas refers to the University of Texas at Austin; in the context "in 2000, Texas released a greatest hits album", Texas refers to the British pop band; in the context "Texas borders Oklahoma on the north", it refers to the U.S. state; while in the context "the characters in Texas include both real and fictional explorers", [it] ... refers to ... [a] novel.
Wikipedia contains much more than unstructured text. Exploiting the semi-structured data -- the redirect, list, and disambiguation pages -- gives this work its power.
For a quick overview on one of many ways this kind of named entity data could be applied, do not miss the screenshot in Figure 3 on page 6 of the paper. It shows a prototype that annotates a web page with pop-ups for the proper names to disambiguate the meaning.
As fun as this paper is, what really excites me is that this is one of many recent research projects that are cleverly using Wikipedia to attack challenging problems. There is little doubt that, deep in the Wikipedia pages, there is much buried treasure, if we can just figure out how to look.