An extended excerpt:
The frequent presence of many diverse semantic units in individual Web pages makes near-duplicate detection particularly difficult. Frame elements for branding, and advertisements are often freely interspersed with other content elements. The branding elements tend to be deliberately replicated across the pages of individual sites, creating a "noise level" of similarity among pages of the same site.The paper gives an example of a generating a spot signature where a piece of text like "at a rally to kick off a weeklong campaign" produces two spots: "a:rally:kick" and "a:weeklong:campaign".
SpotSigs provides a robust scheme for extracting characteristic signatures from Web documents, thus aiming to filter natural-language text passages out of noisy Web page components ... [It] is much more efficient, easier to implement, and less error-prone than ... layout analysis .... [and] we are able to show very competitive runtimes to the fastest known but more error-prone schemes like I-Match and Locality Sensitive Hashing (LSH).
The points (or "spots") in the page at which spot signatures are generated are all the locations where ... anchor words [occur]. We call the anchor words antecedents, which are typically chosen to be frequent within the corpus. The most obvious, largely domain-independent choices ... are stopwords like is, the, do, have, etc. ... which are distributed widely and uniformly within any snippet of natural-language text.
A spot signature ... consists of a chain of words that follow an antecedent word .... Spot signatures ... [focus on] the natural-language passages of Web documents and skip over advertisements, banners, and navigational components.
What I particularly like about this paper is that they take a very hard problem and find a beautifully simple solution. Rather than taking on the brutal task of tearing apart the page layout to discard ads, navigation, and other goo, they just noted that the most important part of the page tends to be natural language text. By starting all their signatures at stopwords, they naturally focus the algorithm on the important parts of the page. Very cool.
If you can't get enough of this topic, please see also my previous post, "Detecting near duplicates in big data", that discusses a couple papers out of Google that look at this problem.
Update: Martin Theobald posted a detailed summary of the paper on the Stanford InfoBlog.