Thursday, February 09, 2006

Web 2.0 and garbage in, garbage out

In a review of Zillow, Nicholas Carr makes some great points about services that republish bad data:
Entrepreneurs are launching all sorts of sites and services that are built on data that they're siphoning out of third-party sites and databases. Sometimes, the secondhand data is good; sometimes, it's not ... Unfortunately, to the user, the inaccuracies are invisible.

There's one line of thinking about the unreliability of web information that says, essentially, "get used to it." People are just going to have to become more sophisticated consumers of information.

That's nice in theory, but it doesn't wash in the real world. It's like selling wormy apples and telling customers that they're just going to have to become more sophisticated eaters of apples. Fruit buyers don't like worms, and information seekers don't like bad facts.
It's not enough to mashup some data streams, remix it, throw some pretty AJAX on top, and spew it out on the web. For a product to be useful, the data has to be clean, correct, and reliable.

To take an example from Findory, people might think that crawling RSS feeds is easy, but I've been amazed by the crap that people throw into their feeds. I see entire HTML web pages thrown in to the description section. I see Javascript. I see all kinds of turds. The data is dirty, dirty. Most of the code in Findory's crawl is devoted to cleaning the data.

Dirty data is useless to users. If you're not going to put the effort in to make sure your data is good, people aren't going to put the effort in to use your website.


Anonymous said...

Greg, I think your example of Findory in this context is not on the mark. Carr is talking about "incorrect" data, not about "incorrectly formatted" data.

Greg Linden said...

Good point. Carr is talking about unreliable data that it is not useful or trustworthy.

The Findory example is about processing and filtering data to increase its value.

So, you're right that the emphasis is off. The example I gave is more about eliminating noise than about eliminating unreliable data.