Monday, August 06, 2007

Google Search API for research

The Google Search API -- which, after deteriorating badly, was shut down about a year ago -- has been reborn in a version only available to researchers, the University Research Program for Google Search.

The promise of the API looks attractive at first. Google says:
The University Research Program for Google Search is designed to give university faculty and their research teams high-volume programmatic access to Google Search.

Our aim is to help bootstrap web research ... [Using] Google's search technology, you'll no longer have to operate your own crawl and indexing systems.
However, the limits on how the API can be used may be a problem. From the documentation:
Requests to the service MUST be throttled ... A time period of at least one second must be allowed between requests.
This makes some interesting types of research impossible using this API, anything that would need to fire off multiple queries quickly.

For example, let's say I am working on a technique for query expansion, so I want results not only for the search given, but also for tens of other related searchers, which I will then combine. With a one second delay between queries, my research prototype will take tens of seconds to respond, making it no longer interactive.

Nor can I try out some natural language analysis for question answering where I first get the results for the search given, then look at the results, then fire off dozens of additional queries to learn more about what I found in those results.

I cannot even do something that attempts to use conditional probabilities of finding two words together versus finding them apart on the Web as part of the analysis, since each of those requires two queries to the search engine and many of them might be required.

It is good that Google is making tools available to researchers, but they may have to go further than a throttled search API. As is, many researchers trying to work at large scale still will have to build their own crawls and indexes.

By the way, it is not entirely fair to pick on Google's search API here. Other search APIs -- including the Yahoo Search API, the Microsoft Live Search API, and the Alexa Web Search Platform -- either have a fee or similar throttling.

3 comments:

erik said...

Their policy sounds reasonable - all of your examples are solvable by querying ahead of time and caching results locally.

If a researcher has a requirement for his project to respond in real time for any given input, it sounds like he's no longer researching but instead demoing, (or selling?), which seems like a fine time to move away from a free backend.

Also consider that if the average user searches ten times a day (just a guess), google is committing to shoulder a load of nearly 10,000 times that of the average user per researcher, which is not negligible considering the number of researchers out there who would be interested in such a service.

jeremy said...

I haven't had a chance to look at the APIs yet, but I would be curious to know whether you could get more than 10 results at a time as a result of your query. When you are testing query expansion, you want to know if you missed the target document by just a few docs (i.e. it was ranked 14th or 15th, rather than 9th) or if you missed it by hundreds of documents (it was ranked 104th) or if you missed it by thousands of documents (it was ranked 2243rd).

If you can only get 10 docs at a time, and there is a 1 second delay in between requests for the next 10 docs, it will take you forever to determine this sort of information.

And sometimes, you really do want to know that the target doc is 104th and not 2243rd, even if a user would never see it at 104th, because that can help you determine whether or not you are even close to being on the right track.

burtonator said...

Hey Greg.

Good seeing you at lunch today.

BTW. We provide access to Spinn3r for researchers who need access to a blog crawler.

We already have a few researchers using it already.