Monday, October 29, 2007

Google Tech Talk on similarities

Yury Lifshits gave a Google Tech talk, "Similarity Search: A Web Perspective", surveying various algorithms for finding similar items. Slides (PDF) from the talk are available.

In the talk, Yury mentions his tutorial, "Algorithms for Nearest Neighbor Search", which should be of interest to those who want to dive in deeper.

6 comments:

Yury L said...

Thanks for promo :-)

Yury
http://yury.name

jeremy said...

Yury, I wonder if you have any thoughts about the relative benefits of a better algorithm inside a given metric space, versus a better metric space.

That is, how much are similar searches improved by better algorithms, and how much are they improved by better features? My intuition is that the best algorithm, with poor features, will still be a poor performer. Any thoughts on this tradeoff?

Yury L said...

Jeremy, thanks for your question!

- my intuition says "yes, right metric si more important than right algorithm"

But:
- Choosing a metric mostly affects RESULT, i.e. who is announced to be nearest neighbor

- Algorithm affects only SPEED and SPACE.

Thus usually you first choose the metric that reflects your understanding of "meaningful results" and then start looking for some fast algorithm for that particular model

Greg Linden said...

Hi, Yuri. It is great to see this comment from you here that people should look for a fast algorithm for their problem.

One concern I had about your talk and tutorial was the lack of emphasis on doing similarity search at large scale. Some algorithms are much easier to scale to big data than others. And, as Peter Norvig has said repeatedly, using more data often seems to yield larger improvements than switching algorithms.

You did have a plea to Google to release some of their big data at the end of your talk. I agree that a major problem is that researchers simply do not have access to the big data available inside of Google and other companies.

In any case, I much enjoyed your talk and tutorial. Thanks for making it available.

jeremy said...

Yuri, I understand the distinction you are making. And I mostly buy it. But.. isn't the algorithm also sometimes responsible not only for speed and efficiency, but for constructing the meaningfulness of the results, based on the underlying metric space?

Let me give a quick example: Think support vector machines for classification. You have your underlying feature set, your underlying data. And that data is laid out in some metric space. But then the SVM comes along, and, through the kernel trick, remaps that data into a different, more semantically meaningful metric space. And then it draws the linear separator in that new metric space.

The algorithm side of the SVM does both the linear separating, as well as the metric remapping. It handles both the efficiency (speed) as well as the effectiveness (meaningfulness) of the results.

So again, if an algorithm can do both speed and meaningfulness, what is the tradeoff? Would users rather have better results in 0.75 seconds more time, or worse results in 0.75 seconds less time?

yury-lifshits said...

Hi Greg! I'll be interviewed by Live Labs this Friday. Hope to see you there :-)


Late answer to Jeremy:
ok, there is a research direction that address you question: it is called

Low distortion embeddings

The idea is to map data into a new space making picture "simpler" in some sense and yet consistent with original data. Google it!