Thursday, August 23, 2007

Marissa Mayer at SES 2007

Elinor Mills at CNet summarizes Googler Marissa Mayer's keynote speech at SES 2007.

Some excerpts from the article, focusing on personalization:
For general Web search, personalization is the future, Mayer said.

Ten to 15 years from now search sites will understand more about searchers, where they are located and what their personal preferences are, she predicted.

Mayer said one of the most important data points for improving search relevance based on personalization is the previous query, although Web history and address books could also be helpful "signals" to the search engine.

It is important that the ads are personalized too, she said ... "My philosophy is that the ads and the search results should match ... For me, search and ads are almost the same."
See also detailed notes on Marissa's keynote from Tamar Weinberg.

For more on personalized search, you might be interested in at least a couple of my posts on that topic, including "Personalized Search Primer" and "Effectiveness of personalized search".

For more on personalized advertising, please also see my posts "What to advertise when there is no commercial intent?" and "Is personalized advertising evil?"

Wednesday, August 22, 2007

Collective search versus personalization

Eric Auchard at Reuters reports on Ask CEO Jim Lanzone's keynote at the SES 2007 conference.

Jim argued that personalization doesn't work and then tried to contrast it with something he called "collective search". Some excerpts: ... aims to tap the collective search habits of its 50 million users to improve the relevancy of Web search.

[Jim Lanzone] said attempts at automated personalization often fail in practice to give users what they want.

Instead, Web search can be improved by understanding the aggregate behavior of different types of users.

This collective approach means users stand to benefit from what users with similar interests have gleaned from previous searches.

"Collective search is something that Ask really believes in," Lanzone said, adding that personalizing what different users see is only a small piece of further improving search.
If I could quote from my favorite movie, "You keep using that word. I do not think it means what you think it means."

I admit the term personalization may be poorly defined these days, but it is hard for me to see the distinction between personalized search and changing the search results based on "what users with similar interests have gleaned from previous searches."

In fact, I would think that is the very definition of personalization. Personalization changes what people see based on their past behavior and the past behavior of others.

Perhaps the distinction here is that collective search may change results even for people who have no history? For example, popular search results may get a higher ranking or search results that appear to be related after analyzing what people click on may be handled differently?

Yet even that often still is referred to as personalization. For example, one of's most successful and useful personalization features is similarities ("Customers who bought X also bought"). That feature is targeted to a specific page, not to a user's history, but is still personalization.

Am I missing something more here? Is Jim suggesting collective search includes something that personalization does not? For example, does collective search include explicit sharing of search results across a social network (like Yahoo's struggling MyWeb)? Or something else?

For more details on Jim's interview, see also Tamar Weinberg's notes over at Search Engine Roundtable.

Monday, August 20, 2007

Personalization session at SES 2007

Barry Schwartz at Search Engine Roundtable posts notes from the SES 2007 session on "Personalization, User Data & Search".

Let me highlight this tidbit on the results of an eye tracking study:
The personalized [search results] ... doubled the performance (click throughs).
That reminds me of what Googler Marissa Mayer once said:
[Personalization is] one of the biggest relevance advances in the past few years.

Personalization doesn't affect all results, but when it does it makes results dramatically better.
Update: More details on the eyetracking study.

Update: Even more details on the eyetracking study.

Friday, August 17, 2007

Interviews on search in 2010

Gord Hotchkiss at Search Engine Land compiled an impressive group of interviews for his article, "Search In The Year 2010". He talked to usability guru Jakob Nielsen, Googler Marissa Mayer, Larry Cornett from Yahoo, Justin Osmer from Microsoft, Michael Ferguson from Ask, and search industry experts Chris Sherman, Greg Sterling, and Danny Sullivan.

From the introduction:
It was with a great deal of anticipation that I threw in front of them the same question: what will the search results page look like in 2010?

Here, aggregated and condensed, are their answers.
Go read the whole thing, but, as usual, I am going to focus here on the part on personalized search.

Chris Sherman, Jakob Nielsen, and Greg Sterling are skeptical about personalization. Personalization "is incredibly hard to do" because "language is so inherently ambiguous" and "you have to guess", not to mention "the so called creep factor."

Danny Sullivan, Justin Osmer, Michael Ferguson, and Larry Cornett are more optimistic. "We're getting close to a tipping point on personalization" where little or no effort ("a very low investment") yields "a lot of return" because searchers will "get a lot more out of [the] search experience" if the search engine knows more about them. Searchers "clicks and their footsteps will walk to the experience that is most delightful and easy for them to use," though we should be careful not to "ask the users to do work." "Google is onto something with their personalized search results," and "people are misunderstanding how sophisticated it can be."

Oddly, Marissa Mayer did not say much on personalized search this time around. In the past, she has said ([1] [2] [3]) that "[personalization is] one of the biggest relevance advances in the past few years", "personalized search is something that holds a lot of promise", and personalization is key for building "the search engine of the future."

Similar to what Esther Dyson said in her interview with Charlie Rose, Chris Sherman agrees with Gord Hotchkiss that "Google is holding a significant portion ... of their personalization algorithm in reserve" because there is "caution" that they might "alienate the searcher." Chris goes on to say, "They probably have tons of stuff that they're not showing us."

On personalization being incredibly hard to do, please also see my March 2005 post, "Personalization is hard. So what?"

Esther Dyson on personalization and Google

Esther Dyson has an interview on the Charlie Rose show that covered several topics, including health, space travel, search, personalization, and social networking.

At 34:22, she talks about personalization, personalized search, and behavioral targeting. Some quotes:
The big issue for Google is going to be personalization.

You sort of see them dancing around this issue of, well, I could do much better search for you if I knew [more about you].

They are very, very concerned about the privacy issue. They're terrified that this is going to be a problem for them.

It's pretty clear Google wants someone else to go first with personalized search to get people to be more relaxed.
Esther goes on to say that she expects smaller companies to take the lead on personalized search and advertising. The implication appears to be that, once these startups nicely warm up the public on personalization, Google will launch all the personalization features they have been holding back, using their big data and big clusters to dominate the field.

The rest of her interview is interesting, especially if you have a strong interest in health, but also just for the tidbits on other technology companies. For example, at 33:38, after a discussion of social networking, she says, "Yahoo should have become Facebook."

See also my May 2007 post, "Esther Dyson on the future of search".

[Charlie Rose talk found via Adario Strange]

Image search to solve hard problems

Alyosha Efros from CMU gave a fun Google Tech Talk, "Using Data to "Brute Force" Hard Problems in Vision and Graphics", with some clever ideas on using large image databases to solve hard problems.

The examples I enjoyed the most started about 20 minutes into the talk. The first, object insertion, looks at the problem of trying to add people or objects to a picture.

Rather than attempting to model the object to be inserted and then adjust the perspective, scale, and lighting, Alyosha suggests we change the problem to finding an appropriate object that already has the right perspective, scale, and lighting. That is, rather than take a specific picture of a man and try to adjust it, search a massive image database to find some picture of a man that is already properly adjusted.

This only works if you have a massive image database and good odds of finding a strong match, but we do now have massive image databases. These databases are only getting bigger with time, improving our odds of finding a good match and making this brute force approach appear even more promising.

Another example discussed in the talk, scene insertion, is a similar problem to object insertion, but at a larger scale. We are no longer just adding objects, but taking out whole chunks of pictures (e.g. construction equipment, roads, buildings, etc.) and creating a new picture by filling in the deleted data. In tools now available, this is done using texture fills, but that works poorly for large deletions.

Alyosha proposes a way of attacking the scene insertion problem where they search a large database for similar images. If they are trying to replace part of a picture that is mostly a beach scene, for example, they search our massive database to find similar beach scenes, then steal the missing chunk from the related scenes.

The talk is enjoyable and light with plenty of pictures of both good and bad examples of what happens when you try to apply this technique. Well worth watching.

See also my Oct 2006 post, "The advantages of big data and big clusters", where I quoted Googler Peter Norvig as saying, "Worry about the data first before you worry about the algorithm."

Friday, August 10, 2007

Effectiveness of personalized search

Zhicheng Dou, Ruihua Song, and Ji-Rong Wen from Microsoft Research had a great paper at WWW 2007, "A Large-scale Evaluation and Analysis of Personalization Search Strategies" (PDF).

There are three major conclusions in the paper: (1) Personalization only helps on some queries. (2) Both long-term and short-term history are important for personalization. (3) Profile-based techniques do not work as well as more fine-grained, click-based techniques.

On how personalization helps only on some queries, the key concept here is "click entropy", the amount of variation in the search results searchers click on.

For queries that are not ambiguous, the top result already may be ideal. If almost everyone clicks on that result, that query's click entropy is low. At least for these queries, there is little opportunity to improve the ranking using personalization (or, for that matter, using any other technique).

Thus, the authors conclude, "Personalized search has different effectiveness on different queries," and, "Click entropy can be used as a simple measurement on whether the query should be personalized."

On long and short-term history, the question here is whether personalization should focus on what you are doing right now (the last few searches) or your general interests (everything you have ever done).

The authors conclude that "the incorporation of long-term interest and short-term context can gain better performance than solely using either of them."

On profile-based versus click-based techniques, profile-based techniques record the high-level categories of interest for each searcher while click-based techniques modify the ranking of past and related clickthroughs.

The authors conclude that click-based techniques "work well", but profile-based "improve the search accuracy on some queries, but they also harm many queries" and are "not as stable as click-based".

These conclusions are not far from what I advocate on this blog. I pick on Google's personalized search for primarily being a profile-based technique focused on long-term history ([1] [2] [3]) and argue for a fine-grained, click-based approach. This MSR paper judges click-based techniques more effective, but also suggests that combining click-based and profile-based algorithms and using both long and short-term history may yield the best results.

See also my Sept 2006 post, "Potential of web search personalization", where I talk about a KDD 2006 paper out of Yahoo Research that comes to some of the same conclusions as this MSR paper.

Sep Kamvar on Read/WriteWeb

Richard MacManus at Read/WriteWeb has a short interview with Google personalization guru Sep Kamvar.

Wednesday, August 08, 2007

Using Wikipedia to disambiguate names

Silviu Cucerzan at Microsoft Research recently published a paper, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data" (PDF), that is a great example of using the high-quality data available in Wikipedia to solve a difficult search problem, in this case, distinguishing between different meanings of the same name.

From the paper:
When ... scaling entity tracking to ... the Web, resolving semantic ambiguity becomes of central importance, as many surface forms turn out to be ambiguous.

For example, the surface form "Texas" is used to refer to more than twenty different named entities in Wikipedia.

In the context "former Texas quarterback James Street", Texas refers to the University of Texas at Austin; in the context "in 2000, Texas released a greatest hits album", Texas refers to the British pop band; in the context "Texas borders Oklahoma on the north", it refers to the U.S. state; while in the context "the characters in Texas include both real and fictional explorers", [it] ... refers to ... [a] novel.
Silviu cleverly uses the high-quality, semi-structured data available from Wikipedia for this task. In addition to pages describing different entities where contextual clues can be extracted (example), Wikipedia contains redirects for different surface forms of the same entity (example), list pages that categorize names (example), and disambiguation pages that show many of the different entities for a surface form (example).

Wikipedia contains much more than unstructured text. Exploiting the semi-structured data -- the redirect, list, and disambiguation pages -- gives this work its power.

For a quick overview on one of many ways this kind of named entity data could be applied, do not miss the screenshot in Figure 3 on page 6 of the paper. It shows a prototype that annotates a web page with pop-ups for the proper names to disambiguate the meaning.

As fun as this paper is, what really excites me is that this is one of many recent research projects that are cleverly using Wikipedia to attack challenging problems. There is little doubt that, deep in the Wikipedia pages, there is much buried treasure, if we can just figure out how to look.

Ambient Findability talk

Peter Morville gave a Google Tech Talk, "Ambient Findability and The Future of Search", that is a fun, light discussion about helping people find the information they need.

The slides from the talk are available (PDF), but they make little sense without Peter talking behind them. If you are in a rush and just want to skim the slides, make sure to see slide 23 with the definition of ambient findability and slide 27 on information overload.

There are tidbits on personalized search (which Peter seems skeptical about) and "keeping found things found" (aka re-finding) in the Q&A starting at 55:42 in the video.

Tuesday, August 07, 2007

Read/WriteWeb: Personalized Search Primer

I have a guest article on Richard MacManus' Read/WriteWeb, "Personalized Search Primer - And Google's Approach".

From the introduction:
Google has received much attention, not all of it positive, for its efforts to personalize search.

In this article, I will briefly describe personalized search, why Google and other search engines are trying to do personalized search, the approach Google is taking toward personalized search, and other approaches to personalized search.
Go read the whole article.

Monday, August 06, 2007

Google Search API for research

The Google Search API -- which, after deteriorating badly, was shut down about a year ago -- has been reborn in a version only available to researchers, the University Research Program for Google Search.

The promise of the API looks attractive at first. Google says:
The University Research Program for Google Search is designed to give university faculty and their research teams high-volume programmatic access to Google Search.

Our aim is to help bootstrap web research ... [Using] Google's search technology, you'll no longer have to operate your own crawl and indexing systems.
However, the limits on how the API can be used may be a problem. From the documentation:
Requests to the service MUST be throttled ... A time period of at least one second must be allowed between requests.
This makes some interesting types of research impossible using this API, anything that would need to fire off multiple queries quickly.

For example, let's say I am working on a technique for query expansion, so I want results not only for the search given, but also for tens of other related searchers, which I will then combine. With a one second delay between queries, my research prototype will take tens of seconds to respond, making it no longer interactive.

Nor can I try out some natural language analysis for question answering where I first get the results for the search given, then look at the results, then fire off dozens of additional queries to learn more about what I found in those results.

I cannot even do something that attempts to use conditional probabilities of finding two words together versus finding them apart on the Web as part of the analysis, since each of those requires two queries to the search engine and many of them might be required.

It is good that Google is making tools available to researchers, but they may have to go further than a throttled search API. As is, many researchers trying to work at large scale still will have to build their own crawls and indexes.

By the way, it is not entirely fair to pick on Google's search API here. Other search APIs -- including the Yahoo Search API, the Microsoft Live Search API, and the Alexa Web Search Platform -- either have a fee or similar throttling.

Friday, August 03, 2007

Google teasing too many lions?

Robert Cringely asks, "Is Google on crack?"

Inflammatory title, but Bob's point is that, while the telcos may be lumbering giants, taking on these powerful companies might just end with Google squashed underfoot:
Google made an unexpected reckless move in the wireless bandwidth market .... Google wants ... [to turn] what would have been yet another mobile phone system into a mobile Internet.

They don't know who they are messing with ... The wireless incumbents ... are mean and spiteful companies and WILL HAVE THEIR REVENGE.

The wireless carriers will spend whatever it takes to win ... because they don't want to change operational rules that have been very profitable for them over the years.

I'm all for tilting at windmills ... but Google has a lot at risk here and I think they are being foolish, even stupid.

What if Verizon, and AT&T, and Comcast, and half a dozen other huge broadband ISPs suddenly cut deals with some search company other than Google and your ISP-supplied browser and homepage no longer give such prominence to Google?

Yahoo ... fully supports Google's bold move, but you notice they didn't make it. Microsoft has been totally silent. Certainly Microsoft smells blood in the water and will be approaching all the outfits Google may have offended, trying to do exclusive search and ad deals with them.
Google seems to be biting off a lot these days. In addition to making Microsoft "hell-bent" to kill them, creating a spat with eBay, frightening the news media, angering the big movie and TV studios, and fighting Yahoo and the other search engines, now Google is threatening the telcos.

Google is teasing too many lions. These media and telco companies are massive. They are not beyond using market power to crush those they do not like.

It does not matter if Google's products are technically superior. It does not matter if Google's goals are noble. If all of these powers align against Google, Google will not survive.

Update: Apparently, Google likes to tease lions in multiple ways. The NYT writes, "Some believe another major goal of [Google's] phone project is to loosen the control of carriers over the software and services that are available on their networks."

Thursday, August 02, 2007

Self-optimizing ad systems

After talking about some of Google's early steps toward personalized advertising, Philipp Lenssen goes a little visionary, writing:
I wonder, with that massive amount of ads + searches Google has, if there’s some merit in allowing the software to figure it out for itself... evolutionary algorithms, self-learning style.

Search sessions are automatically grouped into general patterns, and then random ads are presented, and when an ad performs well, more ads from that ad segment will be displayed next time, and so on, causing a "survival of the fittest ad" environment.

Then when Google meets the press in 2012, they can tell the journalists, "We don’t have a clue anymore how our ads work, but click-throughs are higher than ever."
AdWords and AdSense already self-optimize depending on ad clickthrough rates, but Philipp is talking about something more here.

Rather than have ad systems target to specific keywords, let's make the ad systems target to micro-groups of intent, where intent is determined both by current actions and past behavior.

Rather than specify exactly who to target an ad to using keywords, the keywords merely would be hints to the ad engine, a starting place for a likely target audience. Ads submitted enter a great pool of experimentation where ads are shown to different audiences with different behavior, culled where they fail, reinforced where they succeed.

As much as I like this idea, ad systems appear to be headed in the opposite direction, with offerings from Microsoft and Yahoo touting the additional controls -- the knobs and levers -- they give to advertisers.

It is not surprising that advertisers want control over their ads, but, in the long-term, ad systems are most effective when they serve two audiences, consumers and advertisers. Consumers pay attention to useful and helpful ads. Advertisers want effective ads that consumers pay attention to and use.

Relevant ads are useful, helpful, and effective. While giving up control may be hard for advertisers, it is inevitable that they will have to do so. It is impossible for advertisers to manually tune their ads to millions of micro-audiences with subtle variations of intent. Only an automated solution can deliver that.

Eventually, we will see organic, self-organizing advertising systems. Advertisers merely will give their ads some guidance as they nudge them out the door. Then, they will sit back and watch as the ads find the audiences they seek.

Update: If you might enjoy a version of this for search relevance instead of ad relevance, you might also be interested in my earlier post, "The perils of tweaking Google by hand".

Stanford Summit panel on discovery

Dylan Tweney at Wired has some good notes on yesterday's panel on discovery, recommendations, and personalization at the Stanford Summit 2007.

Wednesday, August 01, 2007

Craig Mundie on AI and personalization

Microsoft Chief Research Officer Craig Mundie offered a few tidbits on artificial intelligence and personalization in the recent Microsoft Analyst Meeting.

Some excerpts:
Think if the computer was really much more personalized in terms of what it did for you.

It will become more humanistic - your ability to interact with the machine more the way you would interact with other people will accrue from this big increase in power.

It ... [will] adapt more to the environment and your needs and the things that are going on around you ... The way in which you will be able to interact with it will be significantly changed.

A computer and its software can move today from a tool to ... a great assistant.

[Assistants] think. They learn about you. They understand what you value. They understand what's important. They make decisions ... They speculate about what might be interesting.

One of the things I dream about personally is being able to move to where the computer is also able to speculate, to do things on the anticipation that it might turn out to be useful for you.

We've done this at the level of speculative execution ... but only for the purposes of trying to make the machine go faster ... Can ... software that is wildly more complex and sophisticated ... but well suited to this class of machine that will emerge in the next 5 to 10 years ... make [machines] qualitatively different and better ... [and] make the machine something that really borders on being your [assistant]?

If the machine actually moves to ... anticipate things and to attempt things on your behalf, then, they would be qualitatively different and more valuable, and I think we will see the [computing] revolution begin again.
An ambitious goal, to be sure, and one that may be well more than 5-10 years out. As I said earlier, this vision requires not only understanding intent and being able to reason about complicated plans, but also having a rich understanding of information acquired and dealing with uncertainty in information, actions, and goals.

While it is a problem we are a long way from solving, that only makes it that much more interesting to try. It certainly is true that even baby steps toward this goal could improve our ability to access and process information, drive increases in productivity, and spark a new computing revolution.

On how close Microsoft Research is to this goal, you might be interested in an April 2006 post where I talk about several Microsoft Research projects and then say that, if those projects can be combined and refined:
Microsoft may be able to build a task-focused, advanced user interface that organizes your information, pays attention to what you are doing to help you find what you need, and surfaces additional information and alerts only when it is important and relevant.
Nevertheless, that kind of assistant is only a first step toward the near AI-Complete vision Craig is describing. Yet, ambitious as it is, Larry Page said something similar a few months back, describing being able to answer any question about anything as the long-term goal for search.

On Craig's words about speculative execution, if you want a trip off into la-la land, you might be interested in my ramblings on what a future with massively multi-core CPUs on our desktop might look like in my past post, "The rise of wildly speculative execution".