Now, I think I may have found a new favorite. Three search gurus, Chris Manning, Prabhakar Raghavan (head of Yahoo Research), and Hinrich Schutze, just published a wonderful new book, "Introduction to Information Retrieval".
If you work in search or if you are just the kind of person that reads textbooks for fun, this one is a great one. It not only describes how to build a search engine (including crawling, indexing, ranking, classification, and clustering), but also offers the kind of opinionated wisdom you can only get from people who have had substantial experience using these techniques at large scale.
Let me try to pick out some excerpts to show at least a few examples of why I found this book so valuable. First, on the difference between perceived relevance and measured relevance:
The key utility measure is user happiness. Speed of response and the size of the index are factors in user happiness. It seems reasonable to assume that relevance of the results is the most important factor: blindingly fast, useless answers do not make a user happy. However, user perceptions do not always coincide with system designers' notions of quality. For example, user happiness commonly depends very strongly on user interface design issues, including the layout, clarity, and responsiveness of the user interface, which are independent of the quality of the results returned.This is a point that is often missed in information retrieval. It is extremely common to measure the quality of search results by having judges sit down and mark how relevant they think they are. That ignores speed, layout, the quality of nearby results, and the many other factors that go into perceived quality and impact user satisfaction.
One benefit of compression is immediately clear. We need less disk space.The authors go on to say that simple compression techniques might be better for some applications than more complicated algorithms because small additional reductions in data size may not be worth any substantial increase in decompression time.
There are two more subtle benefits of compression. The first is increased use of caching ... With compression, we can fit a lot more information into main memory. [For example,] instead of having to expend a disk seek when processing a query with t, we instead access its postings list in memory and decompress it ... Increased speed owing to caching -- rather than decreased space requirements -- is often the prime motivator for compression.
The second more subtle advantage of compression is faster transfer data from disk to memory ... We can reduce input/output (IO) time by loading a much smaller compressed posting list, even when you add on the cost of decompression. So, in most cases, the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.
On supporting phrase searches:
If users commonly query on particular phrases, such as Michael Jackson, it is quite inefficient to keep merging positional posting lists. A combination strategy uses a phrase index, or just a biword index, for certain queries, and uses a positional index for other phrase queries. Good queries to include in the phrase index are ones known to be common based on recent querying behavior. But this is not the only criterion: the most expensive phrase queries to evaluate are ones where the individual words are common but the desired phrase is comparatively rare.Another factor that might be important to consider when picking phrases to put in the phrase index is the caching layer. If [Britney Spears] is almost always cached when queried, it will be satisfied without hitting the posting lists. You probably want to use the frequency of the query hitting the posting lists, not the frequency of the query in the logs, when looking at what to put in a phrase index.
Adding Britney Spears as a phrase index may only give a speedup factor to that query of about three, because most documents that mention either word are valid results, whereas adding The Who as a phrase index entry may speed up that query by a factor of 1,000. Hence, having the latter is more desirable, even if it is a relatively less common query.
On compression, skip lists, and other attempts to speed up access to the index:
Choosing the optimal encoding for an inverted index is an ever-changing game for the system builder, because it is strongly dependent on underlying computer technologies and their relative speeds and sizes. Traditionally, CPUs were slow, and so highly compressed techniques were not optimal. Now CPUs are fast and disk is slow, so reducing disk postings list size dominates. However, if you're running a search engine with everything in memory, the equation changes again.A nice, long excerpt on Naive Bayes as a classifier and why it works so well:
Naive Bayes is so called because the independence assumptions ... are indeed very naive for a model of natural language.On feature selection:
The conditional independence assumption states that features are independent of each other given the class. This is hardly ever true for terms in documents. In many cases, the opposite is true. The pairs hong and kong or london and english ... are examples of highly dependent terms.
In addition, the multinomial model makes an assumption of positional independence [and] the Bernoulli model ignores positions in documents altogether because it only cares about absence or presence ... How can NB be a good text classifier when its model of natural language is so oversimplified?
The answer is that even though the probability estimates of NB are of low quality, its classification decisions are surprisingly good ... It does not matter how accurate the estimates are ... NB classifiers estimate badly, but often classify well.
Even if it is not the method with the highest accuracy for text, NB has many virtues that make it a strong contender for text classification. It excels if there are many equally important features that jointly contribute to the classification decision. It is also somewhat robust to noise features ... and concept drift.
[But,] NB's main strength is its efficiency ... It is often the method of choice if (i) squeezing out a few extra percentage points of accuracy is not worth the trouble in a text classification application, (ii) a very large amount of training data is available and there is more to be gained from training on a lot of data than using a better classifier on a smaller training set, or (iii) if its robustness to concept drift can be exploited.
[Many] have shown that the average effectiveness of NB is uncompetitive with classifiers like SVMs when trained and tested on independent and identically distributed (i.i.d.) data ... However, these differences may be invisible or even reverse themselves when working in the real world where, usually, the training sample is drawn from a subset of the data to which the classifier will be applied, the nature of the data drifts over time ... and there may well be errors in the data (among other problems). Many practitioners have had the experience of being unable to build a fancy classifier for a certain problem that consistently performs better than NB.
Feature selection [picks] a subset of the terms occurring in the training set and [uses] only this subset as features in text classification. Feature selection serves two main purposes. First, it makes training and applying a classifier more efficient ... this is of particular important for classifiers that, unlike NB, are expensive to train. Second, feature selection often increases classification accuracy by eliminating noise features ... [that] might produce ... overfitting. It may appear counterintuitive that at first that a seemingly weaker classifier is advantageous ... [but] we will see that weaker models are often preferable when limited training data are available.On the effectiveness of linear classifiers and why using non-linear classifiers is not an obvious win:
Of the two NB models, the Bernoulli model is particularly sensitive to noise features. A Bernoulli NB classifier requires some form of feature selection or else its accuracy will be low.
Mutual information and chi-squared represent rather different feature selection methods ... Because its criterion is significance, chi-squared selects more rare terms (which are often less reliable indicators) than mutual information. But the selection criterion of mutual information also does not necessarily select the terms that maximize classification accuracy. Despite the differences, the classification accuracy of feature sets selected with chi-squared and MI does not seem to differ systematically ... As long as all strong indicators and a large number of weak indicators are selected, accuracy is expected to be good. Both methods do this.
For some problems, there exists a nonlinear classifier with zero classification error, but no such linear classifier. Does that mean we should always use a nonlinear classifier?Excellent advice on picking which classifier to use:
It is perhaps surprising that so many of the best-known text classification algorithms are linear. Some of these methods, in particular linear SVMs, regularized logistic regression and regularized linear regression, are among the most effective known methods ... Typical classes in text classification are complex and seem unlikely to be modeled well linearly. However, this intuition is misleading for the high dimensional spaces we typically encounter in text applications. With increased dimensionality, the likelihood of linear separability increases rapidly.
If you have fairly little [training] data and you are going to train a supervised classifier, then machine-learning theory says you should stick to a classifier with high bias ... There are theoretical and empirical results that Naive Bayes does well in such circumstances ... A very low bias model like nearest neighbor is probably contraindicated. Regardless, the quality of the model will be adversely affected by the limited training data ... Often, the practical answer is to work our how to get more labeled data as quickly as you can.A nice, intuitive explanation of why support vector machines (SVMs) work well:
If there is a reasonable amount of labeled data, then you are in a perfect position to use ... an SVM. However, if you are deploying a linear classifier such as an SVM, you should probably ... [overlay] a Boolean rule-based classifier over the machine-learning classifier ... If management gets on the phone and wants the classification of a particular document fixed right now, then this is much easier to do by hand-writing a rule than by working out how to adjust the weights of an SVM without destroying overall classification accuracy. This is one reason why ... decision trees, which produce user interpretable Boolean-like models, retain considerable popularity.
If a huge amount of data are available, then the choice of classifier probably has little effect on your results and the best choice may be unclear (cf. Banko and Brill 2001). It may be best to choose a classifier based on the scalability of training or even runtime efficiency.
For two-class, separable training data sets, there are a lot of possible linear separators ... Although some learning methods such as the perceptron algorithm find just any linear separator, others, like Naive Bayes, search for the best linear separator according to some criterion. The SVM in particular defines the criterion to be looking for a decision surface that is maximally far away from any data point ... Maximizing the margin seems good because points near the decision surface represent very uncertain classification decisions ... A slight error in measurement or a slight document variation will not cause a misclassification.On clustering:
[Also,] by construction, an SVM classifier insists on a large margin around the decision boundary. Compared with a decision hyperplane, if you have to place a fat separator between the classes, you have fewer choices of where it can be put. As a result of this, the memory capacity of the model has been decreased, and hence we expect that its ability to correctly generalize to test data has been increased.
The superlinear training time of traditional SVM algorithms makes them difficult or impossible to use on very large training sets. Alternative traditional SVM solutions algorithms which are linear in the number of training examples scale badly with a large number of features, which is another standard attribute of text problems ... [and it] remains much slower than simply counting terms as is done in the Naive Bayes model.
Nonlinear SVMs ... [are usually] impractical ... [and] it can often be cheaper to materialize the higher order features and train a linear SVM ... The general idea is to map the original feature space to some higher dimensional feature space where the training set is [linearly] separable ... The easy and efficient way of doing this ... [is] the kernel trick ... 90% of the work with kernels uses ... polynomial kernels and radial basis functions ... Really, we are talking about some quite simple things. A polynomial kernel allows us to model feature conjunctions (up to the order of the polynomial) ... [For example,] pairs of words ... [require] a quadratic kernel. If triples of words give distinctive information, then we need to use a cubic kernel. A radial basis function allows you to have features that pick out circles (hyperspheres), although the decision boundaries become much more complex as multiple such features interact. A string kernel lets you have features that are character subsequences of terms.
K-means is the most important flat clustering algorithm ... The ideal cluster in K-means is a sphere with the centroid as its center of gravity ... The first step of K-means is to select as initial cluster centers K randomly selected documents, the seeds. The algorithm then moves the cluster centers around in space to minimize RSS ... the squared distance of each [item] from its [closest] centroid.And there is much, much more. This is just a small sample of the insights in those pages. Definitely worth a read.
[We easily can get] suboptimal clustering ... from a bad choice of initial seeds ... Effective heuristics for seed selection include (i) excluding outliers ... (ii) trying out multiple starting points ... and (iii) obtaining seeds from ... hierarchical clustering [of] ... a small random sample.
Hierarchical clustering outputs ... a [useful tree] structure ... [and] does not require us to prespecify the number of clusters .... [but is at least] O(N^2) and therefore infeasible for large sets of 1,000,000 or more documents. For such large sets, HAC can only be used in combination with a flat clustering algorithm like K-means ... [which] combines the ... higher reliability of HAC with the efficiency of K-means.
By the way, the references sections at the end of each chapter offer excellent surveys of recent work on each topic. They are worth their weight in gold and should not be missed.
The authors made the full text of their book available online, but I'd recommend picking up a copy for your shelf too. It is too good of a book to just skim through online; it deserves a more thorough read.
Please see also my June 2006 post, "Foundations of Statistical NLP", which reviews a 1999 book by two of the same authors.