"Foundations of Statistical Natural Language Processing" by Chris Manning and Hinrich Schutze is a remarkable survey, not only in breadth, but also in its deep, critical analysis of each technique's strength and weaknesses.
I was particularly excited by the focus on practical approaches with massive data sets.
For example, when discussing clustering, the authors warn of efficiency issues with hierarchical clustering or EM algorithms and say that K-means "should probably be used first ... because its results are often sufficient."
On text categorization, they talk about other techniques, but then point out that k nearest neighbor (kNN) is a "simple method that often performs well."
When discussing latent semantic indexing (LSI) and other forms of dimensionality reduction, they mention that pseudo feedback -- query expansion by adding terms from the top results for a search for the original query -- can be cheaper and more effective, depending on your needs.
They criticize hidden Markov models (HMMs) because of the large state spaces required to model many real-world problems, but discuss variants that try to mitigate this issue. They then follow up when talking about part-of-speech tagging by offering transformational (rule-based) approaches as a fast and effective alternative to HMMs.
The book also is full of examples of the ambiguity of language, especially in the sections on disambiguation, parsing, and machine translation. The authors tease you with what looks like a problem with a simple solution, then offer examples that tear apart your naive attempts at cleverness.
Though they focus on very large data sets, Manning and Schutze do not see massive data as the entire solution. At one point, when talking about N-gram models, they say:
One might hope that by collecting much more data that the problem of data sparseness would simply go away ... In practice it is never a general solution to the problem. While there are a limited number of frequent events in language, there is a seemingly never ending tail to the probability distribution of rarer and rarer events, and we can never collect enough data to get to the end of the tail.Despite the huge amount of text out there, not everything that can be said already has been said.
A great book. I only regret that I took so long to read it.
Update: Some good discussion in the comments for this post. Don't miss the link to Peter Norvig's book reviews.