Some extended excerpts:
At present, the most common method of accessing music is through textual metadata .... [such as] artist, album ... track title ... mood ... genre ... [and] style .... but are not able to easily provide their users with search capabilities for finding music they do not already know about, or do not know how to search for.The paper goes into much detail on these topics as well as covering other areas such as chord and key recognition, chorus detection, aligning melody and lyrics (for Karaoke), approximate string matching techniques for symbolic music data (such as matching noisy melody scores), and difficulties such as polyphonic music or scaling to massive music databases. There also is a nice pointer to publicly available tools for playing with these techniques if you are so inclined.
For example ... Shazam ... can identify a particular recording from a sample taken on a mobile phone in a dance club or crowded bar ... Nayio ... allows one to sing a query and attempts to identify the work .... [In] Musicream ... icons representing pieces flow one after another ... [and] by dragging a disc in the flow, the user can easily pick out other similar pieces .... MusicRainbow ... [determines] similarity between artists ... computed from the audio-based similarity between music pieces ... [and] the artists are then summarized with word labels extracted from web pages related to the artists .... SoundBite ... uses a structural segmentation [of music tracks] to generate representative thumbnails for [recommendations] and search.
An intuitive starting point for content-based music information retrieval is to use musical concepts such as melody or harmony to describe the content of music .... Surprisingly, it is not only difficult to extract melody from audio but also from symbolic representations such as MIDI files. The same is true of many other high-level music concepts such as rhythm, timbre, and harmony .... [Instead] low-level audio features and their aggregate representations [often] are used as the first stage ... to obtain a high-level representation of music.
Low-level audio features [include] frame-based segmentations (periodic sampling at 10ms - 1000ms intervals), beat-synchronous segmentations (features aligned to musical beat boundaries), and statistical measures that construct probability distributions out of features (bag of features models).
Estimation of the temporal structure of music, such as musical beat, tempo, rhythm, and meter ... [lets us] find musical pieces having similar tempo without using any metadata .... The basic approach ... is to detect onset times and use them as cues ... [and] maintain multiple hypotheses ... [in] ambiguous situations.
Melody forms the core of Western music and is a strong indicator for the identity of a musical piece ... Estimated melody ... [allows] retrieval based on similar singing voice timbres ... classification based on melodic similarities ... and query by humming .... Melody and bass lines are represented as a continuous temporal-trajectory representation of fundamental frequency (F0, perceived as pitch) or a series of musical notes .... [for] the most predominant harmonic structure ... within an intentionally limited frequency range.
Audio fingerprinting systems ... seek to identify specific recordings in new contexts ... to [for example] normalize large music content databases so that a plethora of versions of the same recording are not included in a user search and to relate user recommendation data to all versions of a source recording including radio edits, instrumental, remixes, and extended mix versions ... [Another example] is apocrypha ... [where] works are falsely attributed to an artist ... [possibly by an adversary after] some degree of signal transformation and distortion ... Audio shingling ... [of] features ... [for] sequences of 1 to 30 seconds duration ... [using] LSH [is often] employed in real-world systems.
By the way, for a look at an alternative to these kinds of automated analyses of music content, don't miss this last Sunday's New York Times Magazine section article, "The Song Decoders", describing Pandora's effort to manually add fine-grained mood, genre, and style categories to songs and articles and then use it for finding similar music.