The paper describes finding similar songs using very short audio snippets from the songs, a potential step toward building a music recommendation system. The similarity algorithm used is described as a variant of locality-sensitive hashing.
First, an excerpt from the paper describing LSH:
The general idea is to partition the feature vectors into subvectors and to hash each point into separate hash tables... Neighbors can be [found by] ... each hash casting votes for the entries of its indexed bin, and retaining the candidates that receive some minimum number of votes.What is interesting (and a bit odd) about this work is that they used neural networks to learn the hash functions for LSH:
Our goal is to create a hash function that also groups "similar" points in the same bin, where similar is defined by the task. We call this a forgiving hash function in that it forgives differences that are small.A curious detail is that initializing the training data by picking output bins randomly worked poorly, so instead they gradually change the output bins over time, allowing them to drift together.
We train a neural network to take as input an audio spectrogram and to output a bin location where similar audio spectrograms will be hashed.
The primary difficulty in training arises in finding suitable target outputs for each network.I have not quite been able to make up my mind about whether this is clever or a hack. On the one hand, reinitializing the target outputs to eliminate biases introduced by random initialization seems like a clever idea, perhaps even one that might have broad applicability. On the other hand, it seems like their learning model has a problem in that it does not automatically learn to shift the outputs from their initial settings, and the reordering step seems like a hack to force it to do so.
Every snippet of a song is labeled with the same target output. The target output for each song is assigned randomly.
The drawback of this randomized target assignment is that different songs that sound similar may have entirely different output representations (large Hamming distance between their target outputs). If we force the network to learn these artificial distinctions, we may hinder or entirely prevent the network from being able to correctly perform the mapping.
Instead of statically assigning the outputs, the target outputs shift throughout training ... We ... dynamically reassign the target outputs for each song to the [bin] ... that is closest to the network's response, aggregated over that song's snippets.
By letting the network adapt its outputs in this manner, the outputs across training examples can be effectively reordered to avoid forcing artificial distinctions .... Without reordering ... performance was barely above random.
In the end, I leave this paper confused. Is this a good approach? Or are there ways to solve this problem more directly?
Perhaps part of my confusion is my lack of understanding of what the authors are using for their similarity metric. It never appears to be explicitly stated. Is it that songs should be considered similar if the difference between their snippets is small? If so, is it clear that is what the NNet is learning?
Moreover, as much as I want to like the authors idea, the evaluation only compares their approach to LSH, not to other classifiers. If the goal is to minimize the differences between snippets in the same bin while also minimizing the number of bins used for snippets of any given song, are there better tools to use for that task?