Keras: How To Aggregate Over Frame-level Predictions To Song-level Prediction
Solution 1:
In the dataset each label might be named (ex: 'rock'
). To use this with a neural network, this needs to be transformed to an integer (ex: 2
), and then to a one-hot-encoding (ex: [0,0,1]
). So 'rock' == 2 == [0,0,1]
. Your output predictions will be in this one-hot-encoded form. [ 0.1, 0.1, 0.9 ] means that class 2 was predicted, [ 0.9, 0.1, 0.1 ] means class 0 etc.
To do this in a reversible way, use sklearn.preprocessing.LabelBinarizer.
There are several ways of combining frame-predictions into an overall prediction. The most common are, in increasing order of complexity:
- Majority voting on probabilities
- Mean/average voting on probabilities
- Averaging on log-odds of probabilities
- Sequence model on log-odds of probabilities
- Multiple-Instance Learning
Below is an example of the three first ones.
import numpy
from sklearn.preprocessing import LabelBinarizer
labels = [ 'rock', 'jazz', 'blues', 'metal' ]
binarizer = LabelBinarizer()
y = binarizer.fit_transform(labels)
print('labels\n', '\n'.join(labels))
print('y\n', y)
# Outputs from frame-based classifier. # input would be all the frames in one song# frame_predictions = model.predict(frames)
frame_predictions = numpy.array([
[ 0.5, 0.2, 0.3, 0.9 ],
[ 0.9, 0.2, 0.3, 0.3 ],
[ 0.5, 0.2, 0.3, 0.7 ],
[ 0.1, 0.2, 0.3, 0.5 ],
[ 0.9, 0.2, 0.3, 0.4 ],
])
defvote_majority(p):
voted = numpy.bincount(numpy.argmax(p, axis=1))
normalized = voted / p.shape[0]
return normalized
defvote_average(p):
return numpy.mean(p, axis=0)
defvote_average_logits(p):
logits = numpy.log(p / (1 - p))
avg = numpy.mean(logits, axis=1)
p = 1/(1+ numpy.exp(-avg))
return p
maj = vote_majority(frame_predictions)
mean = vote_average(frame_predictions)
mean_logits = vote_average_logits(frame_predictions)
genre_maj = binarizer.inverse_transform(numpy.array([maj]))
genre_mean = binarizer.inverse_transform(numpy.array([mean]))
genre_mean_logits = binarizer.inverse_transform(numpy.array([mean_logits]))
print('majority voting', maj, genre_maj)
print('mean voting', mean, genre_mean)
print('mean logits voting', mean_logits, genre_mean_logits)
Output
labels:
rock
jazz
blues
metal
y:
[[0 0 0 1]
[0 1 0 0]
[1 0 0 0]
[0 0 1 0]]
majority voting: [0.40.0.0.6] ['rock']
mean voting: [0.580.20.30.56] ['blues']
mean logits voting [0.497727040.444994430.414213560.248299140.4724135 ] ['blues']
A simple improvement over averaging probabilities, is to compute the logits (log-odds) of the probability and average that. This more properly accounts for things that are very likely or unlikely. It can be seen as a Naive Bayes, computing the posterior probability under the assumption that the frames are independent.
One can also perform voting by using a classifier trained on the frame-wise predictions, though this not so commonly done and is complicated when input length varies. A simple sequence model can be used, ie an Recurrent Neural Network (RNN) or a Hidden Markov Model (HMM).
Another alternative is to use Multiple-Instance-Learning with GlobalAveragePooling over the frame-based classifications, to learn on whole songs at once.
Post a Comment for "Keras: How To Aggregate Over Frame-level Predictions To Song-level Prediction"