Skip to content Skip to sidebar Skip to footer

How Do You Write A Django Model That Can Automatically Normalize Data?

I'm building a music recommendation engine that uses the lyrics of a track to figure out how closely songs are related to each other emotionally. I've used the tfidf algorithm(not

Solution 1:

There are two ways to trigger some action when a model is saved: override the save method, or write a post_save listener. I'll show the override method since it's a little simpler, and fits this use case nicely.

To to get the max / min, you can use Django's queryset aggregation functions:

from django.db.models import Max, Min


classParty(models.Model):
    ...
    defsave(self, *args, **kwargs):
        max = Party.objects.all().aggregate(Max('tfidf'))['tfidf__max']
        min = Party.objects.all().aggregate(Min('tfidf'))['tfidf__min']
        self.normalized_tfidf = (self.tfidf - min) / (max - min)
        super(Party, self).save(*args, **kwargs)

Overriding default model methods like save is pretty straightforward but there's some more info here if you're interested.

Note that if you are doing bulk updates to Party.tfidf at any point, the save handler won't get called (or post_save signals sent, for that matter), so you'd have to process all of the rows manually - which would mean a lot of DB writes and would pretty much make doing bulk updates pointless.

Solution 2:

To prevent issues with stale data, etc., as mentioned by @klaws in the comments above, it may not be ideal to calculate the normalized value at the time a new song is added.

Instead, you could use a query that lets the database calculate the normalized value, whenever it is needed.

You'll need to import some stuff from django's expressions and aggregates:

from django.db.modelsimportWindow, F, Min, Max

Here's a simple example, applied to the OP's problem, assuming no grouping is needed:

defquery_normalized_tfidf(party_queryset):
    w_min = Window(expression=Min('tfidf'))
    w_max = Window(expression=Max('tfidf'))
    return party_queryset.annotate(
        normalized_tfidf=(F('tfidf') - w_min) / (w_max - w_min))

The Window class allows us to continue annotating the individual objects, as explained e.g. here and in Django's docs.

Instead of using a separate query function, we could also add this to a custom model manager.

If you need the normalized values to be calculated with respect to certain groups (e.g. if the song had a genre), the above could be extended, and generalized, as follows:

defquery_normalized_values(queryset, value_lookup, group_lookups=None):
    """
    a generalized version that normalizes data with respect to the
    extreme values within each group
    """
    partitions = Noneif group_lookups:
        partitions = [F(group_lookup) for group_lookup in group_lookups]
    w_min = Window(expression=Min(value_lookup), partition_by=partitions)
    w_max = Window(expression=Max(value_lookup), partition_by=partitions)
    return queryset.annotate(
        normalized=(F(value_lookup) - w_min) / (w_max - w_min))

This could be used as follows, assuming there would be a Party.genre field:

annotated_parties = query_normalized_values(
    queryset=Party.objects.all(), value_lookup='tfidf',
    group_lookups=['genre'])

This would normalize the tfidf values with respect to the extreme tfidf values within each genre.

Note: In the special case of division by zero (when w_min equals w_max), the resulting "normalized value" will be None.

Post a Comment for "How Do You Write A Django Model That Can Automatically Normalize Data?"