Skip to content Skip to sidebar Skip to footer

How To Merge Two Pandas Dataframes Based On A Similarity Function?

Given dataset 1 name,x,y st. peter,1,2 big university portland,3,4 and dataset 2 name,x,y saint peter3,4 uni portland,5,6 The goal is to merge on d1.merge(d2, on='name', how='le

Solution 1:

Did you look at fuzzywuzzy?

You might do something like:

import pandas as pd
import fuzzywuzzy.process as fwp

choices = list(df2.name)

deffmatch(row): 
    minscore=95#or whatever score works for you
    choice,score = fwp.extractOne(row.name,choices)
    return choice if score > minscore elseNone

df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1, 
                  df2,
                  left_on='df2_name',
                  right_on='name',
                  suffixes=['_df1','_df2'],
                  how = 'outer') # assuming you want to keep unmatched records

Caveat Emptor: I haven't tried to run this.

Solution 2:

Let's say you have that function which returns the best match if any, None otherwise:

defbest_match(s, candidates):
    ''' Return the item in candidates that best matches s.

    Will return None if a good enough match is not found.
    '''# Some code here.

Then you can join on the values returned by it, but you can do it in different ways that would lead to different output (so I think, I did not look much at this issue):

(df1.assign(name=df1['name'].apply(lambda x: best_match(x, df2['name'])))
 .merge(df2, on='name', how='left'))

(df1.merge(df2.assign(name=df2['name'].apply(lambda x: best_match(x, df1['name'])))),
           on='name', how='left'))

Solution 3:

The simplest idea I can get now is to create special dataframe with distances between all names:

>>>from Levenshtein import distance>>>df1['dummy'] = 1>>>df2['dummy'] = 1>>>merger = pd.merge(df1, df2, on=['dummy'], suffixes=['1','2'])[['name1','name2', 'x2', 'y2']]>>>merger
                     name1         name2  x2  y2
0                st. peter   saint peter   3   4
1                st. peter  uni portland   5   6
2  big university portland   saint peter   3   4
3  big university portland  uni portland   5   6

>>>merger['res'] = merger.apply(lambda x: distance(x['name1'], x['name2']), axis=1)>>>merger
                     name1         name2  x2  y2  res
0                st. peter   saint peter   3   4    4
1                st. peter  uni portland   5   6    9
2  big university portland   saint peter   3   4   18
3  big university portland  uni portland   5   6   11
>>>merger = merger[merger['res'] <= 5]>>>merger
       name1        name2  x2  y2  res
0  st. peter  saint peter   3   4    4

>>>del df1['dummy']>>>del merger['res']>>>pd.merge(df1, merger, how='left', left_on='name', right_on='name1')
                      name  x  y      name1        name2  x2  y2
0                st. peter  1  2  st. peter  saint peter   3   4
1  big university portland  3  4        NaN          NaN NaN NaN

Post a Comment for "How To Merge Two Pandas Dataframes Based On A Similarity Function?"