Skip to content Skip to sidebar Skip to footer

Nearest Neighbor Matching In Pandas

Given two DataFrames (t1, t2), both with a column 'x', how would I append a column to t1 with the ID of t2 whose 'x' value is the nearest to the 'x' value in t1? t1: id x 1 1.49

Solution 1:

Using merge_asof

df = pd.merge_asof(df1.sort_values('x'),
                   df2.sort_values('x'),
                   on='x', 
                   direction='nearest', 
                   suffixes=['', '_2'])

print(df)
Out[975]: 
   id     x  id_2
0   3  0.87     6
1   1  1.49     5
2   2  2.35     4

Method 2 reindex

df1['id2']=df2.set_index('x').reindex(df1.x,method='nearest').values
df1
   id     x  id2
011.494122.353

Solution 2:

convert to list t1 and t2 and sort them after this and with the zip() function match the id

list1 = t1.values.tolist()
list2 = t2.values.tolist()

list1.sort() // ASC ORD DESC YOU DECIDE 
list2.sort()

list3 = zip(list1,list2)

print(list3)

//after that you must see the output like (1,4),(2,3)

Solution 3:

You can calculate a new array with the distance from each element in t1 to each element in t2, and then take the argmin along the rows to get the right index. This has the advantage that you can choose whatever distance function you like, and it does not require the dataframes to be of equal length. It creates one intermediate array of size len(t1) * len(t2). Using a pandas builtin might be more memory-efficient, but this should be as fast as you can get as everything is done on the C side of numpy. You could always do this method in batches if memory is a problem.

import numpy as np
import pandas as pd

t1 = pd.DataFrame({"id": [1, 2], "x": np.array([1.49, 2.35])})
t2 = pd.DataFrame({"id": [3, 4], "x": np.array([2.36, 1.5])})

Now comes the part doing the actual work. The .to_numpy() bit is important since otherwise Pandas tries to merge on the indices. The first line uses broadcasting to create horizontal and vertical "repetitions" in a memory-efficient way.

dist = np.abs(t1["x"][np.newaxis, :] - t2["x"][:, np.newaxis])
closest_idx = np.argmin(dist, axis=1)
closest_id = t2["id"][closest_idx].to_numpy()

output = pd.DataFrame({"id1": t1["id"], "id2": closest_id})
print(output)

Solution 4:

Alternatively, you can useroundto1precision

t1 = {'id': [1, 2], 'x': [1.49,2.35]}
t2 = {'id': [3, 4], 'x': [2.36,1.5]}
df1 = pd.DataFrame(t1)
df2 = pd.DataFrame(t2)
df  = df1.round(1).merge(df2.round(1), on='x', suffixes=('','2')).drop('x',1)
print(df)
      id   id2
0     1    4
1     2    3
  • add.drop('x',1)to remove the output for the binding column 'x'.
  • addsuffixes=('','2')to rename the column titles.

Post a Comment for "Nearest Neighbor Matching In Pandas"