Find The Nearest Point In Distance For All The Points In The Dataset - Python
Solution 1:
The brute force method of finding the nearest of N points to a given point is O(N) -- you'd have to check each point.
In contrast, if the N points are stored in a KD-tree, then finding the nearest point is on average O(log(N)).
There is also the additional one-time cost of building the KD-tree, which requires O(N) time.
If you need to repeat this process N times, then the brute force method is O(N**2) and the kd-tree method is O(N*log(N)).
Thus, for large enough N, the KD-tree will beat the brute force method.
See here for more on nearest neighbor algorithms (including KD-tree).
Below (in the function using_kdtree) is a way to compute the great circle arclengths of nearest neighbors using scipy.spatial.kdtree.
scipy.spatial.kdtree uses the Euclidean distance between points, but there is a formula for converting Euclidean chord distances between points on a sphere to great circle arclength (given the radius of the sphere).
So the idea is to convert the latitude/longitude data into cartesian coordinates, use a KDTree to find the nearest neighbors, and then apply the great circle distance formula to obtain the desired result.
Here are some benchmarks. Using N = 100, using_kdtree is 39x faster than the orig (brute force) method.
In [180]: %timeit using_kdtree(data)
100 loops, best of 3: 18.6 ms per loop
In [181]: %timeit using_sklearn(data)
1loop, best of 3: 214 ms per loop
In [179]: %timeit orig(data)
1loop, best of 3: 728 ms per loopFor N = 10000:
In [5]: %timeit using_kdtree(data)
1 loop, best of 3: 2.78 s per loop
In [6]: %timeit using_sklearn(data)
1 loop, best of 3: 1min 15s per loop
In [7]: %timeit orig(data)
# untested; too slow
Since using_kdtree is O(N log(N)) and orig is O(N**2), the factor by
which using_kdtree is faster than orig will grow as N, the length of
data, grows.
import numpy as np
import scipy.spatial as spatial
import pandas as pd
import sklearn.neighbors as neighbors
from math import radians, cos, sin, asin, sqrt
R = 6367defusing_kdtree(data):
"Based on https://stackoverflow.com/q/43020919/190597"defdist_to_arclength(chord_length):
"""
https://en.wikipedia.org/wiki/Great-circle_distance
Convert Euclidean chord length to great circle arc length
"""
central_angle = 2*np.arcsin(chord_length/(2.0*R))
arclength = R*central_angle
return arclength
phi = np.deg2rad(data['Latitude'])
theta = np.deg2rad(data['Longitude'])
data['x'] = R * np.cos(phi) * np.cos(theta)
data['y'] = R * np.cos(phi) * np.sin(theta)
data['z'] = R * np.sin(phi)
tree = spatial.KDTree(data[['x', 'y','z']])
distance, index = tree.query(data[['x', 'y','z']], k=2)
return dist_to_arclength(distance[:, 1])
deforig(data):
defdistance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
c = 2 * asin(sqrt(a))
km = R * c
return km
shortest_distance = []
for i inrange(len(data)):
distance1 = []
for j inrange(len(data)):
if i == j: continue
distance1.append(distance(data['Longitude'][i], data['Latitude'][i],
data['Longitude'][j], data['Latitude'][j]))
shortest_distance.append(min(distance1))
return shortest_distance
defusing_sklearn(data):
"""
Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler)
"""defdistance(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
lon1, lat1 = p1
lon2, lat2 = p2
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
km = R * c
return km
points = data[['Longitude', 'Latitude']]
nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
distances, indices = nbrs.kneighbors(points)
result = distances[:, 1]
return result
np.random.seed(2017)
N = 1000
data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N),
'Longitude':np.random.uniform(0,360,size=N)})
expected = orig(data)
for func in [using_kdtree, using_sklearn]:
result = func(data)
assert np.allclose(expected, result)
Solution 2:
You can do this very efficiently by calling a library that implements smart algorithms for this, one example would be sklearn which has a NearestNeighbors method that does exactly this.
Example of the code modified to do this:
from sklearn.neighbors import NearestNeighbors
import numpy as np
defdistance(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
lon1, lat1 = p1
lon2, lat2 = p2
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
points = [[25.42, 55.47],
[25.39, 55.47],
[24.48, 54.38],
[24.51, 54.54]]
nbrs = NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
distances, indices = nbrs.kneighbors(points)
result = distances[:, 1]
which gives
>>>result
array([ 1.889697 , 1.889697 , 17.88530556, 17.88530556])
Solution 3:
You can use a dictionary to hash some calculations. Your code calculates the distance A to B many times (A and B being 2 arbitrary points in your dataset).
Either implement your own cache:
from math import radians, cos, sin, asin, sqrt
dist_cache = {}
defdistance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""try:
return dist_cache[(lon1, lat1, lon2, lat2)]
except KeyError:
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
dist_cache[(lon1, lat1, lon2, lat2)] = km
return km
Or use lru_cache:
from math import radians, cos, sin, asin, sqrt
from functools import lru_cache
@lru_cachedefdistance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
Post a Comment for "Find The Nearest Point In Distance For All The Points In The Dataset - Python"