Skip to content Skip to sidebar Skip to footer

Efficient Way To Calculate Distance Matrix Given Latitude And Longitude Data In Python

I have data for latitude and longitude, and I need to calculate distance matrix between two arrays containing locations. I used this This to get distance between two locations give

Solution 1:

There's a lot of suboptimal things in the Haversine equations you are using. You can trim some of that and minimize the number of sines, cosines and square roots you need to calculate. The following is the best I have been able to come up with, and on my system runs about 5x faster than Ophion's code (which does mostly the same as far as vectorization goes) on two random arrays of 1000 and 2000 elements:

def spherical_dist(pos1, pos2, r=3958.75):
    pos1 = pos1 * np.pi / 180
    pos2 = pos2 * np.pi / 180
    cos_lat1 = np.cos(pos1[..., 0])
    cos_lat2 = np.cos(pos2[..., 0])
    cos_lat_d = np.cos(pos1[..., 0] - pos2[..., 0])
    cos_lon_d = np.cos(pos1[..., 1] - pos2[..., 1])
    return r * np.arccos(cos_lat_d - cos_lat1 * cos_lat2 * (1 - cos_lon_d))

If you feed it your two arrays "as is" it will complain, but that's not a bug, it's a feature. Basically, this function computes the distance on a sphere over the last dimension, and broadcasts on the rest. So you can get what you are after as:

>>> spherical_dist(locations_1[:, None], locations_2)
array([[ 186.13522573,  345.46610882,  566.23466349,  282.51056676],
       [ 187.96657622,  589.43369894,  555.55312473,  436.88855214],
       [ 149.5853537 ,  297.56950329,  440.81203371,  387.12153747]])

But it could also be used to calculate the distances between two lists of points, i.e.:

>>> spherical_dist(locations_1, locations_2[:-1])
array([ 186.13522573,  589.43369894,  440.81203371])

Or between two single points:

>>> spherical_dist(locations_1[0], locations_2[0])
186.1352257300577

This is inspired on how gufuncs work, and once you get used to it, I have found it to be a wonderful "swiss army knife" coding style, that lets you reuse a single function in lots of different settings.

Solution 2:

This is simply vectorizing your code:

def new_get_distances(loc1, loc2):
    earth_radius =3958.75

    locs_1 = np.deg2rad(loc1)
    locs_2 = np.deg2rad(loc2)

    lat_dif = (locs_1[:,0][:,None]/2- locs_2[:,0]/2)
    lon_dif = (locs_1[:,1][:,None]/2- locs_2[:,1]/2)

    np.sin(lat_dif, out=lat_dif)
    np.sin(lon_dif, out=lon_dif)

    np.power(lat_dif, 2, out=lat_dif)
    np.power(lon_dif, 2, out=lon_dif)

    lon_dif *= ( np.cos(locs_1[:,0])[:,None] * np.cos(locs_2[:,0]) )
    lon_dif += lat_dif

    np.arctan2(np.power(lon_dif,.5), np.power(1-lon_dif,.5), out= lon_dif)
    lon_dif *= ( 2* earth_radius )

    return lon_dif

locations_1 = np.array([[34, -81], [32, -87], [35, -83]])
locations_2 = np.array([[33, -84], [39, -81], [40, -88], [30, -80]])
old= get_distances(locations_1, locations_2)

new= new_get_distances(locations_1,locations_2)

np.allclose(old,new)
True

If we look at timings:

%timeit new_get_distances(locations_1,locations_2)10000 loops, best of 3: 80.6 µs per loop

%timeit get_distances(locations_1,locations_2)10000 loops, best of 3: 74.9 µs per loop

It is actually slower for a small example; however, lets look at a larger example:

locations_1 = np.random.rand(1000,2)

locations_2 = np.random.rand(1000,2)

%timeit get_distances(locations_1,locations_2)1 loops, best of 3: 5.84 s per loop

%timeit new_get_distances(locations_1,locations_2)10 loops, best of 3: 149 ms per loop

We now have a speedup of 40x. Can probably squeeze some more speed in a few places.

Edit: Made a few updates to cut out redundant places and make it clear that we are not altering the original location arrays.

Solution 3:

It is more efiicient when using meshgrid to replace the double for loop:

import numpy as np

earth_radius = 3958.75

def get_distances(locs_1, locs_2):
   lats1, lats2 = np.meshgrid(locs_1[:,0], locs_2[:,0])
   lons1, lons2 = np.meshgrid(locs_1[:,1], locs_2[:,1])

   lat_dif = np.radians(lats1 - lats2)
   long_dif = np.radians(lons1 - lons2)

   sin_d_lat = np.sin(lat_dif / 2.)
   sin_d_long = np.sin(long_dif / 2.)

   step_1 = (sin_d_lat ** 2) + (sin_d_long ** 2) * np.cos(np.radians(lats1[0])) * np.cos(np.radians(lats2[0])) 
   step_2 = 2 * np.arctan2(np.sqrt(step_1), np.sqrt(1-step_1))

   dist = step_2 * earth_radius

   return dist

Solution 4:

Does the Haversine formula provide good enough accuracy for your use? It can be off by quite a bit. I think you'd be able to get both accuracy and speed if you use proj.4, in particular the python bindings, pyproj. Note that pyproj can work directly on numpy arrays of coordinates.

Post a Comment for "Efficient Way To Calculate Distance Matrix Given Latitude And Longitude Data In Python"