为数据集中的所有点找到距离最近的点 \- Python

新手上路,请多包涵

我有一个数据集如下,

 Id     Latitude      longitude
1      25.42         55.47
2      25.39         55.47
3      24.48         54.38
4      24.51         54.54

我想为数据集的每个点找到最近的距离。我在互联网上找到了以下距离函数,

 from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    km = 6367 * c
    return km

我正在使用以下功能,

 shortest_distance = []
for i in range(1,len(data)):
    distance1 = []
    for j in range(1,len(data)):
        distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j]))
    shortest_distance.append(min(distance1))

但是此代码为每个条目循环两次并返回 n^2 次迭代,因此它非常慢。我的数据集包含近 100 万条记录,每次遍历所有元素两次变得非常昂贵。

我想找到更好的方法来找出每一行的最近点。谁能帮我找到在 python 中解决这个问题的方法?

谢谢

原文由 haimen 发布,翻译遵循 CC BY-SA 4.0 许可协议

阅读 1.6k
2 个回答

找到最近的 N 指向给定点的蛮力方法是 O(N) 你必须检查每个点。相反,如果 N 点存储在 KD 树 中,则找到最近的点平均为 O(log(N)) 。还有构建 KD 树的额外一次性成本,这需要 O(N) 时间。

如果需要重复这个过程 N 次,那么暴力法是 O(N**2) ,kd-tree法是 O(N*log(N)) 因此,对于足够大的 N ,KD 树将击败蛮力方法。

有关最近邻算法(包括 KD 树)的更多信息,请参见 此处


下面(在函数 using_kdtree 中)是一种使用 scipy.spatial.kdtree 计算最近邻的大圆弧长的方法。

scipy.spatial.kdtree 使用点之间的欧氏距离,但是有一个 公式 可以将球体上点之间的欧氏弦距离转换为大圆弧长(给定球体的半径)。所以想法是将纬度/经度数据转换为笛卡尔坐标,使用 KDTree 找到最近的邻居,然后应用 大圆距离公式 以获得所需的结果。


这里有一些基准。使用 N = 100 , using_kdtreeorig (蛮力)方法快 39 倍。

 In [180]: %timeit using_kdtree(data)
100 loops, best of 3: 18.6 ms per loop

In [181]: %timeit using_sklearn(data)
1 loop, best of 3: 214 ms per loop

In [179]: %timeit orig(data)
1 loop, best of 3: 728 ms per loop

对于 N = 10000

 In [5]: %timeit using_kdtree(data)
1 loop, best of 3: 2.78 s per loop

In [6]: %timeit using_sklearn(data)
1 loop, best of 3: 1min 15s per loop

In [7]: %timeit orig(data)
# untested; too slow

Since using_kdtree is O(N log(N)) and orig is O(N**2) , the factor by which using_kdtree is faster than orig 将增长为 Ndata 的长度增长。


 import numpy as np
import scipy.spatial as spatial
import pandas as pd
import sklearn.neighbors as neighbors
from math import radians, cos, sin, asin, sqrt

R = 6367

def using_kdtree(data):
    "Based on https://stackoverflow.com/q/43020919/190597"
    def dist_to_arclength(chord_length):
        """
        https://en.wikipedia.org/wiki/Great-circle_distance
        Convert Euclidean chord length to great circle arc length
        """
        central_angle = 2*np.arcsin(chord_length/(2.0*R))
        arclength = R*central_angle
        return arclength

    phi = np.deg2rad(data['Latitude'])
    theta = np.deg2rad(data['Longitude'])
    data['x'] = R * np.cos(phi) * np.cos(theta)
    data['y'] = R * np.cos(phi) * np.sin(theta)
    data['z'] = R * np.sin(phi)
    tree = spatial.KDTree(data[['x', 'y','z']])
    distance, index = tree.query(data[['x', 'y','z']], k=2)
    return dist_to_arclength(distance[:, 1])

def orig(data):
    def distance(lon1, lat1, lon2, lat2):
        """
        Calculate the great circle distance between two points
        on the earth (specified in decimal degrees)
        """
        # convert decimal degrees to radians
        lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
        # haversine formula
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
        c = 2 * asin(sqrt(a))
        km = R * c
        return km

    shortest_distance = []
    for i in range(len(data)):
        distance1 = []
        for j in range(len(data)):
            if i == j: continue
            distance1.append(distance(data['Longitude'][i], data['Latitude'][i],
                                      data['Longitude'][j], data['Latitude'][j]))
        shortest_distance.append(min(distance1))
    return shortest_distance

def using_sklearn(data):
    """
    Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler)
    """
    def distance(p1, p2):
        """
        Calculate the great circle distance between two points
        on the earth (specified in decimal degrees)
        """
        lon1, lat1 = p1
        lon2, lat2 = p2
        # convert decimal degrees to radians
        lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
        # haversine formula
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arcsin(np.sqrt(a))
        km = R * c
        return km
    points = data[['Longitude', 'Latitude']]
    nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
    distances, indices = nbrs.kneighbors(points)
    result = distances[:, 1]
    return result

np.random.seed(2017)
N = 1000
data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N),
                     'Longitude':np.random.uniform(0,360,size=N)})

expected = orig(data)
for func in [using_kdtree, using_sklearn]:
    result = func(data)
    assert np.allclose(expected, result)

原文由 unutbu 发布,翻译遵循 CC BY-SA 3.0 许可协议

你可以通过调用一个为此实现智能算法的库来 非常有效地 做到这一点,一个例子是 sklearn,它有一个 NearestNeighbors 方法可以做到这一点。

为此修改的代码示例:

 from sklearn.neighbors import NearestNeighbors
import numpy as np

def distance(p1, p2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)
    """
    lon1, lat1 = p1
    lon2, lat2 = p2
    # convert decimal degrees to radians
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    # haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

points = [[25.42, 55.47],
          [25.39, 55.47],
          [24.48, 54.38],
          [24.51, 54.54]]

nbrs = NearestNeighbors(n_neighbors=2, metric=distance).fit(points)

distances, indices = nbrs.kneighbors(points)

result = distances[:, 1]

这使

>>> result
array([  1.889697  ,   1.889697  ,  17.88530556,  17.88530556])

原文由 Jonas Adler 发布,翻译遵循 CC BY-SA 3.0 许可协议

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题