# k近邻法的优缺点

k近邻法是最简单的预测模型之一，它没有多少数学上的假设，也不要求任何复杂的数学处理，它所要求的仅仅是：

• 某种距离的概念

• 彼此接近的点具有相似性质的假设（否则用近邻来预测结果就是不合理的）

k近邻法有意忽略了大量信息，对每个新的数据点的预测只依赖少量最接近它的点。

# 案例：最喜欢的编程语言

``````cities = [(-86.75,33.5666666666667,'Python'),(-88.25,30.6833333333333,'Python'),(-112.016666666667,33.4333333333333,'Java')......]
``````

## 数据可视化

``````plots={"Java":([],[]),"python":([],[]),"R":([],[])}
``````

``````markers = { "Java" : "o", "Python" : "s", "R" : "^" }
colors  = { "Java" : "r", "Python" : "b", "R" : "g" }
``````

``````for (longitude, latitude), language in cities:
plots[language][0].append(longitude)
plots[language][1].append(latitude)
``````

``````In [1]: plots.items()
Out[1]:
[('Python',([-86.75, -88.25, -118.15......],[33.5666666666667, 30.6833333333333, 33.8166666666667.....])),('R'......),('Java'.....)]
``````

``````import matplotlib.pyplot as plt

for language, (x, y) in plots.items():
plt.scatter(x, y, color=colors[language], marker=markers[language],
label=language, zorder=10)

plt.legend(loc=0) #让matplotlib选择一个位置
plt.axis([-130,-60,20,55]) #设置坐标轴
plt.title("most popular language") #设置图标标题
put.show()
``````

## k近邻法的python实现

``````In [2]: cities[0]
Out[2]: ([-86.75, 33.5666666666667], 'Python')
``````

``````  other_cities=[other_city for other_city in cities if other_city != (cities[0][0],cities[0][1])]
``````

``````from linear_algebra import distance

by_distance=sorted(other_cities, key= lambda point_label:
distance(point_label[0], cities[0][0]))
``````

``````In [3]: k_nearest_labels=[label for _, label in by_distance[:1]]

In [4]: k_nearest_labels
Out[4]: ['Python']
``````

``````In [5]: k_nearest_labels=[label for _, label in by_distance[:3]]

In [6]: k_nearest_labels
Out[6]: ['Python', 'R', 'Python']

In [7]: k_nearest_labels=[label for _, label in by_distance[:5]]

In [8]: k_nearest_labels
Out[8]: ['Python', 'R', 'Python', 'Java', 'R']

In [9]: k_nearest_labels=[label for _, label in by_distance[:7]]

In [10]: k_nearest_labels
Out[10]: ['Python', 'R', 'Python', 'Java', 'R', 'Python', 'Java']
``````

``````def majority_vote(labels):
"""假设labels已经从最近到最远排序"""
vote_counts = Counter(labels)  #Counter返回的是字典，以label为键，出现次数为值
winner, winner_count = vote_counts.most_common(1)[0] #most_common方法可以找出vote_counts中出现次数前1（前几由括号内参数指定）的键和值，以元组组织
num_winners = len([count
for count in vote_counts.values()
if count == winner_count]) #计算vote_counts中前1的出现次数出现了几次，有几个胜出者

if num_winners == 1:
return winner                     # 如果只有一个胜出者，直接返回
else:
return majority_vote(labels[:-1]) # 如果有几个胜出者，排除lavels中最远的点，再试一次
``````

`````` def knn_classify(k, labeled_points, new_point):
"""k决定取最近的几个点；labeled_points指带标签的点，即(point, label)的数据对，是除待预测的点之外所有的点；new_ponit为待预测的点"""

# 将带标签的点，即其他点从近到远排序
by_distance = sorted(labeled_points,
key=lambda point_label: distance(point_label[0], new_point))

# 找到最近的k个点
k_nearest_labels = [label for _, label in by_distance[:k]]

# 对每个点进行计数
return majority_vote(k_nearest_labels)
``````

``````for k in [1, 3, 5, 7]:
num_correct = 0

for location, actual_language in cities:

other_cities = [other_city
for other_city in cities
if other_city != (location, actual_language)]

predicted_language = knn_classify(k, other_cities, location)

if predicted_language == actual_language:
num_correct += 1

print(k, "neighbor[s]:", num_correct, "correct out of", len(cities))
``````

(1, 'neighbor[s]:', 40, 'correct out of', 75)
(3, 'neighbor[s]:', 44, 'correct out of', 75)
(5, 'neighbor[s]:', 41, 'correct out of', 75)
(7, 'neighbor[s]:', 35, 'correct out of', 75)

# 参考资料：

Joel Grus《数据科学入门》第12章

3 人关注