# 【七夕福利】k均值聚类算法告诉你到哪里找对象

## k均值算法

k-平均算法源于信号处理中的一种向量量化方法，现在则更多地作为一种聚类分析方法流行于数据挖掘领域。k-平均聚类的目的是：把n个点（可以是样本的一次观察或一个实例）划分到k个聚类中，使得每个点都属于离他最近的均值（此即聚类中心）对应的聚类，以之作为聚类的标准。这个问题将归结为一个把数据空间划分为Voronoi cells的问题。

## 获取数据

``````df = pd.read_csv('chnedu.csv')

``````

## 处理数据

``````df['highedu'] = (df['大学本科小计'] + df['研究生小计']) / df['6岁及以上人口合计']
df['gender'] = df['6岁及以上人口男'] / df['6岁及以上人口女']
``````

## 散点图

``````fig = plt.figure(figsize = (9,7))
ax1.scatter(df['gender'], df['highedu'], c = 'r', marker = 'o')
plt.show()
``````

``````for i, txt in enumerate(df['地区']):
ax1.annotate(txt, (df['gender'][i], df['highedu'][i]))
``````

``````df['gender'] = (df['大学本科男'] + df['研究生男']) / (df['大学本科女'] + df['研究生女'])
``````

## k均值聚类

``````from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 4)
X_clustered = kmeans.fit_predict(df[['gender', 'highedu']])

LABEL_COLOR_MAP = {0:'r', 1:'g', 2:'b', 3:'m'}
label_color = [LABEL_COLOR_MAP[l] for l in X_clustered]

ax1.scatter(df['gender'], df['highedu'], c = label_color, marker = 'o')
``````

## 美化

``````import seaborn as sns
sns.set()
``````

``````import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

sns.set()

df['highedu'] = (df['大学本科小计'] + df['研究生小计']) / df['6岁及以上人口合计']
df['gender'] = (df['大学本科男'] + df['研究生男']) / (df['大学本科女'] + df['研究生女'])

plt.rcParams['font.sans-serif']=['SimHei']

fig = plt.figure(figsize = (9,7))

kmeans = KMeans(n_clusters = 5)
X_clustered = kmeans.fit_predict(df[['gender', 'highedu']])

LABEL_COLOR_MAP = {0:'r', 1:'g', 2:'b', 3:'m', 4:'k'}
label_color = [LABEL_COLOR_MAP[l] for l in X_clustered]

ax1.scatter(df['gender'], df['highedu'], c = label_color, marker = 'o')

for i, txt in enumerate(df['地区']):
ax1.annotate(txt, (df['gender'][i], df['highedu'][i]))

plt.show()
``````

## 交互

``````import pandas as pd
from sklearn.cluster import KMeans
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)

df['highedu'] = (df['大学本科小计'] + df['研究生小计']) / df['6岁及以上人口合计']
df['gender'] = (df['大学本科男'] + df['研究生男']) / (df['大学本科女'] + df['研究生女'])

# Set a 3 KMeans clustering
kmeans = KMeans(n_clusters = 4)
# Compute cluster centers and predict cluster indices
df['cluster'] = kmeans.fit_predict(df[['gender', 'highedu']])

data = []
for i in range(0, 4):
trace = go.Scatter(
x = df.loc[df.cluster==i].gender,
y = df.loc[df.cluster==i].highedu,
text = df.loc[df.cluster==i].地区,
mode = 'markers'
)
data.append(trace)

iplot(data, filename='basic-scatter')
``````

12.4k 声望
4.4k 粉丝
0 条评论