# Spark Kmeans的平方欧氏距离和误差平方和及源码分析

1.欧氏距离
d(x,y) = √( (x[1]-y[1])^2 + (x[1]-y[2])^2 + … + (x[n]-y[n])^2 )
2.squared Euclidean distance平方欧式距离
Spark KMeans的距离公式是使用了平方欧式距离，平方欧氏距离就是欧式距离的平方（去掉了开根号）
d(x,y) = (x[1]-y[1])^2 + (x[1]-y[2])^2 + … + (x[n]-y[n])^2
3.误差平方和（Sum of Squared Error(SSE)）
Spark KMeans使用的误差评价指标是误差平方和

4.Spark相关代码

``````/**
* Return the K-means cost (sum of squared distances of points to their nearest center) for this
* model on the given data.
*/
@Since("0.8.0")
def computeCost(data: RDD[Vector]): Double = {
val cost = data.map(p =>
distanceMeasureInstance.pointCost(bcCentersWithNorm.value, new VectorWithNorm(p)))
.sum()//点到最近簇中心的距离求和
bcCentersWithNorm.destroy()
cost
}``````

`````` /**
* @return 离给定点最近的中心的指数，以及成本cost。
*/
def findClosest(
centers: Array[VectorWithNorm],
point: VectorWithNorm): (Int, Double) = {
var bestDistance = Double.PositiveInfinity
var bestIndex = 0
var i = 0
while (i < centers.length) {
val center = centers(i)
val currentDistance = distance(center, point)//使用了平方欧式距离
if (currentDistance < bestDistance) {
bestDistance = currentDistance
bestIndex = i
}
i += 1
}
(bestIndex, bestDistance)
}

/**
* @return 给定点相对于给定簇中心的k-means成本cost。
*/
def pointCost(
centers: Array[VectorWithNorm],
point: VectorWithNorm): Double = {
findClosest(centers, point)._2
}``````