梯度下降算法spark实现

spark中梯度下降的实现在GradientDescent 中的方法runMiniBatchSGD 中

1.第一步是采样并计算梯度

采样用的是RDD.sample 方法

//sample 方法获取一个子集
//根据采样计算梯度
// RDD.aggregate方法介绍 https://www.jianshu.com/p/15739e95a46e      
//aggregate treeAggregate 区别https://www.cnblogs.com/drawwindows/p/5762392.html
val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
        .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
          seqOp = (c, v) => {
            // c: (grad, loss, count), v: (label, features)
            val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
            (c._1, c._2 + l, c._3 + 1)
          },
          combOp = (c1, c2) => {
            // c: (grad, loss, count)
            (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
          })

MLlib现在提供的求导方法或者说梯度计算类有好几种
举例子来讲如果是线性回归，用的是LeastSquaresGradient ，
各种求导方法参见下文。

2.第二步更新权重参数

        val update = updater.compute(
          weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
          stepSize, i, regParam)
        weights = update._1
        regVal = update._2

不同模型采用不同的参数权重更新方法，参见下文。

A.梯度计算类

具体的梯度计算类都继承自Gradient

LeastSquaresGradient 实现了最小二乘法进行线性回归的梯度计算方法。

  override def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) = {
    //spark BLAS 总结介绍https://blog.csdn.net/sunbow0/article/details/45505227
    //点积
    val diff = dot(data, weights) - label
    val loss = diff * diff / 2.0
    val gradient = data.copy
    //常数乘以向量 x = a * x
    scal(diff, gradient) //gradient即为梯度 gradient=（y - lable）* x 
    (gradient, loss)
  }

HingeGradient 最大化分类间距，如SVM二分类中使用

 override def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) = {
   val dotProduct = dot(data, weights)
   // Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x)))
   // Therefore the gradient is -(2y - 1)*x
   val labelScaled = 2 * label - 1.0
   if (1.0 > labelScaled * dotProduct) {
     val gradient = data.copy
     scal(-labelScaled, gradient)
     (gradient, 1.0 - labelScaled * dotProduct)
   } else {
     (Vectors.sparse(weights.size, Array.empty, Array.empty), 0.0)
   }
 }

B.参数更新

class SimpleUpdater extends Updater {
  override def compute(
      weightsOld: Vector,
      gradient: Vector,
      stepSize: Double,
      iter: Int,
      regParam: Double): (Vector, Double) = {
     //thisIterStepSize为下降的梯度
    val thisIterStepSize = stepSize / math.sqrt(iter)
    val brzWeights: BV[Double] = weightsOld.asBreeze.toDenseVector
    //axpy函数功能：y += a * x，x、y为相同维度向量，a为系数。(a: A, x: X, y: Y)
    brzAxpy(-thisIterStepSize, gradient.asBreeze, brzWeights)

    (Vectors.fromBreeze(brzWeights), 0)
  }
}

梯度下降算法spark实现

HelloData

引用和评论

xgboost原理

【活动回顾】StarRocks Singapore Meetup #2 @Shopee

鹰角：EMR Serverless Spark 在《明日方舟》游戏业务的应用

Spark on K8s 在vivo大数据平台的混部实战

最佳实践 | 在 EMR Serverless Spark 中实现 Doris 读写操作

最佳实践 | 在 EMR Serverless Spark 中实现 StarRocks 读写操作

立马耀：通过阿里云 Serverless Spark 和 Milvus 构建高效向量检索系统，驱动个性化推荐业务