Spark User-guide Summary - Tunning - 个人文章

Summarized from http://spark.apache.org/docs/...

Hyper-parameter Tunning

two example model selection tools are cross-validation, train-validation

example using cross-validation in a pipeline usage

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# pipeline 

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
# choose 3 choices of numFeatures, 2 choices of regParam, totally 3*2=6 choices

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cross-validation method split the data into many train & test pieces (numFolds=2)

example of train-validation

tvs = TrainValidationSplit(estimator=lr,
       estimatorParamMaps=paramGrid,
       evaluator=RegressionEvaluator(),
       # 80% of the data will be used for training, 20% for validation.
       trainRatio=0.8)
# trainRatio is the only difference from cross-validation

Spark User-guide Summary - Tunning

Hyper-parameter Tunning

Lycheeee

引用和评论

python配置自己的第三方包

【活动回顾】StarRocks Singapore Meetup #2 @Shopee

鹰角：EMR Serverless Spark 在《明日方舟》游戏业务的应用

最佳实践 | 在 EMR Serverless Spark 中实现 Doris 读写操作

Spark on K8s 在vivo大数据平台的混部实战

最佳实践 | 在 EMR Serverless Spark 中实现 StarRocks 读写操作

立马耀：通过阿里云 Serverless Spark 和 Milvus 构建高效向量检索系统，驱动个性化推荐业务