Summarized from http://spark.apache.org/docs/...

Hyper-parameter Tunning

two example model selection tools are cross-validation, train-validation

example using cross-validation in a pipeline usage

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# pipeline 

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
# choose 3 choices of numFeatures, 2 choices of regParam, totally 3*2=6 choices

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cross-validation method split the data into many train & test pieces (numFolds=2)

example of train-validation

tvs = TrainValidationSplit(estimator=lr,
       estimatorParamMaps=paramGrid,
       evaluator=RegressionEvaluator(),
       # 80% of the data will be used for training, 20% for validation.
       trainRatio=0.8)
# trainRatio is the only difference from cross-validation

Lycheeee
0 声望1 粉丝