Spark API Example 理解

学习背景：学习 Spark 的过程中，必要的一定是读官方文档。这里对http://spark.apache.org/examp... 中的例子做些理解性质的总结。

Spark API Examples包含以下内容：

RDD API：完成数据转换、操作两部分

DataFrame API：RDD转换成DataFrame、读数据库表转换成DataFrame，然后进行关系操作

机器学习 API：用 Logistic 做训练和预测

RDD处理：统计按空格分隔的词的个数，并保存成文件：

JavaRDD<String> textFile = sc.textFile("hdfs://...");
JavaPairRDD<String, Integer> counts = textFile
    .flatMap(s -> Arrays.asList(s.split(" ")).iterator())
    .mapToPair(word -> new Tuple2<>(word, 1))
    .reduceByKey((a, b) -> a + b);
counts.saveAsTextFile("hdfs://...");

RDD处理：投掷 NUM_SAMPLES 个样本点（x、y 随机），统计落到圆内的概率：

List<Integer> l = new ArrayList<>(NUM_SAMPLES);
for (int i = 0; i < NUM_SAMPLES; i++) {
  l.add(i);
}

// 使用parallelize方式创建并行集合，参数：数据、分区个数
long count = sc.parallelize(l).filter(i -> {
  double x = Math.random();
  double y = Math.random();
  return x*x + y*y < 1;
}).count();
System.out.println("Pi is roughly " + 4.0 * count / NUM_SAMPLES);

DataFrame处理：Spark 中，DataFrame是有列名的分布式集合，在该集合上可以进行各种关系操作。
下面这个例子：读取log文件的error信息；处理成一列（名为"line"）；通过RDD、StructType转换成DataFrame；过滤出 line 列中包含 ERROR 的行，计数。

// Creates a DataFrame having a single column named "line"
JavaRDD<String> textFile = sc.textFile("hdfs://...");
JavaRDD<Row> rowRDD = textFile.map(RowFactory::create);
List<StructField> fields = Arrays.asList(
  DataTypes.createStructField("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
DataFrame df = sqlContext.createDataFrame(rowRDD, schema);

DataFrame errors = df.filter(col("line").like("%ERROR%"));
// Counts all the errors
errors.count();
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count();
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect();

DataFrame处理：使用 sqlContext 读取 mysql 数据库中的表，返回的是DataFrame；用 groupBy("age") 分组计算每个年龄段的人数；最后把结果保存成Json格式。

// Creates a DataFrame based on a table named "people"
// stored in a MySQL database.
String url =
  "jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword";
DataFrame df = sqlContext
  .read()
  .format("jdbc")
  .option("url", url)
  .option("dbtable", "people")
  .load();

// Looks the schema of this DataFrame.
df.printSchema();

// Counts people by age
DataFrame countsByAge = df.groupBy("age").count();
countsByAge.show();

// Saves countsByAge to S3 in the JSON format.
countsByAge.write().format("json").save("s3a://...");

机器学习 API：spark的机器学习库 mllib 提供了：

许多分布式ml算法，包括特征提取、分类、回归、聚类、推荐等任务
提供了一些工具，如用于构建工作流的 ml 管道、用于优化参数的 crossvalidator 以及用于保存和加载模型的模型持久性。

下面这个例子：读数据 RDD，转换成有 label 和 features 列的 DataFrame；输入到 LR 模型中进行训练；模型训练结束后，预测每个点的 label。

// Every record of this DataFrame contains the label and
// features represented by a vector.
StructType schema = new StructType(new StructField[]{
  new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
  new StructField("features", new VectorUDT(), false, Metadata.empty()),
});
DataFrame df = jsql.createDataFrame(data, schema);

// Set parameters for the algorithm.
// Here, we limit the number of iterations to 10.
LogisticRegression lr = new LogisticRegression().setMaxIter(10);

// Fit the model to the data.
LogisticRegressionModel model = lr.fit(df);

// Inspect the model: get the feature weights.
Vector weights = model.weights();

// Given a dataset, predict each point's label, and show the results.
model.transform(df).show();

Spark API Example 理解

时光格

引用和评论

用VMWare搭建Hadoop集群

【活动回顾】StarRocks Singapore Meetup #2 @Shopee

PySpark一：Windows10环境搭建

美的楼宇科技基于阿里云 EMR Serverless Spark 构建 LakeHouse 湖仓数据平台

【赵渝强老师】Spark的容错机制：检查点

最佳实践 | 在 EMR Serverless Spark 中实现 StarRocks 读写操作

最佳实践 | 在 EMR Serverless Spark 中实现 Doris 读写操作