Spark User-guide Summary - Streaming - 个人文章

Simple Example

This simple example is word count

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession.builder.appName("StructuredNetworkWordCount")\
    .getOrCreate()

# Create DataFrame with input lines from connection to localhost:9999
lines = spark.readStream.format("socket")\
    .option("host", "localhost").option("port", 9999).load()

# Split the lines into words
words = lines.select(
    explode(
        split(lines.value, " ")
    ).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

Then, use a write stream to output

# Start running the query that prints the running counts to the console
query = wordCounts.writeStream.outputMode("complete").format("console").start()

query.awaitTermination()

Every data item that is arriving on the stream is like a new row being appended to the Input Table. For example, input

w1 w1 w2
w2 w1

to the above program will get output as w1: 3, w2: 2 finally

Note the Result table is real, the Input unbounded table does not exist, the data is discarded after it is used to update

DataFrame API

...

Spark User-guide Summary - Streaming

Simple Example

DataFrame API

Lycheeee

引用和评论

python配置自己的第三方包

PySpark一：Windows10环境搭建

【活动回顾】StarRocks Singapore Meetup #2 @Shopee

美的楼宇科技基于阿里云 EMR Serverless Spark 构建 LakeHouse 湖仓数据平台

鹰角：EMR Serverless Spark 在《明日方舟》游戏业务的应用

最佳实践 | 在 EMR Serverless Spark 中实现 StarRocks 读写操作

最佳实践 | 在 EMR Serverless Spark 中实现 Doris 读写操作