Simple Example
This simple example is word count
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession.builder.appName("StructuredNetworkWordCount")\
.getOrCreate()
# Create DataFrame with input lines from connection to localhost:9999
lines = spark.readStream.format("socket")\
.option("host", "localhost").option("port", 9999).load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
Then, use a write stream to output
# Start running the query that prints the running counts to the console
query = wordCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()
Every data item that is arriving on the stream is like a new row being appended to the Input Table. For example, input
w1 w1 w2
w2 w1
to the above program will get output as w1: 3, w2: 2
finally
Note the Result table is real, the Input unbounded table does not exist, the data is discarded after it is used to update
DataFrame API
...
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。