Simple Example

This simple example is word count

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession.builder.appName("StructuredNetworkWordCount")\
    .getOrCreate()

# Create DataFrame with input lines from connection to localhost:9999
lines = spark.readStream.format("socket")\
    .option("host", "localhost").option("port", 9999).load()

# Split the lines into words
words = lines.select(
    explode(
        split(lines.value, " ")
    ).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

Then, use a write stream to output

# Start running the query that prints the running counts to the console
query = wordCounts.writeStream.outputMode("complete").format("console").start()

query.awaitTermination()

Every data item that is arriving on the stream is like a new row being appended to the Input Table. For example, input

w1 w1 w2
w2 w1

to the above program will get output as w1: 3, w2: 2 finally

clipboard.png
Note the Result table is real, the Input unbounded table does not exist, the data is discarded after it is used to update

DataFrame API

...


Lycheeee
0 声望1 粉丝