在PySpark二:常见数据格式及如何读写中,我们提到了怎么输出到文件, spark还支持table.
table分为两类,一类是Spark Managed table,table的Meta信息(比如schema定义什么的)和data都是由spark来负责维护, 如果要是删除Meta和data都会被删掉,老是有就是external table,它的Meta是由spark来负责维护,data则是有用户自己维护,删除table的时候只是删除了Meta,数据不会被删除。
如果数据是被多个系统使用的,建议采用external table.
在PySpark四: 周五定投还是周一定投?是个问题中我们基于已有信息计算出来了周一周五定投的结果, 我们下面来完善一下对应的代码, 来利用spark table来把数据保存到table中。
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, dayofweek, lit, sum
from pyspark.sql import DataFrame
if __name__ == "__main__":
spark = SparkSession \
.builder \
.master("local[3]") \
.appName("FixedInvestment") \
.getOrCreate()
# Load the Parquet files from the fund/parquet/ folder
parquetDF = spark.read \
.format("parquet") \
.load("data/fund/parquet/*.parquet")
parquetDF.show(5)
print(parquetDF.schema.json())
# Get distinct fund IDs
fund_ids = parquetDF.select("fundid").distinct().collect()
result_list = []
for fund in fund_ids:
fundid = fund["fundid"]
print(f"Processing fundid: {fundid}")
# Filter data for the current fund
fundDF = parquetDF.filter(col("fundid") == fundid)
# Logic for Monday investment
mondayDF = fundDF.filter(dayofweek(col("Date")) == 2)
mondayDF = mondayDF.withColumn("investment", lit(100))
mondayDF = mondayDF.withColumn("unitsPurchased", (col("investment") * (1 - 0.015)) / col("price")) # Apply 1.5% fee
finalPrice = fundDF.orderBy(col("Date").desc()).select("accumulativePrice").first()["accumulativePrice"]
mondayDF = mondayDF.withColumn("earn", col("unitsPurchased")*(lit(finalPrice) - col("accumulativePrice")))
mondayResultDF = mondayDF.groupBy().agg(
sum("investment").alias("totalInvestment"),
sum("unitsPurchased").alias("totalUnits"),
sum("earn").alias("totalEarn")
)
mondayResultDF = mondayResultDF.withColumn(
"investmentRate", col("totalEarn") / col("totalInvestment") * 100
)
mondayResultDF = mondayResultDF.withColumn("fundid", lit(fundid))
mondayResultDF = mondayResultDF.withColumn("day", lit("Monday"))
result_list.append(mondayResultDF)
print(f"Monday Results for fundid: {fundid}")
mondayResultDF.show()
# Logic for Friday investment
fridayDF = fundDF.filter(dayofweek(col("Date")) == 6)
fridayDF = fridayDF.withColumn("investment", lit(100))
fridayDF = fridayDF.withColumn("unitsPurchased", (col("investment") * (1 - 0.015)) / col("price")) # Apply 1.5% fee
fridayDF = fridayDF.withColumn("earn", col("unitsPurchased")*(lit(finalPrice) - col("accumulativePrice")))
fridayResultDF = fridayDF.groupBy().agg(
sum("investment").alias("totalInvestment"),
sum("unitsPurchased").alias("totalUnits"),
sum("earn").alias("totalEarn")
)
fridayResultDF = fridayResultDF.withColumn(
"investmentRate", col("totalEarn") / col("totalInvestment") * 100
)
fridayResultDF = fridayResultDF.withColumn("fundid", lit(fundid))
fridayResultDF = fridayResultDF.withColumn("day", lit("Friday"))
result_list.append(fridayResultDF)
print(f"Friday Results for fundid: {fundid}")
fridayResultDF.show()
# Combine all results into a single DataFrame
resultDF = result_list[0]
for df in result_list[1:]:
resultDF = resultDF.union(df)
print("Final Result DataFrame:")
resultDF.show()
# Sink data to Spark external table
# sample for sinking to S3:
# .option("path", "s3a://your-bucket-name/external_result/") \
# sample for sinking to Azure bucket
# .option("path", "wasbs://your-container-name@your-storage-account-name.blob.core.windows.net/external_result/") \
resultDF.write \
.mode("overwrite") \
.option("path", "/external_result/") \
.saveAsTable("ext_result_data_tbl")
# Load data from the external Spark table
tableDF = spark.sql("SELECT * FROM ext_result_data_tbl")
print("Data loaded from external Spark table:")
tableDF.show(5)
print("External table schema:" + tableDF.schema.json())
# Sink data to Spark table
spark.sql("CREATE DATABASE IF NOT EXISTS AIRLINE_DB")
spark.catalog.setCurrentDatabase("AIRLINE_DB")
# sink data to spark managed table
resultDF.write \
.mode("overwrite") \
.saveAsTable("result_data_tbl")
# Load data from the Spark table
spark.sql("USE AIRLINE_DB")
tableDF = spark.sql("SELECT * FROM result_data_tbl")
print("Data loaded from Spark table:")
tableDF.show(5)
print("table schema:" + tableDF.schema.json())
spark.stop()
Spark managed Table 目录:
External Table目录:可以设置为本地目录,也可以设置为AWS S3 和Azure Bucket等。
代码:https://gitee.com/yanghang1977/pyspark/blob/master/FixedInves...
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。