PySpark二:常见数据格式及如何读写中,我们提到了怎么输出到文件, spark还支持table.
table分为两类,一类是Spark Managed table,table的Meta信息(比如schema定义什么的)和data都是由spark来负责维护, 如果要是删除Meta和data都会被删掉,老是有就是external table,它的Meta是由spark来负责维护,data则是有用户自己维护,删除table的时候只是删除了Meta,数据不会被删除。
如果数据是被多个系统使用的,建议采用external table.

PySpark四: 周五定投还是周一定投?是个问题中我们基于已有信息计算出来了周一周五定投的结果, 我们下面来完善一下对应的代码, 来利用spark table来把数据保存到table中。

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, dayofweek, lit, sum
from pyspark.sql import DataFrame

if __name__ == "__main__":
    spark = SparkSession \
        .builder \
        .master("local[3]") \
        .appName("FixedInvestment") \
        .getOrCreate()

    # Load the Parquet files from the fund/parquet/ folder
    parquetDF = spark.read \
        .format("parquet") \
        .load("data/fund/parquet/*.parquet")

    parquetDF.show(5)
    print(parquetDF.schema.json())

    # Get distinct fund IDs
    fund_ids = parquetDF.select("fundid").distinct().collect()

    result_list = []

    for fund in fund_ids:
        fundid = fund["fundid"]
        print(f"Processing fundid: {fundid}")

        # Filter data for the current fund
        fundDF = parquetDF.filter(col("fundid") == fundid)

        # Logic for Monday investment
        mondayDF = fundDF.filter(dayofweek(col("Date")) == 2)
        mondayDF = mondayDF.withColumn("investment", lit(100))
        mondayDF = mondayDF.withColumn("unitsPurchased", (col("investment") * (1 - 0.015)) / col("price"))  # Apply 1.5% fee
        finalPrice = fundDF.orderBy(col("Date").desc()).select("accumulativePrice").first()["accumulativePrice"]
        mondayDF = mondayDF.withColumn("earn", col("unitsPurchased")*(lit(finalPrice) - col("accumulativePrice")))
        mondayResultDF = mondayDF.groupBy().agg(
            sum("investment").alias("totalInvestment"),
            sum("unitsPurchased").alias("totalUnits"),
            sum("earn").alias("totalEarn")
        )
        mondayResultDF = mondayResultDF.withColumn(
            "investmentRate", col("totalEarn") / col("totalInvestment") * 100
        )
        mondayResultDF = mondayResultDF.withColumn("fundid", lit(fundid))
        mondayResultDF = mondayResultDF.withColumn("day", lit("Monday"))
        result_list.append(mondayResultDF)
     
        print(f"Monday Results for fundid: {fundid}")
        mondayResultDF.show()

        # Logic for Friday investment
        fridayDF = fundDF.filter(dayofweek(col("Date")) == 6)
        fridayDF = fridayDF.withColumn("investment", lit(100))
        fridayDF = fridayDF.withColumn("unitsPurchased", (col("investment") * (1 - 0.015)) / col("price"))  # Apply 1.5% fee
        fridayDF = fridayDF.withColumn("earn", col("unitsPurchased")*(lit(finalPrice) - col("accumulativePrice")))
        fridayResultDF = fridayDF.groupBy().agg(
            sum("investment").alias("totalInvestment"),
            sum("unitsPurchased").alias("totalUnits"),
            sum("earn").alias("totalEarn")
        )
        
        fridayResultDF = fridayResultDF.withColumn(
            "investmentRate", col("totalEarn") / col("totalInvestment") * 100
        )
        fridayResultDF = fridayResultDF.withColumn("fundid", lit(fundid))
        fridayResultDF = fridayResultDF.withColumn("day", lit("Friday"))
        result_list.append(fridayResultDF)
      
        print(f"Friday Results for fundid: {fundid}")
        fridayResultDF.show()

    # Combine all results into a single DataFrame
    resultDF = result_list[0]
    for df in result_list[1:]:
        resultDF = resultDF.union(df)

    print("Final Result DataFrame:")
    resultDF.show()

    # Sink data to Spark external table
    # sample for sinking to S3:
    # .option("path", "s3a://your-bucket-name/external_result/") \
    # sample for sinking to Azure bucket
    # .option("path", "wasbs://your-container-name@your-storage-account-name.blob.core.windows.net/external_result/") \
    resultDF.write \
        .mode("overwrite") \
        .option("path", "/external_result/") \
        .saveAsTable("ext_result_data_tbl")

    # Load data from the external Spark table
    tableDF = spark.sql("SELECT * FROM ext_result_data_tbl")
    print("Data loaded from external Spark table:")
    tableDF.show(5)

    print("External table schema:" + tableDF.schema.json())

    # Sink data to Spark table
    spark.sql("CREATE DATABASE IF NOT EXISTS AIRLINE_DB")
    spark.catalog.setCurrentDatabase("AIRLINE_DB")

    # sink data to spark managed table
    resultDF.write \
        .mode("overwrite") \
        .saveAsTable("result_data_tbl")

    # Load data from the Spark table
    spark.sql("USE AIRLINE_DB")
    tableDF = spark.sql("SELECT * FROM result_data_tbl")
    print("Data loaded from Spark table:")
    tableDF.show(5)
    print("table schema:" + tableDF.schema.json())

    spark.stop()

Spark managed Table 目录:
image.png

External Table目录:可以设置为本地目录,也可以设置为AWS S3 和Azure Bucket等。

代码:https://gitee.com/yanghang1977/pyspark/blob/master/FixedInves...


愚公爬山
1 声望0 粉丝