https://segmentfault.com/a/1190000046426373中我们提供了一个简单的ETL例子, 这个例子会把多个csv文件加载后,做了一下格式的转换和添加了一些聚合字段,保存到parquet文件中。 这篇文章中尝试利用这些数据简单的演示一下如果来计算一下周一定投和周五定投的收益率,看看哪个时间定投更好。 由于数据比较少,只是用来演示,不具有指导意义。

paruqet关于价格有两个字段,一个是price,是基金当日价格,另外一个是accumulativePrice。 因为基金可能会不定期分后,为了简单,可以用当前的accumulativePrice和购买基金的日期的accumulativePrice来计算盈利情况。

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, dayofweek, lit, sum

if __name__ == "__main__":
    spark = SparkSession \
        .builder \
        .master("local[3]") \
        .appName("FixedInvestment") \
        .getOrCreate()

    # Load the Parquet files from the fund/parquet/ folder
    parquetDF = spark.read \
        .format("parquet") \
        .load("data/fund/parquet/*.parquet")

    parquetDF.show(5)
    print(parquetDF.schema.json())

    # Get distinct fund IDs
    fund_ids = parquetDF.select("fundid").distinct().collect()

    for fund in fund_ids:
        fundid = fund["fundid"]
        print(f"Processing fundid: {fundid}")

        # Filter data for the current fund
        fundDF = parquetDF.filter(col("fundid") == fundid)

        # Logic for Monday investment
        mondayDF = fundDF.filter(dayofweek(col("Date")) == 2)
        mondayDF = mondayDF.withColumn("investment", lit(100))
        mondayDF = mondayDF.withColumn("unitsPurchased", (col("investment") * (1 - 0.015)) / col("price"))  # Apply 1.5% fee
        finalPrice = fundDF.orderBy(col("Date").desc()).select("accumulativePrice").first()["accumulativePrice"]
        mondayDF = mondayDF.withColumn("earn", col("unitsPurchased")*(lit(finalPrice) - col("accumulativePrice")))
        mondayResultDF = mondayDF.groupBy().agg(
            sum("investment").alias("totalInvestment"),
            sum("unitsPurchased").alias("totalUnits"),
            sum("earn").alias("totalEarn")
        )
        mondayResultDF = mondayResultDF.withColumn(
            "investmentRate", col("totalEarn") / col("totalInvestment") * 100
        )
     
        print(f"Monday Results for fundid: {fundid}")
        mondayResultDF.show()

        # Logic for Friday investment
        fridayDF = fundDF.filter(dayofweek(col("Date")) == 6)
        fridayDF = fridayDF.withColumn("investment", lit(100))
        fridayDF = fridayDF.withColumn("unitsPurchased", (col("investment") * (1 - 0.015)) / col("price"))  # Apply 1.5% fee
        fridayDF = fridayDF.withColumn("earn", col("unitsPurchased")*(lit(finalPrice) - col("accumulativePrice")))
        fridayResultDF = fridayDF.groupBy().agg(
            sum("investment").alias("totalInvestment"),
            sum("unitsPurchased").alias("totalUnits"),
            sum("earn").alias("totalEarn")
        )
        
        fridayResultDF = fridayResultDF.withColumn(
            "investmentRate", col("totalEarn") / col("totalInvestment") * 100
        )
      
        print(f"Friday Results for fundid: {fundid}")
        fridayResultDF.show()

    spark.stop()

运行结果:
image.png

代码: https://gitee.com/yanghang1977/pyspark/blob/master/FixedInves...


愚公爬山
1 声望0 粉丝