前言

  • 当前 AWS EMR 的最新版本是 6.15,自带的 Python 版本是 3.7,尝试上传使用 Python 3.11

Python 环境打包

  • 技术栈

    Ubuntu        22.04(x86)
    Linux version 5.15
    Python        3.11.5
    pyspark       3.4.1
    conda         23.10.0
    conda-pack    0.7.1
  • 官方建议用在 Amazon Linux 2 上编译安装 Python 环境,测试发现在 Ubuntu 上用 Miniconda 生成的虚拟环境也是可以的
  • 在 Ubuntu上安装 Miniconda,用 conda 安装Python 3.11

    conda create -n EMRServerless python=3.11
  • 进入 EMRServerless 虚拟环境,安装 pyspark 3.4.1(EMR 6.15 的 Spark 版本是 3.4.1)。发现这步没用,AWS会用自己的 pyspark 版本

    conda activate EMRServerless
    pip3 install pyspark==3.4.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
  • 退出到 base 环境,安装 conda-pack

    conda activate base
    pip3 install conda-pack -i https://pypi.tuna.tsinghua.edu.cn/simple
  • 使用 conda-pack 打包生成文件 py311.tar.gz

    conda pack --prefix /home/qbit/miniconda3/envs/EMRServerless -o py311.tar.gz
  • 解压可以看到里面最大的目录是 .../lib/python3.11/site-packages/pyspark/jars
  • py311.tar.gz 上传到 S3,路径如下

    s3://your-bucket/usr/qbit/py311_env/py311.tar.gz

AWS CLI 安装测试

  • 安装 AWS CLI,并做基本配置
[profile emr-serverless]
aws_access_key_id = ACCESS-KEY-ID-OF-IAM-USER
aws_secret_access_key = SECRET-ACCESS-KEY-ID-OF-IAM-USER
region = cn-northwest-1
output = json
  • 测试是否安装好

    aws emr-serverless help

AWS IAM 角色创建

  • 在本机当前目录下新建文件 emr-serverless-trust-policy.json,有这个策略才能运行 Serverless,文件内容如下
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EMRServerlessTrustPolicy",
      "Action": "sts:AssumeRole",
      "Effect": "Allow",
      "Principal": {
        "Service": "emr-serverless.amazonaws.com"
      }
    }
  ]
}
  • 关联上面的策略文件创建角色
aws iam create-role \
    --role-name EMRServerlessS3RuntimeRole \
    --assume-role-policy-document file://emr-serverless-trust-policy.json
  • 在本机当前目录下新建文件 emr-qbit-access-policy.json,这个策略是为了关联 S3 权限,注意将 your-bucket 换成自己的桶名,还要注意将官方示例中的 aws 替换为 aws-cn,,记得保持返回值中的 Arn 备用,文件内容如下
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "FullAccessToOutputBucket",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws-cn:s3:::your-bucket",
        "arn:aws-cn:s3:::your-bucket/*"
      ]
    }
  ]
}
  • 创建 S3 访问策略,记得保持返回值中的 Arn 备用
aws iam create-policy \
    --policy-name EMRServerlessS3AccessPolicy \
    --policy-document file://emr-qbit-access-policy.json
  • S3 访问策略与 Serverless 角色关联,命令中 arn 为 S3 访问策略的 arn
aws iam attach-role-policy \
    --role-name EMRServerlessS3RuntimeRole \
    --policy-arn arn:aws-cn:iam::XXX:policy/EMRServerlessS3AccessPolicy

运行 WordCount

class EMRServerless:
    def create_application(self, name: str, release_label: str, wait: bool = True):
        """ 创建 application """

    def start_application(self, wait: bool = True) -> None:
        """ 启动 application """

    def stop_application(self, wait: bool = True) -> None:
        """ 关闭 application """

    def delete_application(self) -> None:
        """ 删除 application """

    def run_spark_job(
        self,
        script_location: str,
        job_role_arn: str,
        arguments: list(),
        s3_bucket_name: str,
        wait: bool = True,
    ) -> str:
        """ 提交 Spark 任务 """

    def get_job_run(self, job_run_id: str) -> dict:
        """ 按行任务信息 """

    def fetch_driver_log(
        self, s3_bucket_name: str, job_run_id: str, log_type: str = "stdout"
    ) -> str:
        """ 取日志 """
  • 单独看一下 run_spark_job 函数,关注 sparkSubmitParameters 参数
def run_spark_job(
    self,
    script_location: str,
    job_role_arn: str,
    arguments: list(),
    s3_bucket_name: str,
    wait: bool = True,
) -> str:
    response = self.client.start_job_run(
        applicationId=self.application_id,
        executionRoleArn=job_role_arn,
        jobDriver={
            "sparkSubmit": {
                "entryPoint": script_location,
                "entryPointArguments": arguments,
                "sparkSubmitParameters": (
                    "--conf spark.driver.cores=1 "
                    "--conf spark.driver.memory=4g "
                    "--conf spark.executor.instances=1 "
                    "--conf spark.executor.cores=1 "
                    "--conf spark.executor.memory=4g "
                    "--conf spark.archives=s3://your-bucket/usr/qbit/py311_env/py311.tar.gz#py311 " 
                    "--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=/home/hadoop/py311/bin/python " 
                    "--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=/home/hadoop/py311/bin/python " 
                    "--conf spark.executorEnv.PYSPARK_PYTHON=/home/hadoop/py311/bin/python"
                )
            }
        },
        configurationOverrides={
            "monitoringConfiguration": {
                "s3MonitoringConfiguration": {
                    "logUri": f"s3://{s3_bucket_name}/{self.s3_log_prefix}"
                }
            }
        },
    )
    job_run_id = response.get("jobRunId")

    job_done = False
    while wait and not job_done:
        jr_response = self.get_job_run(job_run_id)
        job_done = jr_response.get("state") in [
            "SUCCESS",
            "FAILED",
            "CANCELLING",
            "CANCELLED",
        ]

    return job_run_id
  • 上传文本文件 s3://your-bucket/usr/qbit/input/word.txt
  • s3://your-bucket/usr/qbit/wordcount.py 代码如下
import pyspark
from pyspark.sql import SparkSession
import sys
# 创建SparkSession
spark = SparkSession.builder.appName("WordCountExample").getOrCreate()

# 读取文本文件创建RDD 
text_file_path = "s3://your-bucket/usr/qbit/input/word.txt"
text_rdd = spark.sparkContext.textFile(text_file_path)

# 使用flatMap将每行文本拆分成单词
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))

# 使用map将每个单词映射为(单词, 1)的键值对
word_count_rdd = words_rdd.map(lambda word: (word, 1))

# 使用reduceByKey执行Word Count操作
word_counts = word_count_rdd.reduceByKey(lambda a, b: a + b)

# 保存结果
word_counts.saveAsTextFile("s3://your-bucket/usr/qbit/output/")

# 停止SparkSession
spark.stop()
# 打印 Python 版本号
print(f"Python version: {sys.version}")
print(f"pyspark version: {pyspark.__version__}")
  • 安装 boto3
pip3 install boto3 -i https://pypi.tuna.tsinghua.edu.cn/simple
  • 调用 emr_serverless.py 运行 wordcount.py
python emr_serverless.py \
    --job-role-arn arn:aws-cn:iam::XXX:role/EMRServerlessS3RuntimeRole \
    --s3-bucket zt-hadoop-cn-northwest-1
  • s3://your-bucket/usr/qbit/output/ 目录查看 wordcount 的输出结果
  • 结果中的版本信息,可以看到 Python 用了上传的解释器,pypspark 是 aws 自己版本
Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
pyspark version: 3.4.1+amzn.2

引入第三方库

  • 第三方库有二级依赖,不能直接压缩为 zip 引入,可以把第三方库安装在 conda 环境,一起打包上传

引入自己编写的代码

  • 可以将自己的代码文件夹打包为 util.zip 文件,用以下参数引入
--conf spark.submit.pyFiles=s3://your-bucket/qbit/util.zip

相关资料

本文处在 qbit snap

qbit
271 声望279 粉丝