Easily build a data warehouse and "play with" Amazon EMR with FreeWheel

Amazon Elastic MapReduce (Amazon EMR) is provided Amazon Web Services hosting cluster platform, users can very easily use Amazon EMR erected a cluster for supporting large data application framework , such as Apache Spark, Hive, Flink, Presto and so on. Because Amazon EMR has good configurability and scalability , the user can flexibly customize according to their needs , at the same time to meet the production needs, reduce maintenance costs of infrastructure operation and .

FreeWheel Big Data team in the process of building a data warehouse, on the use of Amazon EMR has accumulated a lot of practice and operation and maintenance experience, from Amazon EMR practical point of view this paper, tells FreeWheel Transformer team in the process of building ETL pipeline How to play with Amazon EMR's , in order to attract others.

Reply "FreeWheel" in the background, more exciting content is waiting for you

📢 To learn more about the latest technology releases and practical innovations of Amazon Cloud Technology, stay tuned to the 2021 Amazon Cloud Technology China Summit! Click on the picture to sign up~

a typical Spark on Amazon EMR cluster architecture

Let's first understand what a typical Amazon EMR cluster looks like. Amazon EMR uses Yarn by default to manage cluster resources. Amazon EMR divides nodes into three groups: Master, Core and Task (implemented through Yarn label).

Master node

Amazon EMR's Master node is used to manage the entire cluster. The cluster must have at least one Master node (starting from Amazon EMR-5.23, if you need Master node HA, you can choose the instance count to be 3, at this time, depending on the application, there will be different HA scheme).

Take running Hadoop HDFS and Yarn as an example, Yarn ResourceManger and HDFS NameNode will run on the Master. Under the high-availability scheme, Yarn ResourceManager will run on all three master nodes and work in Active/Standby mode. If the master node of ResourceManager fails, Amazon EMR will initiate an automatic failover process, and the master node with Standby ResourceManager will take over all operations, and the current service status can be obtained through yarn rmadmin-getAllServiceState. For specific details about Yarn ResourceManger HA , you can refer to ResourceManager HA.

Similar to ResourceManager, NodeManager will run on two of the three master nodes, in Active and Standby states respectively. If the master node of the Active NameNode fails, Amazon EMR will initiate the HDFS failover process. At this time, the NameNode in the Standby state will become Active and take over all operations on HDFS in the cluster. We can get the current NameNode state through the hdfs haadmin-getAllServiceState command. For the specific details of HDFS HA, please refer to HDFS HA .

Core node

Core node is an optional group on which HDFS DataNode runs. When the number of core nodes is less than 4, Amazon EMR will set the replica of HDFS to 1 by default. When the number of core nodes is less than 10, the replica is set to 2, and in other cases, it is 3. If you need to make custom settings, you can modify the configuration of dfs.replication in hdfs-site.xml to start Amazon EMR; here is a point to note, if we are in an already started Amazon EMR, when scaling the Core node, It will affect the re-balance of HDFS data, which is a time-consuming operation, and it is not recommended to perform core node scaling frequently. And for a cluster with 3 replicas, if the number of core nodes is reduced to less than 3, it will not succeed.

Task node

Task node is a normal working node. It is very suitable as a simple working node to deal with scenarios where workloads need to be scaled frequently. NodeManager is running on it, and after it is started, it will be added to Yarn's cluster management. For scenarios that require frequent scaling, only the scale task node is a more practical solution, and the efficiency is relatively high.

task queue (InstanceFleet)

Both Core node and Task node can be configured with instance queues. Instance queues are a very useful feature, especially in the scene of Spot instances. When some popular events occur, the interruption rate of some popular models will become very high. If you choose With a single type of instance, it is easy to fail to meet the demand. We can check the current instance interruption spot advisor In the following chapters, I will describe in detail how we use instance queues.

spot advisor
https://aws.amazon.com/cn/ec2/spot/instance-advisor/

Amazon EMR version

When creating Amazon EMR, you need to pay extra attention to the release version of Amazon EMR. Different versions support different application versions. Starting from Amazon EMR-6.1.0, the combination of Spark3.0.0 + Hadoop 3.2.1 has been supported. The latest Amazon EMR-6.2.0 version can already support the stable version of Spark 3.0.1. If you need to run Spark 2.x version, you can choose Amazon EMR-5.x version or Amazon EMR-6.0.0.

In addition to the application version difference, starting from Amazon EMR-6.0.0, the system is based on Amazon Linux 2 and Corretto JDK 8. The biggest difference compared to the previous Amazon Linux 1 is that it provides new system tools, such as Systemctl, and optimized Amazon linux LTS kernel. In addition, Amazon Corretto JDK 8 provides a compatible JDK certified by Java SE, including long-term support, performance enhancements and security fixes. For the release note of Amazon emr-6.x, please refer to Amazon EMR release note .

In addition, Amazon Web Services has recently supported Amazon EMR on Amazon EKS. We can create Amazon EMR in Amazon EKS more flexibly, with a view to more flexible use and lower costs. At present, our team is investigating and adopting, I will Let’s talk about this in a follow-up article.

Amazon EMR release note
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html

Amazon EMR in the FreeWheel batch ETL pipeline practice

In the Freewheel batch processing ETL pipeline, there are two important components called Optimus and JetFire, which are well-known Optimus Prime and Skyfire. They are a set of data modeling and processing framework based on Spark established by our company FreeWheel (so I The products of the FreeWheel Transformer team are named after the roles in Transformer). Optimus is mainly aimed at the construction of the data warehouse layer. Its main function is to collect the logs generated by the advertising system through the front-end module, extract and transform it according to the needs of business logic, unified modeling, and do a lot of business enrichment, and the original The data is converted into Context Data that is convenient for downstream applications to analyze and use, and finally Spark SQL generates a wide table and places it. JetFire is more inclined to a flexible and general-purpose Spark-based ETL Framework, which allows users to quickly realize data processing requirements based on wide tables more easily and conveniently. These pipelines are all scheduled and scheduled by Airflow.

Mission characteristics

Optimus-based batch ETL pipeline

The ETL pipeline has a batch per hour. Because customers are distributed in various time zones around the world, the data release requirement is to release the customer’s data for the first 24 hours of the current time zone after the zero point of the customer’s time zone; in addition, some products also need to support the data of each hour. No delay, the data processing of each hour needs to be controlled within 30 minutes;
The amount of data is unstable; the density of the distribution of customers in different time zones, the flow of each hour are also different, the occurrence of regional events, and the possible delay of internal upstream modules will cause uneven data volume per hour. Although the data volume distribution for 24 hours a day has the same general trend, it cannot be accurately predicted. Only when the batch is about to be processed can the data volume information of this batch be obtained from the upstream;

There is no need for data persistence on HDFS in the intermediate process of ETL; the requirement for HDFS is to support access to Spark tasks and Amazon EMR clusters, and to ensure the transactional nature of a batch of tasks. The data that needs to be persisted will be loaded to clickhouse by subsequent modules, and simultaneously published to Amazon S3 to be managed by hive;
Spark tasks require cluster resources: In Optimus, there are a large number of calculations (such as data serialization and deserialization, metric calculation, data sorting and aggregation, Hyperloglog calculation, etc.) and caching (in hierarchical modeling, it is used in DAG). Repeatedly used data), Spark tasks have different requirements for cluster resources at different stages: the process from loading data into memory to transforming data to modeling is computationally intensive and requires a lot of CPU. At the same time, because some dataframes need to be reused by higher-level models and need to be cached, a large amount of memory is needed here; and in the end, the network and CPU are both large performance bottlenecks in the data extraction and aggregation operations that support a large number of concurrent SparkSQL. We have done a very fine tuning work for Spark applications. For details, please refer to these articles Spark practice one, Spark practice two . In the process of processing a batch, the cluster resource usage can be compared with the following three figures.

Spark practice one
https://mp.weixin.qq.com/s/McZZXOUSUCt3iJJQTfmE4g
Spark practice two
https://mp.weixin.qq.com/s/E1SNTIQdxiRroi5y7QR3ng

DataFeed pipeline based on JetFire

Tasks in Datafeed have different schedules and different input data volumes, and the task load submitted to the Amazon EMR cluster cannot be predicted in advance. In this scenario, Amazon EMR is more like a shared computing platform. It is difficult for us to plan the overall resources from the perspective of a single application.
Considering that the Optimus-based ETL pipeline is a common dependency of many downstream applications and needs to ensure high stability, we hope to reduce the overhead as much as possible on the premise of monopolizing an Amazon EMR cluster. The Datafeed pipeline tasks are fragmented and numerous, and most of the tasks are lightweight and need to be completed quickly.

Let’s break through the above requirements

Amazon EMR cluster configuration

Based on the introduction to the Amazon EMR cluster and actual application requirements in the previous article, in general, we used the Amazon EMR cluster in Long term, running the single-point Ondemand model of the Master node + the single-point Ondemand model of the Core node + dynamic scaling Task node composed of InstanceFleet of Spot&Ondemand model.

Long term Amazon EMR VS Temporary Amazon EMR cluster

The reason why we do not choose to re-create an Amazon EMR cluster every time we need an Amazon EMR cluster is that in addition to the machine instance provisioning time, Amazon EMR also needs to execute a longer bootstraping script to start many services to make the entire cluster ready. . We have introduced Master node and Core node above. Compared with Task node, they not only need to support Hadoop services, but also need to download and configure Spark, Ganglia and other environments, as well as more services (determined according to the application selected by the user) Compared with hourly schedule tasks, the time overhead caused by such frequent creation and destruction cannot be ignored. So we chose to create and maintain a long term Amazon EMR cluster, and use scale worker nodes to control the load. For longer usage intervals, such as daily or even weekly job, it is a more reasonable solution to create temporary creation every time you need to use Amazon EMR.

After choosing Long term, we need to pay attention to the stability of the cluster. However, we still chose the solution without HA. Since the Master and Core are both Ondemand models, it can be ensured that the cluster will not crash under non-extreme special circumstances. . For extreme special cases, considering that the two pipelines have no need for data persistence, if the Amazon EMR cluster can be rebuilt and the task can be restored within minutes, compared to the long-existing master + core solution of the HA solution, The ROI is higher when the production needs can be met. Therefore, in comprehensive consideration, we chose the single-point master and single-point core node solutions, with the Terraform script and Airflow scheduling, we can quickly rebuild the Amazon EMR cluster and rerun the task in the case of a small probability of cluster accidents, thereby restoring the pipeline. .

It’s also worth mentioning that in the early days of Amazon EMR, Terraform did not support the creation of Amazon EMR with InstanceFleet. At that time, it was only possible to create an Amazon EMR cluster using the command line or the boto3 library. Since there was no state file for tracking, It is also impossible to perform unified management (such as destruction and modification, etc.) through Terraform. For Amazon EMR, you can only implement a set of stateful management scripts by yourself. At present, the latest version of Terraform has added support for Amazon EMR with InstanceFleet. However, in actual use, for Amazon EMR clusters that use InstanceFleet, many configurations can only be specified at creation time, such as Master&Core&Task model configuration, InstanceFleet model selection, and some customized configurations of Hadoop and Spark (this can still be done after startup Modify the corresponding xml or conf file, and then restart the corresponding service to modify, but this approach is not friendly to the product environment), Security group, etc., if you need to modify these, you still need to destroy the cluster and recreate it with the new configuration. It should be noted that, for use InstanceGroup cluster is by Reconfiguration way modification.

way of reconfiguration
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html

About the selection of Spot model and AZ

Considering the cost, we will give priority to using the Spot model as the Task node, and using the Ondemand model as a supplement. Spot models may not be applied for when resources are scarce. Even if the application is received, there may be interruptions and withdrawals. We will set the AZ to multiple when creating Amazon EMR to increase the probability of successfully applying for the machine. . However, the configuration of multiple AZs only takes effect when Amazon EMR is created. Amazon EMR will choose the AZ with more resources at the time of creation to apply for instances. After Amazon EMR is created, it can only be used in a fixed AZ. AZ machine resources are in short supply and other optional AZ resources are abundant. Since the cluster Master and Core already exist in the current AZ, they cannot apply for machines from other AZs. Considering the communication overhead of the Spark cluster, the performance impact of cross-AZ cannot be ignored, and this design is also very reasonable. Therefore, the multi-AZ setting is a very useful feature for scenarios where Amazon EMR needs to be recreated every time, but it is limited for long term clusters. After creating a long term cluster, in order to reduce the impact of the Spot model, it is necessary to use InstanceFleet with Spot and ondemand to reduce the machine. This part will be described in detail in the Task node scaling strategy.

Master and Core node model selection

In the choice of the model of Master and Core node, since the Core node needs to run HDFS Datanode, we expect higher throughput and IOPS in the storage when running Spark tasks, so we chose the storage-optimized i-series. In addition, since Spark's driver executor runs on the core node by default, for some code blocks that need to be used concurrently in Spark, the number of driver's CPU cores determines the upper limit of concurrency, so here you need to select the appropriate CPU machine according to the application requirements. type. One thing that needs to be mentioned here, starting from Amazon EMR 6.x, the Yarn label function is disabled by default, but considering that the core node is an ondemand model, and the worker node is a spot model and will frequently scale and run on The ondemand core node will be more stable, so we still choose to enable the Yarn label and let the Driver executor run on the core node.

able to pass:
yarn.node-labels.enabled: true
yarn.node-labels.am.default-node-label-expression: ‘CORE’
To turn on this configuration.

For the Master node, in addition to the memory overhead of the NameNode, it is the memory overhead of the Spark historyServer and ResourceManager. Relative to the worker node, the resources are not very tight, so it is good to choose an appropriate choice.

Task node scaling strategy

In order to satisfy that the processing of different hourly data volumes can be completed within a limited time, the cluster capacity needs to be adjusted according to the current hourly input data volume. In the process of scaling Task nodes, we have considered the following aspects.

InstanceFleet

In consideration of cost reduction, we will give priority to choosing Spot models instead of onDemand models. However, when models are in short supply and there are large-scale events, popular models will be in short supply and will even be frequently recalled. For the case where nodes in the cluster are recycled, although the Spark task can handle and shuffle that part of the data to the remaining available workers for scheduling processing if the resource is feasible, the data is moved and the data of the DAG subsequent task is tilted The resulting performance degradation, or even the eventual failure of the task caused by insufficient resources, may cause the spark task to run for a longer time, which may lead to a greater chance of spot recovery. Therefore, we use InstanceFleet to reduce dependence on a single model. InstanceFleet can set mixed models and lifeCycle types for Amazon EMR's Master, Core, and Task nodes, in order to meet the target capacity when instance resources are tight.

An InstanceFleet can currently support up to 15 models. Currently, the commonly used models that can be used to run Spark are C, R, and M models. The C model tends to be more computationally intensive. In addition to the higher CPU frequency of the C type, the main difference between the several types of models is the ratio of vCPU/memory. The R type is more suitable for tasks with higher memory usage, the C type is more suitable for computing tasks, and the M type is more cost-effective. middle.

Different types of instance capacity are different. In InstanceFleet, different unit values can be set according to the capacity. By default, the unit value is equal to the number of vCores. In actual use, we have two methods:

One is the time to use the default, such as 12xlarge is 48 units, then c5.24xlarge is 96, so that after the resources required by the cluster are estimated, you can directly convert the required CPU resources into the number of units to apply. This situation is suitable for models configured to InstanceFleet that all have the same CPU/memory ratio, or some models have a memory ratio greater than the target value, and are willing to take part of the waste of memory resources. If a model with a small memory ratio is selected, it may be unable to apply for enough executors as expected due to insufficient memory of this model, resulting in the spark application staying in the accept stage due to insufficient resources;
Another solution is to configure resources based on the executor of the Spark application, and use the greatest common divisor of the maximum number of executors that can be started on the selected model as a unit. The limitation in this case is the application of the cluster and the configuration of the spark application With the binding, although the cluster resources can be used more efficiently, once the application configuration changes, the cluster needs to be redeployed. The degree of customization is high, which has advantages and disadvantages.

During the long term Amazon EMR operation and maintenance, we sometimes encounter the need to modify Amazon EBS storage. For Master and Core nodes (need to be models that do not rely on NVMe SSDs, such as non-i3 series), we can use Elastic volumes For details, please refer to this article Dynamically scale up storage on Amazon EMR clusters . However, for scenarios where task nodes require frequent scaling, our use of dynamically scaling Amazon EBS will not pay off. In this scenario, if Amazon EMR can support modifying the configuration of Instance Fleet, and it will take effect during the next scaling, it is the most convenient way. At present, in order to meet the configuration modification of Task node using InstanceFleet, we can only rebuild the cluster.

Dynamically scale up storage on Amazon EMR clusters
https://aws.amazon.com/cn/blogs/big-data/dynamically-scale-up-storage-on-amazon-emr-clusters/

InstanceFleet currently has some limitations, such as:

We cannot actively set the provision priority of different instance types. In some scenarios, we expect to add more model options. However, for some of the instance types, we only hope to add them as supplementary models when the main models are insufficient. This demand is currently unachievable; official The explanation given is that there will be an internal algorithm to calculate the units we need, and the overall cost and instance resource conditions will be considered for machine provisioning.

In addition, when viewing the status of InstanceFleet on the Amazon console, it is impossible to efficiently find the current machine in a certain state, because fleet will keep the record of 1000 machines in the history. When there are more than 1000 historical machines, the oldest machine will be discarded. Record, always keep, but the page needs to be constantly refreshed to get the complete list. At this time, it is not very friendly for users who are intuitively debugging on the WEB. For users who use amazon cli, we can use list fleet and filter An instance of a certain state to get the result you want.

The following is the configuration of the InstanceFleet used:

Cluster scaling strategy

For different application scenarios, we currently use two scaling strategies, one is active scaling by the task scheduler according to the task situation, and the other is passive scaling by Amazon EMR by monitoring the cluster status.

Active scaling method

For the Optimus-based ETL pipeline, for each batch, how many resources Spark needs to calculate, we can fit historical data to find out the relationship between the amount of data and the required resources, and through continuous feedback, we can get more and more Accurately estimate the cluster resources required for a given amount of data set. Since the Amazon EMR cluster run by Optimus is dedicated, we can go to the accurate scale cluster to achieve the goal capacity before submitting the Spark task. The figure below is the workflow of Task node scaling under Airflow scheduling.

Passive telescopic mode

The Datafeed pipeline is actually a collection of tasks with different schedule intervals, different resource requirements, and different task SLA requirements. Since each task is independent of each other, Amazon EMR is equivalent to a shared computing resource pool at this time. The dimension of each task schedule manages the capacity of the Amazon EMR cluster. So we adopted Amazon EMR managed scaling . After this function is activated, Amazon EMR will continuously evaluate cluster metrics, make expansion decisions, and dynamically expand capacity. This function is suitable for clusters composed of instance groups or instance queues. In addition, we set up different queues on Yarn, in order to be able to coarse-grained isolation of tasks with different resource requirements and priorities, combined with Yarn's capacity scheduler, so that the overall cluster resources can be used as reasonably as possible. In actual use, we have encountered the situation that the automatic scaling cannot be performed many times. At this time, the scaling needs to be triggered manually, and then the automatic scaling can be restored. The reason is currently unknown.

Amazon EMR managed scaling
https://docs.aws.amazon.com/zh_cn/emr/latest/ManagementGuide/emr-managed-scaling.html

The above two schemes have their own advantages and disadvantages. For the first scheme, it is more suitable for a dedicated cluster to submit a fully predictable capacity scenario. In this case, we can actively set the cluster size to before submitting the task. The capacity we want is fast and accurate; for the second scenario, it is very suitable for Amazon EMR as a shared computing platform. As a single task submitter, the application side cannot get the full picture of the current and future tasks submitted. It is impossible to calculate the capacity that Amazon EMR needs to expand. In this case, the Amazon EMR cluster needs to dynamically adjust and expand according to some cluster metrics after the task is submitted. There will be a delay in timeliness. For some DAGs, it is more complicated and there are many intermediate steps. And it is not friendly to Spark applications that are sensitive to shuffle and data tilt.

In response to the above cases, the Transformer team is developing a set of Framework-Cybertron that runs on top of Amazon EMR and Yarn based on the application and strategy perspective. We hope that through this service, we can rationally manage multiple Amazon EMR cluster resources from a more global perspective, so that the application side can not pay attention to the resource situation of the Amazon EMR cluster, and integrate the configuration of Yarn's scheduler and Queue. Resource and task scheduling can have a higher perspective of control.

Mixed use of Spot and Ondemand models

In actual applications, even if we choose InstanceFleet, the resources will not be available due to extreme conditions. At this time, we have to use the Ondemand model to fill the gap. As shown in the flowchart above, when we wait for the cluster to scale for a certain period of time, If we still cannot meet the demand target, we will apply for the ondemand model to make up. In the InstanceFleet scenario, the ondemand model is the same as the Spot model. It’s worth noting that although the provisioning timeout action that can be configured in Amazon EMR is After xx minutes, Switch to On-Demand instances, the actual situation is that this policy only takes effect when provisioning is first created, and it is not applicable to start spot in a running cluster. Therefore, we still need to actively apply for the Ondemand model according to the actual InstanceFleet situation.

For more extreme scenarios, such as the arrival of Black Friday or during the US general election, we will directly replace all applied models with ondemand directly to prevent the frequent inability to complete the additional time cost caused by Spot. In addition, during the running of the Spark job, if there is a task interruption caused by the recycling of the Spot model, we will increase the ondemand model ratio according to the current cluster status under the scheduling of Airflow, in order to be able to recover quickly.

effect

The figure below is a cluster using the active scale scheme. The cluster scales according to the amount of data 24 hours a day. The red column is the memory capacity of the cluster, and the cumulative part of blue + green is the amount of memory actually used. From the actual use point of view, we can actively scale out to the desired capacity before the task is submitted. Each batch task can make full use of the cluster resources, complete the data processing within the SLA time, and scale in immediately after the cluster is not required. Come in. Meet task requirements while reducing overhead.

The figure below is a cluster that uses the Amazon EMR autoScale solution. We can see that as more Spark tasks are submitted, the cluster load becomes higher. The cluster is scaled out twice according to the load situation, and finally the task node is retracted to 0 after all tasks are completed and idle for a period of time.

HDFS dependency

In our usage scenario, Amazon EMR is only used as a computing resource, and the HDFS on it only needs to support Spark applications. In a batch process, the generated data will be stored on HDFS first, and then the publisher will move and persist the data on Amazon S3. Regarding why not directly write to Amazon S3, it is mainly because of the characteristics of business data and the characteristics of Amazon S3. Simplify a lot of system design, you can refer to the Strong Read-After-Write Consistency ). From the perspective of system design, we do not expect that the upstream and downstream of the data stream will be too tightly coupled due to the data format, which is not only not conducive to maintenance, but also increases the running time of the Spark application, which increases the cost in disguise. Scenario, longer running time also means greater chance of being interrupted.

Strong Read-After-Write Consistency
https://aws.amazon.com/cn/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/

summary

As we use Amazon EMR more and more deeply, while meeting product needs, we are also constantly "squeezing out" Amazon EMR expenses. This article hopes to inspire readers through our Amazon EMR experience. Thanks also to the Amazon EMR team for their support in this process.

In addition, the Amazon EMR team is constantly improving and launching new features based on the actual needs of customers. We are currently investigating and trialing the latest Amazon EMR on Amazon EKS from Amazon Web Services, hoping to have more flexible usage, faster scale speed and lower overhead. We will continue to update the progress in subsequent articles.

Author of this article

Peng Kang
FreeWheel Senior Software Engineer

Graduated from the Institute of Computing Technology of the Chinese Academy of Sciences, currently working in the Comcast FreeWheel data product team, mainly responsible for the construction of the advertising data platform data warehouse

Easily build a data warehouse and "play with" Amazon EMR with FreeWheel

a typical Spark on Amazon EMR cluster architecture

Amazon EMR in the FreeWheel batch ETL pipeline practice

summary

亚马逊云开发者

引用和评论

基于亚马逊云科技 Amazon Bedrock Tool Use 实现 Generative UI

一、MyBatis简介：MyBatis历史、MyBatis特性、和其它持久化层技术对比、Mybatis下载依赖包流程

七、MyBatis自定义映射resultMap

Arthas watch （方法执行数据观测）

StarRocks 与主流 BI 工具兼容性盘点（Superset/帆软/QuickBI/Tableau）

什么是 StarRocks？核心优势与适用场景解析

从 Greenplum 到 StarRocks：头部金融客户如何通过架构升级实现“实时分析自由”？