The latest big data practices of Netease Shufan, Cloud Music, Intel, Youzan (PPT download + video playback)
At the NetEase Shufan Technology Salon big data session jointly organized by NetEase Shufan and Intel, Yao Qin, a big data expert at Netease Shufan and Apache Spark Committer, Chen Qi, head of OLAP of Youzan Infrastructure Group, and Intel senior software development project Manager, Apache Hive Committer Xu Cheng, NetEase cloud music data expert Lei Jianbo, and NetEase Shufan big data product expert Gu Ping, five experts, respectively, on topics such as Serverless Spark, ClickHouse, Spark/Flink acceleration, data warehouse and data products Shared the latest practices of their respective teams.
Kyuubi: Open source enterprise-level Serverless Spark framework
Netease Shufan big data expert and Apache Spark Committer Yao Qin shared the original intention of the research and development, design points of the open source project Kyuubi, and its practice in Netease. Kyuubi is a distributed JDBC service that follows HiveSever2's RPC implementation. After Spark grants multi-tenancy capabilities, it can become an ideal platform for Hive QL migration to Spark SQL. Secondly, it integrates the entire SQL Compiler (compile optimization) and Runtime (Execution) All are implemented by Spark, and very outstanding performance can be obtained. Under this framework, Netease Shufan integrated some advanced features of Kyuubi and Spark, and started the journey of Serverless Spark (Spark as a service).
Since Kyuubi encapsulates Spark's high-level APIs and provides them through the C/S architecture, users are "unaware" of Spark-related concepts and frameworks, and are more focused on their own business and data itself. This can meet the direct demand for big data from more people and more businesses.
Within NetEase, Kyuubi has helped NetEase media business complete the smooth migration of Hive QL tasks to Spark SQL. Under the premise of completing a 50% savings in computing resources, the overall time consumption has been reduced by 70% and the overall performance has been improved by 727%. In addition, the team is also helping the line of business to implement the migration of Spark jobs from the YARN cluster to Kubernetes.
Video playback: https://www.bilibili.com/video/BV1164y197iz
Kyuubi open source address: https://github.com/NetEase/kyuubi
ClickHouse is used and optimized in Youzan
Chen Qi, head of OLAP of Youzan Infrastructure Group, introduced the use and optimization of ClickHouse in Youzan from three aspects: 1) ClickHouse's development, platform construction, and application scenarios in Youzan, such as DMP, SCRM, CDP and other scenarios Landing and optimization. 2) Offline read and write separation of hundreds of billions of data, using offline write K8s to temporarily build a cluster to achieve offline data read and write separation, thereby solving the business development problem of more writes and less reads. 3) Explore POC of self-developed new database, try to integrate Doris and ClickHouse to solve the pain points of both parties.
According to Chen Qi, ClickHouse is not like a distributed database in the traditional sense. It is more "manual file" as a whole. In many places, users need to design a process to improve, such as writing, materialized views, etc.; at the same time, ClickHouse does not The ability of automatic Rebalance makes the expansion and contraction operation and maintenance particularly complicated. In contrast, Apache Doris is more like a distributed database, and it also solves some of the pain points, such as being able to auto-balance, supporting Shuffle Join, etc., but so far its single-table performance, maturity and stability are not as good as ClickHouse.
Therefore, Youzan tried to use the high-performance ClickHouse operator to replace Apache Doris based on Impala, and create a better distributed OLAP database in the future. From the point of view of the effect of POC, the scheme is feasible.
Video playback: https://www.bilibili.com/video/BV1h64y1t7EQ
Accelerate big data analysis with Intel Optane PMEM technology
Intel Software Development Engineering Manager and Apache Hive Committer Xu Cheng shared how to use Intel's open source project Optimized Analytics Package (OAP) to accelerate the performance of Spark and Flink, and introduced the existing Spark framework in terms of memory management, Shuffle implementation and other aspects that have further improved performance space. And how to make better use of new hardware, such as using Intel Optane PMEM (persistent memory) technology to take advantage of the unique values of Optane's persistent features, in-place erasing, byte addressing, and low latency. Spark has many further optimized functions point.
Xu Cheng focused on the interpretation of OAP Analytic Cache features, including the use of Arrow's high-performance modules, Spark/Flink cache awareness, Disaggregated cache, Filter/Project/Aggregation decentralization, and high-performance compression accelerator QAT support. Taking Spark cache awareness as an example, OAP extends the existing Spark data source scan to identify cached hot data blocks, and uses cache location provider to provide scheduling level cache
Awareness, and supports a variety of cache location providers for different usage scenarios.
Video playback: https://www.bilibili.com/video/BV1zb4y1C7BG
OAP open source address: https://github.com/oap-project/
NetEase Cloud Music Data Warehouse Construction Road
NetEase Cloud Music Data Expert Lei Jianbo introduced that NetEase Cloud Music is using a standardized, shared, and self-service unified data warehouse system to reduce data usage thresholds, improve decision-making and utilization effects, and achieve data-driven business growth. He shared the practice and thinking of NetEase Cloud Music in coping with challenges and the results obtained from the two aspects of traffic data governance and data asset precipitation.
In terms of traffic data governance, burying points are a huge pain point, including large differences in burying point formats, lack of standardization and demand review in the links before burying points, and client-side burying points without good technical design and engineering specifications, and most of them are aggregated. Traffic needs to be re-submitted to JIRA, etc. NetEase Cloud Music implements governance through measures such as establishing burying specifications beforehand, recreating the burying process during the event, and promoting grayscale audits afterwards. In this process, NetEase Cloud Music and NetEase Shufan jointly built easyTracker burying point management platform, easyFetch self-service access platform and other systems to ensure the standardization of burying points and self-service flow data services.
Video playback: https://www.bilibili.com/video/BV1To4y1C7i7
Netease data product practice
Gu Ping, a big data product expert at NetEase Shufan, shared the practice of NetEase's strict selection of data products-he constructed NetEase's strict selection of data product systems and data middle-office systems from 0 to 1. Netease's Yanxuan business is moving towards a dual-engine model of "data center support + data product driven", releasing the value of data to support the exploration of innovative businesses. Combining Netease's Yanxuan business practice, Gu Ping shares data products covering marketing and supply chain System construction ideas and steps, and introduce the relevant experience of supporting data center and data governance.
Support the strictly selected "brand + platform" operation model. The carefully selected data products cover the three levels of digital operation, digital management and digital supply, including commodity data operation platform, marketing data operation platform, mobile data workbench, and supply chain data operation platform Four major data products. Among them, the mobile data workbench is the first data product developed by Yanxuan. This product is mainly for the management of data management, which helps to promote the successful construction of the data product system from top to bottom. Gu Ping said that data products can be connected to business systems to provide abnormal monitoring and diagnosis, decision-making suggestions, but without the support of the data center, data products cannot be realized. Based on NetEase's well-known capabilities, strict selection has implemented the construction of the data system with high efficiency and high quality.
Video playback: https://www.bilibili.com/video/BV1Bb4y1C75t