Ge Push Transparent Storage Practice
Column-oriented storage is the mainstream storage method for analytical data in big data scenarios. Compared with row-based storage, column-based storage only extracts some data columns and homogeneous data in the same column, and has better encoding and compression methods. At present, Getui's core data is being gradually switched to new data formats such as Parquet for higher I/O performance and lower storage costs.
Xiao De, a senior data R&D engineer from the Cost Reduction and Efficiency Improvement Team of the Getui Data Department, explained the whole process of Getui transparent storage optimization in detail, focusing on the concept of transparent storage, the file reading process, and the implementation of transparent storage.
Related questions and answers during the sharing process:
Q1: How to quantify and evaluate the benefits of transparent storage?
A: The quantitative dimension is divided into two aspects. The first is to measure from the aspect of efficiency improvement, that is, whether the use efficiency is improved, such as the improvement of data use efficiency when users (data analysts) deal with business needs. At present, we combine multiple types of tasks comprehensively, and transparent storage can shorten the running time by 30%; secondly, in terms of cost reduction, that is, whether resource consumption has been reduced, it can be quantitatively evaluated by the usage time of CPU cores and memory usage time. .
Q2: How does transparent storage realize the compatibility and switching of data formats for historical projects?
A: In terms of compatibility, we increase the ability to automatically identify and switch storage formats by extending the read-write API of Hadoop; in terms of switching, we first extended the submission commands of Hadoop and Spark, added hooks, and introduced switching Black and white lists of data formats, so that the information to be changed can be obtained when the task is started; through the above methods, the data format of historical projects can be switched without perception.
The practical road of label storage in the daily arithmetic platform
Relying on massive data resources and powerful modeling capabilities, Getui has formed more than 3,000 kinds of data labels, and built a rich, three-dimensional, and multi-dimensional portrait label system, so as to provide industry customers with data insight-related services, such as APP refined operation, advertisement placement Crowd Orientation, etc.
Due to the complex and diverse tag combinations of the business side, in the process of calculating large-scale data and constructing tags, how to speed up tag calculation and achieve second-level crowd selection and insight has become a difficult problem that we need to overcome.
Based on the development practice of DIOS, the senior data R&D engineer of Getui's daily digitization platform team, he deeply analyzed the core technical means to effectively improve the efficiency of tag storage and crowd selection.
Related questions and answers during the sharing process:
Q1: What is the difference between Spark's shuffle and Hadoop's shuffle (MapReduce)?
A: There is almost no difference in function between MapReduce shuffle and Spark shuffle. Both partition the data on the Map side (there are two methods of aggregate sorting and non-aggregation sorting), and then pull it on the Reduce side or in the next scheduling stage. data, so as to complete the data transmission function from the Map side to the Reduce side.
Q2: It was mentioned in the live broadcast that ClickHouse does not support high concurrency. What is the reason? What should I pay attention to when writing to ClickHouse in a cluster?
A: The reason why ClickHouse is fast is that the bottom layer adopts a parallel processing mechanism. By default, the number of CPU cores used by a single query is half of the number of server cores, so it does not support high-concurrency usage scenarios very well. If you must support high concurrency, it is recommended to increase the current limit at the query layer.
Improve IT resource efficiency and significantly reduce total IT investment
An effective way to reduce enterprise IT costs is to greatly improve the utilization efficiency of IT resources. A McKinsey research report shows that the average daily utilization rate of servers around the world is usually less than 10%, and a Flexera report also shows that enterprises waste an average of 30% of their cloud spending after migrating to the cloud. Year's favorite thing to do. So how to improve IT resource efficiency and reduce total IT investment?
Dr. Yang Shaohua from Beilian Zhuguan shared with you the core technical means that can effectively improve the efficiency of IT resources, such as big data task optimization and on/offline co-location.
Related questions and answers during the sharing process:
Q1: How to realize on/offline co-location?
A: Different companies may have different implementations. We do this: Step1. Schedule offline tasks to online machines through k8s; Step2. Use Agent to dynamically adjust the quota of on/offline resources; Step3. Use some isolation technologies of the kernel to isolate and intervene when necessary, such as Restricting resources for offline tasks in emergency situations will have some requirements on the machine kernel version.
Q2: For Spark/Flink on k8s, should you introduce third-party scheduling plug-ins such as YuniKorn/Volcano, or should you develop similar components to solve the problem of computing resource allocation and management?
A: Our solution is mainly Yarn on k8s, and then Spark/Flink on Yarn. The main consideration here is the invasive problem of customer docking. In most cases, the upper-level data development platform is still connected to Yarn.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。