This article was first published on the Nebula Graph Community public account
This article is compiled from BIGO's theme sharing on nMeetp, and mainly introduces BIGO's understanding and exploration of data management construction in the past year. The core focus of BIGO data management is the construction of a metadata platform to support upper-level data management and construction applications, including data maps, data modeling, data governance, and authority management. This paper mainly focuses on the following five directions:
- OneMeta infrastructure;
- Graph Engine: Nebula replaces JanusGraph
- Data asset platform application;
- Adhoc ad hoc query;
- future plan
BIGO is an independent company established on the basis of the original mobile new product department of YY Live (the parent company of YY Live), and is committed to building the world's leading community-based video live broadcast application and brand. Its flagship product, BIGO LIVE, topped the list of free apps in Thailand just one month after its launch.
Data management platform
The picture above is an abstract diagram of the BIGO data asset management platform. As shown in the picture above, the metadata platform stores data such as technical metadata, business metadata, data lineage, data measurement, normative model, and permission content. Based on the metadata platform The upper layer connects to the application layer, including: map, cost, data access, authority, governance, and model.
OneMeta Infrastructure
business background
Currently BIGO data management faces the following problems:
- The metadata is messy and non-standard, and there is no unified search and management platform;
- Data has no blood relationship, and each development platform is like a data island;
- Without business metadata, it is difficult for business parties to query and unify the caliber;
- Extensive management and control of data rights, primitive rights application and approval process;
To put it simply: lack of standards, lack of data connection, lack of business metadata, and lack of fine-grained authority control. For this reason, BIGO has built a metadata management platform OneMeta to solve the above problems.
The platform capabilities of OneMeta are as follows:
- Real-time storage and management of global metadata, and unified construction of a company's personal and team data asset catalog.
- Application storage management capabilities such as data maps, data retrieval queries, data governance, blood marriage, rights management, and standard models.
- Supports the storage of metadata and blood relationship such as HIVE / HDFS / OOZIE / CLICKHOUSE / BAINA / SPARKSQL / KYUUBI / KAFKA.
- Accurately measure business metadata: update information such as the number of operations, hot and cold degrees, and business attribution of various metadata.
Metadata Platform Architecture
BIGO metadata platform mainly relies on Apache Atlas, Vesoft Nebula, Yandex ClickHouse and BIGO's internally developed DataCollector for construction, which can be divided into four layers as a whole.
The top layer (blue part) is the source of data collection and the DataCollector service. The technical metadata of each platform is mainly collected in real time in the form of hooks, while the business metadata is scheduled and updated at the hour or day level through the DataCollector service; the second Layer (orange part) is the message queue and API layer, providing channel access data to Atlas; the third layer is Atlas (green part), which is the core metadata management layer, all metadata, attribute information and blood relationship All are managed in Atlas, and the Atlas layer also provides interfaces for application calls; the bottom layer (purple part) is the storage layer, mainly using Nebula Graph, Elasticsearch and ClickHouse, where the main metadata are stored in Nebula, which requires The full-text indexed data is transferred to ES through Nebula Graph, and when historical trends or aggregated data need to be queried, the data is read from CK (ClickHouse).
Apache Atlas optimization
This part explains what kind of optimization BIGO has made based on the open source Atlas, mainly in the following aspects:
- Audit capability building is realized through SpringBoot aspect.
- Introduce MicroMeter and Prometheus to realize monitoring and alarm capability building.
- Extract the graph engine dependency, support the reading and writing of the distributed Nebula Graph graph engine, and add 3w+ lines of code.
- Added access speed control and black and white list functions to control burst traffic and malicious access and ensure system stability.
- Periodic task to clean up stale Process data to avoid data bloat.
- Add the Atlas smooth exit mechanism to prevent the loss of consumption messages due to restarts.
- Reconstruct the DAG graph display of blood relationship, optimize the user's visual experience and avoid the problem of slow rendering of large images.
- It supports the workflow of the blood relationship related scheduling engine, and solves the most important part of "finding and outputting" in the data blood relationship.
- Hook expansion: New support for metadata collection such as Oozie, Kyuubi, Baina, ClickHouse, and Kafka.
- Several bug fixes in native version code.
Data Collector features
There is a more important function in the BIGO metadata platform - DataCollector, which is a data collection service. Its main function is to collect and update business metadata (serving upper-layer data measurement and other applications) regularly. For example, the daily data of HIVE table. Access times and access personnel, storage capacity of HDFS paths, business line ownership of metadata, hot and cold judgment of metadata, real person in charge of metadata and other business metadata. At the same time, DataCollector also has the functions of data cleaning (life cycle TTL) and synchronization of data access layer (Baina) metadata.
Graph Engine Alternative
Atlas' native graph engine is JanusGraph. During use, we found that Atlas Janus has the following defects: First, there is a single point of problem in the built-in JanusGraph graph engine that Atlas relies on, and there is a computational bottleneck when the concurrency increases. Second, JanusGraph relies on Solr to build indexes. Although JanusGraph claims to be able to replace Solr with Elasticsearch, there are many problems in actual operation. In addition, there is no technical accumulation related to Solr within BIGO, which requires a certain labor cost. Third, JanusGraph has poor search performance in massive data scenarios, and there are bugs that occasionally cannot be searched for data that are difficult to solve. Fourth, JanusGraph does not have the support of the open source community and the company's internal team, and the maintenance cost is high.
Let's talk about the advantages of Nebula Graph replacing JanusGraph. First of all, business students and operation and maintenance students have tested that Nebula Graph is more than N times that of JanusGraph in terms of graph exploration performance . Furthermore, Nebula Graph is a distributed graph database that supports distributed deployment, both computing and storage can be scaled horizontally, and supports high concurrency . In addition, the Nebula Graph open source community is active, and the product is continuously iteratively optimized to support the data volume of hundreds of billions of vertices and trillions of edges . Finally, there is a cooperative team within BIGO to support and maintain the functional maintenance and development of the Nebula Graph platform.
Challenges & Solutions for Graph Engine Replacement
Although it was determined to replace JanusGraph with Nebula Graph in the selection, there are still certain challenges in the actual replacement process.
First of all, in terms of data model, Nebula Graph is a strong schema type database. If you want to replace the weak type JanusGraph, you need to weaken the concepts of Tag and Edge. Then, there are differences between the two in terms of data type support. Native Nebula Graph does not support complex types such as MAP and LIST to a high degree. Another issue is the design of the index. In Nebula Graph, the index function does not play a role in acceleration, but LOOKUP
a necessary condition for such searches. In addition, Nebula Graph itself does not support transactions, which adds a lot of work to us. The last point is the change of usage habits. In terms of query methods, Nebula Graph self-developed query language nGQL, while JanusGraph supports query through Java API and Gremlin.
How to solve the problem? In terms of strong and weak type conversion, BIGO internally modified the core code of Atlas, adding parameters to dynamically determine the DDL data type. Simply put, when writing data or executing a query, the data type of the nGQL operation is determined by specific parameters. In terms of data type support, the Atlas business layer customizes the data serialization method to support complex types. For native index search, independent indexes and composite indexes are automatically created when the system is initialized to solve the Atlas search problem. In terms of transactions, a semi-transactional interface has been added to the Atlas business layer to reduce the probability of data errors in the Nebula Graph storage layer.
Retrofit of Atlas and Nebula Graph
Here, we focus on the transformation of Atlas and Nebula Graph by BIGO in the process of replacing the graph engine.
During the Atlas transformation process, BIOG added 3w+ lines of code to decouple Atlas and the native graph engine JanusGraph and support the distributed graph engine Nebula Graph to read and write. By transforming the full-text index, the Atlas layer can filter queries concurrently with multiple conditions in the form of INTERSECT
, thereby improving the search speed. It also transforms Atlas' multi-attribute updates into concurrent updates to accelerate the update speed of metadata storage. In the preprocessing of the loss of consumption messages, BIGO adds a smooth exit mechanism to the Atlas layer to prevent the loss of consumption messages due to restarts. In addition, Atlas supports complex types of data by supporting custom multiple (deserialization) methods. I mentioned transaction support just now. Added Vertex#openTranscation/Vertex#commit
interface to the Atlas layer to support semi-transactions, reducing errors due to Nebula Graph's no-transaction rollback. Finally, a large number of independent indexes are combined into composite indexes at the Atlas layer, and the system initialization speed is optimized by creating default indexes and attributes.
In terms of Nebula Graph, BIGO has also transformed it. The first is to modify the LOOKUP
clause to support concurrent execution. After testing, the latency of scanning 1 million data has been reduced from 8s to 1s. Additionally, LOOKUP
query pagination from Elasticsearch is supported. Furthermore, other part of the transformation work is focused on Elasticsearch. BIGO supports data update and deletion in the back-end ElasticSearch; Listener also supports Commitsnapshot and full data update. The REBUILD
function is supported for full-text indexing, and the REBUILD INDEX
authority is issued to the admin user. BIGO also adds the function of creating and deleting full-text indexes independently, avoiding all columns written to ElasticSearch and increasing its storage usage. Finally, the periodic Compaction operation of Nebula Graph has also been optimized to reduce upper-layer performance fluctuations.
Search performance after replacement
The figure above shows the search performance of BIGO after replacing JanusGraph with Nebula Graph. The reason why P99 in the above figure takes more than 2s is that there is always a large search, which will slow down the search speed. After the replacement, the search speed has increased by more than 5 times, and the results returned from the original 5s have been reduced to less than 1s; and there is no longer the problem of occasionally missing data, system maintenance does not require additional maintenance of indexes, and supports high concurrency and super large Data volume storage.
Data Asset Platform Application
I have shared the underlying unified metadata platform architecture before, and I will explain it in detail here. As shown in the figure above, the lower layer is mainly a unified metadata platform, and the upper layer is the product application layer. The lower left part is the real-time metadata storage module, including data sources such as Hive, Kyuubi, Oozie, Baina, CK (ClickHouse), HDFS, etc., which are written to the unified metadata platform through the Kafka message queue through Hook. The core component of the unified metabase platform is Atlas, which relies on ClickHouse and Nebula Graph. In addition, the unified metadata platform also relies on the OneSQL platform within BIGO, mainly the unified SQL query engine, and Ranger is mainly responsible for authority control.
The upper layer of the data asset platform is connected to the application layer, including: REST interface, data map, real-time blood relationship, ad hoc query, data warehouse modeling, visual table building, resignation handover, permission management and other applications.
The following focuses on the related applications.
data map
The above picture shows the data map - search (part), which supports global metadata (HIVE, HDFS, CK, BAINA) search and discovery (data sources are still being added), result sorting and downloading, support filtering, and advanced search functions. In the search interface, the left side is the filter condition, the top is the search box entry, and the bottom is the search result display data.
The above picture shows the data map - details (part), after finding specific metadata through search, you can click to view basic metadata information such as technical details, data measurement, business attribution, life cycle, and historical trends. In the details interface, the details of HIVE metadata in the above figure are taken as an example, which are mainly divided into basic information at the top and detailed information at the bottom, blood relationship, and data preview display.
The basic information lists 107 queries yesterday, the data size is 1.27 TiB, and the occupied space is 3.8 TiB... In addition, the data life cycle and business line can be managed through the [Edit] operation.
The detailed information (left side below) lists field information: HIVE table fields, field types, and field-related descriptions for product and operation use. The right side below is the partition field information. If a certain data is a partition table, the module will display the partition field information; if it is a full scale table, the partition field information will not be displayed. In addition, the data blood relationship and data preview function are not described in detail here.
real-time bloodline
In the real-time blood relationship module, BIGO reconstructs the directed acyclic graph DAG graph, adds data table display, realizes the data lazy loading, association and search workflow, and displays the workflow execution in real time in the blood relationship graph of the business layer. state.
The bloodline module supports two types of views: chart and visualization. The visualization mode is selected in the above figure. In the visual view part, you can select the corresponding node in the upstream and downstream nodes, and there is a [Show Process] button to display or hide the process workflow. For example, a table and b table, b table is generated by a table through a workflow, opening the [Show Process] button will display the generation process, and closing [Display Process] will block the process data.
The figure above shows the core module of data lineage, showing the upstream and downstream of a certain metadata. Floating depth and filter options on the left, you can select the number of upstream and downstream layers (depth) centered on a certain metadata. For example, the above figure selects the 2-layer upstream and downstream nodes of the tiki_core_india... data.
data governance
The data governance section mainly shows TTL management, which is used to manage the life cycle of each event in the general management table.
The above picture is a screenshot of the TTL management part of data governance. From the data map details mentioned above, click the [Edit] button to manage the TTL life cycle of the data. Of course, in addition to life cycle management, BIGO data governance has other functions that will not be introduced in detail here.
data modeling
In terms of data modeling, the unified metadata platform provides SQL scripts to create table models for interactive use by data warehouse developers and data analysts.
Legend: a SQL model data
Legend: Data Modeling Portal
monitor the market
The internal monitoring dashboard of BIGO displays company data in real time, including the total resources, the proportion of resources of each business line, changing trends and popular resources, etc., so as to promote the cost optimization of teams and business lines.
Legend: Screenshot of the desensitized market
In addition to the above applications, the data asset management platform also has applications such as template access, permission management, resignation handover, group management, data preview, and favorite downloads.
Adhoc ad hoc query
business background
BIGO originally used Cloudera HUE as an ad hoc query platform, but due to various reasons, HUE has long been unable to adapt to internal query needs, and is often complained by users that it is difficult to use and unstable. The main reasons are:
- The code is outdated and basically in a state of being abandoned by open source;
- For historical reasons, BIGO experienced at least six employees who took over HUE and accumulated a lot of internal codes;
- HUE operation and maintenance is very cumbersome;
- The editing window does not meet the actual needs of users;
- BIGO has built a unified SQL router internally, eliminating the need for users to choose an execution engine.
Based on the above reasons, I decided to develop a new SQL query platform to solve the above problems, unify metadata management and data query to the asset management platform, and add the following features: First, build a unified SQL router (onesql), which automatically Route SQL to the back-end sparksql/hive/presto/flinksql engine for execution; secondly, provide a new query entry for users to execute SQL, and standardize DDL statements to facilitate data governance, rights management and cost control; thirdly, according to Product research design a new editing window for multi-TAG interactive operations; fourth point, provide the function of creating tables/load data through SQL or creating tables/load data through visualization; fifth point, automatically compatible with mobile devices such as pad / phone query, It is convenient for domestic and foreign colleagues to use; the sixth point is to increase the daily user access audit / comprehensive monitoring alarm; the last point is to support the query of ClickHouse data in the unified portal (planning).
future outlook
The future planning of the data platform mainly focuses on three aspects: metadata construction, product addition, and business empowerment.
In terms of metadata construction, it will cover the metadata of all platforms in the access layer, computing layer, scheduling layer, and storage layer. In addition, we are planning computing resource governance and one-stop task development (Python/Jar/Shell/SQL), etc.
In terms of product enhancement, it is mainly divided into four parts: governance, cost, efficiency, and application. In terms of governance, it strengthens data governance capabilities and automatically manages unhealthy data; in terms of cost, it helps each team in the company realize a closed loop of cost analysis and provides cost optimization. In terms of efficiency, the work efficiency of users is improved through standardized table building, accurate measurement, query optimization, and bloodline construction; in terms of application, the functions of integrated ad hoc query and distributed scheduling system are further improved to enhance user experience.
In terms of business empowerment, intelligent attribution diagnosis will be carried out to empower business teams to analyze problems automatically. In addition, it will iteratively optimize the number of templates, enabling business teams to easily tap more data value.
The above is the practice sharing of this BIGO data platform.
Nebula Community's first call for papers is underway! 🔗 The prizes are generous, covering the whole scene: code mechanical keyboard⌨️, mobile phone wireless charging 🔋, health assistant smart bracelet⌚️, and more database design, knowledge map practice books 📚 waiting for you to pick up, and Nebula exquisite surrounding delivery non-stop ~🎁
Welcome friends who are interested in Nebula and who like to study and write interesting stories about themselves and Nebula~
Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。