BIGO&#39;s data management and application practice

This article was first published on the Nebula Graph Community public account

This article is compiled from BIGO's theme sharing on nMeetp, and mainly introduces BIGO's understanding and exploration of data management construction in the past year. The core focus of BIGO data management is the construction of a metadata platform to support upper-level data management and construction applications, including data maps, data modeling, data governance, and authority management. This paper mainly focuses on the following five directions:

OneMeta infrastructure;
Graph Engine: Nebula replaces JanusGraph
Data asset platform application;
Adhoc ad hoc query;
future plan

BIGO is an independent company established on the basis of the original mobile new product department of YY Live (the parent company of YY Live), and is committed to building the world's leading community-based video live broadcast application and brand. Its flagship product, BIGO LIVE, topped the list of free apps in Thailand just one month after its launch.

Data management platform

BIGO 的数据管理与应用实践

The picture above is an abstract diagram of the BIGO data asset management platform. As shown in the picture above, the metadata platform stores data such as technical metadata, business metadata, data lineage, data measurement, normative model, and permission content. Based on the metadata platform The upper layer connects to the application layer, including: map, cost, data access, authority, governance, and model.

OneMeta Infrastructure

business background

Currently BIGO data management faces the following problems:

The metadata is messy and non-standard, and there is no unified search and management platform;
Data has no blood relationship, and each development platform is like a data island;
Without business metadata, it is difficult for business parties to query and unify the caliber;
Extensive management and control of data rights, primitive rights application and approval process;

To put it simply: lack of standards, lack of data connection, lack of business metadata, and lack of fine-grained authority control. For this reason, BIGO has built a metadata management platform OneMeta to solve the above problems.

The platform capabilities of OneMeta are as follows:

Real-time storage and management of global metadata, and unified construction of a company's personal and team data asset catalog.
Application storage management capabilities such as data maps, data retrieval queries, data governance, blood marriage, rights management, and standard models.
Supports the storage of metadata and blood relationship such as HIVE / HDFS / OOZIE / CLICKHOUSE / BAINA / SPARKSQL / KYUUBI / KAFKA.
Accurately measure business metadata: update information such as the number of operations, hot and cold degrees, and business attribution of various metadata.

Metadata Platform Architecture

BIGO 的数据管理与应用实践

BIGO metadata platform mainly relies on Apache Atlas, Vesoft Nebula, Yandex ClickHouse and BIGO's internally developed DataCollector for construction, which can be divided into four layers as a whole.

The top layer (blue part) is the source of data collection and the DataCollector service. The technical metadata of each platform is mainly collected in real time in the form of hooks, while the business metadata is scheduled and updated at the hour or day level through the DataCollector service; the second Layer (orange part) is the message queue and API layer, providing channel access data to Atlas; the third layer is Atlas (green part), which is the core metadata management layer, all metadata, attribute information and blood relationship All are managed in Atlas, and the Atlas layer also provides interfaces for application calls; the bottom layer (purple part) is the storage layer, mainly using Nebula Graph, Elasticsearch and ClickHouse, where the main metadata are stored in Nebula, which requires The full-text indexed data is transferred to ES through Nebula Graph, and when historical trends or aggregated data need to be queried, the data is read from CK (ClickHouse).

Apache Atlas optimization

This part explains what kind of optimization BIGO has made based on the open source Atlas, mainly in the following aspects:

Audit capability building is realized through SpringBoot aspect.
Introduce MicroMeter and Prometheus to realize monitoring and alarm capability building.
Extract the graph engine dependency, support the reading and writing of the distributed Nebula Graph graph engine, and add 3w+ lines of code.
Added access speed control and black and white list functions to control burst traffic and malicious access and ensure system stability.
Periodic task to clean up stale Process data to avoid data bloat.
Add the Atlas smooth exit mechanism to prevent the loss of consumption messages due to restarts.
Reconstruct the DAG graph display of blood relationship, optimize the user's visual experience and avoid the problem of slow rendering of large images.
It supports the workflow of the blood relationship related scheduling engine, and solves the most important part of "finding and outputting" in the data blood relationship.
Hook expansion: New support for metadata collection such as Oozie, Kyuubi, Baina, ClickHouse, and Kafka.
Several bug fixes in native version code.

Data Collector features

There is a more important function in the BIGO metadata platform - DataCollector, which is a data collection service. Its main function is to collect and update business metadata (serving upper-layer data measurement and other applications) regularly. For example, the daily data of HIVE table. Access times and access personnel, storage capacity of HDFS paths, business line ownership of metadata, hot and cold judgment of metadata, real person in charge of metadata and other business metadata. At the same time, DataCollector also has the functions of data cleaning (life cycle TTL) and synchronization of data access layer (Baina) metadata.

Graph Engine Alternative

Atlas' native graph engine is JanusGraph. During use, we found that Atlas Janus has the following defects: First, there is a single point of problem in the built-in JanusGraph graph engine that Atlas relies on, and there is a computational bottleneck when the concurrency increases. Second, JanusGraph relies on Solr to build indexes. Although JanusGraph claims to be able to replace Solr with Elasticsearch, there are many problems in actual operation. In addition, there is no technical accumulation related to Solr within BIGO, which requires a certain labor cost. Third, JanusGraph has poor search performance in massive data scenarios, and there are bugs that occasionally cannot be searched for data that are difficult to solve. Fourth, JanusGraph does not have the support of the open source community and the company's internal team, and the maintenance cost is high.

Let's talk about the advantages of Nebula Graph replacing JanusGraph. First of all, business students and operation and maintenance students have tested that Nebula Graph is more than N times that of JanusGraph in terms of graph exploration performance . Furthermore, Nebula Graph is a distributed graph database that supports distributed deployment, both computing and storage can be scaled horizontally, and supports high concurrency . In addition, the Nebula Graph open source community is active, and the product is continuously iteratively optimized to support the data volume of hundreds of billions of vertices and trillions of edges . Finally, there is a cooperative team within BIGO to support and maintain the functional maintenance and development of the Nebula Graph platform.

Challenges & Solutions for Graph Engine Replacement

Although it was determined to replace JanusGraph with Nebula Graph in the selection, there are still certain challenges in the actual replacement process.

First of all, in terms of data model, Nebula Graph is a strong schema type database. If you want to replace the weak type JanusGraph, you need to weaken the concepts of Tag and Edge. Then, there are differences between the two in terms of data type support. Native Nebula Graph does not support complex types such as MAP and LIST to a high degree. Another issue is the design of the index. In Nebula Graph, the index function does not play a role in acceleration, but LOOKUP a necessary condition for such searches. In addition, Nebula Graph itself does not support transactions, which adds a lot of work to us. The last point is the change of usage habits. In terms of query methods, Nebula Graph self-developed query language nGQL, while JanusGraph supports query through Java API and Gremlin.

How to solve the problem? In terms of strong and weak type conversion, BIGO internally modified the core code of Atlas, adding parameters to dynamically determine the DDL data type. Simply put, when writing data or executing a query, the data type of the nGQL operation is determined by specific parameters. In terms of data type support, the Atlas business layer customizes the data serialization method to support complex types. For native index search, independent indexes and composite indexes are automatically created when the system is initialized to solve the Atlas search problem. In terms of transactions, a semi-transactional interface has been added to the Atlas business layer to reduce the probability of data errors in the Nebula Graph storage layer.

Retrofit of Atlas and Nebula Graph

Here, we focus on the transformation of Atlas and Nebula Graph by BIGO in the process of replacing the graph engine.

During the Atlas transformation process, BIOG added 3w+ lines of code to decouple Atlas and the native graph engine JanusGraph and support the distributed graph engine Nebula Graph to read and write. By transforming the full-text index, the Atlas layer can filter queries concurrently with multiple conditions in the form of INTERSECT , thereby improving the search speed. It also transforms Atlas' multi-attribute updates into concurrent updates to accelerate the update speed of metadata storage. In the preprocessing of the loss of consumption messages, BIGO adds a smooth exit mechanism to the Atlas layer to prevent the loss of consumption messages due to restarts. In addition, Atlas supports complex types of data by supporting custom multiple (deserialization) methods. I mentioned transaction support just now. Added Vertex#openTranscation/Vertex#commit interface to the Atlas layer to support semi-transactions, reducing errors due to Nebula Graph's no-transaction rollback. Finally, a large number of independent indexes are combined into composite indexes at the Atlas layer, and the system initialization speed is optimized by creating default indexes and attributes.

In terms of Nebula Graph, BIGO has also transformed it. The first is to modify the LOOKUP clause to support concurrent execution. After testing, the latency of scanning 1 million data has been reduced from 8s to 1s. Additionally, LOOKUP query pagination from Elasticsearch is supported. Furthermore, other part of the transformation work is focused on Elasticsearch. BIGO supports data update and deletion in the back-end ElasticSearch; Listener also supports Commitsnapshot and full data update. The REBUILD function is supported for full-text indexing, and the REBUILD INDEX authority is issued to the admin user. BIGO also adds the function of creating and deleting full-text indexes independently, avoiding all columns written to ElasticSearch and increasing its storage usage. Finally, the periodic Compaction operation of Nebula Graph has also been optimized to reduce upper-layer performance fluctuations.

Search performance after replacement

BIGO 的数据管理与应用实践

The figure above shows the search performance of BIGO after replacing JanusGraph with Nebula Graph. The reason why P99 in the above figure takes more than 2s is that there is always a large search, which will slow down the search speed. After the replacement, the search speed has increased by more than 5 times, and the results returned from the original 5s have been reduced to less than 1s; and there is no longer the problem of occasionally missing data, system maintenance does not require additional maintenance of indexes, and supports high concurrency and super large Data volume storage.

Data Asset Platform Application

BIGO 的数据管理与应用实践

I have shared the underlying unified metadata platform architecture before, and I will explain it in detail here. As shown in the figure above, the lower layer is mainly a unified metadata platform, and the upper layer is the product application layer. The lower left part is the real-time metadata storage module, including data sources such as Hive, Kyuubi, Oozie, Baina, CK (ClickHouse), HDFS, etc., which are written to the unified metadata platform through the Kafka message queue through Hook. The core component of the unified metabase platform is Atlas, which relies on ClickHouse and Nebula Graph. In addition, the unified metadata platform also relies on the OneSQL platform within BIGO, mainly the unified SQL query engine, and Ranger is mainly responsible for authority control.

The upper layer of the data asset platform is connected to the application layer, including: REST interface, data map, real-time blood relationship, ad hoc query, data warehouse modeling, visual table building, resignation handover, permission management and other applications.

The following focuses on the related applications.

data map

BIGO 的数据管理与应用实践

The above picture shows the data map - search (part), which supports global metadata (HIVE, HDFS, CK, BAINA) search and discovery (data sources are still being added), result sorting and downloading, support filtering, and advanced search functions. In the search interface, the left side is the filter condition, the top is the search box entry, and the bottom is the search result display data.

BIGO 的数据管理与应用实践

The above picture shows the data map - details (part), after finding specific metadata through search, you can click to view basic metadata information such as technical details, data measurement, business attribution, life cycle, and historical trends. In the details interface, the details of HIVE metadata in the above figure are taken as an example, which are mainly divided into basic information at the top and detailed information at the bottom, blood relationship, and data preview display.

The basic information lists 107 queries yesterday, the data size is 1.27 TiB, and the occupied space is 3.8 TiB... In addition, the data life cycle and business line can be managed through the [Edit] operation.

The detailed information (left side below) lists field information: HIVE table fields, field types, and field-related descriptions for product and operation use. The right side below is the partition field information. If a certain data is a partition table, the module will display the partition field information; if it is a full scale table, the partition field information will not be displayed. In addition, the data blood relationship and data preview function are not described in detail here.

real-time bloodline

In the real-time blood relationship module, BIGO reconstructs the directed acyclic graph DAG graph, adds data table display, realizes the data lazy loading, association and search workflow, and displays the workflow execution in real time in the blood relationship graph of the business layer. state.

BIGO 的数据管理与应用实践

The bloodline module supports two types of views: chart and visualization. The visualization mode is selected in the above figure. In the visual view part, you can select the corresponding node in the upstream and downstream nodes, and there is a [Show Process] button to display or hide the process workflow. For example, a table and b table, b table is generated by a table through a workflow, opening the [Show Process] button will display the generation process, and closing [Display Process] will block the process data.

BIGO 的数据管理与应用实践

The figure above shows the core module of data lineage, showing the upstream and downstream of a certain metadata. Floating depth and filter options on the left, you can select the number of upstream and downstream layers (depth) centered on a certain metadata. For example, the above figure selects the 2-layer upstream and downstream nodes of the tiki_core_india... data.

data governance

The data governance section mainly shows TTL management, which is used to manage the life cycle of each event in the general management table.

BIGO 的数据管理与应用实践

The above picture is a screenshot of the TTL management part of data governance. From the data map details mentioned above, click the [Edit] button to manage the TTL life cycle of the data. Of course, in addition to life cycle management, BIGO data governance has other functions that will not be introduced in detail here.

data modeling

In terms of data modeling, the unified metadata platform provides SQL scripts to create table models for interactive use by data warehouse developers and data analysts.

BIGO 的数据管理与应用实践

Legend: a SQL model data

BIGO 的数据管理与应用实践

Legend: Data Modeling Portal

monitor the market

The internal monitoring dashboard of BIGO displays company data in real time, including the total resources, the proportion of resources of each business line, changing trends and popular resources, etc., so as to promote the cost optimization of teams and business lines.

BIGO 的数据管理与应用实践

Legend: Screenshot of the desensitized market

In addition to the above applications, the data asset management platform also has applications such as template access, permission management, resignation handover, group management, data preview, and favorite downloads.