Datav: Data visualization large-screen construction system from scratch

关注「Shopee技术团队」公众号，探索更多Shopee技术实践

 目录
1. 现状分析
2. Datav 设计与关键节点实现
   2.1 整体架构设计
   2.2 如何提高各角色之间的协作效率
   2.3 如何支持元数据计算
   2.4 如何支持页面快速配置
   2.5 页面组件直连数据源
   2.6 支持组件联动和筛选查询
3. Datav 带来的收益

With the continuous expansion of Shopee business data, data analysis methods such as tables can no longer meet the daily data analysis needs, and the rich chart analysis Dashboard is particularly important. However, students who are engaged in front-end development know that this kind of manual development of Dashboard pages will consume a lot of human resources and time resources. In the case of a large amount, it may be impossible to respond to business needs in time.
If there is a tool platform that can automatically generate these Dashboard pages, it can save a lot of manpower and time, and the efficiency improvement will be very significant. This article will share how to create a data visualization large-screen construction system from scratch.

1. Current Situation Analysis

Let's look at some data first. Our team has an average of 3-4 Dashboard-related requirements per quarter, and the average project cycle for each requirement is about 40 days. At present, there are 20+ pages, and each page has about 50+ chart components. Another platform (Stella) Dashboard page that is ready to be refactored has 25+ pages, involving more chart components, roughly 100+.

Most of these Dashboard pages have complex content and complex interactions. According to the traditional development method, it takes about 50+ people/day for PM, Dev, and QA to launch a page.

In addition to human resources, let's take a look at the development process of developing a Dashboard.

This is a normal development process. In the whole process, the four most time-consuming parts are data synchronization, interface data aggregation, page development, and joint debugging—about 70% of the time. If there is a platform to solve these problems, then this platform will be of great significance for liberating human bottlenecks, shortening the R&D process, and improving R&D efficiency .

At present, there are many similar platforms on the market, and we have also made a lot of horizontal comparisons. On the whole, considering the closeness of business scenarios, as well as development investment and benefits, we finally decided to develop our own platform "Data Visualization" (hereinafter referred to as "Datav").

The roles we want this platform to take on are as follows:

The Datav platform hosts two main goals:

Shorten the project cycle from 40 days to 20 days;
Reduce labor costs, FE is reduced from 10 people/day to 3 people/day, PM is reduced from 15 people/day to 5 people/day, and BE and QA are no longer required to participate.

2. Datav design and key node implementation

In order to achieve the above two goals, we abstract the functions to be implemented by Datav into five key points:

Reshape the entire project process and improve collaboration between PM and development;
Support simple metadata calculation and more flexible data query;
Support page quick configuration;
Supports direct connection of page components to data sources;
Support component linkage and filter query.

2.1 Overall Architecture Design

Next, we will introduce the implementation of each key point one by one. The following figure is our overall architecture design.

The entire Datav platform consists of five very important subsystems and modules:

Designer : Designer is the core and difficulty of the Datav platform. It supports functions such as page layout configuration, page interaction configuration, and component data configuration. It also supports the configuration of code fragments. It can also be called a low-code platform.
Admin : It is the operation management platform of Datav, including data calculation, work management, component status management, page publishing, page permissions and other general platform management functions.
UI Components : It is the most basic module of the entire platform. We define a standard DSL protocol on the open source chart library. This protocol corresponds to the protocol for accessing Designer. Currently, there are 50+ related components, and the number of components is still growing.
Datav Server : It is a node service that mainly provides some functions such as permission verification, data aggregation, and dynamic SQL generation.
Datasource Access Server : A service dedicated to connecting to different data sources, such as direct connection to MySQL, ClickHouse, Elasticsearch, Presto, etc., providing different connection clients.

As can be seen from the architecture diagram, the Datav platform supports direct connection to various data sources, and will eventually produce a URL that can be easily integrated into any platform. The next plan is to support the generation of source code, which can be used for secondary editing by consumers.

2.2 How to improve the efficiency of collaboration between roles

Before solving this problem, we communicated with various roles many times, and analyzed the pain points and costs of each role in the project:

The pain point on the PM side is to draw a prototype diagram. It takes about 10 days to draw a prototype diagram for each requirement, and it is still a static picture. After completing the requirements with the development, it is necessary to modify some static data in it, and then perform PRD. review;
The main energy of BE and FE is spent on page development, interface development, data synchronization and page joint debugging.

From the traditional development process, these are normal processes and the minimum development path. To solve this problem, we need to reshape the process of the entire project, so that all roles can participate in the configuration, so we redefine the development process of the Dashboard project together, as shown below.

PM can configure the prototype page directly on Datav;
Data Dev also supports automatic synchronization of data calculation results to ClickHouse;
FE can directly connect to the data source through Datav, and can reuse the prototype page configured by PM in step 1 for configuration optimization;
The URL of the final page is generated, and the PM can directly access the URL for testing.

The new process is very effective. The entire project cycle has been shortened by half, from 8 weeks to 4 weeks, and the support of BE and QA is no longer required. FE only needs to invest 3 days on average.

2.3 How to Support Metadata Calculation

2.3.1 Basic knowledge

Supporting metadata calculations is a complex and bulky function. Data is the cornerstone of all systems, and any platform or business is inseparable from the flow of data. At the same time, the flow of data is also very complicated, especially when the amount of data is relatively large.

Let's first understand what has been done in the generation and circulation of data, and what stages have been experienced from the time the client generates data to the aggregation of data seen on the Dashboard.

The figure above is the architecture diagram of a conventional big data platform, which clearly describes the entire process from data generation to data application. There may be some proprietary words that are a bit unfamiliar to FE, but here we only need to understand: the data generated by the user needs to undergo a series of processing before it has a relatively large value.

Here's a more understandable flow chart to illustrate:

As can be seen from the figure, there are four key processes in data preparation, data collection, data preprocessing, data modeling, and data service . Each stage has a series of things to do. Data Dev students can use data warehouses and related tools , to quickly produce the data required by the business, the Datav platform will not involve this part of the function, and the Datav process is after the data service.

Although the data generated from the data warehouse has undergone a series of calculations, it is often not the desired data displayed on the page. According to experience, a Dashboard page often has a lot of data aggregation logic, that is to say, data aggregation needs to be performed according to the data produced by the data service. For example, it is necessary to calculate indicators such as year-on-year and month-on-month.

2.3.2 Datav opens the data direct connection channel

The business side needs to do data aggregation, which means that the back-end development needs to provide API interfaces to the front-end. The data generated by the data warehouse cannot be directly used by the business team due to environmental isolation and permissions. Therefore, the business team has a particularly tortuous process for using this data, which makes the entire process more complex and the link is relatively long.

The BE Dev stage of this process needs to provide two services, a data synchronization service and an API interface service. Rough statistics, this step of BE Dev takes about 30% of the time of the entire project process, and these tasks are also repetitive.

Therefore, the first problem that Datav solves - to open the data direct connection channel, the solution adopted is to directly connect to ClickHouse.

Datav provides a direct data source service , which can directly connect to ClickHouse provided by the Data Infra team. In this way, most Dashboard pages have the ability to directly connect to the data source, and no longer need to rely on the API provided by the BE team, so that the entire The project cycle has been shortened by around 30%.

As mentioned above, API is mainly used for some data aggregation and data logic calculation, so how does Datav support these functions? This is the second big challenge that Datav faces - supporting data aggregation calculations and adding logical fields.

2.3.3 Support metadata calculation

Next, a simple example is used to illustrate how Datav implements an alternative API interface to support some data aggregation calculations. For example, there is a sales table tab_sales (this is the offline data calculated by the data warehouse), the content is as follows:

date	category	name	order_count	pay_succ_orders
20220701	clothing	A Brand T-Shirt	500	200
20220701	clothing	B Brand T-Shirt	1000	500
20220701	digital	A brand mobile phone	1000	600
20220701	digital	B brand mobile phone	1500	800

Now there is a requirement: to calculate the payment success rate of each category.

According to the previous method, the API interface will first find out the data according to the type, and query the SQL:

After getting the original data, it needs to be calculated, and the final data structure on the right side is given to the front end:

Just imagine, if Datav can generate such a SQL, and the query result can also return the same data structure, is it possible to not rely on the API interface? With this question in mind, we made a lot of attempts.

The final conclusion proves that in most scenarios, it is completely possible to not rely on the API interface. Just like the above example, we can get the same data structure by changing the SQL:

2.3.4 Datav data management

In order to solve the above two major problems - how to open up the data direct connection channel and how to support the logical calculation of metadata, Datav has built a data management module.

Data management is a relatively important module. Before configuring a page, the first thing to think about is where the data of this page comes from, which table does the data of each component in the page come from, and whether it is calculated by some fields and so on.

Therefore, the data management module should include several major blocks: data source management, data field editing (including field name aliases, new fields, support for calculations between fields, field formatting, field display permissions, etc.), data subject management (support for multiple Inter-table associations produce logical data wide tables, and custom SQL queries produce data subjects). The process is as follows:

1) Data source management

Currently, the data source supports direct connection to offline data sources (MySQL, ClickHouse), and will soon support direct connection to real-time data sources (Kafka, Elasticsearch, Prometheus, etc.).

2) Data field editing

Field editing provides a series of advanced functions such as alias settings for tables and table fields, table field permission settings, new logical fields, field logical calculations, and field formatting.

3) Data subject management

Data themes currently support visual configuration and custom SQL. Currently, the visualization configuration only supports single-table settings, and will support multi-table associations to form logical wide tables.

The multi-table association function is shown in the figure:

The custom SQL query is shown in the figure:

2.4 How to support quick page configuration

The core of supporting rapid page configuration is the implementation of Designer. Similar to the current low-code platform implementation in the industry, the location information and attributes of page components are described by a common intermediate protocol DSL, and then the DSL is dynamically parsed by the parsing engine at runtime, and then rendered into the page. .

The architecture is as follows:

In order to further improve the collaboration efficiency between UI and FE, we reserve some advanced functions.

For example, implementing a Figma plug-in allows the UI to use our components when designing the page, and then generates the DSL product of the page through this plug-in, and then passes it to the Datav parsing engine for page rendering;
There is also a bolder idea, which is to identify the components in the picture through machine learning, automatically generate the DSL product of the page, and finally hand it over to the Datav parsing engine for page rendering.

Both options are still in the design stage and have not yet been implemented.

Designer has many functions and is more complicated, so I won't go into details here. We have implemented two page layout methods so far: absolute layout and flex layout. For different scenarios, using different layout methods, the page configuration efficiency will be very high.

Flex layout: suitable for configuration scenarios where the page structure is simple and clear, similar to the Admin form type page.

Absolute layout: The page configuration with complex page structure and irregular component layout is particularly suitable for large-screen pages.

2.5 The page component is directly connected to the data source

The key points for a component to connect directly to a data source include:

understand dimensions and metrics;
Understand how to generate an SQL statement.

Next, let’s use two examples to illustrate intuitively:

This is a two-dimensional and one-indicator chart. One dimension is the date, that is, the X-axis; the other dimension is the classification of the column, and the indicator is the data of the Y-axis.

If we want to generate such a chart, our SQL statement should be written like this:

 Select [indicator] from [table_xxx] group by [dimension1], [dimension2]

It can be found from the SQL statement that the dimension attribute is placed after group by , and the indicator attribute is placed after Select .

Let's look at another example:

Similarly, this is a one-dimensional and one-index graph, and the corresponding SQL is as follows:

 Select [indicator] from [table_xxx] group by [dimension1]

After understanding the dimensions and indicators, as well as the SQL generation rules, based on this idea, we implemented a Data-Connector component, which can configure fields such as dimensions, indicators, paging, sorting, etc., and finally generate a corresponding field according to these configurations. SQL statement.

The corresponding generated SQL statement is as follows:

 SELECT field1 FROM Demo-Table GROUP BY date_day LIMIT 1000 OFFSET 100 ORDER BY  field3  ASC

In this way, we realize that the component is directly connected to the data source and displays the corresponding dynamic data.

2.6 Support component linkage and filter query

In most scenarios, the page we configure is not a static page. It requires dynamic data and various interactions. The most common one is to filter this interaction.

Interaction is a difficult point for page configuration, because there are many interactions, and some interactions are also very complicated. Datav currently only supports some interactions, such as component data filtering, button click events, pop-up windows, tab switching and other common functions.

Let's look at an example first. Why must we support interaction between components?

This is the data that we query by using the component to directly connect to the data source. You will find that the amount of data is very large.

At this time, it may cause the component to crash or the page to freeze. Therefore, the best way to solve this problem is to support data filtering, that is, to support one component to filter the data of another component, as shown in the following figure.

This is what we want to achieve: a filter component to control the amount of data in the chart component. So how to achieve it?

We have also investigated similar solutions in the industry. The most common one is to support writing code and implement it using the publish-subscribe design pattern. The implementation principle is roughly as follows:

The advantages of this are obvious - simplicity, but there are also some disadvantages:

Need to write JS code, only friendly to front-end students;
If there are many associated components, code maintenance will become very complicated;
The efficiency of page configuration will also be greatly reduced.

Therefore, in order to solve these problems, Datav realizes the linkage of visual configuration components, and also supports the way of writing code for complex situations. The implementation principle is not complicated, and it can be summarized into four steps:

Establish the relationship between the filter component and the display component, and the relationship between the filter component and the filter parameter;
Monitor changes in filtering parameters;
Once the filter parameter changes, notify all associated presentation components and pass the new value to it;
Notifies the utility function to pull new data and re-render.

The schematic diagram is shown in the figure:

The entire Datav platform is relatively large and has many function points. The design of many function points can be introduced in a separate article. This article mainly focuses on some key problems solved by Datav. We hope that through this article, we can let you know what kind of platform Datav is, what problems it can solve, and what business scenarios it is suitable for. Therefore, there is not much development in the technical details, and another article will be written in the future.

3. Benefits from Datav

The benefits brought by the Datav project can be considered in terms of process optimization, development efficiency, and infrastructure.

3.1 Process optimization

The cycle of the entire project process has been reduced from about 40 days to about 20 days now. The manpower consumption has also been reduced, and the participation of BE and QA students is not necessarily required, and the project cycle has been shortened by 100%.

This is because PM students can configure the prototype of the page directly through the Datav platform. After determining the requirements, the prototype configuration can be directly handed over to FE students for processing. With some interaction and data-related configurations, they can be directly tested.

3.2 Development efficiency

From the perspective of front-end development efficiency, the average 10 people/day has been shortened to an average of 3 people/day, and the efficiency has increased by 200% +.

From the perspective of the efficiency of the entire R&D stage, the total manpower required to participate in the past is: (Data dev) 10 人/日 + (BE Dev) 13 人/日 + (FE Dev) 10 人/日 = 33 人/日 .

Now the total manpower is: (Data dev) 10 人/日 + (FE Dev) 3 人/日 = 13 人/日 , which is shortened from 33 people/day to 13 people/day, and the efficiency is still improved by 150% +. Therefore, Datav is also of great significance for the improvement of R&D efficiency.

3.3 Infrastructure

In addition to the quantitative benefits mentioned above, Datav also brings a lot of hidden benefits, such as in team infrastructure:

Develop component development specifications;
A set of standard component libraries and component platforms have been established;
Precipitated some standard Node middleware, such as logger;
Precipitated a set of standard automation scripts, component creation, automatic compilation, automatic document generation, code specification detection, etc.;
More fine-grained rights management system (under construction).

author of this article
Shopee Digital Purchase & Local Services Front End Team.

Datav: Data visualization large-screen construction system from scratch

1. Current Situation Analysis

2. Datav design and key node implementation

2.1 Overall Architecture Design

2.2 How to improve the efficiency of collaboration between roles

2.3 How to Support Metadata Calculation

2.3.1 Basic knowledge

2.3.2 Datav opens the data direct connection channel

2.3.3 Support metadata calculation

2.3.4 Datav data management

1) Data source management

2) Data field editing

3) Data subject management

2.4 How to support quick page configuration

2.5 The page component is directly connected to the data source

2.6 Support component linkage and filter query

3. Benefits from Datav

3.1 Process optimization

3.2 Development efficiency

3.3 Infrastructure

Shopee技术团队

引用和评论

基于 Flink + Hudi 的实时数仓在 Shopee 的实践

手写一个动态海洋和天空效果的vue hooks

你可能不知道的图片加载相关知识

使用CSS给标题添加书名号并超出省略

原生electron起步-从零到一完成构建和打包

Koa+Typescript起手式(空环境) 不用每次玩node都要搭环境了！

LRU算法，你别跑，我就要吃透你