mysql - Financial Data Intelligence Summit | With the explosive growth of data scale, how can companies make precise decisions? Cloud native data warehouse data operation practical sharing - 个人文章

Introduction to At the 2021 Alibaba Cloud Financial Data Intelligence Summit-"The Dark Horse of Cloud-Native Driving Digital Operations", Wei Chuang, senior technical expert of Alibaba Cloud Database, first cut in from the perspective of data value links. Explain how the cloud-native data warehouse supports data-based operations, full-link marketing, and Alibaba Group’s Double 11 business, and showcase best practice cases and application scenarios for financial customers. The content of this article is organized based on speech recordings and PPT.

At the 2021 Alibaba Cloud Financial Data Intelligence Summit-"The Dark Horse of Growth in Cloud Native Driving Digital and Intelligent Operations", Wei Chuang, senior technical expert of Alibaba Cloud Database, first cut from the perspective of the data value link and explained the cloud for everyone. How does the native data warehouse support data-based operations, full-link marketing, and Alibaba Group’s Double 11 business, and demonstrate best practice cases and application scenarios for financial customers. The content of this article is organized based on speech recordings and PPT.

Wei Chuangxian, Senior Technical Expert of Alibaba Cloud Database

1. Background and Trends

(1) Alibaba's 15 years of cloud computing practice

截屏2021-07-20 下午5.32.40.png

Looking back at Alibaba's fifteen years of cloud native development, it can be roughly divided into three stages.

The first phase is the application architecture Internet phase from 2006 to 2015, which is the process of cloud native from 0 to 1. In the earliest days, Alibaba did middleware on Taobao, which was the prototype of the earliest cloud. At that time, we were studying the Oracle database and IBM's minicomputers. But Alibaba found a problem. With the increasing traffic on Taobao, Oracle's machines cannot continue to meet business needs. After three months, our data will not be able to save. This is a very serious problem, so Alibaba launched its plan to go to IOE at that time.

At this time, Alibaba found that our business was doing very well, but there were many technical challenges. Therefore, Alibaba established Alibaba Cloud in 2009 and developed its own Feitian operating system to start the era of cloudification. Taobao and Tmall merged to build a business center, and the three core middleware systems will be launched.

Feitian operating system is based on Apsara and is a distributed operating system. There are two core services above the basic public module: Pangu and Fuxi. Pangu is a storage management service, and Fuxi is a resource scheduling service. The storage and resource allocation of applications on the Feitian kernel are managed by Pangu and Fuxi. Feitian core services are divided into: computing, storage, database, and network.

In order to help developers easily build cloud applications, Feitian provides a wealth of connection and orchestration services to easily connect and organize these core services, including notifications, queues, resource orchestration, distributed transaction management, and so on.

The top layer of Feitian is the cloud market, the first platform for software trading and delivery created by Alibaba Cloud. It is like the "App Store" of cloud computing. Users can open "software + cloud computing resources" with one click on the official website of Alibaba Cloud. There are thousands of products for sale on the cloud market, which support the access of software and services such as mirroring, containers, orchestration, API, SaaS, services, and downloads.

This is the basic framework of the earliest cloud, and it is also a cloud-native architecture.

Beginning in 2011, we began to do container scheduling, and we started to do online business within the group, and online business began to be containerized. By 2013, the self-developed Feitian operating system fully supported the group's business.

In 2015, Alibaba Cloud's cloud native technology was not only used for Alibaba's internal business, but also began to commercialize externally. The above is the first stage.

The second stage is the core system full cloud native stage from 2016 to 2019.

Starting in 2017, we have not only done online, but also adopted cloud-native technologies all offline. There is a large amount of transaction data during the Double 11 Shopping Festival, and the back-end analysis and post-processing of these data are all handed over to offline for completion. We unify the online and offline underlying resource pools based on cloud native to support million-scale e-commerce transactions.

By 2019, 100% of Alibaba's core systems will be clouded. This is actually very difficult, because Alibaba's business volume is so huge that any ordinary system cannot support it.

The third stage is so far in 2020, is comprehensive upgrade next-generation cloud-native technology stage . Alibaba established a cloud native technology committee, and cloud native was upgraded to Alibaba's new technology strategy. Alibaba's core system fully uses cloud-native products to support the big promotion. Alibaba Cloud's native technology has been fully upgraded, and the serverless era has begun.

(2) Alibaba Cloud's assertion on cloud computing

How does Alibaba view cloud computing? What is the difference between cloud computing and traditional technology?

For example, in a village where every household needs to dig wells, each family decides how wide a well to dig based on factors such as the number of its own population, the approximate amount of water it needs, whether there will be guests, and so on. If there are more guests at home or drought, etc., the water may not be enough. In addition to the cost of digging the well, daily maintenance of this well also requires a high cost.

The above scenario is mapped to the enterprise, that is, based on its own IT foundation, the enterprise has to buy a computer room from the operator and buy a few servers to support its services. If these machines are left unused in the future, companies still have to pay a large amount of money, which is very costly.

The problem that the cloud solves is to realize resource pooling through virtualization technology. Using the example of digging a well above to describe it is to build a water plant. The difference between a water plant and a well is that, first, the water supply is very large, even if there are 100 guests, the water supply can meet the demand. Second, there is no need to invest a lot of costs in the early stage to dig wells, but to charge according to the amount of water demand. Even if the water pipe is connected, if you don’t use it, you will never need to pay for it.

This has brought two major benefits to companies. The first is that when companies need to make quick decisions, they don’t need to spend a lot of time "digging wells", but use it right out of the box. The second is that the initial investment cost is very low.

This is the benefit of cloud, so what is cloud native?

Cloud native is a standard service, and we don't need to plan in advance for many things. For example, if I want to do digital transformation, the needs are very simple. I need someone to provide me with this service, how much I want, how much he allocates to me, and I don’t need to make advance preparations. As my business grows, its underlying infrastructure can grow along with it, with very good flexibility. This also greatly reduces the cost and energy of the enterprise, can focus more on doing what is best, and greatly improve efficiency.

Through the above examples, the following points are very easy to understand.

截屏2021-07-20 下午5.34.22.png

First of all, we believe that containers + K8s will become a new interface for cloud computing, which is a trend in the future.

Second, the entire software life cycle will also change. The original software has a very long life cycle, but now it can be iterated faster and faster through cloud-native technology. It can be done by extending the integration of software and hardware downward, and extending the modernization of the architecture upward.

Finally, accelerate the digital upgrade of enterprises. It turns out that the digital transformation of enterprises is very complicated. It may take three to five years to complete the purchase of machines, databases, and applications. Today's digital transformation of enterprises can be completed in just a few months.

(3) Industry trends: data production/processing is undergoing qualitative changes

From the perspective of industry trends, what changes will happen to data in the future, and what changes will it bring to applications?

截屏2021-07-20 下午5.35.04.png

First of all, we believe that future data will surely grow explosively in scale. The global data scale in 2020 is about 40 ZB. 40 What is the concept of ZB? For example, assuming that each movie is 1GB, and assuming that everyone in the world goes to watch a movie, the total amount of data is about 40ZB.

In addition, we predict that the global data scale in 2025 will be 430% of 2020, and the global data scale is growing every year.

The second is real-time data production/processing. Originally, we might look at the report once a month. After big data, we can look at yesterday's data once a day. Data is becoming more and more real-time, and it can achieve second-level response. Take the marketing scenario as an example. In the Double Eleven Shopping Festival scenario, when a merchant finds that a certain activity in a store cannot produce an effect, it can adjust the advertising or placement strategy within a minute or a few minutes to achieve better marketing effects. If the data is fed back on a daily basis, when the data is seen on November 12, the effect of the activity has been greatly reduced. Therefore, real-time data plays a very important role in such similar scenarios, and real-time data will also bring real-time applications.

The third is intelligent data production/processing. At present, in all data, unstructured data accounts for 80%, mainly including text, graphics, images, audio, video, etc., especially in the current popular live broadcast field, intelligent processing of unstructured data, can know the audience Your preferences and other information to facilitate business development. In addition, unstructured data continues to grow at an annual rate of 55%, and it will become a very important source of data analysis in the future.

The fourth is the acceleration of data going to the cloud. We believe that the data going to the cloud is unstoppable, just as gasoline cars will eventually be replaced by trams. It is estimated that by 2025, the scale of data storage on the cloud will be 49%, and the scale of database on the cloud will be 75% by 2023.

(4) Industry trends: Cloud computing accelerates the evolution of database systems

Another industry trend cannot be ignored: cloud computing accelerates the evolution of database systems.

截屏2021-07-20 下午5.35.51.png

First, let's take a look at the development history of the database. Databases were born as early as the 1980s and 1990s. At that time, they were mainly commercial databases, such as Oracle and IBM DB2. Some of these databases still occupy the market today.

By the 1990s, open source databases began to emerge, such as PostgreSQL and MySQL. MySQL is used more in China, and PostgreSQL is used more in foreign countries. After the 1990s, the amount of data has become larger and larger. When the amount was small, PostgreSQL or MySQL could be used. A single machine could solve the problem. With the explosive growth of the data volume, it is necessary to solve the large amount of data in a distributed or minicomputer way. And analyze the problem.

Where is the importance of data analysis?

For example, there is a data warehouse Snowflake company that reached a market value of 100 billion US dollars when it first went public. Now it has 70 billion US dollars. For a company that only produces one product, this is a very high market value. Why is its market value so high?

Some time ago, I communicated with a teacher. He said that for current companies, especially Internet companies such as e-commerce or live broadcasts, the biggest cost of their companies earlier was manpower, and employee salaries accounted for the main expenditure. But now the biggest expenditure is information and data. For the company's future development plan, it is necessary to have a large amount of data to analyze what customers want most, what they need most, and what the development of the industry is. Therefore, companies need to buy a lot of data and do a lot of data analysis, and the cost in this area has exceeded the cost of personnel. This is why a company that only does a data warehouse has a market value of 70 billion US dollars.

After 2000, everyone began to use Hadoop and Spark. In 2010, cloud-native, integrated and distributed products such as AWS, AnalyticDB, etc. began to appear.

(5) Industry trends: Data warehouses are accelerating the evolution from Big Data to Cloud-Native + Fast Data

截屏2021-07-20 下午5.37.10.png

Above is the evolution history of the data warehouse, from offline to online, then to offline integration, and then to distributed. Functions have changed from statistics to AI, data types have also changed from structured to structured and unstructured multi-mode integration, loads have changed from OLAP to HTAP, hardware has also been upgraded to software and hardware integration, and delivery has changed from On-Premise to Cloud-Native + Serverless.

In the different processes of evolution, there are various products as support.

(6) Evolution of database system architecture

截屏2021-07-20 下午5.37.55.png

The picture above shows the evolution of the database system architecture. The simple logic can be understood as the original one factory with one person working, then one factory with ten people working, and then the development into multiple factories with multiple people working, this is the whole data The development history of the warehouse has changed from a single machine to a distributed one, and a piece of data is used by multiple people.

The development of the database is also the same as that of human work. Originally, some stores can be maintained by a couple of husbands and wives. One person is responsible for production and the other is responsible for sales. With the development, there are more and more customers in the store. The store is still a store, but there may be ten employees. Later, the business developed more and more. It recruited 100,000 employees at once, and then worked in 10 venues. This is the distributed cloud-native data warehouse.

(7) Industry trends: key technologies of cloud-native databases

截屏2021-07-20 下午5.38.32.png

Above is the key technology of cloud native database.

Here is a brief talk about two technologies. The first is cloud native. What does cloud native mean? If a user buys a database, when the business volume is low, or when it is not used on statutory holidays, the charge will be less, and when the business volume is large, the charge will be higher. Charges on demand and volume, which is a requirement of our data warehouse.

The other is security and credibility. For example, Alibaba has an investment department. If you invest 5 million in company A and 1 million in company B, this information is highly private and cannot be disclosed. If this information is managed by employees, the employees may leave their jobs, and once they leave their jobs, it is difficult to hold accountable at the legal level. How to make this kind of highly private information completely encrypted, so that even a DBA with the highest authority cannot view this kind of information, so that it is safe and reliable. This will be discussed in detail later.

2. Cloud native and big data applications

(1) Challenges faced by the business

截屏2021-07-20 下午5.39.09.png

The business faces many challenges, mainly in four aspects.

The first is that the data is scattered and inconsistent. There are also a lot of data sources. Collecting the data is a big challenge.

The second is that the system is extremely complex, with 40+ systems or components. It may be based on Hadoop, but now it needs a lot of systems or components. The bottom may be HDFS, the top is YARN, HBase, and then there are many things such as Hive, Flink, etc., which are very complicated.

In addition, the analysis is not real-time, and its data can only be T+1, which is a traditional big data architecture.

Finally, there is high learning cost. Versions of different technologies are iterated very quickly and learning costs are high.

(2) Cloud-native data warehouse + cloud-native data lake to build a new generation of data storage and processing solutions

截屏2021-07-20 下午5.39.49.png

At that time, Alibaba Cloud adopted the simplest architecture to solve the entire product architecture through one or two products, which can make it easier for users to use SQL to solve a variety of problems. For example, the original OSS data, and the data processed by each production are analyzed in a centralized manner.

(3) Cloud-native data warehouse: cloud-native

截屏2021-07-20 下午5.40.04.png

The cloud-native characteristics of a cloud-native data warehouse are mainly reflected in that if there is only one piece of data, then only one piece of data storage will be allocated, and if the amount of data grows, it will automatically allocate more storage.

Similarly, computing is the same. If there is no computing demand or analysis demand, it will not allocate resources. Only when demand comes, will it allocate resources for calculation or analysis. The entire system is paid on demand, plus the flexibility of resources.

(4) Cloud-native data warehouse: integration of database and big data

截屏2021-07-20 下午5.40.52.png

The above are the key technologies in cloud-native data warehouses, such as row-column mixed storage, which can support high-throughput writes and high-concurrency queries.

The second is mixed load, that is, it can run ETL and query.

There is also smart indexing. A very important point in the database is to understand the business, understand the Index, and know what affects the query and what affects the writing, so we hope this thing can be made smarter so that users do not need to manage these things.

(5) A new generation of data warehouse solutions

截屏2021-07-20 下午5.42.05.png

Above is the architecture diagram of the new generation data warehouse solution. The bottom layer is the data warehouse, and the above is the data warehouse model. Alibaba has made a lot of models in Taobao index, data insight, etc., including linking all information through an ID. This information is gathered into a model. There is a data construction management engine on the model, which can do data warehouse planning, code research and development, data asset management, data services, etc.

The top is business empowerment, which has many applications, including regulatory reporting, business decision-making, risk warning, and marketing and operation.

(6) Data security on the cloud

截屏2021-07-20 下午5.43.51.png

Regarding the issue of data security on the cloud, let's expand on it. Every company has top-secret data, and these data face many security issues, such as unauthorized operation by administrators/users, theft of data backups, and malicious modification of data. In addition, data is encrypted throughout the storage, query, and sharing process, and no one (including administrators) can obtain plaintext data. To ensure the integrity of the log in an untrusted environment, no one (including the administrator) can tamper with the log file. Ensure that the query results are correct in an untrusted environment, and no one (including administrators) can tamper with the query results.

The previous solution is very simple, that is, the data is encrypted when it is written to the database. For example, it is called 123 when it is written, and it becomes out of order through encryption, such as 213, 312, etc. This seems to be a good method, but what's the problem with it? It has no way to do a query. For example, if we want to check a transaction that exceeds 50 yuan, but because 50 is encrypted, it will not be 50, and it may become 500. The original 500 encrypted is 50, so this query cannot be performed. Because it becomes a storage, it cannot do analysis and query.

(7) Encrypted data in the cloud will never be leaked

截屏2021-07-20 下午5.44.13.png

Is there a way for us to do data analysis while keeping it confidential, even with the original SQL?

The core thing here is the hardware we use. Through ApsaraDB RDS (PostgreSQL version) + Shenlong bare metal server (security chip TEE technology), the Key can be stored in it in advance, and then all calculations and logic are in the encryption hardware. conduct. Since the entire process is protected by encrypted hardware, even if someone copies all of the system's memory, the copied data is all encrypted, which ensures that even if the operation and maintenance personnel get the top secret data, there is no risk of leakage.

Three, best practices

Let’s take a look at a few best practices:

DMP: Full Link Marketing

截屏2021-07-20 下午5.45.24.png

DMP (Data Management Platform) stands for data management platform, also called data marketing platform.

What is the core thing in marketing? The core thing in marketing is to find people and find the group of people that you care about the most. The professional term is called circle people.

For example, what scene needs to trap people? For example, today we want to find people who are interested in cloud native to discuss cloud native. Find people who are interested in cloud native, this process is called circle people.

Another is similar to the Tmall Taobao report. For example, some time before Double Eleven, the merchant thought that a certain customer might want to buy a dress or a bag this year, and was a potential customer, so they went to push some coupons for the TA, etc. .

The most important thing here is the precise positioning of the crowd, which can accurately distinguish the crowd. There are about 800 million people in China who consume e-commerce. The core of this is to push messages to people who are interested in a certain item.

Alibaba uses data warehouses to trap people. First, they find some seed crowds. These seed crowds are about several million people. They are high-quality customers that we think. For example, they spend more than 5,000 yuan on Taobao every month. People over 10,000 yuan. After all the crowds are taken out, the second step is to cluster the groups.

Clustering means dividing millions of people into several sub-categories. Each category may like one category. For example, this category likes to buy cosmetics, the other likes digital products, and the other likes to buy books. After dividing the sub-categories, for example, there may be 100,000 people who love to buy cosmetics, but most of these 100,000 people may have bought cosmetics before, and there is a high probability that they will not buy them this time.

Therefore, we need to find people among the 800 million consumers who are really likely to buy cosmetics. What should we do?

We need to convert each customer’s consumption behavior and historical purchase records into a vector of the AI model. If two customers’ purchase behaviors are similar, then their vector distance will be very small. In this case, our approach is very small. simple. For example, we put people who are interested in digital products as seeds in 800 million to find them. If there are 10 million people with the closest seed vector to these people, then we will send advertisements or coupons for digital products to these 10 million people. , Use this method to do business marketing.

There are several aspects at the core of this process.

The first is to cluster the population, divide the population, know the historical transactions of TA, and the data must be able to support multi-dimensional analysis in any dimension.

The second is to be able to do a specific analysis of the data in the entire data warehouse.

The third is the vector approximation search after clustering to find out the crowds similar to each class vector for message push.

This is what we have, which is currently implemented based on AnalyticDB.

One more thing is to do Ad-hoc query. For example, we need to find people who are interested in digital and who have not bought an iPhone 12 last year, so that he may buy iPhone 12 this year. Or someone who bought iPhone 12 last year and AirPods at the same time, then we think there is a high probability that he might buy an Apple keyboard or an Apple computer. We need to do a variety of transaction inquiries on these people, so as to accurately find our target population.

Refined advertising management

截屏2021-07-20 下午5.46.08.png

Business challenges:

1) The keyword search event needs high concurrency and real-time storage;

2) All users query the conversion rate at the same time through the dashboard, and the QPS is high for complex queries;

3) High response time is required to avoid missing the prime time of price adjustment.

Business value:

1) Unified management of keywords for multiple sites and stores;

2) Handle tens of thousands of TPS concurrent writes;

3) Real-time analysis of massive data, intelligent price adjustment according to time period;

4) Fast identification and analysis of keywords to maximize revenue.

Online e-commerce

截屏2021-07-20 下午5.46.53.png

Business challenges:

1) The traditional MySQL database analysis is full, and the tens of millions/hundreds of millions of complex reports cannot be returned;

2) Second-level return of complex reports;

3) Compatible with MySQL ecology;

4) The business is developing rapidly, and there are different requirements for computing and storage.

Business value:

1) RDS + AnalyticDB realizes the HTAP joint solution, business and analysis isolation;

2) 2-10 times improvement in analysis performance;

3) Distributed architecture, horizontal expansion, flexible configuration, supporting different needs of data volume and access volume

This is the stage in which the next generation of cloud native technologies will be fully upgraded from 2020 to now-the serverless era. Alibaba established the Cloud Native Technology Committee, and the cloud native has been upgraded to Alibaba's new technology strategy. In the future, the cloud native data warehouse will have more new functions to solve more core pain points for the industry, so stay tuned.

Financial Data Intelligence Summit | With the explosive growth of data scale, how can companies make precise decisions? Cloud native data warehouse data operation practical sharing

1. Background and Trends

(1) Alibaba's 15 years of cloud computing practice

(2) Alibaba Cloud's assertion on cloud computing

(3) Industry trends: data production/processing is undergoing qualitative changes

(4) Industry trends: Cloud computing accelerates the evolution of database systems

(5) Industry trends: Data warehouses are accelerating the evolution from Big Data to Cloud-Native + Fast Data

(6) Evolution of database system architecture

(7) Industry trends: key technologies of cloud-native databases

2. Cloud native and big data applications

(1) Challenges faced by the business

(2) Cloud-native data warehouse + cloud-native data lake to build a new generation of data storage and processing solutions

(3) Cloud-native data warehouse: cloud-native

(4) Cloud-native data warehouse: integration of database and big data

(5) A new generation of data warehouse solutions

(6) Data security on the cloud

(7) Encrypted data in the cloud will never be leaked

Three, best practices

DMP: Full Link Marketing

Refined advertising management

Online e-commerce

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

MySQL 备份 Shell 脚本：支持远程同步与阿里云 OSS 备份

《SQL应用场景解析：如何通过SQL解决实际业务问题》

入选AAAI 2025，浙江大学提出多对一回归模型M2OST，利用数字病理图像精准预测基因表达