数据库 - Open Source Practice | OceanBase's Practice and Thinking in the Big Data Scenario of Red Elephant Yunteng - OceanBase技术站

This article will introduce the implementation practice and thinking of OceanBase in the big data scenario of Red Elephant Cloud, hoping to help enterprise users who are exploring OceanBase to quickly realize the selection and implementation of OceanBase.

Author: Tong Xiaojun

Chairman and CTO of Red Elephant Yunteng (REDOOP), the first Cloudera CCDH certified engineer in China, and former co-chairman of China Hadoop Summit.

Redoop Yunteng is a big data software manufacturer focusing on the Apache Hadoop ecosystem. The main product is the Redoop Yunteng big data platform (Redoop Enterprise V9.0). The products are accessed by CRF data, CRH data storage, and CRS data analysis. It consists of three parts. The core product of Red Elephant is to create a series of solutions with Hadoop as the core to serve Chinese customers. Hadoop is software that does not provide commercial support and services under the Apache Foundation, and Red Elephant provides user-oriented commercial solutions based on this. Hadoop distribution is the starting point of Red Elephant, and it is also our career. We have been doing the promotion and popularization of big data Hadoop all the time. In this regard, Hadoop and OceanBase have a lot to complement each other. We also found that in the big data Hadoop ecosystem, there is a lack of a solution that can support distributed transactions and support multiple centers across two places.

We participate in the OceanBase community with an attitude of active participation and courage to try, because we are the first group of people to eat Hadoop crabs. When Hadoop was born, the architecture was very simple and it was inconvenient to use. But when Hadoop was in a few tenths version, we started to try to use it due to the needs of business and data scale. After OceanBase is open sourced this time, I think it is amazing. If you sum it up in four words, it is "concise and beautiful". Personally, I think the technical staff has a position. What is a position? Our time is precious, and we have to spend it on really good work.

"OceanBase, as a distributed database, shows simplicity and beauty in its architecture, which allows us to see new opportunities." - Tong Xiaojun, CTO, Red Elephant Yunteng

Why did we choose OceanBase?

There are the following five reasons :

First identity change. OceanBase started the open source and open route on June 1 this year, which gave us a sense of participation. If it is not open source, firstly, we have no chance to get this version; secondly, we do not know how to contribute; in addition, the value of our contribution cannot be recognized by the community. Therefore, open source and open source allow us to change from a bystander, or user, to a participant, which is a very important identity change. On the other hand, we may not have a particularly sufficient budget to support the commercial version, but we also hope to use the technology of OceanBase. After the open source route comes out, we will boldly use it.

The second technology selection. We have requirements for databases. First of all, Hadoop is a natively distributed and highly available system, and its inherent design is multiple copies, but it is relatively weak across data centers. Therefore, when we choose a database, we not only require distributed and highly available features, but also linear scalability. This is our requirement for database selection, and OceanBase meets our needs.

Third compatibility, this is a very critical feature. OceanBase has good compatibility with MySQL, and many of our applications can be directly ported to the OceanBase environment without changing too much code. We have an application migrated to the OceanBase environment without modification of a line of code, and it is directly migrated to use, so high compatibility is a key point for us to choose OceanBase.

Fourth technical support. Although the OceanBase solution in Ali and Ant is very mature, OceanBase is a new product for us. Therefore, when selecting and testing models, we are worried that we will encounter technical problems that we cannot solve, such as the inability of normal access to the business or the occurrence of exceptions. Our customers are not easy to explain. During our use of OceanBase, the OceanBase community team has given us great support. When we encounter problems, our technical classmates feedback the situation to the user group, and the community technical team can respond and answer in a timely manner, so that we can use it with confidence.

There is another key point in choosing OceanBase: easy to deploy, concise and easy to use.

When I first saw OceanBase, I found that it has only one main component, OBServer. Of course, there are peripheral components, but the core has only one component. There are too many Hadoop components. As an entry-level user, it takes a long time to understand the functions of a bunch of components to master a Hadoop system. OceanBase implements many features in Hadoop, and also implements many features that are not in Hadoop, so simplicity is beauty.

2. Which scenarios are used in Red Elephant Yunteng?

Red Elephant is mainly engaged in distributed big data business scenarios. There are two routes in this scenario, batch processing route and stream processing route. Hadoop is better at back-end processing, doing large-scale data processing (such as ETL cleaning), and is not very good at facing the user side. When we face the user side, such as reports, when the application is connected, Hadoop seems to be powerless. Hadoop is more for a large-scale data batch processing business.

For real-time-oriented scenarios, although there are solutions (such as HBase, etc.), these solutions are still heavy, so we need a lightweight solution. We used MySQL before, but now we use OceanBase to replace MySQL cluster to undertake business reports. When the data operation is completed and the result is stored in a result database, OceanBase assumes the role of providing services for the application side. Therefore, in some big data scenarios, such as online services for mass data storage, storage of ETL cleaning results, lightweight OLAP analysis reports, and metadata database services, the OceanBase solution can be considered to provide services. The main usage scenarios of distributed databases in big data business are as follows:

Some hive metadata that the current Hadoop ecosystem relies on is stored in MySQL. When the business volume is large, there is no particularly good storage solution for hive metadata. For example, the customer said that the MySQL of our hive metadata must also be highly available. This is the customer's demand on the site of our business line. What should we do? Are we doing a MySQL cluster? Falling in again, right? If it is the master and backup, it does not have high availability. Therefore, there are many scenarios in our entire Hadoop technology stack, such as the first scenario of replacing MySQL, and the second scenario of undertaking multiple front-end business query requests.

"New energy power big data is launched on OceanBase"

Using OceanBase's strengths to make up for Hadoop's shortcomings is what Red Elephant is trying to do, combining these two ecosystems to better serve customers. You can take a look at the new energy case of the power industry that combines Hadoop and OceanBase from Red Elephant: This is a big data platform that includes Hadoop, Hive, Druid, Spark, Flink and other components.

The distributions that you may have seen from the market are very large, and our goal at Red Elephant is to make our distributions small. I have personally thought about it: small to the extreme, cut to the extreme, what can be left? We revolve around Hadoop at the core, make the rest into plug-ins, and find partners to do the peripheral. This is a piece of advice I want to give to entrepreneurs: when you put together a bunch of open source components, it is better to make a component, but it will be more valuable. This is also our route. Do Hadoop well, and we work with partners to complete commercial solutions for the rest.

In the scenario of energy big data, there is a business flow. The data of the photovoltaic sensor will be continuously recorded in kafka, the Flink program will consume the kafka data, and after the consumption data will be re-recorded in kafka, and then Apache Druid will consume the kafka data, and then provide external services, this is an industrial time series Data processing scenarios.

Where does this process use OceanBase? When our point table data is transmitted, it is a comma-separated stream like CSV. Each field of this stream will change, and we need a database to store these changes. We need to get the matadate field description information in Flink, and then add the information of the table structure. Therefore, from the original kafka data to the data to be consumed by our DRUID, there must be a data completion work. The description information for the indicators is stored in OceanBase.

We also have some businesses that use various databases. In the past, we would record these data into Hadoop and build a hive table for business use. The whole process was very complicated and users were tiring to use. Now, directly record the data into OceanBase, and then provide external services. The architecture is very simple, and it is very convenient to solve customer problems.

In the past, Red Elephant liked to add things when doing things, and made a system very complicated, but now they do things by subtraction. We have a communications administration project in Xinjiang, and the project data has 600 Hadoop nodes. First, load the data of China Telecom and China Unicom into kafak, then calculate through Spark, and record it into hive after calculation. The processing flow is too long. After simplification, we can directly push the data into Hadoop. Sometimes it feels better to do subtraction, and now we do things and architectures by doing subtraction.

This set of Hadoop cluster deployment architecture has 4 management nodes on the top and data nodes below. OceanBase consists of 6 nodes, each node is 36 cores and 128G, there is a 2T data disk and a log disk. This is The currently running solution after being checked by the OceanBase team.

Our cluster deployment structure has 3 zones, each zone has 2 machines, and the infrastructure of 6 machines in total. The supported businesses are OceanBase, PGSQL, and HDMS. These components work together to support various businesses. There are data in business scenarios. Application scenarios such as interfaces, system applications, report visualization, etc.

"Hadoop + OceanBase lights up new energy big data"

This is our current cluster situation, which is currently only a small cluster: 3 Kafka, 8 Hadoop management nodes plus data nodes, 6 OceanBase nodes, supporting 100,000 point data. There are 100,000 points to be collected every day, and the newly added data is about 0.5 TB. Although the amount of data is not very large, we have a cluster of 600 Hadoop nodes behind us. The role of the business will become more and more important, and the core functions will be fully reflected, such as elastic expansion and contraction, HTAP capabilities, etc.

The investment in this project is very large. It is a current political platform for photovoltaic energy storage deployed by the Ministry of Energy of our country. It is a very meaningful project. It represents our new technology and new products in the photovoltaic industry.

3. What can be improved from the user's point of view?

From the practice of Red Elephant Yunteng, based on the user's point of view, I would like to make five suggestions for OceanBase for reference.

JBOD (just a bunch of disks) multiple drive letter mount support.

Hadoop machines are all JBODs with many disks, and a machine may have 12 disks, each with a 4-T architecture. But OceanBase points to one disk, what about the remaining 11 disks at this time? If it is made into a raid mode, it is not the same as our hybrid deployment architecture. If JBOD multi-drive mount support can be achieved, it will be more convenient to migrate Hadoop business to OceanBase, and the reuse of machine resources will be more friendly. At present, the OceanBase team has reported that this requirement is being evaluated internally.

Initial file disk usage issue.

After deploying OceanBase, we found that the disk usage reached 90%. This is because OceanBase will create a large file after deployment. It will occupy this space to facilitate reading and writing. I can understand this, but of course operation and maintenance cannot. So after we deployed OceanBase, our operation and maintenance engineer said, "Mr. Tong, the system data has not been written, why is this disk full." Later, we told the operation and maintenance students the actual disk usage by extracting some indicator data from OceanBase. , and we also want the initial file to be incrementally set. At present, the OceanBase team has reported that this requirement is being evaluated internally.

Startup, configuration management, and monitoring issues of individual components.

Before we used the first version of OceanBase open source, we needed to install the deployment environment through the command line, and configure it through the configuration file. It is not particularly convenient for the startup and shutdown of a single component, and the entire cluster needs to be operated. In some scenarios, we need to start and shut down a certain component of a certain machine, so that a certain machine can be started and stopped accurately, instead of changing the configuration file, or more complex operations, can it be aimed at a single machine, The configuration management of a single component is actually a challenge for management tools. Currently in the latest OceanBase Community Edition 3.1.2, the OceanBase Cloud Platform (OCP) supports open source. OCP is an enterprise-level database management platform with OceanBase as the core, which can easily solve the above-mentioned problems. At the same time, OCP not only provides full life cycle management services for components such as OceanBase clusters and tenants, but also provides OceanBase related resources. (hosts, networks, software packages, etc.) provide management services, which can manage OceanBase clusters more efficiently, reduce the IT operation and maintenance costs of enterprises, and are very user-friendly.

ODBC driver support.

There is another one called odbc in our business site. It is known that OceanBase supports jdbc but not odbc, but MySQL can be adapted to odbc, so in actual business scenarios, data needs to be written back to MySQL. odbc In industrial scenarios, there are many windows machines or some reasons that use odbc driver. We hope that OceanBase should not only support jdbc, but also odbc. At present, the OceanBase team has reported that the requirement is already in the schedule planning.

Latin1 character set support.

We want to migrate the metadata of hive, and found that the character set is not supported. Currently, it is adapted by changing the application. It would be better if OceanBase can support it later. At present, the OceanBase team has reported that this requirement is being evaluated internally.

4. Ecological integration of the two parties, complementary advantages

Hive Metadata On OceanBase。

Currently our Hadoop metadata solution is still based on MySQL, so there is a single point and limited capacity problem. We plan to use OceanBase to replace the MySQL solution and apply it to the communication management authority project, which is still in the testing stage at present.

OceanBase BackUp To NFS On HDFS。

This question is about Oceanbase's Backup. OceanBase itself has backup and recovery functions, and everyone can use it. At the same time, the data is in three copies by default (that is, the data will be stored in three copies). The Paxos protocol ensures strong data consistency and naturally ensures high service availability. But what we think about is whether OceanBase can also use Hadoop while helping us solve the problem? In this way, the advantages of the two products will be integrated to form a very good solution.

You can feel it in the following scenario: our company's largest customer is China Aerospace, the data stored and managed on Hadoop reaches dozens of PB, and the data of dozens of satellites is managed. Therefore, we have a great responsibility and cannot Take it lightly. After the system was launched, we have not encountered a large-scale data loss due to machine failure in the past ten years. Sometimes even if a few small blocks are lost, we can retrieve this block of data from the file system. Therefore, Hadoop gives us a good backup mechanism, and everyone can safely store data on it without worrying about data loss. Therefore, you can try to use Hadoop as the backup component of OceanBase, this is my small suggestion.

written at the end

From open source to open source, this is my suggestion to everyone. There are many students who use open source products, and may also be the founders of open source software. We are the beneficiaries of open source software, so we must also make our contribution to open source.

Share my understanding of open source:

First, open source is a process of establishing standards, and first-class companies make standards. Second, open source is a process of establishing connections. Your software is widely used, you have established connections with various companies, and naturally there are business opportunities. Red Elephant Yunteng enables us to participate in the Hadoop open source community to popularize and promote Hadoop, which is the key reason for our growth. We hope to grow together with the OceanBase open source community and contribute our strength to the OceanBase open source.

Open Source Practice | OceanBase's Practice and Thinking in the Big Data Scenario of Red Elephant Yunteng

Why did we choose OceanBase?

2. Which scenarios are used in Red Elephant Yunteng?

3. What can be improved from the user's point of view?

OceanBase技术站

引用和评论

OceanBase v4.3.1 列存 FAQ

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式

Open Source Practice | OceanBase&#39;s Practice and Thinking in the Big Data Scenario of Red Elephant Yunteng

Why did we choose OceanBase?

2. Which scenarios are used in Red Elephant Yunteng?

3. What can be improved from the user's point of view?

OceanBase技术站

引用和评论

OceanBase v4.3.1 列存 FAQ

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式

Open Source Practice | OceanBase's Practice and Thinking in the Big Data Scenario of Red Elephant Yunteng