7
头图
Interviews: SegmentFault Sifu Interviews Guests: Baidu PALO Team (Yang Zhengguo, Miao Ling, Li Haopeng, Zhu Xiaoli, Gong Zheng, Zhang Zhiqiang, Zhong Yi, Zhang Dongjin, etc.)

On June 16, 2022, the Apache Software Foundation published a blog post announcing that Doris officially graduated and officially became an Apache Top-Level Project (TLP). (Related reading: https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces81 ).

From its birth in 2008 to today, Doris has gone through fourteen years. In the past fourteen years, Doris has experienced birth, promotion, development, open source, donation, prosperity, and many hardships and setbacks in the journey of growth.

SegmentFault has an exclusive interview with the Baidu PALO team to show you the unknown stories and twists and turns behind this.

The following content is compiled from the interview transcript.

Born for data analysis, Doris' "Past and Present"

SegmentFault: Looking back at the 14-year development history of Doris, what project milestones has Doris experienced?

Baidu PALO team : Speaking of the history of Doris, it was originally born in Baidu to solve the high-concurrency and high-real-time online reporting needs of Baidu at that time. Doris has gone through 14 years without realizing it.

• In 2008, Doris was born in Baidu, positioned as a high-performance analytical database, which greatly improved the timeliness of data analysis in Baidu's Phoenix Nest business.
• In 2009, Doris began to support other reporting systems within Baidu, and helped Baidu Statistics become the leading Chinese website analysis tool in China.
• In 2012, Doris grew into Baidu's first company-level OLAP analysis platform and officially changed its name to PALO (the reverse of OLAP).
•In 2013, PALO upgraded the new-generation MPP distributed architecture and high-performance data model, and various core technical indicators have been greatly improved.
• In 2017, PALO officially became open source.
In 2018, Baidu donated the core engine of PALO to the Apache Software Foundation and named it Apache Doris, and the Baidu PALO team began to fully promote the development of the Doris community.
• In 2020, with the joint efforts of the Baidu PALO team and community partners, the Apache Doris community will embark on a fast-track development.
• In 2021, Doris's core capabilities will be greatly enhanced, and its industry influence will be further enhanced. It will become the "OSCAR peak open source project and community" of the China Academy of Information and Communications Technology in 2021, and will be certified as an official member of the "First Batch of Trusted Open Source Community Community (TWOS)".
In 2022, Baidu will officially complete the trademark donation, promote the completion of Apache Doris graduation, and become a top-level project of the Apache Software Foundation.


SegmentFault: We saw that Doris once changed its name to PALO (the reverse of OLAP). Is there any special meaning behind this?

Baidu PALO team : In 2008, at the beginning of its birth, it was named Doris within Baidu. Around 2012-2013, Doris underwent a major architectural revision and upgrade, which was in line with the then positioning of "solving high concurrency, high real-time OLAP scenarios" , so the team reversed "OLAP", and "PALO" was born. In the end, PALO became the name of Baidu's commercial data warehouse product based on Apache Doris.

In 2018, the Baidu PALO team and Baidu's colleagues in charge of open source work together to donate the PALO project to the Apache Software Foundation (ASF). In consideration of brand and trademark issues, the open source project was finally named Apache Doris, and the Doris brand and The trademark was donated to ASF and officially became an ASF incubation project.


SegmentFault: As a high-performance analytical database within Baidu, why did you consider making the product open source?

Baidu PALO team : When Doris designed the new version in 2013, it has been verified in Baidu for complex scenarios, high concurrency, and high pressure. We hope to open source it in the future so that more people can use it and benefit from it. Hope to accelerate its growth through open source. Therefore, in the design of the new version, we removed the dependence of Baidu's internal closed source library and internal system, so that the entire system can operate independently and freely. The Baidu PALO team firmly believes that the future infrastructure software will inevitably follow the open source route, and only open source can maintain product vitality and iteration speed.

At the beginning of open source, our goal was to build Doris into one of the best open source data warehouses in the world, so we chose to open source at the Apache Software Foundation. As we all know, the Apache Software Foundation's projects in the field of big data are very influential, such as Hadoop and Spark and other well-known projects.

From open source to the present, the number of Stars and Contributors of Apache Doris has increased several times, or even dozens of times, especially since 2020, the growth rate of community core indicators has accelerated significantly, which also reflects from the side that the outside world has a lot of interest in Apache Doris. The Doris project continues to gain traction and recognition.

The figure below is the data from Star History. You can see that Apache Doris has grown from the initial 240 Stars to 4,500 Stars. It can also be seen from the figure that the growth trend has accelerated significantly in the past two years.

The figure below is the data from api7. It can be seen that the number of Apache Doris Contributors has increased from about 10 at the beginning to 330+ now. The growth trend is gradually accelerating, and the number of monthly active contributors has reached 80.



SegmentFault: In the past two years, China's open source capital has been hot. In your opinion, what kind of software is suitable for open source and what kind of software is suitable for closed source development?

Baidu PALO team : "Open source" has indeed been popular with capital in the past two years. In addition, the national 14th Five-Year Plan has supported "open source", and the popularity has soared again and again. In particular, the track where Doris is located is called the "Golden Track". More than a dozen start-up companies were born around the past two years.

The pursuit and favor of "capital" for open source is definitely not because of "feelings" nor because of "likes", they are more concerned about the "commercialization" prospects behind "open source". The help of capital to open source is the blessing of resources. This blessing can make open source projects develop faster and products mature earlier. And commercial products based on open source are their real value. In order to continue to be "favored" by capital, the commercial products behind open source must meet the needs and pain points of paying users, and differentiate from open source products Features and competitiveness, and maintain a good symbiotic relationship with open source products.

"Open source" emphasizes co-construction and sharing, and a group of people can go further. Therefore, we believe that those products with "large investment and long conversion cycle" are more suitable for open source. It is against this background that Apache Doris adopts the open source route. It is through open source that the Doris project has received far more investment than a single company, as well as the use and polishing of a large number of users. In the end, it also makes Doris's product power and influence. Has been developed by leaps and bounds.


SegmentFault: From internal tools to the extremely fast and easy-to-use MPP open source database, what changes has Doris experienced at the product level?

Baidu PALO team : The product positioning of Doris is called "born for data analysis". Doris' changes at the product level have evolved along with the changes in data analysis scenarios, and have gone through four stages of development:

The first stage is business intelligence analysis , which focuses on describing and analyzing the business through data, which is also the stage when Doris was born. In 2008, Doris was born in Baidu Phoenix Nest, which itself was to solve the reporting problem. By 2012, Doris had become Baidu's first company-level OLAP platform to undertake the reporting needs of the entire company;

The second stage is massive data analysis , that is, after the popularization of enterprise informatization and digitization, the amount of data has increased significantly, and the database must be able to handle ten times as much data as before. In 2013, Doris completed the transformation of the MPP engine, using distributed capabilities to greatly improve data processing capabilities and efficiency, the amount of data that can be processed has been increased to the level of TB to PB, and the query timeliness has also been improved to a certain extent;

The third stage is real-time data analysis . In this stage, enterprises pay more attention to real-time and predictive analysis, expecting to accelerate the analysis efficiency of the traditional T+1 mode under massive data, and obtain minute-level or even second-level analysis capabilities. It is also the core user pain point that Doris has solved in recent years. Since its open source in 2017, Doris has continuously enhanced performance through optimization of various core modules such as storage engine, query engine, query optimizer, etc., especially the recent vectorization engine, which has accelerated performance by 5-10 times. Realize millisecond-level query experience under massive data;

The fourth stage is national data analysis . In this stage, data analysis is no longer the "patent" of data warehouse maintainers and data analysts. Many roles in the enterprise have the needs and capabilities of autonomous data analysis. It is expected to be able to quickly analyze and use data to obtain value, which is also the current requirement that Doris is dealing with and responding to: On the one hand, in terms of query performance, Doris must be able to support higher concurrency and throughput, and provide more diverse query capabilities to deal with On the other hand, Doris needs to provide low-cost and low-threshold query and analysis methods, such as further improving self-operation and maintenance capabilities, providing out-of-the-box data analysis experience, and helping users pay more attention Business and data itself, without spending too much energy on the deployment, operation and maintenance work at the bottom of the system. This is more reflected in Baidu's commercial version of Palo. We make full use of the cloud's elasticity and containerization capabilities to provide users with cluster hosting services with almost zero operation and maintenance costs, whether it is the expansion or contraction of storage and computing resources, or The cloud-native monitoring and tuning capabilities are helping users make a rapid leap to the era of national analysis with extremely low cost and threshold.

The barrel effect, Doris is strong in every plank is balanced

SegmentFault: Doris performs very well in terms of performance, feature richness and ease of use, so in your opinion, what is the strongest competitiveness of Doris?

Baidu PALO team : The strongest competitiveness of Apache Doris is that it can adapt to the business needs of all scenarios. Whether it is aggregation, detail, Ad-Hoc, single-table, multi-table and other data analysis scenarios, Doris can better support it . Secondly, Doris performs very well in terms of performance, feature richness and ease of use. In terms of performance, Doris has been polished for many years and has excellent performance in various scenarios, especially after the recent vectorized version is launched. increased several times.

Doris cluster can easily realize dynamic expansion and shrinkage. When a node fails, data is automatically migrated. These do not depend on external systems and do not affect upper-level business systems. Cluster operation and maintenance is very simple. In addition, Doris supports a very rich standard SQL syntax and can complete various complex query tasks. In addition to traditional AP scenarios, Doris also provides high-performance streaming writing capabilities and online high-concurrency access capabilities, and can build data services in a modern HSAP architecture. Therefore, Doris is a very comprehensive, mature and easy-to-use system, just like a wooden barrel, each board of Doris is well balanced and naturally holds more water.


SegmentFault: People often discuss the importance of performance and ease of use. How does Doris balance the relationship between performance and ease of use?

Baidu PALO team : Performance is one of the most important indicators of the Apache Doris kernel, and it is also an important criterion for the industry to evaluate database products. Therefore, Doris's pursuit of performance has never stopped. The vectorized execution engine and new optimizer that we are continuously developing are important means of performance improvement. Simultaneously, simplicity, ease of use, stability and reliability are the focus of Doris's continuous attention. Doris is well known and recognized by everyone through its minimalist operation and maintenance and rich functions.

The ultimate performance can help users deal with complex and demanding business scenarios; simplicity and ease of use can make the entire system less expensive to build and more efficient in operation and maintenance. Therefore, performance and ease of use are not in conflict, nor are they opposed to each other. Our pursuit of ultimate performance at the kernel level does not affect our ability to provide users with simple and easy-to-use functions through sophisticated design. A lot of work is done where it can be perceived, so that each feature can meet the needs of users at different levels.


SegmentFault: What are the core customer problems that Doris has solved at present? What are the more common application scenarios? Is there a mature business use case?

Baidu PALO team : Doris has solved the customer's high timeliness analysis needs for massive data at the core. Through the construction of its own product capabilities, Doris has achieved more efficient performance, more comprehensive functions, simpler operation and maintenance, and more ecological aspects in the data analysis process. Rich.

Combined with past experience, we have summarized four major application scenarios of Doris:

•Traditional data warehouse acceleration: On the basis of the original traditional stand-alone database or offline data warehouse, high-performance analytical database capabilities are added to accelerate the query capabilities of traditional data warehouses.
• Real-time data warehouse construction: Build a data warehouse based on real-time data, and support end-to-end real-time data analysis, including efficient real-time data writing and high-performance real-time data analysis.
•Multi-source federated query: Provide a unified query entry across multiple data sources, and meet the diverse query needs of business personnel with a unified one-stop query capability.
•Interactive data analysis and online data services: Provide high-concurrency and high-efficiency online query experience, and efficiently support business-side reports, large-screen or ad hoc queries with extremely low construction costs.

Palo, a commercial data warehouse product built by Baidu based on Apache Doris, has been implemented in various industries since it began to provide services in 2017, and has accumulated a large number of commercial use cases.

To date, nearly 100 companies are using Palo Business Services. For example, in the project of a head financial payment company, we focused on strengthening the multi-tenancy capability of Palo to meet the data and resource isolation requirements of the company's provincial and municipal subsidiaries; in a head acoustic component manufacturer project, we made full use of the Palo's ability to write and query time-series data supports the analysis and monitoring of equipment data in the production workshop; in cooperation with an Internet head Q&A community, we have made bitmap and related usage based on the customer's advertising and portrait needs. Focused on optimization, it has provided great support for the construction of customer content data middle platform.


SegmentFault: In the past two years, some external commercial companies based on Apache Doris have begun to emerge. What does the PALO team think about the relationship between open source and business? How do you view the competition and cooperation between them?

Baidu PALO team : We are very happy to see that more and more companies have begun to participate in the construction of the Apache Doris project, which shows that the project has been recognized by users, the community and the capital market. It can be seen that we donated Doris to the Apache Software Fund initially. would be a very correct choice. In the future, with the continuous development of the community, we hope that more and more companies will join in and join hands with Baidu PALO to build a prosperous and powerful open source software ecosystem .

Of course, we've also seen that there have been some bad business-driven behaviors that have done community damage in the past. Therefore, as the main creative team of Apache Doris, we will firmly maintain the order of the community and ensure that the participants in the community can abide by the Apache Way code of conduct, thereby promoting the healthy and sustainable development of the community.

At present, a large number of underlying technology products adopt the open source model, and customers are increasingly accepting the open source model. "Whether it is open source" is becoming an important basis for many customers to make business decisions; on the other hand, the open source community can help us build a solid user base , as well as broad and positive brand recognition, which facilitates the development of our commercialization. Therefore, for future technology products, open source may become a must. This "must" will not necessarily harm the business model, but will promote commercial success .

Doing open source in "Dachang", the harvest is more "poetry" and "distance"

SegmentFault: Doris has gone through nearly 4 years from donation, incubation to graduation. Did the team encounter some difficulties during this period, and how did they solve them? Do you have any suggestions for projects that are just open source and still in incubation?

Baidu PALO team : It has been nearly 4 years since Apache Doris was donated to graduation today. During this period, it has indeed encountered a lot of twists and turns. Here I will share with you a few points.

One is the problem of internal value presentation and resource conflict within the team. Open source enables the rapid development of the Doris project, but also brings additional workload to the team. The team needs to maintain two sets of codes (the open source product Doris and the commercial product Palo). The team needs to balance the investment in the two products, and at the same time It is also necessary to explain to the company the value and relationship of these two parts of work, directly, indirectly, and even the intangible value hidden behind it.

In fact, to solve this problem, the most important thing is the team's redefinition of "benefits". The benefits of open source are not only fed back to performance and promotion, but also help the team build its influence in the community and even the industry. The growth brought about by this process is faster and more than the development of closed-source products in a company - as the saying goes: "The sea is wide, the fish leaps, the sky is high, and the birds fly", open source brings " sea" and "sky".

While doing a good job in open source, the team should also plan the development of commercial products, bring commercial benefits to the company, and present the positive effect of open source on commercialization to the company, so that it will continue to receive the support of the company and form a positive cycle.

The second difficulty is that due to inexperience, we took some detours. At first, everyone had limited understanding of open source. At the beginning, everything was lacking, and everything had to be explored from scratch, including material preparation, including the accumulation of awareness and experience, all stumbled and explored step by step. From the construction of the Apache Doris official website, to the operation of the official public account, to the construction of the channel ecosystem, and the sorting out of the relationship between open source and commercialization, etc., we have experienced many hardships and challenges. Fortunately, everyone persevered together.

In the early stage of content construction, if we were not careful, it was easy to cross the boundary of "open source" and mixed with "commercialization". Fortunately, there are Apache tutors who can help us point out problems and correct them in time.

In this regard, the suggestion given by our team is that at the beginning of the project, we should figure out what the ultimate goal of the team to engage in open source is, "start with the end in mind", so that there will not be too many swings in the implementation of the path, and the investment of resources will not be too much. off file. At the same time, it is necessary to strictly distinguish between open source and commercialization (including product form and operation model), try not to be selfish in the community, and communicate with project mentors more. They have experience and are relatively neutral. Listen to them, you can’t go wrong .

The third difficulty is the interference of some external commercial factors on the project. As an Apache open source project, Doris does not exclude the use and participation of commercial companies. However, some bad behaviors driven by pure commercial interests do not conform to the Apache Way and will cause harm to the community. Therefore, for projects that are considering open source incubation, in addition to selecting the protocol and incubation organization in the early stage, we recommend protecting the project name and brand. Work such as trademark registration must be done in the front. If you encounter problems such as infringement during the incubation process, you must communicate with the project mentor and the company's legal affairs in time, and use legal means if necessary to protect the healthy development of the community.


SegmentFault: How do you understand the Apache Way?

Baidu PALO team : For the understanding of The Apache Way, you can refer to an article by Mr. Sally on the official Weibo of the Apache Software Foundation. If you are interested, you can read the original text directly:

https://blogs.apache.org/foundation/entry/the-apache-way-to-sustainable

The focus of the Apache Software Foundation's work is not to produce software, but to guide the community that produces software. We can understand this guiding method as the Apache Way, which is an open source community development guidebook that is constantly improving and growing in practice. It enables individuals or organizations to understand how large-scale open source software works well in a highly competitive market.

The core principle of The Apache Way is "community is greater than code" . It emphasizes more on "people" and "ecology". Only a healthy community can breed excellent code. After all, talents are the core productivity, and with excellent development that follows the rules Those who do not worry about not having excellent code output. A healthy community can always correct the problems of the code, but an unhealthy community has difficulty maintaining the normal maintenance of the code base, and the Apache Way is the "legal framework" that protects a community to be healthy and prosperous 20 years later. Violation of it must be accepted "punish".

The Apache Way is fully inclusive, open, transparent and consensus-based. It ensures participant neutrality from commercial companies to prevent undue influence (or control) from a single company. It ensures that any individual with a valuable contribution has the right to be empowered, and while community membership inevitably changes over time, it ensures the sustainability of the project.

Graduated to become a top-level project, Apache Doris sailed to the sea of stars

SegmentFault: Graduation from the Apache incubator means that Apache Doris will start a new journey. Looking forward to the future, what is the development plan (community, product, business) of Doris?

Baidu PALO team : Graduation means a new beginning and a new responsibility. We will, as always, fully support and contribute to the community, share the product capabilities we have gained in practice with the community, and work with community partners to share Apache Doris The construction is more complete, and more people can experience the excellent capabilities of Doris.

In terms of product technology , we will continue to polish Doris' core capabilities and maintain the leading position in core technical indicators. Among them, in terms of performance, we will comprehensively polish or reconstruct the existing core components such as the query layer, execution layer, and storage layer, especially the vectorization and optimizer, which are the most popular in the community. The vectorized execution engine will completely eliminate the row memory. At the same time, we will implement a new CBO optimizer and more refined and rich statistical information, which will further push the performance of Doris to the extreme. In terms of stability and observability, we will focus on making up for Doris' shortcomings, strengthening Doris's capabilities of Profiling, Trouble Shooting, fine-grained resource monitoring and control, and we will continue to put all our capabilities in large-scale production environments. Various stability problems and solutions encountered were contributed to the community, which in turn helped Doris become more stable. We will also continue to improve various important functions and ecological docking capabilities of Doris, including built-in support for complex types, optimization of UDF/UDAF, improvement of Hadoop/Spark ecosystem docking capabilities, enhancement of data lake and federated query capabilities, The improvement of the management and control platform and other functions that have been most requested by the community, thus helping Doris become more powerful and easy to use. In addition, Doris still has a lot to improve. We are very willing to listen to the voices of users, and welcome everyone to raise issues to help Doris develop better with us.

In terms of community building , Apache Doris has passed the early embryonic stage and is entering a period of rapid development. On the one hand, we need to further strengthen the operational investment of the community, so that Doris can be known by more contributors, developers and users, and strive to build a diverse, prosperous and international community; Expansion and the growth of business demands, we will establish or improve various rules and regulations and codes of conduct of the community, so that all parties can participate in the community under the guidance of the Apache Way, and ensure the healthy and healthy development of the Doris community.

On the commercial side , we are also continuously improving our commercial product "PALO Data Warehouse" around Apache Doris. Compared with the open source Doris engine, the PALO data warehouse will provide large-scale production-level stability, complete enterprise-level features, an easy-to-use control and access platform, and non-inductive upgrades. Professional technical support services from senior experts. In addition, our newly launched PALO Cloud product will support advanced multi-cloud native capabilities, provide complete cloud native, as well as multi-cloud and cross-cloud capabilities, helping users embrace the cloud and multi-cloud era. At the same time, the capabilities of PALO Cloud, such as offline integration and lake and warehouse integration, will help users build a new generation of data-centric unified lake warehouses in all scenarios, providing enterprises with a unified view and access to data, and truly unlocking the value of data, just like PALO The meaning of "playing OLAP" can be done in that way.


Related Reading:


思否编辑部
4.3k 声望116.9k 粉丝

思否编辑部官方账号,欢迎私信投稿、提供线索、沟通反馈。