Interview guests | Zhang Hua, Ma Qi, Zhang Chengyuan, Chen Chunhui
In February 2017, Liu Qiangdong, then the CEO of JD.com, announced his strategy for the next 12 years: technological transformation. In his speech that year, Liu Qiangdong first mentioned cloud computing, followed by big data, artificial intelligence and genetic technology. From e-commerce to technology providers, JD.com has no lack of courage and confidence. The financial report shows that since entering the technological strategy upgrade, the Jingdong system has invested nearly 75 billion yuan in technology.
As an important part of JD.com's technology strategy, cloud computing has now taken its place on its own. For more than 4 years, JD Cloud has provided technical services to more than 1,500 large enterprises and 1.52 million small, medium and micro enterprises. So, what is the cloud computing strength of JD.com today?
The evolution of JD Cloud: Big promotion and cloud computing complement each other
"Like ten or five years ago, when everyone entered the field of cloud computing, they started from the infrastructure and developed independently, along the same environment, they just chose different growth paths."
Since it was proposed in 2006, cloud computing has become a must for the world's top Internet companies. With its outstanding cloud computing business, Amazon has become the world's largest e-commerce company by market value. On the soil of China, the development of cloud computing is also closely related to the "e-commerce business" that heavily uses IT infrastructure.
From 2008 to 2012, it was the initial stage of the development of local cloud computing in China, and it was also the golden period for the development of China's e-commerce. And then to an average of 500,000 a day. In 2011, the instantaneous traffic peak has exceeded 100,000 orders per second. During shopping festivals and large-scale promotional activities, e-commerce platforms compete on whether the back-end systems and corresponding IT resources can be rapidly expanded to cope with traffic peaks.
At this time, the Jingdong system adopts a centralized architecture, and it is impossible to supplement service resources in the short term. In 2011, JD.com suffered a server downtime because of a huge book promotion. In the last half hour of the event, the shopping cart and ordering pages were either slow to open or not opened at all, preventing many users from placing orders. For this reason, it is rumored in the industry that the R&D colleagues in charge were invited to "drink tea" by the company's leaders, and JD.com had to apologize to everyone on Weibo.
Business pressure brings opportunities for the cloud-based transformation of JD's technology architecture. Around 2012, the IT system was reshaped with a "distributed architecture" and turned physical machines to virtualization, which can flexibly adjust IT resources, and then further moved to a microservices architecture.
In the past ten years, JD.com has grown rapidly. Ten years ago, JD.com had 2,000 employees and a revenue of 4 billion yuan. It has grown to 370,000 employees this year, with a revenue of more than 700 billion yuan. Corresponding to the business development is the increasing demand for infrastructure. After 2014, JD.com conducted an overall assessment and redesign of the technical architecture and cluster construction. At this time, with the rise of Docker technology, JD migrated applications from physical machines to Docker, adopted the OpenStack+nova-docker technology architecture, and managed containers by managing virtual machines, and developed JDOS1.0, the first-generation container engine platform of JD.
Some of JD.com's main core applications such as seckill, deliveryman order details, and global purchases are all deployed in JDOS1.0. Before the June 18 promotion in 2015, the large-scale Docker containers and KVM virtual machine clusters run by JD.com have withstood The test of the flow of the year.
Build JD Hybrid Cloud based on containerization practices in complex scenarios
2015 was a watershed in the development of JD.com's technology. A head of JD.com's technology research and development said that before that, JD.com's technology had always served the development of the business, but after that JD.com's technology began to drive the development of the business. In order to better cope with complex business scenarios, this year, JD.com made a structural adjustment to the technology, separating the technology department from the business department and becoming a separate technology system. Therefore, the technical department of JD.com has gained unprecedented independence. In addition to the application R&D team serving the mall business, technology R&D teams including cloud, big data, AI and other technology R&D teams have started independent technology research and development for the first time. Output and technological transformation lay the groundwork. The development of JD Cloud entered the fast lane after that.
The Docker container technology was originally used in JDOS1.0, and its scheduling method is relatively simple. It can only be screened and scheduled simply according to whether the remaining resources of the physical machine meet the requirements. There is a ceiling in improving the performance of the application and the utilization rate of the platform, which cannot be done. further improvement. In 2016, when the container scale gradually increased to 100,000 or 150,000, JD.com integrated the storage and network of JDOS1.0 around Kubernetes, and opened up the whole process of CI/CD from source code to image to online deployment (JDOS2 .0). From the early use of Oracle and SQL Server products, to the full removal of Oracle and SQL Server, and the use of open source and self-developed database products such as MySQL, JD Cloud Database began to provide external services in 2016, and currently more than ten cloud models are open. database products.
Container technology is the cornerstone of all platform services, and JD.com is one of the most thoroughly containerized companies. These internal infrastructures are containerized, resource pooled, and some open source-based middleware systems are built to form the foundation of JD’s private cloud. Combined with the public cloud platform planned in 2015 to form JD’s hybrid cloud, it was officially launched the following year. open.
In 2017, JD.com used Kubernetes technology to reconstruct related technology stacks, comprehensively upgraded the technology, deployed databases, big data and other services through Kubernetes, and built the "Archimedes" scheduling system on the basis of containerization. The industry's earliest Kubernetes-based hybrid cloud unified scheduling system. The Group's core business has also gradually migrated to the cloud. After several years of 618 and 11.11 promotions, the hybrid cloud PaaS platform has gradually been tempered and matured. The computing power of the server can also be maximized. In 2019, the original infrastructure was used to not purchase physical servers for the whole year, which saved billions of dollars in IT costs in one fell swoop.
Reduce complexity and build a cloud ship platform
According to the recollection of JD.com’s old employees, JD.com’s early system was very small and simple, with only three systems: a trading website, a supply chain management system and a set of financial systems. At that time, it was very simple to do promotional activities: “We talked about the day during the morning meeting in the morning. Do some activity, such as lottery, turntable, after talking about research and development, go to development, develop until 4 or 5 o’clock in the afternoon, test to see if it works, and go online at 7 or 8 o’clock in the evening.”
Up to now, JD Cloud’s underlying infrastructure-related systems have become huge and complex. Just taking the big promotion as an example, in the world, there are 3 cloud vendors, 4 major regions, nearly 50 large data centers, and nearly 60 city clouds. , 77 offline data centers, supporting hundreds of thousands of smart devices and serving nearly 500 million users, which means that public cloud environments, private cloud environments, edge nodes, computer room servers coexist, and even terminals running on the road, delivery vehicles , these large-scale hybrid IT facilities support every big promotion of JD.com. Therefore, it is much more difficult to do promotional activities now, and the resources of activities such as 618 need to be rapidly expanded to 135% of the usual level. JD Cloud started to do a lot of optimization in 2020, and developed its own hybrid cloud operating system "Cloud Ship", which provides a unified interface to schedule IT infrastructure through the cloud ship, which shields the underlying complexity and provides external users more friendly.
This hybrid cloud operating system can simultaneously schedule system resources of over 10 million cores, manage over 2 million Pods online, and carry cloud-native practices for the most complex scenarios. Zhang Chengyuan, technical director of JD.com's JD Cloud Business Group and head of the database R&D department, explained: "For users, Yunjian shields all the IaaS infrastructure below, and the entire Yunjian operating system can be installed with dozens of cores. All services such as database and middleware can be used like software plugins and installed on demand.”
JD.com's "moat" business logistics on the cloud road
JD Logistics is the core asset of JD Group. From 14 years ago, when Liu Qiangdong set up the logistics department against the public opinion, to the independent listing this year, JD Logistics has become an important "moat" of JD.com. Data shows that from 2018 to 2020, JD Logistics' revenue was 37.9 billion, 49.8 billion, and 73.4 billion respectively. Especially in 2020, the revenue achieved an explosive growth of 43.2%.
In 2018, JD.com began planning hybrid cloud services. Around 2019, migrating to the cloud has become an important technology strategy for the entire JD.com, which mainly refers to the shift from private cloud to public cloud. As an important business group, JD Logistics has relatively rich scenarios, and its cloud technology and experience can feed back the entire group. Driven by the company's policy and its own needs, logistics migration to the cloud has become an important part of JD's cloud migration strategy.
Migrating logistics to the cloud is a process of multi-department cooperation. The logistics department grasps the rhythm of cloud migration for the entire business. JD Cloud provides cloud infrastructure as required, and the two departments communicate about progress every half month.
In the newly designed logistics cloud infrastructure, the previously highly coupled Docker, JinDB, ES (Elasticsearch), and DB (database) are placed in the public subnet, business subnet and data subnet respectively through VPC. Therefore, the first step to go to the cloud is to solve the network problem.
Logistics is not pure Internet, and the complexity of its infrastructure topology is far greater than that of the current leading Internet companies. The Jingdong logistics system manages about 1,300 warehouses across the country and is closely related to physical logistics, so many systems run on local physical machines in various regions of the country. Different VPC subnets should be distributed in different physical equipment rooms (AZ). JD Cloud needs to formulate specific network planning, a completely isolated VPC environment, and refine the network configuration of different services.
Going to the cloud will transform logistics from heavy assets to light assets, so the team deployed CMDB to manage hybrid cloud assets and synchronize billing information. In order to ensure the controllability of the entire cloud migration process, the team carried out resource monitoring and performance monitoring. In addition, the team has also developed many self-service operation and maintenance tools and data synchronization platform "Data Honeycomb" to adapt to the cloud architecture, while using some traditional tools, such as J-one and UDBA, to reduce the learning cost of R&D personnel.
Migration to the Cloud: "Stuck" in Dependence on People
"Before the migration, I had no idea, because each business system is completely different, and I can't predict what difficulties will be encountered."
Going to the cloud is a big project, especially the logistics system business is extremely complex, and modules are interdependent. Business applications, databases, and middleware involving millions of cores need to be migrated to the cloud. "During the operation process, every order, transaction, payment, and package involved cannot go wrong. This is a very big technical challenge. It can be said that the entire process of migrating to the cloud is tantamount to giving a Change the engine of the plane flying at high speed." Zhang Chengyuan said.
After getting ready, JD Logistics did not go to the cloud immediately, but made a small-scale migration of non-core businesses to verify the availability of various components and continue to improve the migration tools. This "experimental" phase lasted for more than half a year, after which the logistics system ushered in a large-scale cloud deployment.
There are basically two types of cloud migration methods: cloud transformation and direct reconstruction of the original system. For logistics systems, cloud transformation accounts for more.
Ma Qi, head of the engineering efficiency team of JD Logistics Technology Development Department, mentioned that some systems that need to be transformed are still relatively expensive. "In the end, the so-called cloud-native concept is popular in the industry, but some logistics businesses were originally built. I didn't think about running on the cloud in the future."
This is particularly evident in the logistics system. Compared with ordinary Internet companies with only one or two computer rooms, the infrastructure of logistics companies is localized, because JD.com has thousands of logistics warehouses across the country, and there are many localized databases: JD.com’s system has been deployed in These were extremely powerful physical machines for the time. As an example in extreme cases, for example, the specifications of some databases have high requirements on the performance of physical machines, but in the cloud era, they can be split and distributed to different cloud hosts. "Don't think of a machine on the cloud as a physical machine," Ma Qi emphasized. To migrate these systems to the cloud, the changes they face are huge.
After sorting out the architecture, Jingdong divided the services that need to be migrated into two types: stateful and stateless. For example, the deployment of the Docker service can be treated as a stateless service. After deployment, a large number of verifications can be done, similar to the usual online verification. It is the easiest to check that there are no abnormalities in various indicators through grayscale tests. For stateful services, such as databases, Redis, ES, etc., the entire state has to be migrated, which becomes complicated and requires more effort and cost.
For the migration from physical machines to Docker containers, each team can do pressure testing and calculate the difference in QPS before and after the calculation. Gradual replacement has little impact on the system and can maintain high availability and stability of the system.
Middleware layer migration is a relatively big technical challenge faced by the team. On the one hand, public cloud products will have very standard Open APIs, while some previous internal products basically only consider business needs. On the other hand, the versions of various middleware are different, including between cloud and on-premises.
Among them, one of the more difficult technical points encountered is the migration of Redis. Some teams use cache middleware for retail businesses, which is fundamentally different from public cloud cache middleware. The JimDB distributed cache product in JD.com is different from the distributed Redis product on the cloud. The former is more privatized and customized. The nature of transformation is not friendly to public cloud products. Therefore, in the process of migrating to the public cloud Redis cluster, the logistics, public cloud and Redis development teams are all faced with the test of this difference. In the end, after several rounds of discussions, the team developed an SDK compatible with Jimdb clusters and public cloud Redis clusters, which can achieve seamless migration only by modifying dependencies, URLs, etc.
During the entire migration process, the "bottleneck" fell on the database team. After all, compared with the amount of data in the Redis cache, the data migration of hundreds of gigabytes or terabytes of data would be much more complicated.
Jingdong started to go to SQL Server and Oracle in about 2014. Before going to the cloud, most of the business was using MySQL. Although most of the local databases in the logistics system are MySQL, they still have different version numbers from those on the cloud, or some features are not enabled. When the MySQL version of the public cloud is higher, cross-version migration prevents the RDS cluster in the new scenario from being directly linked as a slave library.
In addition, when the lower version is directly upgraded to the higher version, some problems may occur, such as the change of the data type of the variable, which leads to the change of the time stamp precision. These all require DBA to assist in solving, and DBA is also responsible for a lot of deployment, monitoring, backup and so on.
Migrating local databases in thousands of logistics warehouses initially relied heavily on the DBA team. At that time, there were only seven or eight people in the logistics DBA team. In addition to the intense daily tasks, they had to undertake the migration of thousands of databases. The project was smoking everywhere, and it took almost 24 hours to call on. The team quickly became overwhelmed and seriously affected. The entire cloud migration progress.
Zhang Hua, Chief Architect of JD Logistics Technology Development Department, recalled: "Before, more consideration was given to whether the system architecture and the technical capabilities of the team could adapt to the new operation and maintenance methods brought by the cloud, but I did not expect the manpower of the DBA. It will become a bottleneck, and not enough people will be recruited temporarily.” Therefore, the DBA team can only suspend work to develop automated tools to replace human labor for repetitive tasks such as migration and verification. Finally, through some DBA and other tools, copy the data to the RDS cluster, and then find a time window for domain name switching.
Compared with infrastructure migration, application instance migration is much easier. Application traffic can be sent to private cloud and public cloud groups at the same time. After stable operation, the private cloud group can be removed.
Jingdong also divides the system into three levels according to different logistics businesses: zero-level system, first-level system and second-level system. The influence of these three levels of systems on the business is weakened in turn. Those that affect order placement belong to the zero-level system, while the statistical analysis tasks that only run once a day belong to the low-level system. When migrating, generally start with the second-level system that has the lowest impact on the business, followed by the first-level system and the zero-level system. The system boundary is divided according to the business and implemented in steps. It is also necessary to consider whether the application migration can be restored if there is a problem. roll. The zero-level system also needs to have a comparison test, grayscale switching, and there will be an active-active phase ranging from three days to a month. After verifying that the new architecture has no problems, the old architecture will be dropped. Ma Qi suggested that after migrating to the cloud, developers should do enough testing.
Logistics is the user of cloud resources, and as the supplier, Chen Chunhui, the architect of JD.com's technology delivery department, summed up the work that public cloud-related departments need to do during migration from four aspects:
The first is to provide high availability to ensure that the physical computer rooms (AZ) of different storage systems are connected to the computer rooms of public clouds in different regions such as East China and South China, and ensure the high availability of Docker, database, middleware and other systems from the computer room level;
The second is to ensure high performance and ensure the utilization of the machine. For example, the CPU should not be lower than the threshold of 40-50%, so that the maximum performance of machines, containers, databases, etc. can play a role;
The third is to ensure high security. Combined with VPC subnets, ACL security policies, database auditing, and WAF and DDOS protection are implemented to ensure high business security;
The fourth is to provide high operation and maintenance capabilities, use cloud resources to improve logistics-side operation and maintenance capabilities, calculate resource usage by department, and provide refined billing.
Logistics cloud, not only tens of millions of cost savings
After about two years of hard work by the team, logistics has become the first department of JD.com to achieve full cloud adoption. At present, core business systems such as the JD logistics order platform have been running stably on JD Cloud, and the daily order volume on the cloud has reached tens of millions. What is the standard for "all in the cloud"? The logistics team has also thought about this issue for a long time. In the end, the team came to the conclusion that: taking the application as the standard, it depends on whether all the resources such as Redis, database, and ES are on the cloud. If all these resources are on the cloud, the application is completely on the cloud. This standard has now become the standard for Jingdong to go to the cloud.
The biggest change that the move to the cloud has brought to the logistics department is that it no longer has to spend too much energy on infrastructure.
Compared with ordinary Internet companies, the logistics system has more local computer rooms, which means that the resource utilization flexibility of the logistics system is very small. With development, the physical computer and computing resources required by the logistics department will only increase, and the waste of resources will also increase. At the same time, the logistics department also needs manpower statistics and spends a lot of energy to maintain the stability of many small computer rooms.
Before going to the cloud, most of the infrastructure of the logistics department was used by the retail department, and its own infrastructure was relatively immature, and some maintenance work was also done by the retail team. The logistics team spends a lot of energy on ensuring adequate resources.
After going to the cloud, the flexibility of the resources that can be used by the logistics department has been greatly increased, and the resource utilization rate has also been greatly improved, which has brought cost savings of tens of millions to the department. The automated billing method also gives the R&D team a more intuitive cost concept.
In Ma Qi's view, migrating to the cloud is not about moving physical machines to the cloud, but making the entire system and applications suitable for the cloud, so as to get the most benefits from migrating to the cloud. "If the enterprise has the ability and resources, the sooner it can go to the cloud, the better."
The first big promotion after the logistics on the cloud
This year's 6.18 big promotion on JD.com is the first time that it has accepted the challenge of traffic peaks since the logistics went to the cloud.
"I don't know what to do if I don't resist a big promotion." Zhang Hua said. JD 618 can be called one of the most complex business scenarios in the world, covering multiple business forms from retail, logistics, finance, and health. Before the big promotion every year, the important responsible persons in each BG/BU of JD.com will form a preparation committee, and the key task is to ensure the stability of the system in the event of a surge in traffic.
In the past two years, the logistics team has introduced a full-link stress test to test the flow of the entire process from the user's ordering to the completion of tasks by all participating systems. Among them, the process of "orders are sent to the supply chain system, then downloaded to the logistics system, and the logistics system is downloaded to the specific warehouse" is the core of the entire link and the focus of the transformation. The stress test results are also an important basis for JD Cloud to make capacity planning.
A year ago, the logistics team developed a tool for "troubleshooting drills" to sort out various systems, identify weak points and high availability areas, check for leaks, and further strengthen the robustness of the system.
Currently, JD.com has millions of microservice applications, and troubleshooting is challenging. Based on the modeling of various fault experiences accumulated over the years, JD.com has developed a fault analysis system for automatic screening. Originally, it took 20 to 30 minutes to complete the fault location of multiple departments, but it could be completed in one or two minutes.
The remote multi-active architecture also ensures the stability of the service. Once a node has a problem, the traffic will be cut to other nodes, and the entire service will not be affected.
The Jingdong logistics system also has certain performance indicators during the promotion period. For example, if the CPU is lower than 50%, it will be judged as non-high performance, and there are problems such as low load.
The strong backing of stability is sufficient resources. This year, JD 618 has expanded its resources by 135% compared to usual. After going to the cloud, it becomes possible to promote "daily", and the technical team does not have to consume too much energy for this.
JD Cloud used to evaluate the amount of resources on a yearly basis, but now it is evaluated on a quarterly basis, and even in terms of supply chain, it can be refined to months or weeks, and the use of batches will not cause too much backlog of resources. In addition to meeting daily needs, the resource pool on the cloud will generally have surplus to deal with the large-scale promotion traffic exceeding the budget or other emergencies.
JD.com has more than 500,000 addresses in a single VPC, and has a large-scale network management cluster with a network of more than 100 gigabytes of nodes, carrying TB-level dedicated line traffic. The cloud ship, which manages more than 10 million core resources, supports the rapid expansion of the system during the promotion period. In addition, Yunjian's IT infrastructure scheduling capabilities allow logistics, retail, health and other systems to run on a unified scheduling platform, making the overall system highly flexible. The data shows that during the 618 period, the utilization rate of JD.com's entire system resources increased by 3 times, and the unit order cost decreased by 30%.
JD Cloud prepares for 11.11
This year's special challenge for Double Eleven
Heap resources and stability are the regular guarantee activities of the big promotion, and this year's Double Eleven is a bit special.
Since the beginning of October, notices of "orderly electricity consumption" have been issued in many places across the country, and the IDCs of various home appliances are facing the risk of being cut off and cut off. For JD Logistics, sorting centers are distributed all over the country, and each sorting center has local equipment. If there is a power outage in a certain place, how to resume operations in a short period of time is also a big test for it.
In order to prevent accidents, JD Cloud has done a lot of security work to ensure that the IDC has a backup power supply after a power outage. For the application layer, the core system has been deployed in two computer rooms. After a computer room is powered off, the other computer room can carry traffic and continue to operate.
In addition, this year's Double Eleven on JD.com has advanced the time from 0:00 to 20:00 pm. This pulsed traffic peak also brings severe challenges to the system. Based on the hybrid cloud operating system cloud ship and off-line co-location technology, JD Cloud flexibly allocates and schedules resources across platforms, slashes peaks and fills valleys, realizes staggered and balanced resources, and stably responds to the pulsed traffic peaks formed at 8 o'clock in the evening.
"There is no so-called big move or secret recipe. As long as the specific little things are done well, the big promotion can be stabilized. But how to make the whole process more efficient and shorter is the challenge." Zhang Hua said. To this end, in addition to upgrading infrastructure, process, standardization, and tooling are also very important. In particular, it will be more important to make the preparations for the big promotion routine.
write at the end
"Going to the cloud is a trend. Ten years ago, it might be worth discussing whether to go to the cloud or not, but today it should not be discussed any more. The cloud is a must." Zhang Hua said.
According to the statistics of the Academy of Information and Communications Technology, in 2020, the overall market size of cloud computing in my country will reach 209.1 billion yuan, with a growth rate of 56.6%. Among them, the size of the public cloud market reached 127.7 billion yuan, an increase of 85.2% compared with 2019. As cloud computing plays an increasingly important role in the digital transformation of enterprises, it is expected that enterprises will continue to increase infrastructure investment in the short term. This is undoubtedly a great opportunity for JD Cloud, which is going to the industry.
Just like what JD.com employees said, "JD.com has been talking about technology, technology, and technology all these years, and I really feel a lot of changes." JD.com has paid more attention to process, tool, and cloud than before. It took only five years to build JD Cloud, but the story of JD Cloud is far from over, and there is still a long way to go in the future.
Interview guests:
Zhang Hua, Chief Architect of JD Logistics Technology Development Department, has been responsible for a number of major company-level projects, including cloud logistics, JD 618, and 11.11 preparations.
Ma Qi, the head of the Engineering Performance Team of the Middle and Taiwan Technology Department of the Jingdong Logistics Technology Development Department, is responsible for the management and operation and maintenance of logistics computing resources. In 2021, the overall utilization rate of logistics computing resources will be improved through technical means, which will bring tens of millions of cost savings to logistics.
Zhang Chengyuan, technical director of JD Cloud Business Group of JD.com, head of database R&D department, led the team to realize the construction of JD cloud database product line from 0 to 1 and from 1 to N, and undertook the cloud work of the group, responsible for organizing and coordinating various departments The promotion and implementation of cloud work between teams.
Chen Chunhui, Architect of the Technology Delivery Department of JD.com, our team is responsible for the cloud migration of the group's logistics and retail business, provides technical support and architecture optimization services during the cloud migration process, and cooperates with the group's cloud customers before major events such as 618 and 11.11. The structure improves the stability of customers' business and provides re-insurance services during the promotion period.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。