Global technology is rooted in global business. After five stages of evolution, it has gradually developed into a relatively independent technology system within the Alibaba Group. This article will first focus on the challenges and technical practices of the global infrastructure layer.
1. Business development history
1.1 Business Background
Since the establishment of Alibaba in 1999, the globalization of Alibaba Group has begun. The company's first business unit, Alibaba.com, is the globalization business. The AliExpress Beta version was launched in 2009 on the occasion of the 10th anniversary of the company's establishment, marking Alibaba's Globalization has moved towards the TOC era. In the next 16 years, it was proposed that "in the next 20 years, Alibaba Group will serve 2 billion consumers, create 100 million employment opportunities, and help 10 million small and medium-sized enterprises make profits." The company also successively acquired Lazada ( Overseas e-commerce companies such as 6 countries in Southeast Asia), Daraz (5 countries in South Asia), Trendyol (Turkey), etc., have officially launched globalization as one of the three major strategies of Ali Group (consumption, cloud computing, globalization). era.
After the acquisition of Alibaba's globalization-related businesses in 2016, Alibaba Group's global business layout has taken shape:
- The above figure shows the countries/regions currently covered by Alibaba's globalization business. It can be seen that the key business countries/regions span three continents: Asia, Europe, and America. The differences in business demands lead to obvious differences in technical solutions. A set of end-to-end technical solutions It is impossible to perfectly support all countries/regions, but the differentiated layer combination/customization has been proved feasible, which puts forward requirements for our [system standardization] ;
- The era of extensive harvesting has passed. In the era of refined operations, dealing with user experience/compliance supervision and deploying technical solutions closer to users is the basis for building local experience, which in turn puts forward requirements for our [system lightweight] ;
- With the deepening of the digital age, digitalization/intelligence is affecting and changing all aspects of human society more and more profoundly. As a global business, whether our users are from developed or developing countries, it is always our goal to make digital/intelligence help users live a better life, and this also puts forward requirements for our [system intelligence] .
1.2 Iterative process of global technology system
In response to the above-mentioned business demands, the global technology system was officially incubated from the group's technology system and gradually developed into a relatively independent technology system within the Alibaba Group through five stages of evolution.
1. Phase 1, based on the systems of domestic Taobao, Tmall, Sotui and other teams, built a complete set of new e-commerce core systems supporting Lazada in 6 months.
2. In the second stage, the corresponding customization was carried out on this e-commerce core system, and a complete set of new e-commerce systems supporting Daraz was built.
3. In the third stage, the core of e-commerce and the AE system were deeply integrated, and excellent system solutions from Taobao, Tmall and other teams were introduced at the same time, forming an international middle platform that can support both local and cross-border transaction modes. prototype.
4. Stage 4, based on the above-mentioned integrated version, merge Lazada, Daraz, Tmall Taobao Overseas, complete the 4-in-1 action of the internationalized middle-end technology branch, and finally form the globalization of one middle-end supporting N sites. new architecture.
5. In Phase 5, the internationalized China-Taiwan open source strategy began to be implemented. It took more than one year to complete the open-source of the entire China-Taiwan link in November 2021, and the closed-loop iteration of global business and China-Taiwan was formed.
6. Phase 6, the future has come, so stay tuned.
Next, we will use a series of articles to clarify the challenges and responses of the global technology system. In this article, we will first share with you the challenges and technical practices of the global infrastructure layer.
2. Challenges Facing the Global Infrastructure Level
From the two aspects of e-commerce website services, buyers and sellers, and website operation, in addition to meeting the basic requirements of users' access to websites such as performance and usability, global deployment, legal compliance, Requirements such as data isolation, these requirements have brought new challenges to our infrastructure construction. Here are some examples:
Global deployment: Whether considering user experience or regulatory compliance, global deployment of infrastructure is a basic capability that must be built for global businesses. The globally deployed infrastructure also directly determines many of the global technology systems. At the same time, the construction and maintenance of the global deployed infrastructure itself is also a huge challenge.
·Performance: The performance mentioned here refers to the processing delay of the user request. The shorter the delay from the user initiating the request to the receiving the response, the better the performance . However, global Internet services have natural challenges in terms of delay, that is, the physical distance is longer, the computer room may be in the United States, and the user may be in Australia. Our test data shows that the average network RTT of American users requesting American Internet services is less than 10ms, while the RTT of Russian users requesting the western computer room of the United States ranges from 150ms to 300ms, which directly leads to the user's full-screen loading time being 1 second longer , and 1 second will cause a drop in conversion rate and even loss of users.
· Availability: There are also cost challenges in serving global users, which will also bring challenges in system availability. If we only ensure availability from a local perspective, we need to build dual computer rooms in each local area to ensure high availability, but this makes it impossible to utilize idle resources in computer rooms in other regions, and the overall cost will be very high. Our 7*24-hour availability requirements are based on a global perspective. Therefore, if we can achieve remote disaster recovery on a global scale, we can better take into account the availability of users within an acceptable cost range.
·Data consistency: The data consistency challenge refers to how to ensure data consistency when data is shared by users in multiple places around the world and users in multiple places can read and write them? Example: In the scenario of global buying and global selling, the buyer creates an order in the local data center, and the seller maintains the order in its local data center. If it is the same order and the buyer and the seller are in different data centers, how to ensure the consistency of reading and writing in multiple places ? When disaster recovery occurs between global data centers, there will also be multiple reads and writes. How to ensure data consistency?
In addition, with the upgrade of the global deployment architecture in recent years, the stock computer room has been gradually migrated to the cloud computer room. The expansion of new services and the compliant deployment architecture all use the cloud as the infrastructure, and the global business has already run on the cloud. At the same time, the cloud also provides richer, more flexible, and unlimited infrastructure capabilities. On the cloud infrastructure, we have practiced a multi-mode deployment and disaster recovery architecture suitable for overseas use to solve user experience, availability, and data consistency issues; Business compliance requirements; at the same time, the cloud-native architecture concept defines how to use cloud products to reshape the process of software development.
Combining the challenges faced by globalization, the following will explain in detail the practice of globalization and compliance from three perspectives: overseas deployment and disaster recovery architecture, data compliance, and cloud-native architecture practice.
3. Cloud-based practice of going overseas
3.1 Overseas deployment and actual disaster recovery
3.1.1 Alibaba Cloud Infrastructure
· IAAS layer : Relying on Alibaba Cloud's consistent global infrastructure, we have built an overseas digital business infrastructure involving 6 major regions, 13 physical computer rooms, and 17 logical computer rooms (AZ) around the world. While enjoying elastic resource capabilities without deploying and maintaining data centers in multiple countries.
· PAAS layer : Relying on various middleware/cloud products of Alibaba Cloud for global deployment, it can solve a series of technical challenges of globalization from top to bottom.
3.1.2 Global Deployment Architecture
Based on the two business models of book-to-book and cross-border, we have two deployment structures: remote and intra-city. At the same time, we often need to deploy multi-country and multi-site services in a regional computer room, resulting in a multi-tenant structure, which will be introduced in detail below. Our practice in multi-live in different places, dual-active in the same city, and multi-tenancy in a single area.
Regionalization and live in different places
The core requirement of AliExpress is the global buying & selling of e-commerce. In addition, the network delay and disaster recovery scenarios of the user's nearest access must be considered. Under the constraints of this multi-regional deployment scenario and core requirements, the general principle of regionalized deployment is very clear, that is, different from the local site model of Amazon and Lazada, data consistency must be guaranteed between different regions. For example, when buyers and sellers from different regions conduct transactions, it is necessary to ensure the consistency of shared data; when disaster recovery occurs in different places, after the user region is migrated, it is also necessary to ensure the consistency of unified users of services in different regions.
· Network layer : Users can resolve to the nearest computer room IDC according to DNS, and reach the unified access layer of the computer room.
Access layer : It is necessary to bridge a unified routing layer to perform strong consistency correction for user attribution, that is, call routing service at the access layer, query the attribution of users and realize cross-machine room scheduling, so as to achieve the purpose of users jumping across machine rooms.
Service layer : For data with strong consistency, such as payment, transaction, etc., it is necessary to guarantee the user attribution of the unified routing layer. That is, if the routing of the unified routing layer is wrong, then the MSE layer also needs to call the service across the computer room back to the correct user. At the same time, for the consistency of shared data, it is necessary to expand the cross-machine room service calling function of central reading and writing; in short, at the MSE layer, it is necessary to realize cross-machine room calling according to user attribution or central machine room consumption. Function.
· Database layer : We have implemented the write prohibition function by extending its plug-in, which is also the bottom line for user attribution errors and strong data consistency guarantee, that is, if the user attribution area is inconsistent with the actual calling area, we will prohibit write protection. , to avoid dirty data writing between different regions.
Data synchronization layer : Two-way synchronization of data between the central computer room and the regional computer room ensures data consistency for remote disaster recovery and avoids data loss after user area changes.
This is the same city dual-active
Different from AliExpress's global buying and selling business, Lazada/Daraz business focuses more on Southeast Asia and adopts the Local to Local model of local buying and selling. Therefore, it adopts the dual-active deployment structure in the same city.
In the same-city dual-active disaster recovery construction, as the name suggests, two IDCs in a city are used for disaster recovery construction. The goal is to quickly switch to another IDC when one IDC fails to ensure service availability. The dual-unit deployment architecture is adopted to achieve self-closed-loop isolation of intra-unit traffic by means of unitization. The database uses the RDS three-node enterprise version to ensure its high availability. Once fault disaster recovery is found, the ingress traffic, unified access layer, etc. can be quickly switched to another IDC to ensure service availability.
Multi-tenant architecture
The basic characteristics of global business are multi-region, multi-currency, and multi-language. In order to realize the refined implementation of business strategies, based on these dimensions, business units can be determined. Data standardization requires the concept of tenants oriented to business operation areas, and at the technical architecture level, a unified multi-area-oriented tenant architecture standard needs to be provided. There are certain differences in the business volume and business form of each business unit. Therefore, the deployment architecture can provide two forms of physical isolation and logical isolation of tenants, and the technical architecture needs to provide configuration isolation, data isolation, and traffic isolation capabilities. , the tenant definition needs to maintain a unified tenant model. Based on a unified tenant, the mapping relationship between the business unit and the technical architecture can be established, so that the tenant can realize the development, testing, deployment, operation and maintenance of the business unit dimension and other research and development activities, reduce the coupling between the business units in the process of research and development activities, and improve the Independence of operating units.
Based on the multi-tenant architecture design concept, the internal working principle of the running state is divided into the following core parts:
·【Traffic coloring】On-end request identification, determine what tenant's traffic is, and color the traffic
[Precise location selection] Based on traffic coloring and the service routing capability of the access gateway layer, precise location selection to the physical cluster where the tenant resides
·【Link transparent transmission】Inside a single service instance in the cluster, it is necessary to solve the transparent transmission of the tenant target, as well as the transparent transmission of tenant information in the process of synchronous and asynchronous interaction with upstream and downstream
[Resource isolation] During the execution of internal business logic, the operation of any resource needs to consider isolation issues, such as configuration, data, traffic, etc.
3.1.3 Global Disaster Recovery Solution
· Region level and network unavailable : The computer room level is unavailable, the external network entrance cannot reach the physical computer room or the physical computer rooms cannot communicate with each other.
· Service level unavailable : The external network/intranet connectivity is normal, and the service is unavailable.
Data layer unavailable : DB/cache unavailable.
· Network disaster recovery : In addition to the user's first hop network routing (if the cell network is abnormal, we basically have no room for operation), in the next 2->N hops, we can build network operator switching capabilities (multi-CDN). Manufacturers switch each other), computer room link switching capability (Region level mutual switching), computer room entrance operator switching capability (IDC network team switching) and other means to attempt disaster recovery.
Access layer disaster recovery : After the traffic reaches the Alibaba Cloud computer room and enters the internal gateway routing layer, real-time traffic deviation correction is performed according to the user granularity level, API granularity level and other dimensions, and it takes effect in seconds. In the case that the network and gateway products are not abnormal, the disaster recovery solution at the access layer is the disaster recovery solution that has been applied and drilled the most in daily life.
Service layer disaster recovery : For some strong central services, such as inventory, marketing and other single-region deduction services, it is also necessary to build their disaster recovery capabilities.
· Data layer disaster recovery : For the multi-active architecture, on the basis of ensuring a single master of data, ensure that the data will not be dirty during the disaster recovery process. For compliance scenarios, consider that some regions do not have sensitive data to achieve compliance disaster recovery capabilities in limited scenarios.
3.2 Actual combat of global data compliance
3.2.1 Introduction to Global Compliance Field
For Internet e-commerce platforms, the overall risk compliance field is very broad, and the differences in risk and compliance fields are roughly as shown in the figure above. Compliance generally involves the following: data compliance, intellectual property infringement, product content security, interactive content security, technology export compliance, APP compliance, etc. These are also the focus of current supervision. In addition to data compliance, other Compliance issues are mainly concentrated in individual business scenarios. For example, intellectual property infringement and product content security mainly exist in the commodity domain, while interactive content security mainly exists in scenarios such as buyer-seller communication and live broadcast.
The focus of compliance work is data compliance, which runs through almost all scenarios of e-commerce platforms. Any issues involving data processing can be related to data compliance. At the same time, due to the sensitivity of data compliance, for the platform It is a business fuse risk.
3.2.2 Data compliance requirements and deployment architecture
According to the scope of personal data closure, cross-border business is generally divided into three types: regionalized structure scheme, privacy data closure scheme and personal data closure scheme. This business adopts an independent unit closed scheme.
3.2.3 Local Storage Solutions
At the data compliance level, there is often a direct regulatory requirement: local storage of data (no departure or local retention for future reference). Even some sensitive businesses have higher regulatory requirements, and there may be situations where the use of public cloud or high security and independence of public cloud resources is not allowed. To this end, we need to have the ability to build a complete infrastructure by ourselves to meet the needs of compliant website construction.
3.3 Application Architecture Cloud Native
Cloud-native technologies enable organizations to build and run elastically scalable applications in new and dynamic environments such as public, private, and hybrid clouds. Representative technologies of cloud native include containers, service meshes, microservices, immutable infrastructure, and declarative APIs. These techniques enable the construction of loosely coupled systems that are fault-tolerant, easy to manage, and easy to observe. Combined with reliable automation, cloud-native technologies make it easy for engineers to make frequent and predictable breaking changes to the system.
For global technology research and development, in addition to running the business on the cloud, it is also necessary to further start from the challenges and pain points of its own business research and development, and combine cloud-native technology and related architectural concepts to solve the efficiency problem of business research and development and operation and maintenance itself. .
3.1 Challenges faced by traditional application architecture
Figure traditional application architecture pattern
The above diagram describes the software delivery process under the traditional application architecture. From the perspective of this whole process, the application acts as the object in the development state, the carrier in the delivery state, and the container in the running state. All the capabilities expected by the software capabilities are declaratively referenced in the application source code, and the software is completed through unified construction. The overall delivery, this process can be called the rich application (Fat Application) delivery of the software.
Since the global platform originally evolved from the actual business system (transaction, marketing, payment...), the platform is no exception, continuing the traditional application architecture model. However, with the gradual evolution of the platform itself and the development of business diversity on the platform, it has brought a great impact on both the organizational structure and the business structure, mainly facing the following three challenges
Unsustainable application architecture : Under the rich application delivery model, in the software production process, there is always a single point - the application. When the content supported by the application becomes larger and more complex, it will be the key point that affects the efficiency of research and development , is also the biggest challenge affecting the sustainability of the entire international platform architecture.
Uncertainty in R&D delivery : Global platforms and business-tiered R&D models are inconsistent in purpose and pace of change. In order to solve the difference between the two, the application itself will gradually become bloated and corroded, which will bring great uncertainty and unpredictability to the daily R&D iteration.
Lack of standards for operation and maintenance capabilities : As the complexity of the application itself increases, the matching operation and maintenance capabilities will also increase, and the currently advocated DevOps concept has also derived many related products and tools, but these products and The standards of tools are not unified, which leads to the phenomenon of scattered and complicated products without unified product entry, which leads to the continuous increase of operation and maintenance efficiency and understanding cost.
In response to the above challenges, cloud native technology provides us with new solutions to the problem:
Container orchestration technology: Through the cloud-native container orchestration technology, the traditional software delivery process is evolved into the combined delivery of various container orchestration, and the single application delivery is split into multiple modules for flexible orchestration and delivery, thereby promoting the evolution of the global application delivery system .
Mirroring of deliverables: The application is no longer the only object of research and development, but a mirrored research and development system is built. Based on the immutability of the mirroring, it ensures the certainty of the delivered content, and realizes the mirroring of platform capabilities, with an independent and stable research and development system.
Unified operation and maintenance standards: With the help of cloud-native IaC/OAM and other GitOps concepts, a unified model is used to converge and define application operation and maintenance standards under cloud-native. And redefine the SRE of the business organization, and query, analyze, and measure the status of application operation and maintenance capabilities and resource usage through a unified perspective.
3.2 Global Cloud Native Architecture Practice
3.2.1 Application Architecture Based on Cloud Native
Combined with the cloud-native problem-solving ideas mentioned above, we abstract the overall globalized R&D delivery process to support a broader globalized application architecture upgrade. In this process, we have also fully combined the advanced technologies in cloud native and applied them to globalized scenarios:
· IaC : Provides a unified R&D infrastructure declaration paradigm. In order to better decouple the platform from business dependencies and reduce the cognitive cost of the platform, we have defined layered abstract standards for IaC of site applications, and defined infrastructure standards around globalization scenarios, from specifications, log collection, probes , hooks, and release policies are unified and converged to reduce the cost of service access IaC.
· OAM : Provides the definition of a unified application model. Relying on the separation of concerns between OAM development and operation and maintenance, platform-independent and highly scalable, modular application deployment and operation and maintenance, etc., we standardize and define application-oriented standards for business and platform, so as to better link application developers, Operators, application infrastructure, and cloud-native application delivery and management processes are more consistent.
· GitOps : Provides continuous delivery capabilities for business R&D. Based on the declarative concept of cloud-native GitOps, external dependent components can be integrated from capabilities to operation and maintenance control and declared in a unified project, and then only the declaration and definition of dependent capabilities need to be declared and defined based on the unified GitOps standard, so that the delivery of component capabilities and the The control is handed over to the underlying GitOps engine to improve the integrity and sustainability of the entire software system.
ACK : Provides a unified scheduling engine for resources. Based on Alibaba Cloud's ACK container service, we use the powerful container orchestration, resource scheduling, and automated operation and maintenance capabilities it provides to deliver different business module functions to different environments, and based on upper-layer traffic scheduling, to achieve business on-demand Deploy, schedule on demand.
Container orchestration : The global application architecture has been successfully re-upgraded through ACK container flexible orchestration technology, which completely isolates business logic from infrastructure, platform capabilities, and public rich clients in the R&D state, and in the running state business process and operation and maintenance process Relatively complete isolation is achieved through lightweight containers, which improves the overall application R&D delivery efficiency and business form stability.
Integration of graph application architecture in container state
The emphasis here is on the practice of container orchestration. In the process of upgrading the global application architecture, three types of containers have been derived, as shown in the figure above:
·Infrastructure container (Base Container), which includes the ability to operate and maintain the infrastructure that applications such as container and gateway container depend on;
Temporary Container, which does not have any life cycle, its function is to integrate its own research and development products into the main application container and business container through the shared directory under the Pod, and complete the integration and use of the entire capability. It is mainly composed of platform containers;
Business Container, like the main application container, has a complete life cycle, and completes the communication with the main application through gRPC. It is mainly composed of rich client containers such as categories and multiple languages.
3.2.2 Cloud-native O&M system
The operation and maintenance system in the global application architecture
Combined with the upgrade of the application architecture, globalization has also upgraded the application operation and maintenance system. With the help of the cloud native architecture system and the declarative reference of the IaC standard, globalization unifies the use of various application operation and maintenance capabilities, and achieves efficiency improvement through the powerful capabilities of the infrastructure, including but not limited to:
·Application release: intelligent release decision, in-place upgrade, rolling upgrade, batch release
Elastic capacity: automatic elasticity, timed elasticity, CPUShare
Batch operation and maintenance: in-situ restart of application containers, container replacement, log cleaning, JavaDump
Lightweight container: independent operation and maintenance container, sidecar orchestration
Multi-container delivery deployment: port conflicts, process conflicts, file directory sharing
Observation and stability: application life cycle, startup exception diagnosis, white screen, container perspective monitoring
Figure cloud resources BaaS
In this operation and maintenance system, globalization also introduces cloud-native BaaS capabilities. BaaS provides a complete set of final-state, separation of concerns (Application, Infrastructure) solutions, which open up the production, metering and billing, identity authorization, and consumption processes of Alibaba Cloud and middleware resources, and use IaC as the entrance to provide end-to-end Use experience. Through the introduction of BaaS capability, Globalization realizes the unified measurement management of application cloud resource usage by SRE. At the same time, R&D personnel can realize the consistent and declarative use of various resources through BaaS, which greatly reduces the cost of use.
Figure Java Process Lifecycle Normalization
In order to improve the application self-healing ability in the cloud-native environment, we also unified the life cycle specification of Java applications in K8s Pod, and standardized and defined different stages such as application startup, operation (survival and ready), and service shutdown, and passed IaC The model and SDK are open to business use to achieve consistent binding between the Java application life cycle and the container life cycle.
4. Summary and Outlook
Technology serves business, and global technology is rooted in global business. In the business direction of "making the world have no difficult business", we still have a lot of things to do. Similarly, despite years of construction, the global technology system still has many imperfections and many technical challenges to be overcome, and we are still on the way.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。