1

System availability (Availability) is a feature that describes the effective access of an application system.

Basic Strategy

We all know that hardware failure is the norm, and the main purpose of the high-availability architecture design of the website is to ensure that services are still available and data is still stored and accessible when the server hardware fails.

The main means to achieve the above-mentioned high-availability architecture is the redundant backup and failover of data and services . Once some servers are down, the services will be switched to other available servers. If the disk is damaged, read from the backup disk. Get data.

A typical website design usually follows the basic layered architecture model of application layer, service layer, and data layer. Each layer is relatively independent. The application layer is mainly responsible for specific business logic processing; the service layer is responsible for providing reusable services; the data layer is responsible for data storage and access.

At the application layer, can form a group of servers into a cluster to provide external services through load balancing equipment. When the load balancing equipment monitors that a server is unavailable through heartbeat detection and other means, it will be removed from the cluster list and will The request is distributed to other available servers in , so that the entire cluster remains available, thereby achieving high application availability.

The situation of the service layer is similar to that of the application layer. High availability is also achieved through clustering, except these servers 160a0d812ec249 are accessed by the application layer through the distributed service invocation framework. The distributed service invocation framework will implement software load balancing in the application layer client program. And through the service registration center to perform heartbeat detection on the server that provides the service, and find that the service is unavailable, immediately notify the client program to modify the service access list, and remove the unavailable server .

The server in the data layer is in a special situation. In order to ensure that the data is not lost when the server is down, and the data access service is not interrupted, needs to perform data synchronous replication when data is written, and write data to multiple servers to achieve data redundancy Backup . When the data server is down, the application will switch access to the server with the backup data.

High-availability application layer

The application layer mainly deals with the business logic of the website application, so it is sometimes called the business logic layer. User requests are handled by this layer, and in most cases the requests are stateless.

through load balancing

Load balancing, as the name implies, is mainly used when the business volume and data volume are high. When a single server is not enough to bear all the load pressure, the load balancing method is used to distribute the traffic and data to multiple servers composed of a cluster To improve the overall load handling capacity.

When the servers in the server cluster are available, the load balancing server will distribute the user’s request to any server for processing, and when a server goes down, the load balancing server finds that the server is unresponsive through the heartbeat detection mechanism. It will be deleted from the server list, and the request will be sent to other servers. The processing of the request in any server will not affect the final result.

the Session cluster management

But in fact, the business is always stateful. For example, in a transactional e-commerce website, a shopping cart is required to record the user's purchase information. In a cluster environment that uses load balancing, since the load balancing server may distribute requests to any application server in the cluster, ensuring that the correct session can still be obtained for each request is much more complicated than that of a single machine.

In a cluster environment, Session management mainly has the following methods.

Session Copy

Session replication is a server cluster Session management mechanism that was frequently used in early application systems. Its principle is to synchronize the Session object between the servers in the cluster, so that each server saves all the user's Session information, and when the server uses the Session, it only needs to be obtained on the local machine.

Although this solution is simple and fast to read Session information from the local machine, it can only be used when the cluster is relatively small. When the cluster size is large, a large amount of communication between cluster servers is required for Session replication, which takes up a lot of resources of the server and the network, and the system is overwhelmed.

Session binding

Session binding can be implemented using the source address Hash algorithm of load balancing. The load balancing server always distributes requests from the same IP to the same server. In this way, during the entire session, all user requests are processed on the same server, ensuring that the Session can always be obtained on this server.

But the session binding scheme obviously does not meet our requirements for high system availability, because down, the session on the machine no longer exists, and the user requests to switch to another machine because there is no Session. Unable to complete business processing .

Session server

The current mainstream solution is to use Session Server. utilizes the independently deployed Session server (cluster) to uniformly manage the Session, and the application server accesses the Session server every time it reads and writes the Session.

This solution is actually to separate the state of the application server into a stateless application server and a stateful Session server, and then design their architectures for the different characteristics of the two servers.

For stateful Session servers, a relatively simple method is to use distributed caches such as Redis, etc., to package on the basis of these products to meet the storage and access requirements of Session.

High-availability service layer

The service module provides basic public services for business products. These services are usually deployed independently and distributed and called remotely by specific applications. Services are stateless, so a failover strategy similar to load balancing can be used to achieve high availability. In addition, in specific practice, there are the following high-availability strategies.

hierarchical management

In the operation and maintenance, the server is managed hierarchically, and the core applications and services are given priority to using better hardware, and there is also a higher priority in the operation and maintenance response speed. For example, it is more important for users to pay for shopping in a timely manner than whether they can evaluate products, so order and payment services have a higher priority than evaluation services.

At the same time, necessary isolation is also carried out on service deployment to avoid the chain reaction of failures. low-priority services are isolated by deploying on different containers or virtual machines, while high-priority services need to be deployed on different physical machines, and core services and data even need to be deployed in data centers in different regions.

timeout setting

Due to server downtime, thread deadlock, etc., the application may lose response to server calls, which may result in user requests not being responded to for a long time. At the same time, it also occupies application resources, which is not conducive to timely transfer of access requests to On a normal server.

can set the service call timeout time in the application. Once the timeout expires, the communication framework will throw an exception. According to the service scheduling policy, the application can choose to continue retrying or transfer the request to another server that provides the same service.

asynchronous call

application calls the service through asynchronous methods such as message queues to avoid the situation where a service failure causes the entire application request to fail. If a new user registration request is submitted, the application needs to call three services: write user information into the database, send the account registration success email, and activate the corresponding permissions. If the synchronous service call is used, when the mail queue is blocked and the mail cannot be sent, the other two services will not be executed, and the user registration will fail.

If an asynchronous call is used, the application sends the user registration information to the message queue server and immediately returns a user registration success response. The three services of recording user registration information to the database, sending user registration success emails, and invoking user service activation permissions are used as message consumer tasks, and the user registration information is obtained from the message queue and executed asynchronously. Even if the mail service queue is blocked and the mail cannot be sent successfully, it will not affect the execution of other services. The user registration operation can be completed smoothly, but the successful registration mail will be received later.

Of course, not all service calls can be called asynchronously. For calls such as obtaining user information, the use of asynchronous methods will prolong the response time, which is not worth the loss. For applications that must confirm the success of the service call before proceeding to the next step, it is not suitable to use asynchronous calls.

service downgrade

During the peak period of website visits, the performance of the service may be degraded due to a large number of concurrent calls, which may lead to service downtime in severe cases. In order to ensure the normal operation of core applications and functions, services need to be downgraded. : denial of service and shutdown of service.

Denial of service: rejects calls from low-priority applications, reduces the number of concurrent service calls, and ensures the normal use of core applications; or randomly rejects some requests to save resources, so that other requests can succeed .

Shutdown function: closes some unimportant services, or closes some unimportant functions within the service, to save system overhead, and make resources for important services and functions. For example, Taobao shuts down non-core services such as "evaluation" and "receipt confirmation" during the busiest hours of the system during the annual "Double Eleven" promotion to ensure the smooth completion of core transaction services.

Idempotent design

After the application fails to call the service, it will resend the call request to other servers, but this failure may not be a real failure. For example, the service has been processed successfully, but the application has not received a response due to a network failure. At this time, the application resubmits the request and causes the service to be called repeatedly. If the service is a transfer operation, serious consequences will occur.

Repeated service calls are unavoidable, and the application layer does not need to care about whether the service really fails. As long as it does not receive a successful call response, it can consider the call failed and retry the service call. Therefore, it is necessary at the service layer to ensure that the result of repeated service calls and one call is the same, that is, the service is idempotent.

High-availability data layer

Different from highly available applications and services, because the data stored on the data layer server is different, when a server is down, the data access request cannot be arbitrarily switched to other machines in the cluster.

means to ensure the high availability of data storage is mainly data backup and failover mechanisms. data backup is to ensure that there are multiple copies of data, and the failure of any copy will not cause permanent loss of data, thereby achieving complete data persistence. The failover mechanism ensures that when one copy of the data is inaccessible, it can quickly switch to other copies of the data to ensure that the system is available.

For the "Patron saint" cache service of the data layer, it is also necessary to ensure simple high availability. For single-machine downtime in a cache server cluster, if the cache server cluster is large in scale, the proportion of cache data loss and the change in database load pressure caused by the single-machine downtime are small, and the impact on the entire system is also small.

A simple way to expand the size of the cache server cluster is that all applications share the same distributed cache cluster, and individual applications only need to apply for cache resources from the shared cache cluster. In addition, the cache of each application is deployed on multiple servers through logical or physical partitioning. The cache invalidation caused by any server downtime will only affect a small part of the application cache data, and will not affect the application performance and database load. Cause too much impact.

CAP principle

Before discussing the high-availability data service architecture, a topic that must be discussed is that in order to ensure the high availability of data, the system usually sacrifices another important indicator: data consistency.

The high-availability data layer has the following meanings:

  • Data Persistence: Ensure that the data can be stored persistently, and there will be no data loss under various circumstances.
  • Data availability: Even if there is a damaged copy, it is necessary to ensure that the data is accessible.
  • Data consistency: The content of multiple copies of data needs to remain the same.

The CAP principle believes that a storage system that provides data services cannot simultaneously meet the three conditions of data consistency (Consistency), data availability (Availibility), and partition tolerance (Patition Tolerance, the system has scalability across network partitions) ,as the picture shows.

image.png

In large-scale applications, the scale of data always expands rapidly, so scalability, that is, partition tolerance, is essential. When the scale becomes larger, the number of machines will also become huge. At this time, network and server failures will occur frequently. If you want to ensure that the application is available, you must ensure data availability. Therefore, in large-scale websites, usually choose to strengthen the availability (A) and scalability (P) of the distributed storage system, while giving up consistency (C) to some extent.

Generally speaking, data inconsistency usually occurs in the case of high concurrent write operations in the system or unstable cluster status (failure recovery, cluster expansion...), and the application system needs to understand the data inconsistency of the distributed data processing system. Compensation and error correction in a sense to avoid incorrect application system data.

The CAP principle is of great significance to the design of scalable distributed systems. Because it is difficult to meet strong data consistency, the system usually integrates cost, technology, business scenarios and other conditions, combined with application services and other data monitoring and error correction functions, to make the storage system consistent with users and ensure the correctness of end users' access to data.

data backup

Data backup is an ancient and effective data protection method, and the early data backup methods are mainly data cold backup. The advantages of cold preparation are simple and cheap, with low cost and technical difficulty. disadvantage of 160a0d812ec6c2 is that the data cannot be guaranteed to be final. Because the data is copied regularly, the data in the backup device is older than the data in the system. If the system data is lost, the updated data from the last backup point will be permanently lost. At the same time, data availability cannot be guaranteed. It takes a long time to recover data from cold standby storage, and data cannot be accessed during this period, and the system is unavailable.

Therefore, as a traditional data protection method, data cold backup is still used in daily website operation and maintenance. At the same time, in real-time online business of the website, data hot backup is also required to provide better data availability. Data hot backup can be divided into asynchronous hot backup mode and synchronous hot backup mode.

Asynchronous mode means that the write operation of multiple copies of data is completed asynchronously. Normally, the application only connects to the main storage server. When data is written, the write operation agent module of the main storage server writes the data to the local storage system immediately Return the successful response of the write operation, and then synchronize the write operation data to the secondary storage server through the asynchronous thread.

The synchronization mode means that the write operation of multiple copies of data is completed synchronously, that is, when the application program receives a write success response from the data service system, multiple data copies have been successfully written. However, when the application receives a response to a failed data write operation, some or all of the copies may have been successfully written (due to network or system failures, it is impossible to return a successful response).

Further reading: [Distributed-Key Points] Data Replication

failover

If any server in the data server cluster goes down, all read and write operations of the application for this server need to be rerouted to other servers to ensure that data access will not fail. This process is called failover. The failover operation consists of three parts: invalidation confirmation, access transfer, and data recovery.

Judging that a server is down is the first step for the system to failover. There are two methods for the system to confirm whether a server is down: heartbeat detection and application access failure report. For the application access failure report, the control center also needs to send a heartbeat check again to confirm, so as to avoid wrong judgment that the server is down.

After confirming that a data storage server is down, it is necessary to reroute data read and write access to other servers. For a fully peer-to-peer storage server (several storage servers store exactly the same data), when one of the storage servers goes down, the application directly switches to the peer-to-peer server according to the configuration. If the storage is not equal, then you need to recalculate the route and select the storage server.

Because a server is down, the number of copies of the data storage will be reduced. The number of copies must be restored to the value set by the system. Otherwise, when another server is down, the transfer may be inaccessible (all the servers of the copies are Downtime), the situation of permanent data loss. Therefore, the system needs to copy data from a healthy server and restore the number of data copies to the set value.


与昊
225 声望636 粉丝

IT民工,主要从事web方向,喜欢研究技术和投资之道