[Technical Architecture—Key Points] Scalable and Security Architecture

Scalable architecture

For system scalability, many people often confuse it with scalability. Let's clarify these two concepts first.

Scalability: The system can enhance (reduce) its ability to calculate and process transactions by increasing (decreasing) the scale of its own resources. If this increase or decrease is proportional, it is called linear scalability. In website architecture, it usually refers to the use of clusters to increase the number of servers and improve the overall transaction throughput of the system.

Extensibility (Extensibility): has the least impact on the existing system, the ability of the system to continue to expand or improve. It is manifested in that the system infrastructure is stable and does not require frequent changes, there is less dependence and coupling between applications, and it can respond quickly to demand changes. It is the opening and closing principle of the system architecture design level. When the system adds new functions, there is no need to modify the structure and code of the existing system.

Basic strategy

designing scalable architecture is to modularize , and on this basis, reduce the coupling between modules and improve the reusability of modules.

Layering and segmentation are important means of modular design , using layering and segmentation to divide the software into several low-coupling independent component modules, which are aggregated into a complete unit by means of message passing and dependent calls system.

In large-scale applications, these modules are deployed in a distributed manner, and independent modules are deployed on independent servers (clusters) to physically separate the coupling relationship between the modules, further reducing coupling and improving reusability. distributed deployment of the 160a6646345d5a module, the specific aggregation methods mainly include distributed message queues and distributed services.

Distributed message queue

If there is no direct call between modules, adding or modifying modules will have the least impact on other modules, so that the scalability of the system will be better.

Event-driven architecture

Event Driven Architecture (Event Driven Architecture) is to maintain the loose coupling of the modules by transmitting event messages between low-coupling modules, and complete the cooperation between the modules by means of the communication of event messages. The typical EDA architecture is the common producer-consumer model in the operating system. There are many specific implementation methods, the most commonly used is the distributed message queue.

The message queue uses the publish-subscribe model to work. The message sender publishes the message, and one or more message receivers subscribe to the message. The message sender is the message source. After processing the message, the message is sent to the distributed message queue, and the message receiver continues to process the message after obtaining the message from the distributed message queue.

After the message receiver filters, processes, and packages the message, it constructs a new message type, sends the message continuously, and waits for other message receivers to subscribe to process the message. Therefore, an event (message object)-driven business architecture can be a series of processes.

Because the message sender does not need to wait for the message receiver to process the data before it can return, the system has better response delay; at the same time, during the peak of website visits, the message can be temporarily stored in the message queue waiting for the message receiver to control according to its own load processing capacity Message processing speed reduces the load pressure on back-end storage such as databases.

Principle

Queue is a first-in, first-out data structure. Distributed message queue can be seen as deploying this data structure on an independent server. Applications can use distributed message queues through remote access interfaces for message access operations, and then Implement distributed asynchronous calls. The basic principle is shown in the figure.

The message producer application pushes the message to the message queue server through the remote access interface, and the message queue server immediately returns a successful response to the message producer after writing the message to the local memory queue. The message queue server finds the message consumer application that subscribes to the message according to the message subscription list, and sends the message in the message queue to the message consumer program through the remote communication interface according to the first-in-first-out (FIFO) principle.

At present, there are many open source and commercial distributed message queue products, the more famous ones are RabbitMQ, Kafka, etc.

In terms of scalability, since the data on the message queue server can be regarded as being processed immediately, it is similar to a stateless server, and the scalability design is relatively simple. Add the new server to the distributed message queue cluster and notify the producer server to change the list of message queue servers.

In terms of availability, in order to avoid problems caused by slow consumer processes and insufficient memory space in the distributed message queue server, will write messages to disk if the memory queue is full. After the message push module has processed the memory queue messages, Load the disk content into the memory queue to continue processing .

In order to avoid message loss caused by the downtime of the message queue server, the message successfully sent to the message queue will be stored in the message producer server, and the message will be deleted after the message is actually processed by the message consumer server. After the message queue server goes down, the producer server will select other servers in the distributed message queue server cluster to publish messages.

distributed service

Using distributed services is another important means to reduce system coupling. If the distributed message queue decomposes the system coupling through the message object, different subsystems process the same message; then the distributed service decomposes the system coupling through the interface, and different subsystems make service calls through the same interface description.

Large-scale monolithic applications will bring many problems, such as difficulty in compilation and deployment, confusion in branch management, difficulty in expanding functions and maintenance. The solution is to split and deploy the modules independently to reduce system coupling. Split can be divided into vertical split and horizontal split.

Vertical split: Split a large application into multiple small applications. If the new business is relatively independent, then it is directly designed and deployed as an independent Web application system.

Horizontal splitting: Split the reused business and deploy them as distributed services independently. New businesses only need to call these distributed services and do not need to rely on specific module codes to quickly build an application system. When the business logic changes, as long as the interface remains consistent, the business program and other modules will not be affected. as the picture shows.

Vertical splitting is relatively simple. By sorting out the business, the less related businesses are separated to make them an independent Web application. For horizontal splitting, not only need to identify reusable businesses, design service interfaces, and standardize service dependencies, but also need a complete distributed service management framework.

Security architecture

Since the birth of the Internet, security threats have been accompanied by the development of application systems, and various Web attacks and information leakage have never stopped.

Common web attacks and defense

XSS attack

XSS attack is a cross-site scripting attack (Cross Site Script), which refers to an attack method in which hackers tamper with webpages and inject malicious HTML scripts to control the user's browser to perform malicious operations when the user browses the webpage.

There are two common types of XSS attacks. One is the reflection type, where the attacker tricks the user into clicking a link embedded with a malicious script. The other is the persistent type. The hacker submits a request containing a malicious script and saves it in the database of the attacked Web site. When the user browses the web page, the malicious script is included in the normal page to achieve the purpose of the attack.

XSS attackers generally achieve the purpose of attack by embedding malicious scripts in requests. These scripts are not used in general user input. If they are filtered and disinfected, they will escape certain html dangerous characters, such as ">" Escape to ">", "<" to "<", etc., to prevent most attacks.

SQL injection attack

The principle of the SQL injection attack is as follows: the attacker injects malicious SQL commands into the HTTP request. When the server uses the request parameters to construct the database SQL commands, the malicious SQL is constructed together and executed in the database.

To defend against SQL injection attacks, the first thing to do is to prevent the attacker from guessing the key information of the database such as the table name. In addition, filtering and disinfecting request parameters is also a relatively simple, rude and effective method. Finally, the best way to prevent SQL injection is to use precompiled means to bind parameters.

CSRF attack

CSRF (Cross Site Request Forgery, cross-site request forgery), an attacker uses a cross-site request to forge a request as the user without the user's knowledge. The core is to use the browser Cookie or server Session strategy to steal the user's identity.

Correspondingly, the defense method of CSRF is mainly to identify the identity of the requester. There are mainly the following methods:

Form Token: Add a random number as a Token in the page form. The Token of the response page is different each time. The request submitted from the normal page will contain the Token value, but the fake request cannot get the value. The server checks the request parameters. The value of Token determines whether the request submitter is legal.
Verification code: When the request is submitted, the user is required to enter the verification code to avoid being forged by the attacker without the user's knowledge. However, the user experience of entering the verification code is relatively poor, so please use it when necessary, such as key pages such as payment transactions.
Referer check: The Referer field of the HTTP request header records the source of the request, which can be verified by checking the source of the request.

Information encryption and key security

In order to protect the sensitive data of the system, it is usually necessary to encrypt the information. Information encryption technology can be divided into three categories: one-way hash encryption, symmetric encryption and asymmetric encryption.

One-way hash encryption

One-way hash encryption refers to the hash calculation of information of different input lengths to obtain a fixed-length output. This hash calculation process is one-way irreversible.

Although the one-way hashed ciphertext cannot be inversely calculated to obtain the plaintext through the algorithm, because people set the password with a certain pattern, it can be guessed and cracked by means such as the rainbow table (common password and the corresponding ciphertext relationship table). In order to strengthen the security of the one-way hash calculation, some salt will be added to the hash algorithm. The salt is equivalent to the encryption key, which increases the difficulty of cracking.

Commonly used one-way hashing algorithms include MD5, SHA, etc. Another feature of the one-way hash algorithm is that any small change in the input will result in a completely different output. This feature is sometimes used to generate information digests and calculate random numbers with a high degree of discreteness.

Symmetric encryption

The so-called symmetric encryption means that the keys used for encryption and decryption are the same key (or can be calculated by each other). Symmetric encryption is usually used in situations where information needs to be exchanged or stored securely, such as cookie encryption and communication encryption.

The advantages of symmetric encryption are simple algorithm, high encryption and decryption efficiency, low system overhead, and it is suitable for encrypting large amounts of data. The disadvantage is that encryption and decryption use the same key. How to exchange keys securely in remote communication is a difficult problem. If the key is lost, then all encrypted information will have no secrets at all. Commonly used symmetric encryption algorithms include DES and RC algorithms.

asymmetric encryption

Different from symmetric encryption, the keys used for asymmetric encryption and decryption are not the same key. One of them is open to the outside world and is called a public key, and the other is known only by the owner and is called a private key. The information encrypted with the public key must be unlocked with the private key. Conversely, the information encrypted with the private key can only be unlocked with the public key. In theory, it is impossible to obtain a private key through public key calculations. Asymmetric encryption technology is usually used in information security transmission, digital signature and other occasions.

Information sender A obtains the public key of information receiver B through public channels, encrypts the submitted information, and then sends the cipher text information to B through an insecure transmission channel. After B obtains the cipher text information, he uses his own private key to pair the information Decrypt and obtain the original plaintext information. Even if the ciphertext information is stolen during transmission, the thief cannot restore the plaintext without the decryption key.

The process of digital signature is the opposite. The signer encrypts the information with his own private key and sends it to the other party. The receiver uses the signer’s public key to decrypt the information to obtain the original plaintext information. Since the private key is only owned by the signer, Therefore, the information is non-repudiation and has the nature of a signature.

In practical applications, symmetric encryption and asymmetric encryption are often mixed. First use asymmetric encryption technology to securely transmit the symmetric key, and then use symmetric encryption technology to encrypt, decrypt, and exchange information.

Commonly used algorithms for asymmetric encryption include RSA algorithm and so on. The digital certificate used by the browser in HTTPS transmission is essentially an asymmetrically encrypted public key certified by an authority.

Key security management

An important prerequisite for the security and confidentiality of the several encryption technologies mentioned above is the security of the key. Regardless of the salt used for one-way hash encryption, the key for symmetric encryption, or the private key for asymmetric encryption, once these keys are leaked, all information encrypted based on these keys loses its secrecy.

In practice, there are two ways to improve key security.

One solution is to put the key and algorithm on an independent server, or even make it into a dedicated hardware facility to provide encryption and decryption services to the outside world. The application system implements data encryption and decryption by calling this service. Because the key and algorithm are deployed independently and maintained by a dedicated person, the probability of key leakage is greatly reduced. However, the cost of this solution is relatively high, and a remote service call is required for each encryption and decryption, and the overhead is also relatively high.

Another solution is to put the encryption and decryption algorithm in the application system, and the key is placed in the independent server. In order to improve the security of the key, the key is divided into several pieces when actually stored, and the encryption and decryption are stored in In different storage media, while taking into account the key security, it also improves the performance , as shown in the figure.

Information filtering and anti-spam

In today's Internet, advertisements and spam are commonplace and flooded. Therefore, it is essential to do a good job in information filtering and garbage disposal. Commonly used information filtering and anti-spam methods are as follows.

text matching

Generally, the system maintains a list of sensitive words. If the information posted by the user contains sensitive words in the list, it will be disinfected or refused to be published. So how to quickly determine whether there are sensitive words in user information? If there are fewer sensitive words and the length of the text submitted by the user is shorter, regular expression matching can be used directly. However, the efficiency of regular expressions is generally poor. When there are many sensitive words, the information posted by users is also very long, and the amount of concurrency on the website is high, a more appropriate method is needed to complete it.

There are many published algorithms in this regard. is basically a variant of the Trie tree. The space and time complexity are relatively good, such as the double array Trie algorithm . The essence of the Trie algorithm is to determine a finite state automaton and perform state transitions based on input data. The double-array Trie algorithm optimizes the Trie algorithm. It uses two sparse arrays to store the tree structure, the base array stores the nodes of the Trie tree, and the check array performs state checking. The size of the double array Trie needs to be determined according to business scenarios and experience to avoid too large arrays or excessive conflicts.

Another simpler implementation is to use construct a multi-level Hash table for text matching .

Classification algorithm

A better way to identify advertisements, spam and other content is to use classification algorithms. Take anti-spam as an example to illustrate the use of classification algorithms. first input batches of classified mail samples into the classification algorithm for training to obtain a spam classification model, and then use the classification algorithm combined with the classification model to identify the to-be-processed mails. A relatively simple and practical classification algorithm is the Bayesian classification algorithm .

In addition to anti-spam, the classification algorithm can also be used to automatically classify information. Portals can use this algorithm to automatically classify collected news articles and distribute them to different channels. Email service providers can also use classification algorithms to improve delivery relevance for personalized advertisements pushed by email content.

blacklist

For spam, in addition to the classification algorithm for content classification and identification, you can also use blacklist technology to put the reported spam email address into the blacklist, and then search for the sender of the email in the blacklist. If the search is successful, Then filter the mail.

The blacklist can be implemented through a Hash table. This method is simple to implement and has low time complexity, which can meet general scenarios. But when the blacklist list is very large, the Hash table needs to occupy a large amount of memory space. In scenarios where the filtering requirements are not completely accurate, bloom filters can be used to replace the Hash table.