3
头图


Aurora Senior Engineer-Hu

One, background

Due to the product requirements of SMS signatures and emails supporting uploading local pictures/supporting uploading attachments in the UMS5.1 version, as well as the subsequent scenarios that require a large number of file storage, it is necessary to build a private cloud's own file server, and the server must also Compatible with client file servers (Note: client file servers are generally compatible with S3 protocol)

2. Research file server

After various investigations, selection and discussion within the group, it was finally decided to choose minIO

1.Introduction to minIO

minIO is an object storage service developed in Go language based on the Apache License v2.0 open source protocol. It is compatible with Amazon S3 cloud storage service interface and is very suitable for storing large-capacity unstructured data, such as pictures, videos, log files, backup data and container/virtual machine images, etc., and an object file can range from a few kb to a maximum of 5T Not waiting. minIO is a very lightweight service that can be easily combined with other applications, similar to NodeJS, Redis or MySQL.

2.minIO advantage

Compatible with Amazon S3

minIO uses Amazon S3 v2/v4 API.

Data protection

minIO uses erasure code to prevent hardware failures. Even if more than half of the hard disk is damaged, data can still be recovered from it.

Highly available

The minIO server can tolerate up to (N / 2) -1 node failures in a distributed system.

Lambda calculation

The minIO server triggers the Lambda function through its AWS SNS / SQS-compatible event notification service. The supported targets are message queues, such as Kafka, AMQP, and databases such as Elasticsearch, Redis, and MySQL.

Encryption and tamper-proof

minIO provides confidentiality, integrity and authenticity guarantees for encrypted data, and the performance overhead is minimal. Using AES-256-GCM, ChaCha20-Poly1305 and AES-CBC support server-side and client-side encryption.

Can be connected to back-end storage

In addition to minIO's own file system, it also supports DAS, JBODs, NAS, Google cloud storage and Azure Blob storage.

sdk support

Based on the lightweight feature of minIO, it is supported by sdk similar to languages such as Java, Python or Go

consistency

In the distributed and stand-alone mode of minIO, all read and write operations strictly abide by the read-after-write consistency model.

3.minIO architecture diagram

minIO adopts a decentralized shared-nothing architecture. Object data is scattered and stored on multiple hard disks of different nodes, providing unified namespace access to the outside, and is on each server through a Web load balancer or DNS round-robin. Achieve load balancing between.

4.minIO storage mechanism

4.1 Basic concepts

Hard Disk (Drive): The disk that stores data, which is passed in as parameters when minIO is started.
Set: A set of Drives. Distributed deployment automatically divides one or more Sets according to the cluster size. The Drives in each Set are distributed in different locations. An object is stored on a Set.
Bucket: The logical location where file objects are stored. For the client, it is equivalent to a top-level folder for storing files.

4.2 Erasure Code

minIO uses erasure codes and check sums to protect data from hardware failures and silent data corruption. Even if you lose half of the hard drives (N/2), you can still recover the data.

What is erasure code? It is a mathematical algorithm for recovering lost and damaged data. minIO uses Reed-Solomon Code to implement erasure coding, which splits the object into N/2 data blocks and N/2 parity blocks. This means that if there are 12 disks, an object will be divided into 6 data blocks and 6 parity blocks. You can lose any 6 disks (regardless of whether it is a stored data block or a parity block). You can still recover from the data on the remaining disk.

4.3 Brief Analysis of Reed-Solomon Code Data Recovery Principle

RS encoding uses word as the encoding and decoding unit, and large data blocks are split into words with a word length of w (usually 8 or 16 bits), and then the word is encoded and decoded. The encoding principle of the data block is the same as that of the word encoding. In the following text, the word is taken as an example, and the variables Di, Ci will represent a word. The input data is regarded as a vector D=(D1, D2,..., Dn), the encoded data is regarded as a vector (D1, D2,..., Dn, C1, C2,..., Cm), RS encoding can be Think of it as the matrix operation shown below (Figure 1). The leftmost side of Figure 1 is the coding matrix (or called the generator matrix, distribution matrix, Distribution Matrix), the coding matrix needs to satisfy any n*n sub-matrix invertible. To facilitate data storage, the upper part of the coding matrix is a unit matrix (n rows and n columns), and the lower part is a matrix of m rows and n columns. The lower matrix can choose either Vandermonde matrix or Cauchy matrix.

RS can tolerate deletion of m data blocks at most. The data recovery process is as follows:

(1) Assuming that D1, D4, and C2 are lost, delete the missing data block/row corresponding to the encoding block from the encoding matrix. (Picture 2, 3)

(2) Since B'is invertible, the inverse matrix of B'is (B'^-1), then B'* (B'^-1) = I identity matrix. Multiply both sides by the B'inverse matrix. (Picture 4, 5)

(3) Obtain the following calculation formula for the original data D, as shown in the figure below:

(4) Re-encode D to get the lost encoding.

4.4 Run minIO in erasure code mode

minIO will automatically generate 12 disks, the command is as follows:

4.5 Storage format

When data objects are stored in the minIO cluster, erasure fragmentation is performed first, and then they are scattered and stored on each hard disk. Specifically: minIO automatically generates several erasure groups in the cluster, each erasure group contains a set of hard disks, the number of which is usually 4 to 16; the data objects are fragmented, and the default strategy is to obtain the same number of data fragments Sum check shards; then the erasure group corresponding to the data object is calculated through a hash algorithm, and the data and check shards are stored on the hard disk in the erasure group.

As shown in the figure above, suppose that an erasure group in a minIO cluster contains 4 hard disks, a data object is named MyObject, and its subordinate bucket is named MyBucket, and the corresponding erasure group is Disk 1~4 by hash calculation. Then in the data path of Disk 1~4, the MyBucket/MyObject subpath will be generated. The subpath contains 2 files, which are the xl.json storing metadata information and the first fragment of the MyObject object on the disk. part.1. Among them, xl represents the default storage format of data objects in minIO.

5.minIO golang SDK is simple to use

The following example of uploading files can be run directly, and the files will be uploaded to the minIO official server


III. Practical application of

1. Application system architecture

In the entire architecture, http protocol communication is used between modules, and the role of each module is as follows:

(1) The role of the Web/API server is to provide authentication and authentication of the UMS system, that is, to verify the legitimacy of the Web client or the developer's API request interface;

(2) The role of the file management server is to provide an interface for external operation of the minIO server. According to the current business needs of the UMS system, only the presignedURL of the uploaded file is obtained, the expiration time is set, and the external access is set

Policies, create buckets, and generate URLs for downloading files; then what is presignedURL? It is the object owner using his own security credentials to create a pre-signed URL to grant uploads or downloads within a limited time

Upload object permissions to share objects with other users. Note: Even private objects can be shared with others using presignedURL, and the maximum validity period of presignedURL is 7 days.

The method for file management server to obtain presignedURL of uploaded file directly uses minIO official API. Of course, you can also implement the presignedURL method yourself. In addition, since the maximum retention time of downloading presignedURL is 7 days, it does not meet the business requirements of UMS system, so file management server I have implemented a method to generate a download URL. The expiration time of this link can be set arbitrarily, but the prerequisite is to set the external access policy of the bucket to public. Therefore, the client can directly upload the file on the presignedURL to the minIO server, and use the download link to download the file directly.

(3) The role of minIO cluster is to store physical files. The cluster adopts a decentralized shared nothing architecture. Each node has a peer-to-peer relationship. Connecting to any node can achieve access to the cluster. The front end of the minIO cluster adds Nginx to achieve anti- To the agent; the communication between minIO nodes uses rpc. In addition, in addition to the SDK mentioned above, the official minIO server management also provides the form of command line and web page. The contents are as follows:

Enter the Nginx proxy ip and port number or the ip and port number of any node in the minIO cluster into the browser, and enter the minIO account name and password to log in. The interface is as follows:

2. Specific interaction logic

First, the client needs to request the business server (WebServer/APIServer) to obtain the credential (presignedURL) of the uploaded file. Then, the business server responds with an upload file URL and a download file URL. The client uses the upload URL to upload the file to the file server, and use the download The URL is used as the file parameter of the request backend. For example, sending an email message supports uploading local pictures, and the picture uploaded to the backend can use the file download URL as a parameter.

The advantages of this program are as follows:

The client directly uploads files to the minIO server without going through the business server, reducing the pressure on the business server and improving availability
The database server only stores the download URL of the file, reducing the amount of database storage
Supports uploading of very large files, such as 3G and above, and if the hardware performance is sufficient, a single file of the minIO server can be up to 5T
There is no limit to the number of uploaded files
Can solve the problem of file coverage with the same name
Can be adapted to any file server compatible with the S3 protocol to meet the requirements of different customers

Fourth, minIO distributed deployment

MinIO distributed deployment architecture
1.1 Architecture overview

The minIO cluster adopts a decentralized shared nothing architecture. Each node has a peer-to-peer relationship. Connecting to any node can achieve access to the cluster. This design of maintaining peer-to-peer relationships between nodes is not the most common distributed cluster architecture. In most current distributed storage clusters, nodes can often be divided into multiple roles, such as access nodes responsible for connecting and processing external application requests, management nodes responsible for storing metadata, actual data storage nodes, and so on. MinIO is different from this. All nodes in the minIO cluster assume multiple roles at the same time, integrating metadata storage, data storage, application access and other functions to truly achieve decentralization and complete equality of all nodes. Its advantage lies in effectively reducing the complex scheduling process in the cluster and the failure risk and performance bottleneck caused by the central node.

The minIO cluster in the figure below adds Nginx proxy:

Only one command is required to deploy a minIO cluster, but each node in the cluster must execute the same command

Among them, the official recommendation node IP should be continuous.

1.2 minIO expansion plan

First of all, the minimalist design concept of minIO makes the minIO distributed cluster not support the expansion method of adding a single node to the cluster and performing automatic adjustment. This is because the data balance and erasure group division caused by adding a single node will cause problems. Brings complex scheduling and processing to the entire cluster, which is not conducive to maintenance. Therefore, minIO provides a peer-to-peer expansion method, that is, the number of nodes and disks required to be increased must be equal to the original cluster.

For example, the original cluster contains 4 nodes and 4 disks, and 4 nodes and 4 disks (or a multiple thereof) must be added during expansion so that the system maintains the same data redundancy SLA, which greatly reduces the complexity of expansion. As in the above example, after expansion, the minIO cluster will not perform complete data balance on all 8 nodes. Instead, the original 4 nodes will be regarded as one area, and the newly added 4 nodes will be regarded as another area. When a new object is uploaded, the cluster will determine the storage area according to the available space ratio of each area, and the corresponding erasure group is still determined by the hash algorithm in each area for final storage. In addition, after the cluster performs a peer-to-peer expansion, it can continue to be peer-to-peer expansion according to the expansion rules, but for security reasons, the maximum number of nodes in the cluster should generally not exceed 32.

minIO supports the expansion of an existing cluster by specifying a new cluster (erasure code mode) through a command. The command line is as follows:

Now the entire cluster has been expanded with 1024 disks, and the total disks have become 2048. New object upload requests will be automatically allocated to the least used cluster. Through the above expansion strategy, you can expand your cluster as needed. Restart the cluster after reconfiguration, it will take effect in the cluster immediately, and will not affect the existing cluster. In the above command, we can regard the original cluster as one area, and the newly added cluster as another area, and the new objects are placed in the area according to the proportion of the available space in each area. Within each area, the location is determined based on a deterministic hash algorithm.

Note: Each zone you add must have the same number of disks (erasure code set) size as the original zone in order to maintain the same data redundancy SLA. For example, if the first zone has 8 disks, you can expand the cluster to an area of 16, 32, or 1024 disks. You only need to ensure that the deployed SLA is a multiple of the original zone.

The advantages and disadvantages of peer-to-peer expansion are as follows:

Advantages: The configuration operation is simple and easy, and the expansion can be completed with a single command.

Disadvantages: ① Expansion needs to be restarted; ② Expansion is limited, and the number of cluster nodes generally does not exceed 32. This is because the MinIO cluster guarantees strong consistency through distributed locks. If the number of cluster nodes is too large, maintaining strong consistency will bring performance problem.

However, if the initial storage volume is not very large, and the cluster's short-term shutdown and restart have little impact on the business, use peer-to-peer expansion.

Precautions
All nodes in distributed minIO need to have the same access key and secret key, namely: username and password
The disk directory where distributed minIO stores data must be an empty directory
Distributed minIO officially recommends a minimum of 4 nodes in the production environment, because there are N nodes, at least N/2 nodes must be guaranteed to be readable, and at least N/2+1 nodes to be able to write
The distributed minIO node time must be the same, and the machine configuration must be the same
Distributed minIO will store a data file on each disk to ensure data reliability and security

3. Specific implementation steps

Many people on the Internet deploy minIO clusters using a single script, which is very unfriendly in the actual production environment, because minIO requires each node in the cluster to execute the same command to start successfully, so the best way is to use ansible to deploy minIO Cluster.

3.1 Install ansible

3.2 Deploy minIO cluster using ansible

The core code written by Ansible is as follows, readers can Baidu for specific details

3.3 Configure Nginx proxy cluster

The contents of the Nginx configuration file are as follows:

3.4 Verify that the minIO cluster is successfully deployed

On the browser, enter the address of the server where Nginx is located plus the listening port in the Nginx configuration to access the file server web page. The effect of successful deployment is as follows:

V. Conclusion

The above is the main content of the UMS private cloud file service development and deployment. The solution has been verified. If you want to build a file server compatible with the S3 protocol, then this article has reference value. Of course, due to time constraints and the initial file storage volume is not Very large, the solution also needs to be optimized. For example, if you want to implement a dynamic expansion mechanism, you can use the official federated expansion method, but this requires the introduction of etcd and more machines. In short, you still need to decide according to the specific business scenario, just like buying shoes is not the bigger the better, the best fit is the best.


极光JIGUANG
1.3k 声望1.3k 粉丝

极光(www.jiguang.cn)是中国领先的移动大数据服务商。其团队核心成员来自腾讯、摩根士丹利、豆瓣、Teradata和中国移动等公司。公司自2011年成立以来专注于为app开发者提供稳定高效的消息推送、统计分析、即时通...