5
头图

1. Business background

Many user-oriented Internet services maintain a copy of user data at the back end of the system, and the fast application center business also does this. The quick application center allows users to bookmark quick applications, and records the user's favorite list on the server side, and associates the favorite quick application package names through the user account identifier OpenID.

In order to enable the user's favorite list in the Quick Application Center to connect with the favorite state of the Quick Application Menubar, we also recorded the binding relationship between the user account identifier OpenID and the client's local identifier local\_identifier. Because KuaiApp Manubar is held by KuaiApp Engine and is independent of KuaiApp Center. It is not possible to obtain user account identification through the account system, but only the client local identifier local\_identifier, so we can only use the mapping relationship between the two To keep the state in sync.

In the specific implementation, we trigger a synchronization operation when the user starts the fast application center, and the client submits the OpenID and the client's local identification to the server for binding. The binding logic on the server side is to determine whether the OpenID already exists, and if it does not exist, insert it into the database, otherwise update the local\_identifier field of the corresponding data row (because the user may log in to the same vivo account on two different mobile phones). In the subsequent business process, we can query the corresponding local\_identifier based on OpenID, and vice versa.

But after the code went online for a while, we found that there were many duplicate OpenID records in the t_account data table. According to the above-mentioned binding logic, this situation should not happen in theory. Fortunately, these duplicate data did not affect the update and query scenarios, because we added the limit of LIMIT 1 in the query SQL, so the update and query operations for an OpenID actually only act on the record with the smallest ID. .

2. Problem analysis and positioning

Although redundant data has no impact on the actual business, such obvious data problems are certainly intolerable. So we started to troubleshoot the problem.

The first thing that comes to mind is to start with the data itself. First, through a rough observation of the t_account table data, it is found that about 3% of OpenID will have duplicates. That is to say, the situation of repeated insertion is occasional, and most of the request processing is handled correctly as expected. We re-read the code and confirmed that there are no obvious logic errors in the code implementation.

We further observe the data in detail. We selected several OpenIDs with repetitions, and queried the relevant data records. We found that the number of repetitions of these OpenIDs was not the same, some were only repeated once, and some were more. However, at this time we found a more valuable piece of information-the creation time of these same OpenID data rows is exactly the same, and the self-incrementing ID is continuous.

Therefore, we guess that the problem should be caused by concurrent requests! We simulated the concurrent calls of the client to the interface, and the phenomenon of repeated insertion of data did occur, which further confirmed the rationality of this guess. However, it is clear that the logic of the client terminal is that each user synchronizes once at startup. Why does the same OpenID concurrent request appear?

In fact, the actual operation of the code is not as ideal as we imagined. There are often some unstable factors in the operation of the computer, such as the network environment and the load of the server. These unstable factors may cause the client to fail to send the request. The "failure" here may not mean a real failure, but the entire request time may be too long, exceeding the timeout time set by the client, and thus be artificial It is judged as a failure, so the request is sent again through the retry mechanism. Then it may eventually cause the same request to be submitted multiple times, and these requests may be blocked at some point in the middle (for example, when the server's processing thread load is too large, it is too late to process the request, and the request enters the buffer queue). After mitigation, these few requests may be processed concurrently in a short period of time.

This is actually a typical concurrency conflict problem, which can be simply abstracted as: How to avoid writing duplicate data under concurrent conditions. In fact, there are many common business scenarios that may face this problem, for example, users are not allowed to use the same username when registering.

Generally speaking, when we deal with this type of problem, the most intuitive way is to perform a query first, and insert it only when it is judged that there is no current data in the database.

Obviously, this process is no problem from the perspective of a single request. But when multiple requests are concurrent, both request A and request B initiate a query first, and both get the result that it does not exist, so both perform data insertion again, which eventually leads to a concurrency conflict.

3. Explore feasible solutions

Now that the problem is located, the next step is to start looking for a solution. Faced with this situation, we usually have two options, one is to let the database to solve, the other is to solve the problem by the application.

3.1 Database level processing-unique index

When using MySQL database and InnoDB storage engine, we can use a unique index to ensure that the value of the same column is unique. Obviously, in the t\_account table, we did not create a unique index for the open\_id column at the beginning. If we want to add a unique index at this time, we can use the following ALTER TABLE statement.

ALTER TABLE t_account ADD UNIQUE uk_open_id( open_id );

Once the unique index is added to the open\_id column, when the above concurrency occurs, one of request A and request B must first complete the data insertion operation, and the other will get a similar error. Therefore, it is finally guaranteed that only one record with openid=xxx exists in the t\_account table.

Error Code: 1062. Duplicate entry 'xxx' for key 'uk_open_id'

3.2 Application level processing-distributed locks

Another solution is that we do not rely on the underlying database to provide us with a guarantee of uniqueness, but rely on the application's own code logic to avoid concurrency conflicts. The protection of the application layer is actually a more versatile solution. After all, we cannot assume that all the data persistence components used by the system have the ability to detect data uniqueness.

How do you do it? To put it simply, it is the serialization of the parallel behavior. The reason why we encounter the problem of repeatedly inserting data is because the two actions of "checking whether the data already exists" and "inserting data" are separated. Because these two steps are not atomic, two different requests can pass the detection of the first step at the same time. If we can combine these two actions into one atomic operation, we can avoid data conflicts. At this time, we need to achieve the atomicity of this code block by locking.

For the Java language, the most familiar locking mechanism is the synchronized keyword.

public synchronized void submit(String openId, String localIdentifier){
    Account account = accountDao.find(openId);
    if (account == null) {
        // insert
    }
    else {
        // update
    }
}

However, things are not that simple. You know, our program is not only deployed on one server, but deployed on multiple nodes. That is to say, the concurrency here is not only the concurrency between threads, but the concurrency between processes. Therefore, we cannot solve this synchronization problem through the lock mechanism at the Java language level. What we need here should be a distributed lock.

3.3 The trade-off between the two solutions

Based on the above analysis, it seems that both solutions are feasible, but in the end we chose the distributed lock solution. Why is it that the first solution only needs to add an index, but we don't use it?

Because the existing online data already has duplicate data in the open_id column, it will not succeed if you directly add the unique index at this time. In order to add a unique index, we must first clean up the existing duplicate data. But here comes the problem again. The online program has been running continuously, and repeated data may continue to be produced. Then can we find a time period when the user request is not active to clean up, and complete the creation of the unique index before the new duplicate data is inserted? The answer is of course yes, but this kind of solution requires multi-party collaborative processing of operation and maintenance, DBA, and development, and due to business characteristics, the most suitable processing time period should be in the early morning, when it is dead at night. Even with such harsh repair measures, it cannot be 100% guaranteed that no new duplicate data will be inserted between the completion of data cleaning and the establishment of the index. Therefore, the repair scheme based on the unique index looks very suitable at first glance, but the specific operation is still slightly troublesome.

In fact, the most appropriate opportunity to establish a unique index should be in the initial design stage of the system, so that the problem of duplicate data can be effectively avoided. However, it is a done deal. In the current scenario, we still chose a more operability distributed lock solution. Because we choose this solution, we can first add the new code for distributed lock repair to block the insertion of new duplicate data, and then perform the cleanup operation on the original duplicate data. In this way, we only need to modify the code and go online once. That's it. Of course, after the problem is completely resolved, we can reconsider adding a unique index to the data table.

So next, let's take a look at how to implement a distributed lock-based solution. First of all, let's review the relevant knowledge of distributed locks.

Fourth, an overview of distributed locks

4.1 What characteristics does a distributed lock need to have?

  • In a distributed system environment, only one thread of one machine can acquire the lock at the same time;
  • High-availability acquisition and release of locks;
  • High-performance acquiring and releasing locks;
  • Have reentrant features;
  • Equipped with lock failure mechanism to prevent deadlock;
  • Possess blocking/non-blocking lock characteristics.

4.2 What are the ways to implement distributed locks?

There are three main types of distributed lock implementations:

  • Realize distributed lock based on database;
  • Realize distributed lock based on Zookeeper;
  • Realize distributed lock based on Redis;

4.2.1 Implementation based on database

The database-based implementation is to directly create a lock table, and implement locking and unlocking by operating the table data. Taking MySQL database as an example, we can create such a table, and apply a unique index constraint to method_name:

CREATE TABLE `myLock` (
 `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '主键',
 `method_name` varchar(100) NOT NULL DEFAULT '' COMMENT '锁定的方法名',
 `value` varchar(1024) NOT NULL DEFAULT '锁信息',
 PRIMARY KEY (`id`),
 UNIQUE KEY `uidx_method_name` (`method_name `) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='锁定中的方法';

Then, we can lock and unlock by inserting data and deleting data:

#加锁
insert into myLock(method_name, value) values ('m1', '1');
 
#解锁
delete from myLock where method_name ='m1';

Although the database-based implementation is simple, there are some obvious problems:

  • There is no lock expiration time. If the unlocking fails, the lock record will remain in the database forever, causing a deadlock.
  • The lock is not reentrant because it does not know whether the requestor is the thread currently holding the lock.
  • The current database is a single point, once it goes down, the lock mechanism will be completely broken.

4.2.2 Implementation based on Zookeeper

ZooKeeper is an open source component that provides consistency services for distributed applications. Inside it is a hierarchical file system directory tree structure, which stipulates that the names of nodes in the same directory are unique.

There are 4 types of ZooKeeper nodes (Znodes):

  • Persistent node (the node still exists after the session is disconnected)
  • Persistent Sequence Node
  • Temporary node (the node is deleted after the session is disconnected)
  • Temporary Sequence Node

When a new Znode is created as a sequential node, ZooKeeper sets the path of the Znode by appending the 10-digit serial number to the original name. For example, if a Znode with the path /mynode is created as a sequential node, ZooKeeper will change the path to /mynode0000000001 and set the next serial number to 0000000002, which is maintained by the parent node. If two sequential nodes are created at the same time, ZooKeeper will not use the same number for each Znode.

Based on the characteristics of ZooKeeper, distributed locks can be implemented as follows:

  • Create a directory mylock;
  • If thread A wants to acquire the lock, it creates a temporary sequence node in the mylock directory;
  • Get all the child nodes in the mylock directory, and then get the younger sibling node, if it does not exist, it means that the current thread sequence number is the smallest, and the lock is obtained;
  • Thread B gets all the nodes, judges that it is not the smallest node, and sets the node to monitor the next smallest node than itself;
  • Thread A finishes processing, deletes its own node, thread B listens to the change event, judges whether it is the smallest node, and obtains the lock if it is.

Since the temporary node is created, when the thread holding the lock goes down unexpectedly, the lock can still be released, so the deadlock problem can be avoided. In addition, we can also implement the blocking feature through the node queuing monitoring mechanism, or implement the reentrant lock by carrying the thread identifier in the Znode. At the same time, due to the high availability characteristics of the ZooKeeper cluster, the availability of distributed locks can also be guaranteed. However, because of the need to frequently create and delete nodes, the Zookeeper method is not as good as the Redis method in terms of performance.

4.2.3 Implementation based on Redis

Redis is an open source key-value storage database, which is based on memory and has very high performance and is often used as a cache.

The core principle of implementing distributed locks based on Redis is: try to set a specific key. If the setting is successful (the key does not exist before), it is equivalent to acquiring the lock, and at the same time setting an expiration time for the key to prevent the thread from being released Exit before the lock causes a deadlock. After the thread executes the synchronization task, the active release of the lock is completed by the delete command.

One thing that needs special attention here is how to lock and set the expiration time. Some people will use the two commands setnx + expire to achieve this, but this is problematic. Assuming that the current thread executes setnx to obtain the lock, but it crashes before executing expire, the lock cannot be released. Of course, we can merge the two commands into a lua script to achieve the atomic submission of the two commands.

In fact, we can simply use the set command to directly implement setnx and set the expiration time in one command to complete the lock operation:

SET key value [EX seconds] [PX milliseconds] NX

The unlock operation only needs:

DEL key

Five, the solution based on Redis distributed lock

In this case, we adopted a distributed lock based on Redis.

5.1 Java implementation of distributed locks

Since the project uses the Jedis framework and the online Redis deployment is in cluster mode, we encapsulate a RedisLock class based on redis.clients.jedis.JedisCluster to provide locking and unlocking interfaces.

public class RedisLock {
 
    private static final String LOCK_SUCCESS = "OK";
    private static final String LOCK_VALUE = "lock";
    private static final int EXPIRE_SECONDS = 3;
 
    @Autowired
    protected JedisCluster jedisCluster;
 
    public boolean lock(String openId) {
        String redisKey = this.formatRedisKey(openId);
        String ok = jedisCluster.set(redisKey, LOCK_VALUE, "NX", "EX", EXPIRE_SECONDS);
        return LOCK_SUCCESS.equals(ok);
    }
 
    public void unlock(String openId) {
        String redisKey = this.formatRedisKey(openId);
        jedisCluster.del(redisKey);
    }
 
    private String formatRedisKey(String openId){
        return "keyPrefix:" + openId;
    }
}

In the specific implementation, we set an expiration time of 3 seconds, because the locked task is simple database query and insertion, and the server and the database are deployed in the same computer room. Under normal circumstances, 3 seconds is enough. Execution of the code.

In fact, the above implementation is a crude version of Redis distributed lock. We did not consider the reentrancy of threads in the implementation, nor did we consider the problem of the lock being released by other processes by mistake, but it is already in this business scenario. Can meet our needs. Assuming that it is extended to a more general business scenario, we can consider adding a specific identifier of the current process to the value, and performing corresponding matching checks during the lock and release phases, and a more secure and reliable Redis distribution can be obtained. The realization of the lock.

Of course, frameworks like Redission also provide a fairly complete implementation of Redis distributed lock encapsulation. In some business scenarios with relatively strict requirements, I recommend using this type of framework directly. Since this article focuses on the idea of troubleshooting and solving problems, there is no more introduction to the specific implementation principles of Redisson distributed. Interested friends can find very rich information on the Internet.

5.2 Improved code logic

Now, we can use the packaged RedisLock to improve the original code.

public class AccountService {
 
    @Autowired
    private RedisLock redisLock;
 
    public void submit(String openId, String localIdentifier) {
        if (!redisLock.lock(openId)) {
            // 如果相同openId并发情况下,线程没有抢到锁,则直接丢弃请求
            return;
        }
 
        // 获取到锁,开始执行用户数据同步逻辑
        try {
            Account account = accountDao.find(openId);
            if (account == null) {
                // insert
            } else {
                // update
            }
        } finally {
            // 释放锁
            redisLock.unlock(openId);
        }
    }
}

5.3 Data cleaning

Finally, I will briefly talk about the finishing touches. Due to the large amount of repeated data, it is unlikely to be processed slowly by hand. So we wrote a timed task class, which performs a cleanup operation every one minute, and cleans up 1000 duplicate OpenIDs each time, so as to avoid a large number of queries and delete operations in a short time from affecting the database performance. When it is confirmed that the duplicate data has been completely cleaned up, the scheduling of the scheduled task is stopped, and this code is removed in the next version iteration.

Six, summary

In the daily development process, it is inevitable that there will be various problems. We have to learn to analyze step by step and find the root cause of the problem; then try to find feasible solutions within our own cognitive range, and carefully weigh the various solutions. The pros and cons can finally solve the problem efficiently.


vivo互联网技术
3.3k 声望10.2k 粉丝