Design a complete log system

need

Logs are very important for online troubleshooting. Many problems are actually very occasional. The same system version and the same equipment may be the recurrence of the user, but the development through the same operation and equipment will not recur. However, this problem cannot remain unsolved, so you can troubleshoot the problem through logs. It may be a problem caused by the background, or a logic problem on the client side. Logging at key points can quickly locate the problem.

Assuming that our number of users is one million daily active 1% users, even if the problem is not a crash, it is a business or playback problem. Then this part of users is 10,000 users, and the number of 10,000 users is very large. And most users will not take the initiative to contact customer service after encountering problems, but switch to other platforms.

Although we currently have Kibana network monitoring, we can only check whether there is a problem with the network request, whether the user has requested the server at a certain time, and whether the data sent by the server is correct, but if there is a problem with the business logic, the client must record the log .

status quo

We had a log system before in our project, but from a business and technical point of view, there are two problems. From the perspective of the business layer, the existing log system requires the user to manually export and send it to the customer service, which is an unnecessary interruption to the user. And most users will not agree to customer service requests, and will not export logs to customer service. From a technical point of view, the existing log system code is messy, and its performance is very poor, resulting in the online dare not to continuously record the log, which will cause the player to freeze.

Moreover, the existing log system is limited to the debug environment to enable active recording, and it is not enabled online. After a problem occurs online, the user needs to manually open it, and the recording time is only three minutes. It is precisely because of the many problems that exist now that everyone is not very active in the use of logs, and it is more difficult to troubleshoot problems online.

Design

Ideas

In response to the existing problems, I am planning to build a new log system to replace the existing log system. The positioning of the new log system is very simple, it is purely to record business logs. Crash , buried points, these, we do not record them, these can be used as future expansion. The log system records three kinds of logs, business log, network log, and player log.

Log collection We adopt the active recovery strategy, fill in the user's uid on the log platform, issue a recovery instruction to the designated device through uid , and issue the recovery instruction through a long connection. After the client receives the retrieval instruction, it filters the logs according to the filtering conditions, and then writes them to different files in units of days, compresses them and uploads them to the backend.

On the log platform, you can search based on specified conditions and download files to view the log. In order to facilitate the developers to view the logs, the logs retrieved from the database will be written in the .txt , and this file will be uploaded.

API design

For the API called, it should be simple enough, and the business layer is just like calling NSLog when used. So for API , I used the macro definition method. The calling method is the NSLog , and the calling is very simple.

#if DEBUG
#define SVLogDebug(frmt, ...)   [[SVLogManager sharedInstance] mobileLogContent:(frmt), ##__VA_ARGS__]
#else
#define SVLogDebug(frmt, ...)   NSLog(frmt, ...)
#endif

There are three types of logs, business logs, player logs, and network logs. The three logs correspond to different macro definitions. Different macro definitions have different types of writing to the database, which can be filtered by user logs.

Business log: SVLogDebug .
Player log: SVLogDebugPlayer .
Network days: SVLogDebugQUIC .

Elimination strategy

Not only to write to the database, but also to consider the elimination strategy. The elimination strategy needs to balance the number of recorded logs and timeliness issues. The number of logs should be as large as possible to troubleshoot the problem, and it will not take up too much disk space. Therefore, after the log is uploaded, the uploaded log will be deleted. In addition, the log elimination strategy has the following two types.

The logs are only kept for up to three days, and logs older than three days will be deleted. Check after the application starts, and the background thread executes this process.
A maximum threshold is added to the log, and the parts of the log that exceed the threshold are deleted in chronological order from first to last. The threshold size we defined is 200MB , which generally does not exceed this size.

Record basic information

Some key information is also very important when troubleshooting, such as the user's network environment at the time, and some configuration items. These factors will have some impact on the execution of the code. For this issue, we will also record some user configuration information and network environment to facilitate troubleshooting, but will not involve private information such as user latitude and longitude.

database

Old plan

The previous log solution was DDLog , which has serious performance problems. The way it writes to the log is achieved by NSData txt file in the sandbox and write the file locally through a handle. After each write, put the handle seek to the end of the file, and continue directly at the end of the file next time. Write to the log. The log is NSData in the manner 06189509c0e1e9, which is equivalent to frequent local file write operations, and one or more handle objects are maintained in the memory.

Another problem with this method is that because the binary write is performed directly, the file txt In this way, there is no way to do filtering and other operations, and the scalability is very poor, so we plan to use the database to implement the new log solution.

plan selection

I compared the mainstream databases of the iOS WCDB best overall performance. It is better than FMDB some aspects, and because it is C++ , so from the perspective of code execution, there will be no OC . Additional consumption of message sending and forwarding.

The WCDB statistics official website, WCDB and FMDB contrast, FMDB is SQLite framework simple package, and directly SQLite difference is not great. WCDB is sqlcipher . The overall performance is higher than that of FMDB . The following is a performance comparison. The data comes from the official document of WCDB

A single read operation WCDB than FMDB slow 5% about, in for has been read in the cycle.

A single write operation WCDB faster than FMDB 28% , and a for writes all the time.

The batch write operation is more obvious, WCDB faster than FMDB 180% , a batch task writes a batch of data.

It can be seen from the data that WCDB in write operations is much faster than that of FMDB , and the most frequent local log is write operations, so this fits our needs, so choosing WCDB as the new database solution is the most appropriate . And the exposure module in the project has already used WCDB , which proves that this solution is feasible and has good performance.

Table design

The table design of our database is very simple, with the following four fields, different types of logs are distinguished type If you want to add a new log type, you can also extend it in the project. Because it uses a database, it has good scalability.

index: Primary key, used for indexing.
content: log content, record the log content.
createTime: creation time, the time when the log was stored in the database.
type: Log type, used to distinguish three types.

Database optimization

We are a video application that will involve major functions such as playback, download, and upload. These functions will record a large number of logs to facilitate online troubleshooting. Therefore, avoiding a database that is too large has become a point that I value when designing a logging system.

According to the log scale, I conducted a lot of tests on the three modules of play, download, and upload, playing one day and two nights, downloading 40 episodes of TV series, and uploading multiple high-definition videos. The cumulative number of logs recorded was about 50,000. I found that the database folder has 200MB+ the size of 06189509c0e44d, which is already relatively large, so the database needs to be optimized.

I observed the database folder, there are three files, db , shm , wal , mainly because the database log file is too large, db file is not. So you need to call sqlite3_wal_checkpoint to wal content of 06189509c0e48a to the database, which can reduce the size of the files wal and shm But WCDB does not provide a direct checkpoint , so after investigation, it is found that database can be triggered when the shutdown operation of checkpoint .

When I exited the application, I listened to the terminal notification, and the actual processing was as late as possible. This can ensure that the log is not missed, and the database can be closed when the program exits. After verification, the optimized database disk occupies very little. 143,987 2 databases, the database file size is 34.8MB , the compressed log size is 1.4MB , and the decompressed log size is 13.6MB .

wal mode

Here, by the way, talk about the wal mode to facilitate a deeper understanding of the database. SQLite in 3.7 version incorporates wal mode, but the default is not turned on, iOS version WCDB will wal mode is automatically turned on and do some optimization.

wal file is responsible for optimizing concurrent operations under multithreading. If there is no wal file, in the traditional delete mode, the database read and write operations are mutually exclusive. In order to prevent half of the written data from being read, it will wait until the write operation is completed. After that, perform the read operation. The wal file is to solve the problem of concurrent read and write, and the shm file indexes the wal

SQLite more commonly used delete and wal two modes, these two modes have their own advantages. delete directly reads and writes db-page . The read and write operations are all the same file, so read and write are mutually exclusive and do not support concurrent operations. The wal is append new db-page , so write faster, and can support concurrent operations, not read are operating at the same time writing db-page can be.

Since delete mode of operation db-page is discrete, so perform bulk write operation, delete performance mode would be a lot worse, which is why WCDB bulk write better performance reasons. The wal mode read operation reads db and wal two documents, it will to some extent affect the performance of reading data, so wal query performance relative delete mode worse.

Using the wal mode needs to control the number of db-page wal file. If the page is too large, the file size will be out of control. wal file has not increased, according to SQLite designed by checkpoint can operate wal merge files into db file. However, the timing of synchronization causes the query operation to be blocked, so checkpoint cannot be executed frequently. A threshold of WCDB is set in 1000 1000 will be executed once page reaches checkpoint .

This 1000 is an experience value of the WeChat team, too large will affect the read and write performance, and take up too much disk space. If it is too small checkpoint will be executed frequently, causing the read and write to be blocked.

# define SQLITE_DEFAULT_WAL_AUTOCHECKPOINT  1000

sqlite3_wal_autocheckpoint(db, SQLITE_DEFAULT_WAL_AUTOCHECKPOINT);

int sqlite3_wal_autocheckpoint(sqlite3 *db, int nFrame){
#ifdef SQLITE_OMIT_WAL
  UNUSED_PARAMETER(db);
  UNUSED_PARAMETER(nFrame);
#else
#ifdef SQLITE_ENABLE_API_ARMOR
  if( !sqlite3SafetyCheckOk(db) ) return SQLITE_MISUSE_BKPT;
#endif
  if( nFrame>0 ){
    sqlite3_wal_hook(db, sqlite3WalDefaultHook, SQLITE_INT_TO_PTR(nFrame));
  }else{
    sqlite3_wal_hook(db, 0, 0);
  }
#endif
  return SQLITE_OK;
}

You can also set the size limit of the log file. The default is -1 , which means there is no limit. journalSizeLimit means that the excess part will be overwritten. Try not to modify this file, it may cause wal file.

i64 sqlite3PagerJournalSizeLimit(Pager *pPager, i64 iLimit){
  if( iLimit>=-1 ){
    pPager->journalSizeLimit = iLimit;
    sqlite3WalLimit(pPager->pWal, iLimit);
  }
  return pPager->journalSizeLimit;
}

Issue instructions

Log platform

The log report should be unaware of the user's perception, and the log can be automatically uploaded without the user's active cooperation. And not all user logs need to be reported, only the user logs that have problems are what we need, which can also avoid waste of storage resources on the server side. For these problems, we have developed a log platform to notify the client to upload logs by issuing upload instructions.

Our log platform is relatively simple. Enter uid to issue an upload instruction to the specified user. After the client uploads the log, you can also query uid As shown in the figure above, you can select the following log types and time intervals when issuing instructions. After receiving the instructions, the client will filter based on these parameters. If not selected, the default parameters are used. These three parameters can also be used when searching.

The log platform corresponds to a service. When you click the button to issue an upload instruction, the service will issue a json connection channel. json contains the above parameters, which can also be used to extend other fields in the future. Upload logs are based on days, so you can search based on days here, and click download to preview the log content directly.

Long connection channel

In order to issue this piece, we use the existing long connection. When the user feedbacks the problem, we will record the user's uid . If the technology requires a log to troubleshoot the problem, we will issue an instruction through the log platform.

ack connection service backend, and the service will issue the command through the persistent connection channel. If the command is sent to the client, the client will reply with a 06189509c0e76e message to inform the channel that the command has been received, and the channel will Instructions are removed from the queue. If the user does not open App at this time, this instruction will be App next time and establishes a connection with the persistent connection channel.

Unfinished upload instructions will be in the queue, but no more than three days, because instructions over three days have lost their timeliness, and the problem may have been resolved through other means at that time.

Silent push

If the user opens App , the log command can be issued through the persistent connection channel. There is another scenario, which is the most common one. How to solve the log reporting problem when the App

At that time, we also investigated push ’s log recovery. Meituan’s plan included the strategy of silent 06189509c0e7d0, but after our investigation, we found that silent push had little basic meaning and could only cover some very small scenes. For example, the user App is dropped by the system kill , or is suspended in the background, etc. This kind of scenario is rare for us. In addition and push group of people who communicate a bit, push group said feedback silence push reach some problems, so I can not quiet as push strategy.

Log upload

Multipart upload

When designing the scheme, because the backend does not support direct display of the log, it can only be downloaded as a file. Therefore, it was agreed with the backend to upload log files in units of days. For example, the time point for the recovery is, the start time is 19:00 on April 21, and the end time is 21:00 on April 23. In this case, it will be divided into three files, that is, one file per day. The files on the first day only contain logs starting at 19:00, and the files on the last day only contain logs up to 21:00.

zip upload. When uploading, each log file is compressed with a 06189509c0e82a file and then uploaded. On the one hand, this is to ensure the upload success rate. If the file size is too large, the success rate will decrease. On the other hand, it is for file segmentation. After observation, after each file is compressed into zip , the file size can be controlled within 500kb . 500kb is an experience value we used to upload video slices before. This experience value is a balance between upload success rate and the number of slices.

The log naming uses a timestamp as a combination, and the time unit should be accurate to the minute to facilitate the server's time filtering operation. The upload is submitted in the form of a form, and the corresponding local log will be deleted after the upload is completed. If the upload fails, the retry strategy is used, and each segment is uploaded up to three times. If the upload fails for three times, the upload fails this time, and the upload will be re-uploaded at another time.

safety

In order to ensure the data security of the log, we transmit the log upload request through https , but this is not enough. https can still obtain the plaintext information of the SSL Therefore, the content to be transmitted also needs to be encrypted. When selecting an encryption strategy, considering performance issues, the encryption method adopts a symmetric encryption strategy.

However, the secret key for symmetric encryption is issued through asymmetric encryption, and each upload instruction corresponds to a unique secret key. The client first compresses the file, encrypts the compressed file, and uploads it in fragments after encryption. After the server receives the encrypted file, it decrypts it with the secret key to get zip and decompress it.

Proactively report

After the new log system went online, we found that the recovery success rate was only 40% , because some users lost contact after App . For this problem, there are two main ways for me to analyze user feedback problems. One is that the user feedbacks the problem from the system settings, and after communicating with the customer service, the technology intervenes to troubleshoot the problem. The other is the problem that users feedback through App Store comment area, and operations after a user has a problem.

Both of these methods are suitable for log retrieval, but the first one is due to a specific trigger condition, that is, the user clicks to enter the feedback interface. Therefore, for users who report problems in this scenario, we have added a way of proactive reporting. That is, when the user clicks on the feedback, it will actively report the logs within three days with the current time as the end point. This can increase the success rate of log reporting to 90% . After the success rate increases, more people will be encouraged to access the log module to facilitate online troubleshooting.

Manual export

The log upload method also includes manual export. Manual export is to enter the debugging page in some way, select the corresponding log sharing on the debugging page, and call up the system sharing panel to share through the corresponding channel. Before the new log system, it was in this way that users manually exported logs to customer service. You can imagine how serious the interruption to users is.

Now the manual export method still exists, but it is only used for debug stage. Manually exporting the log to troubleshoot problems, online does not require manual operation by the user.