4

Why do we need application monitoring

Have you ever had this experience:

A few weeks after the new features of the product went live, the customer submitted a work order to report the problem. R&D students are confirmed to be bugs after investigation, and dirty data will be generated. In the end, it took more than a day to fix the bug + go online, and it took a week to write the repair script + repair the data.

发现 bug 的时机

If the bug is found earlier, the lower the repair cost will be .

Through monitoring and automatic feedback of errors or abnormalities in the application, it helps us find hidden problems as soon as possible and improve product quality and R&D efficiency.

Logging system is not equivalent to application monitoring system

Some students may say: Program errors and exceptions are found in our log system. Why do we need a special application monitoring system?

Indeed, a large amount of running process and abnormal information is recorded in the log in detail. However, this information may also have the drawbacks of duplication, invalidity, and lack of contact. Moreover, logs are mainly used when R&D students troubleshoot problems, and are rarely used for active monitoring and warning. There are a large number of error messages that have not been paid attention to and dealt with.

Sentry: A popular application monitoring product

Sentry is an open source application monitoring product, built using Python, JavaScript, HTML, and CSS. There are 29k Stars on GitHub , which is the highest number of open source projects in the application monitoring field. Its official website claims that there are 1 million developers and 70,000 organizations using Sentry. In addition to providing open source products, the company behind it also provides paid SaaS services: sentry.io . In 2021, the company announced that it had received 60 million U.S. dollars in Series D financing. This round of financing brought Sentry's total capital to 127 million U.S. dollars, and was valued at 1 billion U.S. dollars after financing. It is indeed a product worthy of attention.

Sentry has the following important features:

  • Good product experience and perfect functions

    sentry-demo

  • Low access workload

    The official and open source communities provide SDKs for various mainstream development languages and frameworks, which are convenient for developers to access, most of which can be completed within dozens of lines of code.

    SDKs

  • Sentry focuses on Error, Exception, Crash

    You can view the specific error information and call stack, and quickly locate the problem code.

    js-error

  • Provide rich contextual information

    The SDK will automatically report basic information and also supports custom information for troubleshooting.

    context

  • Automatically merge duplicate issues

    Repeated errors are automatically combined and accumulated times to avoid developers looking for clues to bugs in a large amount of redundant information.

    issues

  • Active email alert

    No need to wait for the "customer alert" to start troubleshooting.

Disadvantages of self-deployed Sentry

  • Many deployment dependencies

    Using the official Github warehouse , based on Docker and Docker Compose, it can indeed be deployed with one click and used out of the box. However, when you see 30 containers listed in front of you, you still feel hesitated.

    containers

  • Need to ensure high availability

    As above, Sentry uses many components, such as ZooKeeper, Nginx, Redis, Memcached, Kafka, PostgreSql, ClickHouse, etc. It is not easy to operate and maintain these components on their own and ensure high availability.

Avoid Sentry's avalanche

The introduction of new technologies or tools will more or less increase the complexity and operational risks of the system.

We had a serious problem before: a service with an average of 30 million interface requests per day failed, and a large number of error messages flooded to the Sentry server, which caused a serious delay in Sentry's response. Its Redis queue memory capacity was nearly full, and Nginx also All respond to 504 Gateway Timeout. The service that happened to be faulty caused the synchronization of HTTP requests to be blocked because the request to the Sentry server did not set a timeout period, which actually dragged down the service itself.

In order to circumvent such problems, there are the following practices:

  • Ensure high availability of Sentry server

    This is the most important point, but in reality we have not done it well. Currently our self-deployed Sentry is a single point, and there is no cluster or redundancy. If you want to achieve high availability, the money cost will be higher, and may even exceed the cost of using Sentry SaaS paid services. Since Sentry does not officially provide services in China, the speed of HTTP requests to foreign countries is not ideal, and using official SaaS services is not necessarily a good choice.

  • Set timeout

    When using the Sentry SDK, be sure to set the timeout period for sending requests to the Sentry server. It is recommended that it be less than 3 seconds.

  • Set sample_rate

    When using the Sentry SDK, you can set sampling rate , 0.00 refused to send any event, 1.00 express send all events. It is recommended to set a smaller value in the early stage, and then adjust it according to the PV size of the application. Using the sampling rate may have such a negative impact: sporadic errors may not be reported, resulting in undetected.

  • Timely fuse
    If the Sentry server is overwhelmed, you should avoid the application from continuing to request Sentry. For example: you can manually set the sampling rate to 0.00 .
  • Use asynchronous mode (async) to send the request

    If the SDK supports sending requests asynchronously, use it to avoid synchronous blocking.

  • Isolate the Sentry in the production environment

    Operation and maintenance colleagues deployed two sets of Sentry in isolation, one is the experience environment for application access in the development environment/test environment/pre-release environment; the other is the formal environment for application access in the production environment/privatization environment use. If you want to test the functions of Sentry or adjust the configuration of Sentry, we will first perform it in the Sentry of the experience environment. After confirming that there is no problem, we will adjust the Sentry of the production environment to ensure the stability of the production environment Sentry.

  • Buffer the concurrency pressure of requests to Sentry through the queue

    Assuming that the request volume and concurrency of the application are huge, and each request processing error occurs when a serious failure occurs, even if a lower sampling rate is set in the SDK (for example: 0.01 ), the concurrency request to Sentry may still be Exceed its limited load. In order to avoid this problem, we tried the following in the service with the largest traffic: we added a queue, first listed the error events of the service, and started a small number of consumption processes to consume the queue to slowly report errors to the Sentry server. And processing is done in the application, even if the queue capacity is full, it will not affect normal business (just discard error events). Practice has proved that this transfer buffer method is very effective, but it also increases the workload of connecting to Sentry, so everyone can choose.

Sentry tips

  • Report to Environment

    filter-env

    In different environments, use the SDK to configure different identifiers, such as Development, Test, Release, Production, Privatisation, so that it is easy to identify and filter problems.

  • Custom Tags

    The SDK will automatically help to report some basic tags, and we can also add some custom tags (for example: tenant, project and other business information) to help troubleshoot problems.

    tags

    Tag can be used to filter:

    filter-tag

    Tag can be used for statistics:

    tag-stats

  • Auto-marking is resolved

    Some bugs have been fixed and online, but R&D students generally do not remember that they have been manually marked in Sentry; there are also problems that do not need to be dealt with, such as third-party service exceptions, and they are unlikely to manually mark them. Using the "Auto Resolve" function, when the problem does not recur for a long time, the system will automatically help mark it as resolved, which is very convenient.

    auto-resolve

  • Merge problem

    Sentry can automatically identify and merge most duplicate problems. However, there are occasional exceptions. For example, if there are some random content in the error message, Sentry may think that it is a different type of error, and then it is not merged, which causes repeated problems to keep emailing alarms, which is very annoying. By setting "Fingerprint Rules", the "fingerprints" of similar errors are forced to be specified, so that these errors can be merged.

    fingerprint-rules

  • Identify and deal with real problems to avoid "the wolf is coming"

    Don't throw exceptions or remember errors at every turn. For example: "The format of the file you uploaded is incorrect. Please upload the file in the correct format as required." In fact, this is a normal business reminder. If you report it to Sentry as an error, it does not make much sense. Will not be processed. If such "noise" accumulates, it will reduce the sensitivity of R&D students to real problems. Emails receive a bunch of fake "wolves are coming" every day. When the "wolves" really come, we may not take action and cause an accident. When you hear the fake "wolf is coming", the correct approach is to shut it up instead of covering its ears. For example: modify the code, do not throw an exception, or change error to warning. In short, don't let it report to Sentry, don't let it interfere with our identification of real problems.

Sentry also provides many useful functions such as "performance analysis", "breadcrumbs", and "identify suspicious submissions", which are worth exploring and using.

Our department's use of Sentry in the past six months

  • 9 applications or services have been connected
  • Accumulatively identified dozens of hidden problems (after deduplication)
  • 3 services have reached zero problems
  • Identified the abnormality of the third-party service provider twice, and reported it to the other party for handling in time
  • The release failure was discovered 2 times in time and dealt with urgently

Summarize

Monitoring applications can actively discover hidden problems and improve product quality. Sentry is a popular application monitoring open source product with rich and useful features. While choosing to use it, we have also taken many measures to avoid negative impacts caused by introducing it. We have accumulated a bit of money in the process of use, and finally achieved good results. Share with you, I wish you all a few bugs and high efficiency.


Mr_Jing
4.4k 声望213 粉丝