Full link log of microservices (Sleuth+MDC)

The decentralization of the microservice architecture allows each business to easily create a highly autonomous service. Today, the research and development projects I am involved in have divided dozens of microservices, which are responsible for different teams, and these services call each other. Needless to say, the benefits of microservices can also cause some problems, such as:

The test colleagues reported that the interface in charge of my service reported an error, and often the wrong program was not in my code, but the downstream service it depended on went wrong, and then I had to contact the team responsible for the corresponding service. When an interface reports an error, how can you immediately locate which service has the problem?
The leader proposed to improve the response speed of an interface, but this interface also calls several downstream service interfaces. How to locate the performance bottleneck?

There are many such things. When there are more and more microservices, the importance of full-link logs can be found more and more. The most common solution at present is Spring Cloud Sleuth , this article explains how to use it. It will also analyze the principle of its implementation, and then explore the scheme of custom implementation based on this principle.

1. Sleuth

1.1. Concept

Spring Cloud Sleuth can be used to track the entire request in the microservice request, and it can record the time-consuming, exception, etc. between each request. The information recorded by Sleuth needs a system to collect, store and query. There are many tools on the market, such as: Zipkin, Skywalking, pinpoint, etc. Taking the simplest Zipkin as an example, some tracking information collected by Spring Cloud Sleuth can be sent to Zipkin. Zipkin provides a graphical interface to view the calling relationship between services and the time-consuming of calls.
Let's first look at some basic concepts in Spring Cloud Sleuth.

service : Service name, it takes spring.application.name by default.
span : The direct translation is the span. For example, service1 calls service2. This round-trip can be counted as a span. Each span is identified by a unique spanId.
trace : Direct translation means tracing. It consists of a series of spans. Usually, a trace is started when the first service receives a request, and a unique traceId is generated. All subsequent traces can use the same traceId, so that the entire call chain can be uniquely identified.
annotation : mark an event in the tracking process, record the time when the event occurred. By recording the time when multiple events occur, the time-consuming situation between any two events can be known. Spring Cloud Sleuth records the following events by default.
(1) Client Sent : Sent by the client. Recording the time when the client initiates the request is also the starting point of a span.
(2) Server Received : The server receives it. Record the time when the server receives the request, and its time minus the time sent by the client is the network delay from the client to the server.
(3) Server Sent : Sent by the server. Record the time when the server-side processing ends and the result is responded. Its time minus the time received by the server is the processing time of the business logic of the server.
(4) Client Received : The client receives. Record the time when the client completes the entire request and receives the response content from the server. Its time minus the time sent by the client is the time it takes for the client to initiate the request to the time it receives the response from the server.
(5) Tag : tag, you can add some Key and Value information to the span.

1.2. Logs

The traceId and spanId in the trace information of Spring Cloud Sleuth will be automatically added to the MDC of Slf4j. So if the logging framework you are using follows the SLF4j specification, you can use the traceId and spanId variables in the log.

The default log framework of spring is logback. Take logback-spring.xml as an example. Add variables %X{traceId} and ---fd745496227d638c5d335fbeb2306fa6--- in pattern %X{spanId} to read the MDC. Set traceId, spanId values.

X-B3-* has been removed

It should be noted that some examples on the Internet use %X{X-B3-TraceId}, %X{X-B3-SpanId} instead of the above values. After personally stepping on the pit, I found that after the release of spring cloud sleuth 3.0.0, it can also correspond to spring cloud 2020.0.0, X-B3- This writing has been removed. It may be seen elsewhere, but not in MDC.

LOG_LEVEL_PATTERN

Someone else has to say that after I introduced sleuth, I did not create logback-spring.xml, nor did I configure values such as traceId and spanId. Or create logback-spring.xml, but the variables of traceId and spanId are not configured in the pattern, why is it also printed in the log, like this:

 2022-02-20 23:01:28.662  INFO [a-service,d0a7a4be9e9f3ab4,cb4cb09a83e0c553] 20570 --- [nio-8081-exec-1] o.s.web.servlet.DispatcherServlet        : Initializing Servlet 'dispatcherServlet'

The reason is that when logback-spring.xml is not created in the project, the default pattern will contain the LOG_LEVEL_PATTERN variable, similar to the following logback-spring.xml:

 <?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <include resource="org/springframework/boot/logging/logback/defaults.xml"/>
    <property name="CONSOLE_LOG_PATTERN"
              value="%d{yyyy-MM-dd HH:mm:ss.SSS} %clr(${LOG_LEVEL_PATTERN:-%5p}) %magenta(${PID:- }) --- [%15.15t] %cyan(%-40.40logger{39}) : %m%n${LOG_EXCEPTION_CONVERSION_WORD:-%wEx}}" />

    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>DEBUG</level>
        </filter>
        <encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
            <pattern>${CONSOLE_LOG_PATTERN}</pattern>
            <charset>utf8</charset>
        </encoder>
    </appender>
    <root level="INFO">
        <appender-ref ref="console"/>
    </root>
</configuration>

In the sleuth framework, there is an extension processing class that replaces the LOG_LEVEL_PATTERN variable with the %5p [${spring.zipkin.service.name:${spring.application.name:}},%X{traceId:-},%X{spanId:-}] value. Therefore, even if the value such as %X{traceId} is not set, it will still be displayed.
TraceEnvironmentPostProcessor.java

  public void postProcessEnvironment(ConfigurableEnvironment environment, SpringApplication application) {
        Map<String, Object> map = new HashMap();
        if (Boolean.parseBoolean(environment.getProperty("spring.sleuth.enabled", "true"))) {
            map.put("logging.pattern.level", "%5p [${spring.zipkin.service.name:${spring.application.name:}},%X{traceId:-},%X{spanId:-}]");
        }

        this.addOrReplace(environment.getPropertySources(), map);
    }

2. MDC

MDC(Mapped Diagnostic Contexts) , translated as: mapped diagnostic context . Meaning: in the log (mapped) request ID (requestId), which can be used as a keyword (context) for us to locate (diagnose) the problem.

2.1. Basics

With the MDC tool, as long as the put and remove codes are implanted in the interface or aspect, when locating the problem, all logs of a request can be quickly filtered according to the unique requestID of the mapping. In addition, when a client request comes, you can save the client id, ip, request parameters and other information in the MDC, and configure it in logback.xml, then this information will be automatically included in each log entry.

The basic principle of the MDC class is actually very simple. It holds a ThreadLocal instance internally to save context data. MDC provides several core interfaces such as put/get/clear to operate the data in ThreadLocal; KV in ThreadLocal can be Declare it in logback.xml, that is, by declaring %X{Key} in the layout to print the value corresponding to this key saved in the MDC in the log.
For example, define the filter below

 public class TraceFilter implements Filter {
    @Override
    public void init(FilterConfig filterConfig) throws ServletException {
        log.info("<<< 初始化TraceFilter>>>");
    }
    @Override
    public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException {
        try {
            final HttpServletRequest httpServletRequest = (HttpServletRequest) servletRequest;
            String requestURI = httpServletRequest.getRequestURI();
            MDC.put("requestURI", requestURI);
            // 放行
            filterChain.doFilter(servletRequest, servletResponse);
        } finally {
            // 请求MDC 的内容
            MDC.clear();
        }
    }
    @Override
    public void destroy() {
        log.info("<<< 销毁TraceFilter>>>");
        MDC.clear();
    }
}

Configure logback-spring.xml

 <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
  <layout>
    <Pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %highlight(%-5level) [%X{requestURI}]  %thread %logger Line:%-3L - %msg%n </Pattern>
  </layout>
</appender>

There are some problems that need to be paid attention to when using MDC. These problems are usually caused by ThreadLocal. For example, we need to clear the data in the MDC before the thread exits; when using MDC in the thread pool, we need to clear the data before the child thread exits. ; MDC.clear() method can be called.
The MDC mechanism of slf4j is internally implemented based on ThreadLocal, so the request ID passed in by calling the MDC.put() method is only valid in the current thread. Therefore, the MDC data set in the main thread cannot be obtained in its child thread (thread pool). So how does the main thread pass the MDC data to the child thread? Official advice:

Before the parent thread creates a new child thread, call the MDC.getCopyOfContextMap() method to take out the MDC content and pass it to the child thread.
The child thread calls the MDC.setContextMap() method to set the MDC content of the parent thread to the child thread before executing the operation.

For example the following thread pool:

 public class ExecutorMDCTest { 
    private static final AtomicInteger num = new AtomicInteger(1);
    static ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor(
            5,
            5 + 1,
            60L,
            TimeUnit.SECONDS,
            new LinkedBlockingDeque<>(10_00), new ThreadFactory() {
        @Override
        public Thread newThread(Runnable r) {
    
            return new Thread(r, "test-thread" + num.getAndIncrement());
        }
    });
 
    static public void main(String[] args) throws Exception {
    
        Logger log = LoggerFactory.getLogger(ExecutorMDCTest.class);
        MDC.put("bl-traceid", "988f54f4bdd34920a53d908827a9fa59");
        Map<String, String> copyOfContextMap = MDC.getCopyOfContextMap();
        // 线程池
        threadPoolExecutor.execute(() -> {
            // 会丢失日志
            log.info("测试打印日志4：{}", "data");
            MDC.setContextMap(copyOfContextMap);
            log.info("设置后-测试打印日志5：{}", "data");
        });
    }
}

Configure logback-spring.xml

 <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
  <layout>
    <Pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %highlight(%-5level) [%X{bl-traceid}]  %thread %logger Line:%-3L - %msg%n </Pattern>
  </layout>
</appender>

2.2. MDC in Slf4j

The internal implementation of Slf4j MDC is simple. Implement a singleton corresponding instance, obtain the specific MDC implementation class, and then its external interface is to verify the parameters, and then call the method implementation of MDCAdapter. MDCAdapter is an interface class. When the logging framework uses Logback, the implementation class of the corresponding interface is LogbackMDCAdapter, so the core implementation class is still it. The code is listed below:

 public class LogbackMDCAdapter implements MDCAdapter {
    final ThreadLocal<Map<String, String>> copyOnThreadLocal = new ThreadLocal();
    private static final int WRITE_OPERATION = 1;
    private static final int MAP_COPY_OPERATION = 2;
    final ThreadLocal<Integer> lastOperation = new ThreadLocal();

    public LogbackMDCAdapter() {
    }

    private Integer getAndSetLastOperation(int op) {
        Integer lastOp = (Integer)this.lastOperation.get();
        this.lastOperation.set(op);
        return lastOp;
    }

    private boolean wasLastOpReadOrNull(Integer lastOp) {
        return lastOp == null || lastOp == 2;
    }

    private Map<String, String> duplicateAndInsertNewMap(Map<String, String> oldMap) {
        Map<String, String> newMap = Collections.synchronizedMap(new HashMap());
        if (oldMap != null) {
            synchronized(oldMap) {
                newMap.putAll(oldMap);
            }
        }

        this.copyOnThreadLocal.set(newMap);
        return newMap;
    }

    public void put(String key, String val) throws IllegalArgumentException {
        if (key == null) {
            throw new IllegalArgumentException("key cannot be null");
        } else {
            Map<String, String> oldMap = (Map)this.copyOnThreadLocal.get();
            Integer lastOp = this.getAndSetLastOperation(1);
            if (!this.wasLastOpReadOrNull(lastOp) && oldMap != null) {
                oldMap.put(key, val);
            } else {
                Map<String, String> newMap = this.duplicateAndInsertNewMap(oldMap);
                newMap.put(key, val);
            }

        }
    }

    public void remove(String key) {
        if (key != null) {
            Map<String, String> oldMap = (Map)this.copyOnThreadLocal.get();
            if (oldMap != null) {
                Integer lastOp = this.getAndSetLastOperation(1);
                if (this.wasLastOpReadOrNull(lastOp)) {
                    Map<String, String> newMap = this.duplicateAndInsertNewMap(oldMap);
                    newMap.remove(key);
                } else {
                    oldMap.remove(key);
                }

            }
        }
    }

    public void clear() {
        this.lastOperation.set(1);
        this.copyOnThreadLocal.remove();
    }

    public String get(String key) {
        Map<String, String> map = (Map)this.copyOnThreadLocal.get();
        return map != null && key != null ? (String)map.get(key) : null;
    }

    public Map<String, String> getPropertyMap() {
        this.lastOperation.set(2);
        return (Map)this.copyOnThreadLocal.get();
    }

    public Set<String> getKeys() {
        Map<String, String> map = this.getPropertyMap();
        return map != null ? map.keySet() : null;
    }

    public Map<String, String> getCopyOfContextMap() {
        Map<String, String> hashMap = (Map)this.copyOnThreadLocal.get();
        return hashMap == null ? null : new HashMap(hashMap);
    }

    public void setContextMap(Map<String, String> contextMap) {
        this.lastOperation.set(1);
        Map<String, String> newMap = Collections.synchronizedMap(new HashMap());
        newMap.putAll(contextMap);
        this.copyOnThreadLocal.set(newMap);
    }
}

It is worth noting that in the early LogbackMDCAdapter class, the definition of copyOnThreadLocal is InheritableThreadLocal , and many materials seen on the Internet say this. Compared with TreadLocal, InheritableThreadLocal has the following characteristics: InheritableThreadLocal<T> is an extension and inheritance of ThreadLocal<T>, and its data ThreadLocal.ThreadLocalMap is stored in the inheritableThreadLocals variable of Thread. At the same time, if we start a new thread in the current thread, and There is an inheritableThreadLocals variable in the current thread, then the child thread will copy the value held by this variable in the current thread (parent thread).

However, using InheritableThreadLocal to define variables directly in LogbackMDCAdapter, the default is to copy the child thread, which affects the performance. In the latest version, it has been changed to ThreadLocal.

3. sleuth principle

3.1. http tracking

In order to implement request link tracking, when a request is sent to the entry of the distributed system, it is only necessary to create a unique tracking identifier for the request in the service tracking framework, and ensure that the identifier circulates within the distributed system until the request is returned. The identifier is the traceId, through which the logs of different service calls can be concatenated.

In order to count the delay of each processing unit (application service), when the request arrives or the processing logic reaches a certain state, the start, the specific process and the end are also marked with a unique identifier (to identify the entry of a request within a service, and the processing to the end), This identifier is the spanId. For each spanId, there must be two nodes: start and end. By calculating the timestamps of the start and end spans, the time delay of the span can be calculated.

In the Http request chain, the way to ensure that the traceId is passed in each service is to pass the traceId and other information through the Http request header when the Http-based service is called. You can try it out and print out all the request headers in the downstream method of the normal feign call:

     @GetMapping("/hello")
    public String hello(@RequestParam String name, @RequestHeader Map<String, String> headers) {
        log.info("headers:{}", headers);
    }

Log reads as follows, contains the parameters: x-b3-traceid , x-b3-spanid , x-b3-parentspanid , x-b3-sampled=1 .

 headers:{x-b3-traceid=6dd57dc2ad55c58f, x-b3-spanid=b5e6de658b261ac7, x-b3-parentspanid=6dd57dc2ad55c58f, x-b3-sampled=1, accept=*/*, user-agent=Java/1.8.0_202, host=localhost:8081, connection=keep-alive}

When the downstream service accepts the call, the sleuth framework LazyTracingFilter will do the processing and obtain the span data from the Header of the Http Request. If there are attributes such as x-b3-traceid and x-b3-spanid in the Header, It means that the previous node in the call chain has generated the span and passed it down. At this time, the span data can be used directly. Otherwise, create a new span.

3.2. Multi-threaded tracing

As mentioned earlier, MDC is implemented based on ThreadLocal, which means that the traceId, spanId and other data stored by the current thread cannot be transmitted in the child thread. But this obviously does not meet our expectations, the idea of tracking the complete link according to a traceId.
Still based on the previous example, print in the downstream method:

     @Resource
    private Executor asyncExecutor;

    @GetMapping("/hello")
    public String hello(@RequestParam String name, @RequestHeader Map<String, String> headers) {
        log.info("headers:{}", headers);
        CompletableFuture.runAsync(() -> this.syncPrint(), asyncExecutor);
        log.info("exec:{}", asyncExecutor.toString());
        }
        
    void syncPrint() {
        log.info("异步打印");
    }
    
    // ... 配置类中定义 bean
    
    @Bean("asyncExecutor")
    public Executor getAsyncExecutor() {
        ThreadPoolTaskExecutor threadPoolTaskExecutor = new ThreadPoolTaskExecutor();
        threadPoolTaskExecutor.setCorePoolSize(7);
        threadPoolTaskExecutor.setMaxPoolSize(42);
        threadPoolTaskExecutor.setQueueCapacity(11);
        threadPoolTaskExecutor.setThreadNamePrefix("MyExecutor-");
        threadPoolTaskExecutor.initialize();
        return threadPoolTaskExecutor;
    }

The result printed is:

 2022-02-28 23:14:40.711  INFO [a-service,14b328adfec294a8,739f0082a7cbaa19] 23400 --- [nio-8081-exec-1] p.k.aservice.controller.HelloController  : headers:{x-b3-traceid=14b328adfec294a8, x-b3-spanid=739f0082a7cbaa19, x-b3-parentspanid=14b328adfec294a8, x-b3-sampled=1, accept=*/*, user-agent=Java/1.8.0_202, host=localhost:8081, connection=keep-alive}

2022-02-28 23:14:40.718  INFO [a-service,14b328adfec294a8,739f0082a7cbaa19] 23400 --- [nio-8081-exec-1] p.k.aservice.controller.HelloController  : exec:org.springframework.cloud.sleuth.instrument.async.LazyTraceThreadPoolTaskExecutor@72864083

2022-02-28 23:14:40.720  INFO [a-service,14b328adfec294a8,060fcae6ba359b01] 23400 --- [   MyExecutor-1] p.k.aservice.controller.HelloController  : 异步打印

In fact, I also did a few experiments, and the final printed result is the same:

If the bean of asyncExecutor is not registered in the code, the default thread pool of Spring is used directly, and the printed result is still the same.
If the code does not use CompletableFuture.runAsync to execute the asynchronous method, but write the method in another class and call it through @Async, the printed result is still the same.

To find out why, start with the result printed by asyncExecutor.toString(): org.springframework.cloud.sleuth.instrument.async.LazyTraceThreadPoolTaskExecutor@72864083 . By searching the source code, it is found that the sleuth framework defines automatic configuration, which will proxy the thread pool bean registered in the spring container to LazyTraceThreadPoolTaskExecutor , and the thread pool customized by the LazyTraceThreadPoolTaskExecutor framework can realize the transmission of information such as traceId.

Open the proxy through the configuration parameter spring.sleuth.async.enabled , because the default value is true, so it is enabled by default. You can also configure spring.sleuth.async.ignored-beans if some custom thread beans don't want to be proxied like this.

However, if the asynchronous printing part of the above code is changed to CompletableFuture.runAsync(() -> this.syncPrint()); , the printed result is:

 2022-02-28 23:14:40.715  INFO [a-service,,] 23400 --- [onPool-worker-1] p.k.aservice.controller.HelloController  : 异步打印

The reason is that the thread pool bean in the spring container is not used at this time, but ForkJoinPool.commonPool() in jdk is used by default, because it is not proxied.

Custom thread pool pass

If you don't use the framework's LazyTraceThreadPoolTaskExecutor and turn off the switch in the configuration file, how to customize the thread pool implementation? Here you can use the thread pool's decorator mode:

     @Bean("customExecutor")
    public Executor getAsyncExecutor() {
        final RejectedExecutionHandler rejectedHandler = new ThreadPoolExecutor.CallerRunsPolicy() {
            @Override
            public void rejectedExecution(Runnable r, ThreadPoolExecutor e) {
                log.warn("LOG:线程池容量不够，考虑增加线程数量，但更推荐将线程消耗数量大的程序使用单独的线程池");
                super.rejectedExecution(r, e);
            }
        };
        ThreadPoolTaskExecutor threadPoolTaskExecutor = new ThreadPoolTaskExecutor();
        threadPoolTaskExecutor.setCorePoolSize(properties.getCorePoolSize());
        threadPoolTaskExecutor.setMaxPoolSize(properties.getMaxPoolSize());
        threadPoolTaskExecutor.setQueueCapacity(properties.getQueueCapacity);
        threadPoolTaskExecutor.setKeepAliveSeconds(properties.getKeepAliveSeconds());
        threadPoolTaskExecutor.setRejectedExecutionHandler(rejectedHandler);
        threadPoolTaskExecutor.setThreadNamePrefix("Custom Executor-");
        threadPoolTaskExecutor.setTaskDecorator(runnable -> {
            try {
                Optional<RequestAttributes> requestAttributesOptional = ofNullable(RequestContextHolder.getRequestAttributes());
                Optional<Map<String, String>> contextMapOptional = ofNullable(MDC.getCopyOfContextMap());
                return () -> {
                    try {
                        requestAttributesOptional.ifPresent(RequestContextHolder::setRequestAttributes);
                        contextMapOptional.ifPresent(MDC::setContextMap);
                        runnable.run();
                    } finally {
                        MDC.clear();
                        RequestContextHolder.resetRequestAttributes();
                    }
                };
            } catch (Exception e) {
                return runnable;
            }
        });
        return threadPoolTaskExecutor;
    }

3.3. Use with zipkin

Generally speaking, the zipkin server collects the data of sleuth on the client and provides a visual interface for query display.
Here we simply start a zipkin server with docker:

 docker run -d \
 --name zipkin \
 --restart=on-failure:3 \
 -p 9411:9411 \
 openzipkin/zipkin

Introduce zipkin client dependencies in pom.xml:

         <dependency>
            <groupId>org.springframework.cloud</groupId>
            <artifactId>spring-cloud-sleuth-zipkin</artifactId>
        </dependency>

Configure the zipkin server address in application.properties:

 spring.zipkin.base-url=http://localhost:9411

That's it, after running the project containing sleuth normally, visit the address in the browser: http://localhost:9411 , and enter the zipkin interface. And you can query the link log.

Of course, the actual configuration of zipkin is not so simple. It needs to do sleuth data sampling, and it is often used in conjunction with MQ. But these are not the focus of this article, and few companies choose zipkin these days.

4. Supplementary programme ideas

sleuth's solution for implementing full-link logs is relatively clear. Looking back at the previous question:

Now to troubleshoot the problem, the first thing is to find the traceId of the request, and then I can get any logs I want from elk. When you find the logs of calling other services reporting errors, you can directly contact the team responsible for the downstream service with the traceId.
During performance optimization, you can see the processing time of each span through the graphical interface. This allows you to locate the performance bottleneck at a glance.

However, in actual projects, for some application scenarios, the native sleuth is also beyond the reach. But we can imitate its implementation and make some supplements:

MQ link log

In some business scenarios, I hope to connect MQ producers and consumers, and through the traceId of the producer, I can also see the link log of the message being consumed.

It should be similar to http here. When the producer sends a message, the data such as traceId is put into the message header, and when the consumer receives the message, it is obtained from the message header and continues to be passed as the traceId.

Exception Response property

Many people will handle the same exception based on ExceptionHandler in the project, and the return body generally contains key information such as status code and common error content. But since it is exception handling, the most important thing is to troubleshoot errors, and traceId should be included.

custom link tracking

Based on the implementation solutions of sleuth and zipkin, can we develop a customized full-link log solution by ourselves?

Can not have a very fine-grained like sleuth, such as thread-based span and so on. We only record coarse-grained parameters such as a single microservice request and mq consumption.
Or pass the traceId based on the request header, act as an interceptor at the service request and response, do aop when the service is called, and capture the exception log, etc., and then upload the log information to the service responsible for link log collection for processing.
Separately develop a microservice that records link logs. On the one hand, it collects the logs uploaded at the buried points of all microservices, and on the other hand, it does custom data analysis. For example, you can query the full link log through traceId like zipkin; you can also do a performance analysis dashboard to find out the links with long request times, the links that report errors, the most frequently called services, and so on. When the problem reaches the threshold, it can even send a message to the operation and maintenance personnel.