Best Practices of JD Cloud Distributed Link Tracing in Financial Scenarios

Microservices are the most popular software architecture design concepts in recent years, and together with containers and devops, they form the technical foundation of cloud native. Microservices originate from the market demand for rapid product delivery. By adopting a series of agile development practices such as automated testing and continuous integration, organizational efficiency is activated, and software reusability is enhanced, virtually paving the way for the evolution of China-Taiwanization. The road has been flattened, and a large number of domestic and foreign Internet companies have thus obtained technological dividends.

However, many enterprises have not achieved the expected results after introducing the microservice architecture. The second law of thermodynamics tells us that an isolated system must evolve in the direction of increasing entropy, that is, in the direction of increasing complexity. If the service is divided too finely, the complexity of a single service is reduced, but the complexity of the whole system increases exponentially. Theoretically, the complexity of n services is n×(n-1)/2. Microservices transfer the complexity within the system to the complexity between systems, so the team falls into chaos, which slows down the delivery speed.

How to solve the dilemma of "entropy increase" and truly enjoy the dividends brought by microservices? On the one hand, it is necessary to use a series of devops tools and methods to make the organizational structure match the software architecture, so that the new technology can be used by me instead of being a slave to the tools; technology to fully control the calling relationship between microservices.

JD Cloud SGM (Service Governance And Monitoring) product carries the data analysis and query of JD’s trillion-level call chain every day, escorts the traffic of Double Eleven and 618, so that every transaction can be traced, and every fault has nowhere to go Escape, with extremely high stability and extremely low resource consumption. The financial industry has always been a pioneer in the introduction of IT technology. JD Cloud has exported SGM products and practices to a large number of banks and consumer finance companies in the industry, enabling the digital transformation of the financial industry and achieving good results. However, we also found that some customers use The effect is not smooth enough. The reason is that, on the one hand, it is limited by the user's technical environment, especially traditional financial customers. Their technology stacks tend to be monolithic applications. There are many closed commercial software products, and customers themselves have relatively low control over product applications. On the other hand, in the consumer finance field with a partial Internet technology stack, the application effect is relatively good. In addition to the constraints of the technical environment, another factor is whether the organization has the basic ability to build a complete monitoring system and the degree of cognition of the entire monitoring system. Any monitoring product solves one aspect of the monitoring system, not all of it. All single monitoring products have blind spots and limitations. For example, it is unrealistic to hope that NPM (Network Performance Monitor) monitoring can achieve the flexibility of APM (Application Performance Monitor) application-level monitoring.

In the context of technology integration, JD Cloud cooperated with a leading consumer finance company to carry out the best practices of cloud native and full-link tracking in the field of consumer finance. A consumer finance company is a licensed consumer finance institution with over 100 million registrations of its inclusive finance APP products, and the backend is supported by heavyweight business systems such as risk control and intelligent customer service. Front-end users are highly active, traffic is large, there are a large number of APP server and back-end business systems, and the scenarios are complex. The entire system operation and technical operation and maintenance team are under great pressure.

In any organization, monitoring should be a comprehensive, three-dimensional and systematic platform that requires the collaboration of multiple monitoring tools. The SGM full-link monitoring system occupies the position of middle-to-upper-layer business monitoring in the entire monitoring system. The focus is application-level performance monitoring, service call relationship monitoring, and traffic monitoring. The main features are service-oriented interfaces and methods. The extension point is based on Method monitoring realizes multi-dimensional business monitoring. The bottom layer also needs system-level monitoring, basic host monitoring, network monitoring, database monitoring (partial to physical resources and the management dimension of the database itself), log monitoring system and other components to coordinate the coarse-grained monitoring. Combined with fine-grained monitoring, the monitored objects are organized from top to bottom. Many users cannot fully apply SGM products, and there are often blind spots in monitoring. A consumer gold company has built a FASTX basic monitoring system, which integrates monitoring and alarm modules at the basic network, host, and device layers. It is also based on the open source framework Pinpoint. The secondary development builds a full-link monitoring system to realize application-level link monitoring. However, due to the large performance loss of Pinpoint, the narrow monitoring range, the coarse monitoring granularity, the inability to flexibly start and stop monitoring items, and the lack of rich monitoring indicators and business monitoring systems, the results of pinpoint application monitoring are not very satisfactory.

Step 1: Take control and experience the same

Xiaojin Company has independent alarm channel management, basic information platform NCMDB, AD domain control and other management systems for users/applications/devices. New products need to be integrated into this environment. SGM's authentication module and alarm module are pluggable and can be connected through the OPEN API to realize in-depth collaboration between user management and authentication system, alarm system and SGM products, smooth the differences between systems, and form a unified use environment. The access threshold for business applications is lowered, and basic users and alarms are integrated into the existing technical system to ensure a consistent user experience.

Step 2: Access in batches for quick results

There are many internal applications in the consumer gold company, and the two parties conduct hierarchical and batch access according to the characteristics of the application technology framework. SGM does not have any intrusive transformation on the business application code, the access is simple, and it adapts to the common open source technical framework. After sorting out, it is connected in three batches.

● The first batch are mainly C-side APP applications. The back-end services are basically applications of the JAVA SpringCloud technology system. The monitoring items are app back-end services, which are more sensitive to response time and user experience, and access is given priority.
● The second batch is mainly based on basic service systems, mainly Java.
● The third batch is mainly large-scale applications and big data applications of back-end business management. Java and Python coexist, and they are gradually launched with the system iteration rhythm.

Get the effect:

● Complete the access of the first batch of systems and the launch of the production environment within one week.
● Completed 70% application access in one month.
● Completed most of the application access in three months, the total number of access applications was close to 700, the number of real-time monitoring methods reached 66,000, and the peak monitoring TPS reached 16W, the early access time control was ideal, and the access cost was low , to achieve the monitoring and management objectives expected by the management.

Step 3: Grasp the pain points and break through the advantages

It is difficult to promote new products in the early stage. Rejection from business parties and changes in existing habits are all obstacles to promotion, especially when there is an internal self-developed link monitoring system available.

The SGM product itself has a lot of functional items, and it is not necessary to fully expand it in the initial stage. Therefore, according to the characteristics of the existing pinpoint link monitoring system of a consumer gold company, it is recommended to recommend the business side an optimal function usage route. After two rounds of special training and counseling The business side implements a four-layer fine-grained monitoring system of application-service-method-instance, determines the return codes of key methods and custom business fields, constructs available business success rate observation indicators, and assists the business side to pay attention to key alarm items and alarm strategies.

After the SGM product is connected to the business side, it can quickly realize the application-service-method-instance four-layer fine-grained monitoring system for the business side without excessive manual configuration, and guide the business to sort out the key core methods that need to be monitored. Observing business success rate indicators, successfully introduced into the main line of SGM core functions of invocation query, invocation link, time-consuming analysis, and log linkage query. This plays a crucial role in the early stage of SGM product introduction. The acceptance of the application and the extensive use of the precipitated effective data promote the healthy operation of the monitoring system, and the SGM intervention period has been passed smoothly and smoothly. The most difficult moment laid a solid foundation for subsequent in-depth application.

The fourth step: step by step, comprehensive promotion

After completing the initial promotion in the first stage and being accepted by the business side, how can the business side, monitoring team, and system operation and maintenance team obtain greater benefits from the same monitoring platform? The two teams negotiated promotion ideas, based on in-depth application, fully tapped the value points of monitoring data, and formulated promotion strategies from the perspective of development, application operation and maintenance, application operation layered indicator monitoring, and large-screen situational awareness. , to form a plan that can be implemented and comprehensively promote SGM monitoring.

Users have obtained good benefits and positive feedback in the process of in-depth use, and at the same time, in line with the business scenarios and technical characteristics of Xiaojin Company, they have reported several problems to our SGM product team, including the Kafka JMXClient that has not been encountered in the internal scene of JD.com The problem of conflict and the failure of Tomcat Request information to extract custom business fields after Recycle has prompted SGM products to grow together with customers and become more perfect in more financial scenarios.

In the process of serving internal applications and external customers for a long time, we have summarized several best practice scenarios of distributed link tracking, overlooking the overall situation from the perspective of God, and giving full play to the agile power of the microservice architecture:

1. Problem solving for R&D troubleshooting

1. Typical problem: How to accurately locate the fault?

Business application performance problems frequently occur, traffic fluctuates frequently, and the troubleshooting process for sudden anomalies is difficult. There is no snapshot of the on-site environment when a fault occurs. Afterwards, you can only rely on system logs and team member skills for troubleshooting. No set is effective and can be reused. It is a huge challenge for the technical team of Xiaojin Company, which pursues service SLA guarantee capabilities, to accurately locate the problem and shorten the time for troubleshooting.

Solution: Thanks to the real-time log collection capability and efficient processing capability of the SGM full-link monitoring system, at the beginning of an abnormality in the application being monitored, the alarm information will be pushed to the business application stakeholders in a timely manner through the SGM built-in alarm module. , the alarm will prompt the application method time-consuming, average response time, frequency, JVM monitoring, and multi-dimensional TP9XX/AVG/MAX series performance indicators. At the same time, the alarm information organizes the related troubleshooting clue entries together, which is convenient for business engineers to intervene in the investigation. . A series of troubleshooting tools provided by SGM are connected in series through the alarm entry, including call query, time-consuming details, call chain, topology map, performance distribution of topology call chain, JVMGC analysis, network connection, JVM memory toolbox, etc. The entire troubleshooting process is smooth and the operation Simple and effective.

Effect: A set of standardized troubleshooting steps and toolsets are formed through the built-in functional modules of SGM, which fit the active alarm module. Quick troubleshooting.

2. Typical problem: how to deal with the underlying IO level?

During the operation of the application system, errors at the underlying IO level often occur, including relational databases, NoSQL databases, caches, Logger frameworks, MQ frameworks, etc. High-frequency problems are often mixed in log files, which are easily ignored and eventually lead to production. ACCIDENT.

Solution: Guide users to make good use of the SGM alarm module. SGM has one-stop built-in low-level IO detection rules and thresholds for various anomalies. Application access can enjoy standard detection and alarm capabilities, and calmly respond to production system anomalies.

Effect: The problem of the underlying IO type is dealt with separately, the alarm level is improved, and the business application can establish a cognitive system for hierarchical monitoring, identify the source of the problem, optimize the alarm strategy in time, change from passive to active, and improve the early warning and response of underlying IO problems. Processing capability, combined with the SGM troubleshooting toolbox to handle quickly.

3. Typical question: How to analyze service time?

In the microservice architecture system, how to monitor the time-consuming distribution of calls is a difficult point. In addition to the cost of the service itself, network overhead, inter-computer room delay, network packet loss, server thread pool blocking, service link fuse, and current limiting The impact of other measures, the impact of server-side GC, and the impact of client-side GC all constitute the overhead of the entire distributed call. The technical architecture of a consumer gold company is mainly based on spring cloud microservices. The time-consuming distribution of service calls and how fast a problem occurs Determining the attribution of abnormal services is the most concerned issue of the technical team.

Solution: By coordinating the monitoring of the underlying host and the link tracking of the SGM, a global view of the time-consuming monitoring of calls is formed, and accurate statistics and problem positioning of the service time in the cross-host communication mode in the micro-service era are realized. SGM provides various levels of monitoring such as applications, services, methods, and instances. At the same time, based on each call, it can reverse the source of the call, track the service status of upstream and downstream, observe the curve of service performance fluctuation, lock the problem service in time, and coordinate the service owner. Do joint troubleshooting.

Effect: Through SGM products, the information on the dependencies between services can be clearly obtained, and the time-consuming distribution of service calls can be accurately grasped. For the consumer gold company team, business dependencies can be quickly clarified, problem services can be located, and upstream and downstream services can be quickly checked together. problem, simple and efficient.

2. Problem Solving for Architecture Governance

4. Typical question: How to use runtime data for service governance?

The business is fast and violent, and new applications are constantly emerging. The actual running state of the service may have deviated from the architecture plan at that time. The traditional method is to conduct service governance based on the architecture document. Under the current situation of accelerated iteration and frequent changes, how to quickly discover services Dependency problems, and rely on the most realistic operational data for service governance? It is the content that the consumer gold company team is most concerned about.

Solution: The answer given by SGM products is based on the logs collected in real time by the application system, using monitoring log data to assist service governance, and in-depth application invocation of many information in the entire link to form a layered global view, exposing the real invocation relationship between services, Invocation frequency, invocation intensity, and fluctuation status of upstream and downstream traffic. SGM provides a full-link invocation analysis function, hierarchical drill-down, up-drilling, invocation source analysis, invocation topology, topology performance monitoring, and real-time invocation topology graphs. Crack service bottlenecks, disassemble unreasonable service modules, combine services that are scattered and free, constantly explore and adjust, and timely observe new models of changing data and then optimizing.

Effect: The internal service governance scenario of the consumer gold company has been continuously and deeply applied, and the technical team has formed an effective governance idea and plan in the continuous exploration, and developed the SRE business-level index evaluation system on the basis of SGM. The system is based on SGM Various monitoring data of the product can effectively monitor the service status and business indicators of each application, and meet the company's management's requirements for improving the degree of technical visualization from the use of data.

5. Typical questions: How to evaluate availability and failure rates?

How to evaluate application health, business success rate, and system availability? Most of the internal applications of a consumer gold company use the requested status code to determine whether the business is normal. The granularity is relatively coarse, and the method level cannot be accurately identified. Each application has inconsistent understanding of the business health identification method. How to unify the caliber and shield the difference becomes the structure an important issue of governance.

Solution: Build a unified and credible monitoring system for availability and failure rate (success rate). SGM products provide a set of conventional identification code specifications by default to mark the health of monitored objects, and also provide an entry for business custom rules . SGM has a global, application-level, and method-level three-layer identification code mechanism. By monitoring the call chain of the application running state in real time, it mines the sudden exception information during the execution process, and forms the real-time availability monitoring result of the system. Based on the unified result tag, the differences of the specific method return codes are shielded, and the dynamic monitoring results of the method-level return codes are used to construct five methods, service level, application level, instance level, and computer room level in combination with the availability indicators. Dimensional application success rate detection system. The technical team of Xiaojin Company can objectively evaluate the real-time health status of the application through the success rate and availability rate, and monitor and observe whether the business operation meets the expected goals through the classification of return codes. In the SGM product, in addition to the failure rate and availability indicators, the data of fluctuations in performance indicators, logs and capacity data are added to build a multi-dimensional, application-oriented comprehensive health evaluation indicator system.

Effect: In the practice of a consumer gold company, each business party's understanding of application health and the definition of monitored methods and return codes has undergone an evolutionary process from chaos, acceptance and governance to clear and orderly. A deep understanding of the essence of method monitoring and return code identification is a good support for SGM products to be widely implemented in Xiaojin.

6. Typical question: How to do business capacity assessment?

The call volume of each business of the consumer gold company fluctuates greatly, and the change in the amount of energy between businesses also varies greatly. The business capacity evaluation has not found a reliable starting point and data support point. The contradiction between how to balance resource utilization and ensure service availability and user experience often troubles technical teams.

Solution: SGM application real-time capacity evaluation function and water level map are the best means to describe application resources and service availability. While collecting application monitoring logs, SGM background is also using special calculation methods to evaluate the real-time capacity of applications. Capacity evaluation From method to service re-application, layer-by-layer accumulation, and finally real-time feedback of capacity changes through the water table.

Effect: Not only the architecture team pays attention to the change of application capacity, but also the monitoring personnel on the application side pay particular attention. On the one hand, it is necessary to pay attention to the response time of application operation to ensure user experience; on the other hand, it is necessary to take into account the utilization rate of resources and control costs. Through the real-time capacity evaluation module, the SGM product has effectively assisted the team of the consumer gold company to do a good job in this regard, achieving a balance between ensuring user experience and resource utilization, and has won praise from users.

3. Problem solving for application operation and maintenance

7. Typical question: how to achieve effective alarm?

The project that needs to be alerted does not issue an alert, or there is a large amount of repeated information in the alert, and the effective information and the repeated information are mixed together, which disturbs the monitoring personnel.

Solution: Active monitoring and intelligent release of massive interfaces are realized based on the SGM alarm module. The application alarms need to be comprehensive and accurate, and the workload of separate configuration is large. SGM products provide a variety of options. Baseline-based alarms are a feature of the SGM alarm module. In SGM, three dimensions of global alarms, application alarms, and method alarms are provided for applications. Based on services, it provides alarm capabilities specific to service monitoring charts, and generally takes into account the alarm demands of different groups. The SGM alarm modules all have the capability of root cause analysis, intelligently match the correlation relationship with the continuously fluctuating alarm information, merge the suspected root cause, and get the root cause alarm and push it to the application related personnel.

Effect: In the application scenario of Xiaojin Company, the alarm module is a functional section that is frequently used. Many daily production failures and problem handling processes are triggered based on alarms. The SGM troubleshooting toolbox (call query, call source, call chain route, time-consuming details, drill-down call chain, drill-up call chain, performance indicator fluctuation chart, associated MDC log linkage, extraction of custom business fields) to quickly locate problems and deal with them in a timely manner.

8. Typical question: How to convert monitoring data into business language?

In the early stage of monitoring, the technical team of a consumer gold company tried to collect monitoring data based on open source pinpoint. Due to the constraints of the structure of pinpoint itself, the lack of rich chart customization and visual display modules also caused the entire monitoring data to not play a role in application. Due to the rapid development of the business, the overall C-side user traffic continues to rise, and the technical team is under great pressure. Continuously available service guarantees require the support of all-round, multi-perspective observable data and the display of visual monitoring charts.

Solution: Through the built-in fixed charts of SGM, including call volume, performance TP, AVG, MAX indicator monitoring chart, failure rate, availability rate, monitoring radar chart, application market and other modules, the application system can quickly start to build basic monitoring situational awareness environment, and further according to the characteristics of the application's own products, the application large-screen monitoring, classified monitoring view, process monitoring, loop monitoring, ratio monitoring, and key method multi-dimensional performance indicator monitoring have been customized.

Effect: A sign of users' deep use of the product is the continuous mining of data application scenarios. In a consumer gold company, not only the alarm module is used to process and apply various abnormal scenarios and fault diagnosis, but also various data produced by SGM are used in depth. Visualized data display, customized various reports, and the technical team is also developing a number of monitoring and disposal systems belonging to the SRE technical system through SGM OpenAPI, and deeply utilizes SGM product monitoring data to play a huge business value. Benefiting from a set of basic monitoring system built by Xiaojin Company in the early days and the process of exploring the link monitoring system based on Pinpoint's two-open mode, its team has a deep understanding of the service link monitoring system, so it is in the process of promoting SGM products. , many overlapping points between the two sides have been opened, and the superior functions of SGM products are superimposed, which really benefits business growth.

As a consumer finance company with pure Internet technology background, the technology stack widely adopts the Internet sensitive architecture, and its own digital technology demands are relatively mature. It is one of the typical cases explaining the best practice of SGM distributed full-link tracking. SGM is an important product of JD Cloud's distributed financial middle-end matrix. It has undergone double tempering in internal and external scenarios. Compared with open source and industry commercial products, it has accumulated a deep understanding of financial scenarios. In the future, we will further share SGM link tracking. The implementation principle and technical highlights of Observability, explore the application progress of observability in more cloud-native scenarios, such as servicemesh and other new technology fields.

Best Practices of JD Cloud Distributed Link Tracing in Financial Scenarios

京东云开发者

引用和评论

提高IT运维效率，深度解读京东云AIOps落地实践（异常检测篇）

LRU算法，你别跑，我就要吃透你

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

被 Manus 带火的 MCP 是什么｜一文看懂

MySQL × 向量数据库：大模型时代的黄金组合实战指南

百万级群聊的设计实践