About makes full use of the computing and storage capabilities of edge nodes, and combines cold and hot data to achieve cost-effective data value exploration has gradually become the mainstream of the APM field.

Author: Xia Ming (Ya Hai)

The call chain records the complete request status and circulation information, which is a huge data treasure house. However, the cost and performance issues brought about by its huge amount of data are difficult problems that every student who actually applies Tracing cannot avoid. How to record the most valuable links and their associated data on demand at the lowest cost is the main topic discussed in this article. core keywords of 160cc60999e83c are: edge computing + separation of hot and cold data. If you are facing the high cost of the full storage call chain, and the data cannot be found after sampling or the chart is inaccurate, please read this article patiently, I believe it will bring you some inspiration.

1622602023496-5fa4c815-3076-453b-899f-89bd065b8687.png

Edge computing, recording more valuable data

Edge computing, as the name implies, is to perform data calculation at the edge node. If you follow the trend, you can also call it "computing left shift". In the context of limited network bandwidth, transmission overhead and global data hotspots are difficult to solve, edge computing is an effective method to find the optimal solution of cost and value balance.

The most commonly used edge computing in the tracing field is to filter and analyze data in the user process. In the public cloud environment, data processing within user clusters or private networks also belongs to edge computing, which can save a lot of public network transmission costs and disperse the pressure of global data calculation.

In addition, from a data perspective, on the one hand, edge computing can filter out more valuable data, on the other hand, it can refine the deep value of data through processing, and record the most valuable data at the lowest cost.

Filter more valuable data

link data is uneven. According to incomplete statistics, the actual query rate of the call chain is less than one in a million. Full storage of data will not only cause huge cost waste, but also significantly affect the performance and stability of the entire data link. Two common screening strategies are listed below.

  • performs Tag-based Sampling based on link data characteristics. such as wrong/slow call full sampling, the first N samplings per second for specific services, custom sampling for specific business scenarios, etc. The following figure shows the Cloud ARMS custom sampling configuration page . Users can freely customize the storage strategy according to their own needs. The actual storage cost is usually less than 5% of the original data.

1622602023478-e4396e70-7be6-42d4-8873-a57dc942294e.png

  • abnormal scenarios, the associated data site is automatically retained. we diagnose the root cause of the problem, in addition to the call chain, we also need to make a comprehensive judgment based on related information such as logs, exception stacks, local method time-consuming, and memory snapshots. If all the associated information requested for each time is recorded, there is a high probability that the system will crash. Therefore, can automatically retain snapshots in abnormal scenarios through edge computing is one of the important criteria for measuring the pros and cons of Tracing products. As shown in the figure below, slow call thread analysis , memory abnormal HeapDump and other capabilities.

1622602023442-e3eb93e9-5b37-4b35-a3b1-1a98b8661b79.png

1622602023470-2b7165d7-9e70-4b56-87b1-1167ab4ec6b3.png

Regardless of the screening strategy, the core idea is that calculates through edge node data, discards useless or low-value data, and retains abnormal sites or high-value data that meet specific conditions. This selective reporting strategy based on data value is much more cost-effective than full data reporting, and may become the mainstream trend of Tracing in the future.

Refining the value of data

In addition to data screening, data processing at edge nodes, such as pre-aggregation and compression, can also effectively save transmission and storage costs while meeting user needs.

  • Pre-aggregation statistics: The biggest benefit of pre-aggregation on the client side, is to greatly reduce the amount of data reported without losing data accuracy. example, after 1% sampling of the call chain, it can still provide accurate service overview/upstream and downstream monitoring and alarm capabilities.
  • Data compression: Compress and encode repetitive long texts (such as exception stacks, SQL statements), which can also effectively reduce network overhead. Combining non-key fields to blur the effect is better.

Separation of cold and hot data, low cost to meet the needs of personalized post-aggregation analysis

Edge computing can meet most pre-aggregation analysis scenarios, but it cannot meet diversified post-aggregation analysis requirements. For example, a business needs to count interfaces and source distribution that take more than 3 seconds. This kind of personalized post-aggregation analysis rules cannot Exhaustive. And when we cannot predefine the analysis rules, it seems that we can only use the extremely costly full amount of raw data storage. Is there no room for optimization? The answer is yes. Next, we will introduce a low-cost solution to the post-aggregation analysis problem—cold and hot data separation.

Brief introduction of cold and hot data separation scheme

cold and hot data separation lies in the user's query behavior satisfying the principle of locality in time. simple understanding of 160cc60999eb9d is that the most recent data is queried most often, and the cold data query probability is small. For example, due to the timeliness of problem diagnosis, more than 50% of link query analysis occurs within 30 minutes, and link queries after 7 days are usually concentrated on the wrong and slow call chain. The theoretical foundation is established. Next, we will discuss how to realize the separation of cold and hot data.

First of all, thermal data has timeliness. If only the thermal data in the most recent period of time is recorded, the storage space requirements will drop a lot. In addition, in the public cloud environment, the data of different users is naturally isolated. Therefore, the hot data calculation and storage solution inside the user's VPC has a better cost performance.

Secondly, the query of cold data is directional, and cold data that meets the diagnosis requirements can be filtered through different sampling strategies for persistent storage. For example, wrong and slow sampling, sampling of specific business scenarios, etc. Due to the long storage period of cold data and high requirements for stability, unified management within the Region can be considered.

In summary, hot data has a short storage period and low cost, but it can meet the needs of real-time full-quantity post-aggregation analysis; while cold data is accurately sampled, the total amount of data drops significantly, usually only 1% to 10% of the original data volume, and It can meet the diagnostic requirements of most scenarios. The combination of the two achieves the optimal solution of the balance between cost and experience. Leading APM products at home and abroad, such as ARMS, Datadog, and Lightstep, all adopt a storage solution that separates hot and cold data.

1622602023503-96d5ed39-e917-430f-888a-c1a4d0bae42c.png

Real-time full analysis of thermal data

Link detail data contains the most complete and richest call information. The most commonly used service panels, upstream and downstream dependencies, and application topology views in the APM field are all based on link detail statistics. Post-aggregation analysis based on link detail data can locate problems more effectively according to the individual needs of users. However, the biggest challenge of post-aggregation analysis is to perform statistics based on the full amount of data, otherwise there will be sample skew that will cause the final conclusion to be far from reality.

Alibaba Cloud ARMS, as the only cloud vendor in China selected in the Gartner APM Magic Quadrant in 2021, provides the ability to fully analyze hot data within 30 minutes, and can achieve filtering and aggregation under various combinations of conditions, as shown in the following figure:

1622602024564-41cdaf8c-93a0-440c-9a1c-74165cf2be21.png

Cold data persistence sampling analysis

The persistent storage cost of the full call chain is very high, and the actual query rate of the call chain after 30 minutes is less than one in a million, and most of the queries are concentrated on the wrong slow call chain, or the chain that meets specific business characteristics. Road, I believe students who often troubleshoot link problems will feel the same. Therefore, we should only keep a small number of call chains that meet the precise sampling rules, so as to greatly save the cost of cold data persistent storage.

So how should accurate sampling be achieved? The methods commonly used in the industry are mainly divided into two types: Head-based Sampling and Tail-based Sampling. Head sampling is generally performed at edge nodes such as client Agents, such as current limiting sampling or fixed ratio sampling based on interface services; while tail sampling is usually filtered based on full thermal data, such as error, slow, and full sampling.

ideal sampling strategy for 160cc60999ed34 should only store the data that really needs to be queried. APM products need to provide flexible sampling strategy configuration capabilities and best practices, and users can make adaptive adjustments based on their own business scenarios.

Concluding remarks

When more and more enterprises and applications go to the cloud and the scale of public cloud clusters is growing exponentially, "cost" will be a key factor for enterprises to use cloud. In the cloud-native era, making full use of the computing and storage capabilities of edge nodes and combining cold and hot data to achieve cost-effective data value exploration has gradually become the mainstream in the APM field. The traditional solution of full data reporting, storage, and reanalysis will face increasing challenges. What will happen in the future, let us wait and see.

Recommended Products

1622602024609-7febb1bf-234a-4646-9a02-bc6c2098f9ed.png

join us

[Stability is more than everything] Build a knowledge base in the field of domestic stability, makes the unsolvable problems a little less, and makes the world a little more certain .

  • GitHub address
  • Dingding group number: 23179349
  • If you have gained something from reading this article, welcome to share with friends around you, and look forward to more students joining!
Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright, and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。