Author: Shen Bin, You Jing
Business Scenarios and Challenges
AIA is a life insurance group listed on the Hong Kong Stock Exchange, covering 18 markets. As of December 31, 2021, total assets are $340 billion.
AIA established a branch in Shanghai in 1992. It was one of the first non-local insurance institutions to be issued a personal life insurance business license after the reform and opening up, and the first insurance company to introduce the insurance salesperson system into China. In June 2020, AIA was approved to convert the Shanghai Branch of AIA Insurance Co., Ltd. into AIA Life Insurance Co., Ltd. In July 2020, AIA officially became the first wholly foreign-owned life insurance company in Mainland China. The AIA Youxiang App won the Best InsurTech Platform in 2021.
Business Features and Architecture
In order to practice AIA's slogan of healthy, long-term and good life, we have done a lot of microservice transformation to the application in the process of cloud migration to adapt to the rapidly changing business requirements and performance requirements, and we have made a microservice of the core package program previously in AS400. Service-oriented transformation has improved the available time. In addition, we adopted a containerization solution to make the application run on K8s for elastic expansion and self-healing.
The above transformation has led to an increase in the complexity of the application system. Therefore, observing the operation of microservices and K8s has become a major challenge.
At the same time, some outsourced applications have no source code and are not suitable for microservice transformation, but we still containerized these applications and deployed them into K8s; some applications are not suitable for various reasons. Cloud transformation, and finally left in the IDC room. Therefore, calls between services will involve complex situations such as cloud-on-cloud and cloud-off-cloud-on.
After moving to the cloud, it has brought us a real improvement in SLA, but it has also led to an increase in the complexity of access links and deployment. How to better observe applications has become an unavoidable challenge.
Observability building pain points and challenges
Building an excellent observation system will face the following pain points:
- Increased observation complexity : Although cloud-native microservices bring high HA, it also increases the complexity of the system and increases the difficulty of observation. The underwriting pass rate, delivery success rate, and daily/monthly activities of users are scattered in each business module. The business needs to provide a global perspective to observe the operation of important business nodes throughout the life cycle of the policy, and obtain the specific situation in the R&D state. .
- Difficulty in technology selection : Due to historical reasons, AIA's internal application technology selection and versions are different, resulting in great difficulties in observable technology and call chain tracking.
- Difficulty in unified observation : AIA is a financial company. The development system and application operation and maintenance are completely separated, and the logs are also completely stored and maintained separately. Therefore, the above data cannot be presented in the same large disk.
- Indicator governance : There are many indicators in the IaaS layer, PaaS layer and application layer, and there may be more than 200 indicators in a single database. If you want the indicators to reach a number that is easier to understand and track, you need to constantly review and delete them.
- Fast fault location : In the era of IDC computer room, there is no intuitive way for applications to check whether their own resources are sufficient. While commercial APM tools exist, they are expensive and not cost-effective. When the problem occurs, because only a small number of applications have APM installed, the call chain is incomplete and fast fault location cannot be achieved.
Observability building process and planning
The construction of the observable system is mainly divided into four stages: investigation and analysis , scheme design , transformation and implementation , and online verification .
A good observable system needs to meet at least five requirements:
- Service resource tracking : You can aggregate CPU memory, network disk, and IO application metrics on service running nodes. When a problem occurs, abnormal indicators can be easily observed.
- Provides a top view of services: According to the call volume, request time, and hotspot ranking of services, the application can easily know which hotspot APIs are, which APIs have a high request volume, etc., and can better plan their own service resources.
- Call chain tracing : Correlate the upstream and downstream of the service, and preferably non-intrusive, you can correlate the trace to the log in a wide range, and get the link problem.
- Call duration distribution : Observe the upstream and downstream of the service, observe the asynchronous time-consuming, and when the request is slow, you can easily determine whether the service resources are time-consuming or depend on the service resources.
- Database association operation : Helps the application observe the API associated SQL, slow SQL, slow key query in Redis query, slow query in Mongo and other operations.
practice and implementation
Observability overall design idea
In order to meet the needs of business development, AIA needs to upgrade and transform the cloud-native technology architecture at the technical level. Therefore, Alibaba Cloud and AIA have carried out in-depth cooperation on application containerization and observability. Combined with the business situation and monitoring pain points, through dozens of discussions and deductions, we finally clarified two important construction ideas:
First, design an observable system from the top down based on business value . Proceed all the way down from business monitoring, application monitoring, and resource monitoring. With a bottom-up design approach, when a problem occurs, the team wastes a lot of time and effort troubleshooting issues that never affect the customer, or the customer catches the problem before the monitoring system. Therefore, it is necessary to focus and design business monitoring related to user experience and core transactions first.
Secondly, link tracking and application performance monitoring of business design services need to be combined . For example, translating the API interface of an application into a language readable by the business, such as the processing time and processing quantity of the interface when the insurance policy takes effect, and which other services the interface calls/depends on, to finally identify the problem, and finally combine the application diagnostic tool Arthas , JVM tuning tools, application logs, and resource-level monitoring to identify code issues or underlying resource usage issues. By determining the occurrence of the accident, locating the cause of the accident, and then confirming the problem itself, the ability of fault discovery and problem location can be improved.
After confirming the top-down observable system, the next step is to define the scope of observable indicators.
Design of monitoring indicators for the whole life cycle
Observable indicators are not only in the running state, but also in the R&D state to form a monitoring indicator system for the entire application life cycle.
After the system has undergone cloud-native transformation, AIA's CICD pipeline is automated through Jenkins. In order to improve the efficiency of software research and development, it is necessary to abstract measurable indicators, such as the number of daily application builds, construction time, construction success rate, deployment frequency or deployment success rate, and the basic metadata information that forms these indicators.
The running state is divided into three layers: system layer monitoring, application layer monitoring, and business layer monitoring, and the monitoring importance levels increase in turn. The resource monitoring layer mainly focuses on monitoring indicators such as node nodes, disk networks, running Pod monitoring, and core cloud products of the K8s cluster; the application layer mainly focuses on application health, status code, performance monitoring, JVM, GC and other performance indicators; The business layer mainly monitors the core indicators of the business, such as PV, UV, the number of insureds, the amount of insurance, the number of signed orders, etc., which directly affects the success or failure of the monitoring system design, because this is the part that can best reflect the business value.
The big picture of the observability architecture
The above picture shows the structure of AIA's observability system. The overall design idea is divided into three layers:
The first layer is the acquisition layer . To meet AIA's technical architecture and construction requirements, we choose to write a pipelined CICD data collector in Java. When developers use Jenkins to build or deploy applications, the collector can store all application build data and deployment data in the database. In addition, the associated tags are added when collecting data to realize the sharing of metadata. For example, the name of the application built by the pipeline must be the same as the service name of K8s. When the build fails, the faulty application can be quickly found.
In addition, for the application of APM probes, the community generally uses bytecode-enhanced non-invasive techniques. However, due to the complexity of the AIA architecture, the Skywalking probe cannot completely cover the AIA scene. At the same time, AIA also has high requirements for in-depth performance diagnosis, and hopes to integrate Alibaba's open source Arthas, Memory dump and other capabilities. APM probes will also affect application performance. Therefore, we finally chose the ARMS Agent that has undergone large-scale testing on Double 11.
The monitoring indicators of various cloud product middleware and clusters are mainly collected through Prometheus; application logs are mainly collected by DaemonSet, which occupies less resources and is simpler in engineering than Sidecar.
The second layer is the storage layer . R&D metadata and pipeline construction data are stored in MySQL because of their small amount of data and structured form. Metrics Monitoring metrics data is stored on Alibaba Cloud's Prometheus product, and logs and call chain tracing data are stored on Alibaba Cloud's SLS product. Considering the growth of the business, a large amount of data will be generated in the future, these two products can ensure the stability, scalability and high availability of the monitoring system. At the same time, both products are serverless and continue to pay as you go, and there is no waste of disk or space.
The third layer is the unified display layer , which is aggregated and displayed through Grafana. At that time, Ali had not yet launched the hosted version of Grafana, so we chose to build it ourselves, and it is recommended to use the version 8.0 or above. In order to ensure the high availability of the operation, it is necessary to deploy multiple instances, and transfer the configured data to the database in a unified manner. Then, according to the previously designed monitoring indicators, select the corresponding data source to write query statements, and finally combine the rich graphs of Grafana for unified display.
The implementation of business monitoring is through statistical analysis of business logs and application logs collected in SLS. The SQL query function of SLS is very rich, and the statement writing is also very convenient. Then integrate it into Grafana through the SLS Grafana plug-in, and the final business statistics can be displayed on the Grafana dashboard.
Unified monitoring platform
The picture above shows the construction results. Command decision-making, R&D dashboard & application performance display, alarm push, and multi-dimensional monitoring capabilities are formed through large, medium and small screens.
The large screen on the left displays core indicators, such as general indicators such as resource utilization of container clusters, service Pod health, and connectivity, providing support for company decision-making.
The middle screen on the upper right mainly displays the R&D efficiency indicators of the pipeline, the indicators of application performance, and the global call chain, helping R&D personnel improve the efficiency and the speed of problem location.
The small screen on the lower right sets the alarm threshold through the comparison of historical data. When an abnormality occurs, it will be pushed to the computer and mobile terminal through DingTalk or SMS alarm to help the operation and maintenance personnel to find and deal with the problem in time.
About Observability Consulting Services
Click here for more product details!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。