author
Yuhuliu, Tencent R&D engineer, focuses on storage, big data, and cloud native fields.
Summary
During the rapid development of the medical information business, dozens of businesses and thousands of services covering different scenarios, different users, and different channels have been formed. In order to efficiently meet the diverse needs of users, the medical technology team uses TKE to go to the cloud, use the Coding DevOps platform, and cloud observable technology to improve R&D efficiency and reduce operation and maintenance costs. This article introduces some of our practice and experience in the process of going to the cloud, as well as some thoughts and choices.
Business background
- stage1: Medical information mainly includes core services such as medical classics, doctors, and medicine. The medical classics mainly provide medical-related content acquisition and medical knowledge transmission; doctors meet the interconnection of doctors and patients; medicine serves a large number of pharmaceutical companies. In the business development process, we originally built a large number of back-end services based on the taf platform, and completed the rapid establishment of the initial business. Due to the large number of businesses and the requirements of many businesses in multiple regions, we finally deployed multiple business clusters on the taf platform. At this time, release, operation and maintenance, and troubleshooting are purely manual stages, and the efficiency is low.
Business on the cloud
- stage2: With the rapid expansion of business scale, traditional development and operation and maintenance methods have imposed greater constraints on business iteration in terms of agility, resources, and efficiency. As the company's self-developed cloud project advances, embracing cloud nativeness, K8s to meet the diversified business needs and flexible scheduling of different resources, and agile iteration based on the existing mature devops platform, it has increasingly become the right choice for business. The medical back-end team started the migration of the overall service to the cloud.
- Before going to the cloud, there are still several issues to consider
1: There are many services, how to manage the code
2: How to quickly locate and troubleshoot problems after going to the cloud
3: How to choose a monitoring and alarm platform
4: How to choose the basic mirror
About service code management
Use git for code version control, establish project groups according to business, use separate code warehouses for each service, and use the same naming convention for warehouse names.
About troubleshooting
There are mature elk services on the research cloud. The business only needs to put the logs in the same directory. After being collected by filebeat, the logs can be easily imported into Elasticsearch through the ETL logic. Another advantage of this approach is that it can support the collection of front-end and back-end service logs at the same time. The technology is relatively mature, and the component capabilities are reused. By adding traceid to the request, it is convenient to locate the problem in the whole link.
About the monitoring and warning platform
CSIG provides a CMS platform based on log monitoring. After importing business logs into the CMS, you can configure monitoring and alarms based on the reported logs. The monitoring dimensions and index business can be defined by yourself. We adopted dimensions such as main tone, called, interface name, call volume, time-consuming, failure rate and other indicators to meet the requirements of business monitoring alarms. Log-based monitoring can reuse the same data collection link, and the system architecture is unified and simple.
About the base image
In order to facilitate the rapid cloud access at the initial stage of the business, unified service startup, and data collection and reporting, it is necessary to process the basic image of the business, establish a corresponding directory in advance, and provide scripts and tools to facilitate quick business access. Here we provide basic images in different languages and versions, encapsulate supervisord and filebeat, and use supervisord to pull up filebeat and business services.
Devops
- stage2: In the process of going to the cloud, we have gradually improved with quality students to pipeline the original manual steps in the development process to improve the iterative efficiency and standardize the development process; improve the stability of the service through single test and automatic dial test. After adopting a unified pipeline, the efficiency of development and deployment has been reduced from the original hourly level to the minute level.
The coding platform is mainly used here. In order to distinguish between different environments, four different pipeline templates for development, testing, pre-release, and testing have been established, and a confluence mechanism has been introduced to join the manual code review stage.
In the merging stage: through MR HOOK, automatically poll the code review results to ensure that the code can only proceed to the next step after the review is passed (different teams may require different requirements).
In the CI phase: through code quality analysis to improve code standardization, and unit testing to ensure service quality.
In the CD stage: Improve service stability by introducing manual approval and automated dial-up testing.
Improved resource utilization
- stage3: After the business as a whole goes to the cloud, many businesses have requirements for multi-regional deployment (Guangzhou, Nanjing, Tianjin, Hong Kong), and each service requires four different sets (development, testing, pre-release, and official) The environment, after going to the cloud, we initially sorted it out, and there are a total of 3000+ different workloads. Due to the great uncertainty in the amount of visits to different businesses, resources are allocated in the initial stage based on the ideal state, and there is a lot of waste.
In order to improve the overall utilization of resources, we have carried out a series of optimizations, roughly following the following specifications:
Here, HPA will cause the business container to dynamically expand and shrink. If the original traffic is still being accessed during the stop process, or if the traffic is imported before the start is completed, the business will fail. Therefore, it is necessary to enable the preStop and ready detection on TKE in advance. .
1: Stop gracefully, wait for Polaris and cl5 routing caches to expire before the process stops;
Entry: tke->workload->specific business->update workload
If the service discovery used is CL5, preStop70s is recommended, and the Polaris configuration 10s is sufficient
2: Readiness and survival detection, after the process is started, the flow will be allocated;
Entry: tke->workload->specific business->update workload, configure different detection methods and time intervals according to different businesses.
Through the above series of adjustments and optimizations, our resource utilization rate has been greatly improved. Through the elastic expansion and contraction of TKE, the problem of insufficient local peak access resources is basically solved while ensuring normal business access, avoiding resource waste and improving service stability. ; But many environmental problems will still cause certain losses.
Observability Technology
stage4: The log-based approach (log/metric/tracing) was used in the initial stage to meet the initial requirements of fast cloud migration and improvement of troubleshooting efficiency. However, as the business scale grows, more and more log streams are occupied. The more resources, the accumulation of logs becomes the norm during peak periods, the CMS alarm may have been half an hour apart from the actual occurrence, and the maintenance cost of ELK has also risen sharply. Cloud-native observable technology has become necessary. Here we introduce the observable technical solution recommended by Coding application management, and collect business data through a unified coding-sidecar:
Monitoring: Cloud monitoring middle station
Log: CLS
Tracing:APM
Through the ability to access these platforms, our problem discovery, location, and troubleshooting efficiency has been greatly improved, and the operation and maintenance costs of the business have been greatly reduced. Through monitoring and tracing, many potential system problems have also been discovered and improved Improve the quality of service.
end
Finally, I would like to thank all the development students for their hard work in the process of going to the cloud, as well as the strong support of the R&D leaders.
about us
For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~
Welfare:
①Respond to the backstage of the official account [Manual] to get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~
②The public account backstage reply [series], you can get the "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.
[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。