about us

For more cases and knowledge about cloud native, you can pay attention to the public account of the same name [Tencent Cloud Native]~

Welfare:

① Reply to the [Manual] in the background of the official account, you can get the "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The official account will reply to [series] in the background, and you can get "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency enhancement, K8s performance optimization practices, best practices and other series.

③If you reply to the [White Paper] in the background of the official account, you can get the "Tencent Cloud Container Security White Paper" & "The Source of Cost Reduction - Cloud Native Cost Management White Paper v1.0"

④ Reply to [Introduction to the Speed of Light] in the background of the official account, you can get a 50,000-word essence tutorial of Tencent Cloud experts, Prometheus and Grafana of the speed of light.

author

Lv Pengzhao, senior engineer of Tencent business operation and maintenance, was responsible for the operation and maintenance of the storage access layer of the QQ album and the small world business, and is now in charge of the operation and maintenance of the AI business.

background

QQ Album is a mature product in QQ products that provides users with functions such as image storage and sharing. Since its launch, it has been providing users with stable and fast image uploading and downloading services. As the first platform in the social business group to access TKE, Photo Album has summed up a set of TKE cloud migration practices suitable for itself in the past year or so.

Service scene

After the self-album has been completely transformed and migrated to the cloud, the new architecture is as follows:

question

As the modules of the album have been basically containerized, many problems in use have also been exposed.

Resource utilization improvement and CICD optimization

1. Improve resource utilization

Multiplexing clusters, independent clusters migrate to shared clusters

Because the multiplexing cluster is a K8s cluster built with old machines, there will be a lot of loss in use, and there will often be serious resource preemption. The independent cluster will cause a certain degree of waste of resources because the specifications of the parent machine are not high enough. Based on these two points, it is a better choice to migrate the systems of multiplex clusters and independent clusters to multiplex clusters. It will be found that the resource utilization and the number of error codes have been significantly improved after the migration.

At the same time, HPA is used to elastically expand and shrink the capacity. According to the CPU utilization, resources can be released during the low-peak period of the business, and the capacity can be automatically expanded during the peak period, which can better save costs.

2. Deployment strategy optimization

Because the current scheme of building a workload based on the namespace of an availability zone does not take into account the disaster recovery of a single availability zone, we have established workloads in multiple availability zones of a cluster, and also established disaster recovery workloads in other regions. , when a fault occurs at the machine room or region level, it can automatically switch to other machine rooms or regions.

3. Colorful stone configuration

Each application uses colorful stones as configuration management. In the TKE scenario, configuration changes will change the annotations field in the workload yaml, and the downward api will inject the value of annotations as volume into the colorful stone directory of the container. Call configuration-reload.sh inside the container to implement configuration changes. Because this process only involves modifying annotations, it does not cause the pod and container to be rebuilt. The entire configuration change process of a pod only takes a few seconds.

4. ClickHouse log query

With the increase in the volume of photo album business logs, the log storage cost is also increasing, so we migrated the logs to ClickHouse. Within the acceptable impact range, ClickHouse requires only 30%-50% of ES's resources.

5. Tget dial test

In terms of business monitoring, combined with Tget dial test, dial test alarms are made for both the oc domain name and the origin site domain name, which improves the response speed to VIP bans or network problems.

6. One-stop access to Zhiyan

The album is being connected to Zhiyan for one-stop development, using Zhiyan's R&D full life cycle management service from demand-development-test-release-online-online operation, so that the album CICD can better improve efficiency.

Operational Development Capability

1. Alarm analysis mobile terminal self-help rejection

Combined with Zhiyan's alarm analysis callback interface, you can manually remove the binding of a machine to Polaris on the enterprise WeChat interface.

2. Quality Score Analysis

In response to the increase in the abnormal failure rate caused by new error codes that are not usually monitored, the quality score analysis tool allows operation and maintenance personnel to quickly know which error code is currently causing the quality degradation.

Cloud Native Maturity Score

1. Business stability comes first

In the process of cloud-native score improvement, it is found that the bottleneck of many modules is not the CPU, but the traffic or memory. However, the current cloud-native computing method only calculates the CPU. Therefore, when formulating the HPA expansion and shrinkage strategy, it is necessary to consider all dimensions.

2. Decrease the request value

For some traffic-based modules such as http, preupload and prxoy, the value of the request of the workoad can be appropriately reduced, so that the CPU utilization can be improved immediately. It is necessary to combine the stress test to confirm that the CPU will not become a problem after the request is reduced. bottleneck.

summary

As the TKE business of albums is transferred from other platforms to shared clusters, combined with deployment optimization strategies and operational development capabilities, the summary is as follows:

  1. There has been a significant increase in cloud-native maturity.
  2. The album platform has accumulated 50,000+ cores using TKE.

3. Open up the CICD process of Zhiyan, and significantly improve the efficiency of daily development and operation and maintenance.

[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !

账号已注销
350 声望974 粉丝