Author: Zhuang
Review and proofreading: Campfire
Editing & Typesetting: Wen Yan

It took 7 days from our understanding of SAE products to the overall launch. The core application API gateway was launched in 3 days, 100% of the traffic was sent to SAE after verification on the 5th day, and the remaining 30 systems were quickly migrated to SAE on the 6th to 7th days. The whole process went very smoothly. After using SAE, the operation and maintenance efficiency has increased by 70%, the cost has dropped by more than 40%, and the expansion efficiency has increased by more than 10 times. This is an intuitive change that has brought us.
—— Pumpkin Movie CTO Zhuang Xulin

Established in 2015, Pumpkin Film is a streaming media platform that has developed very rapidly in China in the past two years. With its non-advertising and purely paid business model, it has gained a certain popularity in the movie fan circle; later, it relied on strong community interaction. (AI intelligent recommendation, film review interaction, online "cloud viewing" through the screening hall, etc.), rapid membership growth and streaming media market occupation; next will gradually develop into diversified video platforms: such as documentaries, various types Self-made programs, etc.

As an industry on the Internet, traffic and life cycle will have completely different performance due to changes in market wind direction, which puts forward higher requirements for enterprise innovation and low-cost trial and error. The overall application architecture of Pumpkin Movies continues to evolve with the rapid development of the business. Today I mainly share this development process with you from three parts:

Pain points: Review the current business, structure and pain points of Pumpkin Film at the time.

Selection: Share our thoughts and decisions on the road of technology selection, and why we finally chose to use SAE this product.

Actual combat: How did we land step by step and fully serverless the entire platform with hundreds of servers and more than 30 systems in just 7 days.

Pain points

From the beginning of the business, the overall application architecture of Pumpkin Movie has been built on Alibaba Cloud. It is a typical enterprise that "born on the cloud and grows on the cloud". The bottom layer uses Alibaba Cloud ECS, and the infrastructure, middleware, databases, big data services, and cloud security all use Alibaba Cloud products to maximize the value of the cloud. On top of the basic services is our self-developed competence center. Based on algorithms and video enhancement capabilities, it provides services such as membership, adaptive bitrates, search engines, movie reviews, and theaters. Provide services to various users through SLB global dispatch and WAF secure access. The upper level undertakes multi-terminals, basically covering all terminal types on the market: including mobile phones, Pads, web pages, various clients, and in-vehicle devices.

1.png

Pumpkin movie initial application architecture

However, with the continuous development of business, the ECS-based operation and maintenance architecture has gradually exposed many problems, mainly:

1) Elastic expansion is too slow: When the traffic is peaking, new machines need to be temporarily purchased and then deployed one by one, which is very time-consuming and cannot guarantee the system SLA.

2) Slow release & error-prone: Frequent releases on the Internet are the norm, but every time hundreds of servers are deployed and released one by one is very slow, and errors will occur if you are not careful. I have also tried scripted deployment, and it is really easy to run smoothly, but when there are more server groups and the script is constantly modified during the process of continuous modification of the script, it is very difficult to locate the problem in case the middle is stuck.

3) High system maintenance cost: Traditional cluster operation and maintenance are cumbersome, and personnel skills requirements are very high: not only must be proficient in lua/ansible scripts, etc., but also understand cloud product network configuration and monitoring operation and maintenance. In the early days, the company did not have full-time operation and maintenance personnel, which consumed a lot of development effort and was very painful.

4) Capacity planning is difficult and resource utilization is low: For the streaming media industry, the peak period is generally at noon or evening, and access at other times is relatively low, but it is difficult to prepare the capacity accurately. We generally keep servers fixed for a long time according to the peak value, and the resource utilization rate is relatively low.

5) The assignment of permissions is cumbersome: In the face of multi-tenant enterprises, permission isolation is often a very headache. Especially when newcomers arrive on duty or cross-team joint debugging, it is very cumbersome to configure user groups, RAM permissions, and new machine login and connection methods, and account managers often become a bottleneck.

A hit movie speeds up the thinking of pumpkin film technology upgrade

I believe that many companies will face the same problems as ours, and at the same time they will restrict the development of the company. However, all developers have a certain degree of inertia, thinking that as long as there is no accident, they will continue to consume it. And what really made us determined to upgrade the technology, we have to thank the movie that was shown in 19 years.

I received a call from a classmate that morning that the business is under heavy pressure. I said: "It's impossible. Generally, there is less traffic in the morning." He said, "I don't know, all kinds of businesses have started to warn. I have opened a plan and kept buying and buying. I bought a machine." Later, I learned that the number of new registered users exceeded 80W+ in one hour (more than 5 times the usual peak), which is a huge challenge and opportunity for Pumpkin Movies. Soon the server crashed directly, the API gateway of the total traffic entry couldn't support it, and the back-end services and databases were abnormal.

Everyone tensed their nerves and started an emergency expansion of the entire link: from buying ECS, uploading scripts to a new machine, running scripts, expanding DB... the whole process affected users intermittently, and some users could not access it directly, and continued for 4 It took only hours to finally fully recover.

Because the platform is all paying customers, our customer service phone was busy from morning to night, and users kept complaining, saying that it could not be used in the morning and demanding compensation.

2.png

Therefore, a sudden attack like this is a matter of training the team for the team, but it is a relatively large loss for the company. We compensated all users who opened the APP that day: all users who opened the app on that day were free, which is also a loss at the business level. But in the end, because of this movie, the number of new registered users of Pumpkin Movie has been rising all the way, and the business growth rate is obvious. But looking back at the entire operation and maintenance process, it took 4 hours, which is too thrilling and we don't want to go through it again.

Selection

In response to the above problems, we were thinking about how to transform in the next step. At that time, there were two internal plans, but both had some drawbacks:

Solution 1: In-depth script optimization. Although it can solve some repetitive operation and maintenance problems, the maintenance cost is too high. It is too difficult to recruit operation and maintenance personnel who can really write good scripts. We have also been using scripts, but there is really no way to fully automate it. You have to manually purchase ECS for emergency expansion.

Solution 2: Self-built K8s, although it can solve the problem of high-density deployment, greatly reduce costs, and can automatically expand application instances, but the explosion radius is larger than ECS, we are still a little worried. The most important thing is that the cost of learning K8s is too high. It is easy to set up an environment to run and run, but if you are producing on the basis of the eight classics, you still need to form a professional team, which obviously cannot be completed in the short term.

3.png

Later, after the introduction of Alibaba Cloud colleagues, there was soon another solution-using SAE, which was also the final solution.

Solution 3: Choose Alibaba Cloud Serverless Application Engine (SAE for short). The first impression of SAE is that it is easy to get started, saving time and effort, without any modification, WAR/JAR package is uploaded and deployed directly, and there is no need to buy a machine to operate and maintain the machine, saving Lots of development time. Moreover, SAE is a super large-scale flexible resource pool. You can play as many as you want, and you can play whenever you want, which is very suitable for the business scenarios of pumpkin movies.

4.png
SAE first impression

Actual combat

ROUND 1: CI/CD Pipeline-Accelerate iteration efficiency

Before officially relocating the business, the first thing we did was to open up the CI/CD pipeline based on Travis CI + SAE to improve the publishing efficiency. Previously, when we submitted the code on GitHub, the Travis CI tool was automatically integrated and unit tested automatically. After the test passed, the file was uploaded to the privatized OSS and then deployed to ECS. After using SAE, you only need to change deploy to ECS to deploy to SAE, which is very simple and has no effect on the development side. And when the application is deployed, you can also choose to configure multiple release strategies such as single batch, batch, canary, etc., and immediately stop and roll back when abnormal, which is very efficient.

5.png

ROUND 2: Launched the first application API gateway

The next step is to choose the first application for actual combat. At that time we made a bold decision: first migrate the API gateway. API Gateway is our core application and the most stressful application. Why did we choose it?

First, it has deployments across the country. Second, it has a large number of ECS clusters. We only need to operate the scheduling system to transfer part of the traffic to SAE. Assuming that SAE is unstable, the traffic can also be switched back to ECS instantly, which has almost no impact on users. Third, as the total traffic entrance, the API gateway has a lot of burst traffic, which matches the flexibility advantages of SAE, and can test whether SAE is suitable for our business to the greatest extent.

At the beginning of the production environment, we were also very worried. In order to prevent accidents, we decided to let the original ECS instance and the SAE instance run together. If one party had a problem, immediately switch the traffic, and then use the ECS instance as disaster recovery. link.

6.png

ROUND 3: API gateway automatically expands and shrinks to cope with sudden increase in traffic

Stable operation at normal flow rate cannot prove that SAE is reliable. Therefore, we have focused on verifying the resilience of SAE during the sudden increase in traffic in testing and production environments.

We used 5 times the traffic scale of the last hot movie to conduct systematic stress testing, and set the thresholds of CPU, memory, QPS, and RT measured by the stress in the SAE elastic rules, and then observe the application on the SAE console in real time. Monitor all indicators and find that they are all normal. SAE can really automatically expand the capacity in seconds at peak times, and automatically scale down as needed during peaks and valleys. As shown in the figure below, after using SAE, it saves about 40% of the hardware cost compared to the previous ECS long-term retention method.

7.png

In this way, our first application API gateway was successfully migrated, and the old ECS instance was also completely offline. Alibaba Cloud SAE proved to us that the previous worries were unnecessary with its stable and efficient performance. So we have successively migrated to other business lines.

ROUND 4: Out-of-the-box full link monitoring & diagnosis capabilities

During the migration process, occasionally some problems with abnormal application status may be encountered. SAE's built-in ARMS monitoring system provides great support for the analysis, troubleshooting and resolution of our online problems, saving a lot of troubleshooting time. On SAE, you can see the topological diagram of the application call relationship, you can locate the slow SQL, slow service, the call stack of the method, and then locate the code level problem.

Not only that, SAE also accepted our reasonable suggestions and provided TopN application reports in various dimensions: One person can easily operate and maintain hundreds of applications, and which applications currently have the biggest problems and should be the most concerned are all clear at a glance.

8.png

ROUND 5: [enterprise-level features] permission isolation & approval

SAE also helped us solve an old difficult problem: permission isolation and approval.

Let's take a look at this comparison chart: In the past ECS mode, when cross-team access to applications is required, it is necessary to configure user groups and add RAM permissions to different people at machine granularity. If it involves operation and maintenance deployment, you have to modify the script configuration and configure the user name, password, and operation log of the new machine on the springboard. Once there are many people and many machines, permission configuration becomes very cumbersome. Moreover, the operation and maintenance operations have not been approved, and the risks are uncontrollable. The development has the user name and password of the machine, and the release is relatively random.

After using SAE, everything becomes simple. Add permissions based on application granularity, and an application only needs to be added once, saving worry and effort. SAE has also designed the operation and maintenance approval process through the main sub-account: After the sub-account initiates the operation and maintenance operation of a certain resource, it needs to be approved by the main account to continue execution, otherwise SAE will abort the task, effectively converging the random release brought by online Quality risk.

9.png

ROUND 6: Landing completed

Through continuous running-in verification with the SAE platform, on the 7th day, all our applications have been fully Severless, ALL ON SAE. The entire migration process was smooth, without any transformation costs, zero failures, and only 1 to 2 R&D personnel were invested.

We have analyzed the overall value that SAE brings to the pumpkin film, which can be summarized into several points:

1) Expansion is faster: There is no need to consider insufficient peak periods and waste in trough periods. SAE will automatically scale and adjust the number of instances according to the optimization.

2) Faster release: Improve the release efficiency through the CI/CD pipeline, and quickly implement local one-click deployment to the cloud SAE through the Cloudtoolkit plug-in.
3) More worry-free operation and maintenance: Free operation and maintenance is not non-operation and maintenance. For us, when you receive an alarm, log on to the console, and start repairing, it is basically completed. The entire operation and maintenance speed is faster than manual operation. Quick

4) Faster troubleshooting: SAE's built-in monitoring capabilities save us a lot of time in troubleshooting.

After calculation, compared with our previous traditional server model, the development efficiency has increased by 70%, the cost has dropped by more than 40%, and the expansion efficiency has increased by more than 10 times.

10.png

Summary & Expectation

Finally, we share some summary and stepped pits during the use process with everyone.

1) Multi-zone deployment: All of our applications used to configure only single-zone A and suffered a loss. Later, under the suggestion of the SAE team, all of them were cut into multi-zones to deploy disaster recovery, so this point of attention is seriously recommended.

2) Batch/Gray release strategy: Multi-instance applications must be released in batches or gray levels to avoid the impact of abnormal situations on the overall business, and the entire release must be fully tested.

3) Health check: The application of a custom health check script must be pre-checked to prevent the application from failing to start all the time due to the script's own problems.

4) Reasonable setting of expansion threshold: The expansion threshold must be tested more, and then set after the system pressure test. When necessary, adjust the threshold appropriately, and would rather expand more instances than online failures.

5) Configure SLS log and ARMS alarm: It is recommended to configure the SLS log and ARMS alarm to provide great help for problem location afterwards.

We are also full of expectations for SAE: For example, if we want to optimize the Java cold start time, some of our applications will only take 1-2 minutes to start (later learned that SAE has been implemented). I also hope that SAE will take it one step further and provide users with a complete serverless architecture: not only the application layer, but also databases, networks, etc., so that we can completely focus on business development. Although this may be difficult to implement and will take some time, we are very confident in SAE.

11.png

Finally, I sincerely thank Alibaba Cloud SAE for their cooperation and support in the development of Pumpkin Movie. After using SAE, a large-scale failure has not occurred so far. Throughout the process, we have also gained a lot of experience, so that we can quickly provide services to users through it.

Pumpkin Movies will, as always, bring the highest quality film resources and the most extreme movie-watching experience to the majority of fans and friends, and create more positive energy for the society. I also wish Alibaba Cloud dare to dream and innovate to create new achievements and serve more companies around the world!

Click here to view SAE related details!

For more information, please scan the QR code below or search for WeChat account (AlibabaCloud888) to add a cloud native assistant! Get more information!

二维码.png


阿里云云原生
1.1k 声望321 粉丝