Author: Feng
Jobs are a common scenario for Internet services. In scenarios such as AI training, live broadcast (video transcoding), data cleaning (ETL), scheduled inspections, etc., can the task platform support fast and high concurrent task startup performance, provide high offline computing resource utilization, and rich The upstream and downstream ecology is the core pain point of this type of scenario. Function computing is an event-driven fully managed computing service, and its execution mode is inherently in line with this type of job scenario. It provides full support for the above-mentioned pain points and helps "task" serverless cloud.
Function calculation and serverless jobs
What capabilities should the "Job" system have?
In the aforementioned "Job" scenario, a task processing system should have the following capabilities:
- task triggering: supports flexible task triggering methods. For example, it supports manual triggering by the client, event source triggering, timing triggering, etc.;
- Task Orchestration: can orchestrate complex task processes and manage the relationships between subtasks, such as branching, parallelism, and looping logic;
- task scheduling and status management: scheduling task priority, multi-tenant isolation and task status management, supports multiple task concurrency and current limit; can manage task status, control task execution, etc.;
- resource scheduling: solves the problem of task running resources. This includes a variety of runtime support, cold start delay control of computing resources, and on/off task mixing. The ultimate goal is to make the system have a higher resource utilization rate;
- task observability: task execution history viewing and auditing; task execution log;
- task scheduling system upstream and downstream ecology: task scheduling system can naturally access upstream and downstream systems. For example, the ability to integrate with Kafka/ETL ecology, message ecology, etc.
Alibaba Cloud Function Computing Serverless Job
The panoramic view of the capabilities of function calculation Jobs is shown in the figure below:
Figure 1: A panoramic view of the capabilities of function computing Jobs
Comparison of Job Capabilities of Common Task Scheduling Systems in the Industry
Table 1: Comparison of common task scheduling system capabilities
In general, the smallest granularity supported by task scheduling systems such as batch computing products of some cloud vendors, open source K8s Jobs, etc. is generally based on instance level expansion and contraction, and does not have the ability to manage large-scale tasks (orchestration), so It is more suitable for low-concurrency, heavy-load, and long-running businesses (such as genetic computing, large-scale machine learning training), etc.; and the task scheduling of some open source process execution engines and big data processing systems often lack flexibility, multi-tenancy isolation, A series of capabilities such as high concurrency management and visualization. As a serverless platform that is free of operation and maintenance, function computing combines the advantages of the above-mentioned different systems. In addition, the innate flexibility of serverless well supports the needs of high concurrency peaks and valleys that are common in tasks.
Recommend best practices & customer cases
AI training & reasoning
The core appeal of the scene:
- At the same time, it supports real-time reasoning + offline training, and real-time reasoning has requirements for cold start;
- There are obvious peaks and valleys, a large amount of calculation, high concurrency is required, and there is basically no need for collaboration between computing instances;
- Generally, it is necessary to run a custom library in a container image for training.
Case 1: NetEase Cloud Music-Audio and Video Processing Platform
The music "discovery" and "share" functions of NetEase Cloud Music rely on the analysis and extraction of basic characteristics of music. When running this type of recommendation algorithm and data analysis, it is necessary to rely on very large computing power to process the original music files. After the NetEase cloud music audio and video offline processing platform has undergone a series of evolutions such as asynchronous processing mode-priority and queue optimization-algorithm cluster virtualization-algorithm mirroring framework-cloud native, it has chosen functional computing as the video platform's infrastructure. It effectively solves the problems of difficult operation and maintenance and poor flexibility caused by the ever-expanding computing scale.
Case 2: Database Autonomous Service-Database Inspection Platform
The internal database inspection platform of Alibaba Cloud Group is mainly used to optimize and analyze the query and log of sql statement. The tasks of the entire platform are divided into two main tasks: offline training and online analysis. The calculation scale of the online analysis business has reached tens of thousands of cores, and the daily execution time of the offline business is also millions of hours. Due to the uncertainty of online analysis and offline training time, it is difficult to improve the overall resource utilization of the cluster, and great elastic computing power support is required when business peaks come. The business finally uses functional computing to build a database inspection platform to meet daily AI online reasoning and offline model training tasks.
Case 3: Focus Media-Serverless Image Processing Business
In the advertising business, running deep learning algorithms for image processing, comparison, and recognition is a relatively common business. This type of business often has the characteristics of diverse data sources, uncertain single-instance processing time, obvious peaks and valleys, and high task observability requirements. The use of self-purchased machines to run services not only needs to consider machine operation and maintenance and resource utilization, but it is also difficult to adapt to a variety of image sources, and it is difficult to quickly launch the service.
The multi-event source trigger support of function calculation provides great convenience for this kind of business. Focus Media uses OSS/MNS triggers to trigger function calculations to solve the problem of diverse data sources. The user's picture data can be uploaded to OSS or MNS, and the corresponding trigger will directly trigger the function calculation to complete the picture processing task. The flexibility of function calculation and the pay-as-you-go model solve the troubles of resource utilization and machine operation and maintenance. In terms of observability, the task processing instance uses a stateful asynchronous invocation mode, which enables the traceability of any task that has been triggered, which facilitates the business to troubleshoot and retry tasks that have failed.
Video transcoding & live streaming & recording to live streaming
Live transcription/recording and broadcasting to live broadcast services often have the characteristics of real-time business and irregular business at the same time:
- It is required that the live broadcast can pull up the processing instance at any time and stop the transcription instance at any time;
- The peak business hours are concentrated in a few hours during the day, and there are almost no business requests at night. Therefore, resource utilization and cost are the main considerations.
In addition to general flexibility requirements for video transcoding scenarios, there are often requirements for flexibility in resource specifications (CPU) in order to expect higher resource utilization. like:
- Resource specifications: Due to the different output code rates of transcoding, due to cost considerations, I hope to be able to flexibly pop up resources of different specifications;
- Randomness of running time. Due to the need to improve the transcoding efficiency, the video is often fragmented, so the moment the task comes may require a high number of instances;
- In order to improve the transcoding efficiency, it may be processed separately after fragmentation, which involves sharing data among multiple functions;
- Requires container mirroring to run some of its own libraries, and often starts quickly;
- Due to the offline business nature of transcoding, certain task records need to be kept after the task is completed for follow-up audits, troubleshooting and other requirements.
Case 1: New Oriental-Cloud Classroom System Serverless Video Processing Platform
The New Oriental Cloud classroom system supports all New Oriental's online education scenes such as live video, transcoding, and on-demand. With the increase in business volume, due to the obvious peak and trough characteristics of the task processing platform for live transcription and video transcoding, the low resource utilization of self-built computer rooms has become the core pain point of the business. In order to improve the overall resource utilization, the above functions of the cloud classroom system use functional computing, which can flexibly choose the specifications of computing resources according to business characteristics. The millisecond-level cold start performance and the "pay as you go" payment model also make the overall computing resources The utilization rate is very high, which allows the entire system to have the lowest cost while meeting the peak computing power.
In the process of serverless business scenarios, the cloud classroom system uses the Alibaba Cloud function computing stateful invocation mode. This mode is also specially created for the Job scene, which can perform historical record query and gracefully stop tasks. In terms of storage, the video temporary file adopts the functional computing-NAS solution. Through the function scheduler of the video platform, New Oriental can poll multiple function services for load balancing. Each service is equipped with a different NAS to achieve file sharing and at the same time increase the utilization rate of the internal NAS temporary storage in the function, and further reduce The cost of resource usage.
Case 2: Millian-Live video real-time compliance review platform
The main task of Millian’s live dating service involves video processing, which is video frame cutting, and the video frame is cut while pulling the stream and uploaded to the target storage. Due to the characteristics of peaks and valleys in this kind of live broadcast scenes, in addition to resource utilization requirements, they also have certain real-time and long-term execution requirements. The audit platform finally uses functional computing to support high flexibility and long-term computing power, which effectively supports business scenarios.
Data processing & ETL
The core appeal of the scene:
- Flexible, high concurrency support. Resources are paid on demand, with various types, high utilization rate, and free operation and maintenance;
- Orchestration support for complex processes;
- Observability of tasks.
Case: Tucson Future-automated data processing platform to make everything simple and reliable
The future development of unmanned driving technology in Tucson relies on the accumulation of a large amount of road test test data, and efficient road test and rapid processing of road test data to guide the update iteration of the model are the core demands of this type of scenario. However, the drive test runs irregularly, the data storage process is longer, involves multiple system interactions, and the uncertainty of computing power brings greater challenges to the data processing platform for process orchestration tasks.
In response to the above situation, Tucson will explore the automation of the data processing platform in the future. The data processing platform uses Serverless workflow to arrange the overall process, and solves the problem of data connection between the cloud and the cloud through the natively supported messaging service MNS.
In addition to scheduling, Tucson will use task input and output mapping and status reporting mechanisms to efficiently manage the life cycle of each task in the process and data transfer between each other, and maintain the status of tasks in the process and data updates during execution. , Which solves the data processing requirements of long-term uncertain and long-term processes.
Summarize
Combining the above cases and analysis, the flexibility, observability, queue isolation capability and complete event ecology of function calculations excellently support this type of mission scenario. A brief summary is mainly reflected in the following aspects:
Triggering of the task\
函数计算支持定时触发器、OSS 触发器、各类消息队列触发器,这为 EDA 架构的应用程序、多种数据来源的数据处理场景提供了丰富的能力;
Task scheduling & task scheduling\
函数计算原生被阿里云 Serverless 工作流服务无缝集成,Serverless 工作流支持顺序、分支、并行等方式来编排分布式任务,跟踪每个任务的状态转换,并在必要时执行事先定义的重试逻辑。Serverless 工作流 + 函数计算的组合可以很好的支持复杂长流程的运行;
At the resource level, Serverless can better reflect its core advantages: development is free of operation and maintenance, and it provides high flexibility and high availability guarantees. \
对比自建,使用无服务器架构后,仅需要按实际任务执行的使用量付费,即节省了成本,也省去了运维的烦恼。函数计算支持多种运行时语言,也支持了运行自定义容器镜像,极大方便了开发调试流程。
- In terms of observability, serverless workflow and function computing provide a wealth of observability indicators and query methods for multi-task processes and single-task processes, which can easily search for history, observe indicators and logs of intermediate tasks performed, and facilitate debugging and problems track.
In the future, Functional Computing-Serverless Jobs will intensively cultivate the task processing scenarios in the vertical field, including providing longer instance execution time, richer observability indicators, more powerful task scheduling strategies and end-to-end integration capabilities. Provide you with the "shortest path" in vertical scenarios to help your business take off.
Click here to see more information about function calculations~
For more information, please scan the QR code below or search for WeChat account (AlibabaCloud888) to add cloud native assistant ! Get more information!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。