头图

The emergence of cloud computing has brought great convenience to the management, business development, and resource integration of enterprises, and it is also one of the core infrastructures of digital construction. Inevitably, the world's leading computing platform is no exception. For example, at 10:45 a.m. EST on December 7, Amazon AWS suffered an outage, which affected the online services of some websites such as Disney+ and Netflix. The failure also caused great concern in the industry.

The reason why it is said that the downtime of cloud vendors cannot be avoided 100% is that there are many reasons, such as human error, network interruption or regional network congestion, power outages, natural disasters, etc. As a cloud vendor, what we can do is to keep going. Optimize technologies and services to deal with these problems and minimize the probability of downtime.

As the world's leading real-time interactive cloud service provider, Shengwang also uses AWS infrastructure resources for some overseas businesses. In the event of AWS downtime, Shengwang's real-time audio and video services were not affected. The core reason is that The unique architecture design of the SD-RTN™ large network ensures the high availability of RTE (real-time interaction) services, so that in the event of a failure of the equipment room, hardware, network and other infrastructure, it can still provide users with high-availability RTE services .

First we need to understand what high availability is. Generally speaking, a reliable cloud service must have very high availability. The evaluation standard for availability is SLA: Service Level Agreement (SLA) is a guarantee of service availability for cloud vendors. Many domestic cloud vendors are selling cloud services. All services promise 99.9% availability, the more 9 means the longer the service is available throughout the year, the more reliable the service, and vice versa. For example, based on 365 days of the year, 99.9% availability and only 8.76 hours of service per year are unavailable. Every improvement in availability is a technical challenge. In the event of environmental disasters, unreliable public network infrastructure, etc. When there is a problem, how to face these problems quickly, how long to recover, and whether there is a mature filing are issues that any cloud vendor must face honestly.

If you want to improve the availability of services, you need to make layouts from multiple levels, such as computer room layout, service infrastructure, operation and maintenance automation, etc. So how does Shengwang ensure the high availability of RTE services in practice, we can look at four levels Expanding on:

01 SD-RTN™ architecture design: real-time fault perception and intelligent scheduling, remote multi-active

  • Business Architecture : As we all know, the infrastructure will be unavailable for a period of time due to sudden network congestion, hardware failure, force majeure and other factors. Under this premise, the architect team of SoundNet SD-RTN™ large network has fully considered the unstable factors of the infrastructure from the very beginning of the design. If there are a few keywords to describe SD-RTN™, it is global coverage, real-time fault perception and intelligent scheduling, ultra-low latency, elastic capability, multiple activities in different places, and ultra-high concurrency. Once the infrastructure fails, SD-RTN -RTN™'s real-time fault perception and intelligent scheduling capabilities and the multi-active construction method in different places will play an important role in ensuring the high availability of services.

Real-time fault perception and intelligent scheduling : From a global perspective, the public network fluctuates frequently, SD-RTN™ network sniffing service can sense the quality of the network in real time, combined with AI Ops (intelligent operation and maintenance) The analysis capability can realize minute-level user migration and ensure the user's audio and video experience.

remote multi-active : SD-RTN™ network divides global resources into multiple Regions (regions), which can still achieve a minimum of N+3 within a Region (ie: when the largest three resource clusters are unavailable , the remaining resources can still undertake the load of the current Region) resource redundancy requirements, not only that, the Regions can still form a complementary situation. When a Region fails, it can be undertaken through the complementary Region.

Flexible capacity expansion and capacity : Each region of the SD-RTN™ large network has at least 200% real-time capacity capacity expansion and capacity reduction, and has the ability to respond to emergencies, and can fully and reasonably use resources with intelligent scheduling .

  • SDK : At the same time, a lot of optimization work has been done on the audio and video SDK side of the sound network, including anti-weak network optimization, audio and video experience optimization, etc., forming a situation of "inside and outside" with the business layer to improve the availability of services.

在这里插入图片描述

02 Infrastructure level: global distribution of computer rooms, resource coverage of three centers in five places

  • Basic resource selection : SD-RTN™ has deployed 250+ data centers around the world, covering more than 200 countries and regions around the world. The minimum requirement for major regions is the resource coverage of five locations and three centers, and each region uses core nodes + POP point way. In this way, once one or two computer rooms in a certain area fail, relying on technology, all traffic in the faulty city can be switched to the normal computer room.
  • Supply Chain Management : Do not rely on the basic resources of a single supplier (including: computer room, hardware, network, etc.), when a supplier has problems, you can quickly switch to other suppliers with normal services.

在这里插入图片描述

03 Intelligent operation and maintenance, quickly block faults

Nowadays, there is a consensus in the industry that the complexity of operation and maintenance is increasing rapidly, but traditional operation and maintenance are already stretched. Applied to the daily operation and maintenance of SD-RTN™, it solves the pain points of traditional operation and maintenance: 7*24H uninterrupted guarantee; high consistency and high-quality execution results; unified and efficient operation and maintenance efficiency.

SoundNet's AI Ops (intelligent operation and maintenance) can identify abnormality in the equipment room within 1 minute (including the overall end-to-end time for data aggregation, reporting, judgment, execution, recovery, etc.) and automatically operate and maintain, quickly blocking the spread of faults. , to ensure high availability of edge services. For example, the network congestion of edge nodes is unavoidable. After the congestion occurs, the user's audio and video experience will be discounted (stuttering, delay increases). It takes an average of 20 minutes to process. If the fault occurs in the middle of the night or the processing is not timely, the time will be longer, which has a great impact on the user experience. At this time, the value of AI OPS is reflected, and it can be within 1 minute. Identify and handle exceptions, and execute 7*24 uninterrupted and high-consistency to ensure high-quality RTC experience for users.

04 RTE Industry's First Quality of Experience Standard - XLA

As we mentioned earlier, SLA is the criterion for judging service availability for many cloud vendors and the telecommunications industry, but in the view of Shengwang, SLA regulates equipment and network access standards, focusing on service availability. However, in the RTE industry, it is far from enough to meet the "usable" standard. What users desire is clear and smooth audio and video interaction without stuttering, so the quality of real-time interactive experience must meet the "easy to use" standard. In this regard, Shengwang designed, defined and launched the first experience quality standard in the real-time interactive industry - XLA (Experience Level Agreement) in July 2020. , Compensable quality of experience standards.

Unlike SLA, XLA is not only concerned with the availability and quality of service of real-time interaction, but also the quality of user experience, and it is also the first standard to shift the focus of quality assurance from devices to people. XLA mainly includes four experience indicators, namely 5s login success rate, 600ms video freeze rate, 200ms audio freeze rate and 400ms network delay compliance rate, and the monthly compliance rate of the four indicators (1-total duration of non-compliant slices/monthly). total time) must exceed 99.5%. The 5s login success rate means that the successful login time is less than 5s to be considered qualified. This indicator mainly tests the usability and waiting experience of real-time interaction; the 600ms video freeze rate and the 200ms audio freeze rate mainly test the fluency experience during real-time interaction. ;The 400ms network delay index is for the real-time interaction of audio and video, and the delay needs to be less than 400ms.

在这里插入图片描述

Through XLA, customers can obtain the promise and guarantee of real-time interactive experience quality in multiple dimensions such as login success rate, end-to-end delay, audio and video freezing rate, etc., and no longer need to worry about the quality of experience of end users. Do use it with confidence and use it with satisfaction!

Defining the quality standard of real-time interactive experience seems to be just a few indicators, but it actually bears the long-term efforts of the Shengwang team. The launch of the XLA quality standard was repeatedly polished, improved, and verified by hundreds of technical experts for full-link data. It has undergone repeated iterations of 10 versions, adapted to 50+ network models, 200+ country and region optimization, 6000 + Optimization of different types of terminal experience and 1 trillion minutes of data polishing for the entire link. Behind this is also the long-term deep cultivation and accumulation of Shengwang in the real-time interactive cloud industry.


RTE开发者社区
668 声望976 粉丝

RTE 开发者社区是聚焦实时互动领域的中立开发者社区。不止于纯粹的技术交流,我们相信开发者具备更加丰盈的个体价值。行业发展变革、开发者职涯发展、技术创业创新资源,我们将陪跑开发者,共享、共建、共成长。