头图

This article was originally shared by the iQIYI technical team. The original title "Design and Practice of Building a Universal WebSocket Push Gateway" has been optimized and modified.

1 Introduction

According to Cong Suozhi, the HTTP protocol is a stateless, TCP-based request/response protocol, that is, requests can only be initiated by the client and responded by the server. In most scenarios, this request/response Pull mode can meet the demand. But in some situations: for example, message push (the most common in IM, such as offline message push of IM), real-time notification and other application scenarios, data needs to be synchronized to the client in real time, which requires the server to support the ability to actively push data.

The traditional Web server push technology has a long history, and has gone through the development of short polling, long polling and other stages (see "Getting Started Post: Detailed Explanation of the Principles of the Most Complete Web-side Instant Messaging Technology in History"), which can solve the problem to a certain extent. But there are also shortcomings, such as timeliness and waste of resources. The WebSocket specification brought by the HTML5 standard basically ended this situation and has become the mainstream solution for the current server-side message push technology.

Integrating WebSocket in the system is very simple, and relevant discussions and materials are very rich. But how to implement a general WebSocket push gateway has yet to have a mature plan. The current cloud service vendors are mainly concerned with mobile pushes such as iOS and Android, but they also lack support for WebSocket. This article shares the practical experience of iQiyi based on Netty to realize the real-time push gateway of WebSocket long connection.

study Exchange:

5 groups for instant messaging/push technology development and communication: 215477170 [recommended]
Introduction to Mobile IM Development: "One entry is enough for novices: Develop mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article was published synchronously at: http://www.52im.net/thread-3539-1-1.html )

2. Thematic catalogue

This article is the fourth in a series of articles. The general content is as follows:

"Special Topic on Long Connection Gateway Technology (1): Summary of Jingdongmai's Production-level TCP Gateway Technology Practice"
"Special topic on persistent connection gateway technology (2): Knowing the practice of high-performance persistent connection gateway technology with tens of millions of concurrent"
"Special Topic on Long-Term Connection Gateway Technology (3): The Road to Technology Evolution of Mobile Access Layer Gateways in Hand-Taking 100 Million Levels"
"Special Topic on Long Connection Gateway Technology (4): Practice of iQiyi WebSocket Real-time Push Gateway Technology" (* This article)

Other related technical articles:

"Absolute Dry Goods: Technical Essentials of Push Service for Massive Access Based on Netty"
"Jingdong Daojia Netty-based WebSocket Application Practice Sharing"

Other articles shared by iQIYI technical team:

"IQIYI Technology Sharing: Easy and humorous, explaining the past, present and future of video codec technology"
"IQIYI Technology Sharing: Summary of Practice of Optimizing the Startup Speed of IQIYI Android Client"
"IQIYI Mobile Network Optimization Practice Sharing: Network Request Success Rate Optimization"

3. Technical pain points of the old scheme

The iQiyi account is an important component of our content ecology. As a front-end system, it has high requirements for user experience and directly affects the creative enthusiasm of creators.

Currently, iQiyi has used WebSocket real-time push technology in multiple business scenarios, including:

  • 1) User comments: push comment messages to the browser in real time;
  • 2) Real-name authentication: The user needs to be authenticated with real-name before signing the contract. After scanning the QR code, the user enters the third-party authentication page, and asynchronously informs the browser of the authentication status after the authentication is completed;
  • 3) Living body recognition: Similar to real-name authentication, when the living body recognition is completed, the browser will be notified of the result asynchronously.

In actual business development, we found that there are some problems in the use of WebSocket real-time push technology.

These questions are:

  • 1) First of all: The WebSocket technology stack is not unified, both based on Netty and Web container implementations, which brings difficulties to development and maintenance;
  • 2) Secondly: The implementation of WebSocket is scattered in various projects and is strongly coupled with the business system. If there are other businesses that need to integrate WebSocket, they will face the dilemma of repeated development, waste of costs and low efficiency;
  • 3) Third: WebSocket is a stateful protocol. When the client connects to the server, it only connects to one node in the cluster, and only communicates with this node during data transmission. WebSocket clusters need to solve the problem of session sharing. If only single-node deployment is used, although this problem can be avoided, it cannot be scaled horizontally to support higher loads, and there is a single point of risk;
  • 4) Finally: Lack of monitoring and alarms. Although the number of long WebSocket connections can be roughly estimated through the number of Socket connections in Linux, the numbers are not accurate, and it is impossible to know the number of users and other indicators of business meaning; it cannot be compared with the existing micro Service monitoring integration realizes unified monitoring and alarm.

PS: Due to space limitations, this article does not introduce the WebSocket technology itself in detail. If you are interested, you can read "WebSocket from entry to proficiency, half an hour!" ".

4. Technical goals of the new solution

As shown in the previous section, in order to solve the problems in the old solution, we need to implement a unified WebSocket persistent connection real-time push gateway.

This new set of gateways needs to have the following features:

  • 1) Centralized realization of long connection management and push capabilities: unified technology stack, and precipitation of long connections as basic capabilities, which facilitates function iteration and upgrade maintenance;
  • 2) Decoupling from business: Separate business logic from long-connection communication, so that business systems no longer care about communication details, and avoid repeated development and waste R&D costs;
  • 3) Easy to use: Provide HTTP push channel to facilitate the access of various development languages. The business system only needs simple calls to push data and improve the efficiency of research and development;
  • 4) Distributed architecture: realize multi-node clusters, support horizontal expansion to meet the challenges brought by business growth; node downtime does not affect the overall service availability, ensuring high reliability;
  • 5) Multi-terminal message synchronization: Allow users to use multiple browsers or tabs to log in online at the same time to ensure that messages are sent synchronously;
  • 6) Multi-dimensional monitoring and alarming: Customized monitoring indicators are connected with the existing micro-service monitoring system. When problems occur, they can be alerted in time to ensure the stability of the service.

5. Technical selection of new solutions

Among the many WebSocket implementations, Netty was finally chosen in terms of performance, scalability, and community support. Netty is a high-performance, event-driven, asynchronous and non-blocking network communication framework, which is widely used in many well-known open source software.

PS: If you know very little about Netty, you can read the following two articles in detail:

  • "The most popular Netty entry in history: basic introduction, environment construction, hands-on combat"
  • "Novice Getting Started: The most thorough analysis of Netty's high-performance principles and framework architecture so far"

WebSocket is stateful and cannot achieve load balancing in a cluster like direct HTTP. After a long connection is established, a session is maintained with a node on the server side. Therefore, it is difficult to know which node the session belongs to under the cluster.

There are generally two technical solutions to solve the above problems:

  • 1) One is to use a registry similar to microservices to maintain the global session mapping relationship;
  • 2) One is to use event broadcasting to let each node determine whether to hold a session. The comparison of the two schemes is shown in the following table.

WebSocket cluster solution:

Considering the implementation cost and cluster scale comprehensively, a lightweight event broadcasting solution was selected.

To realize broadcasting, you can choose the message broadcasting based on RocketMQ, Publish/Subscribe based on Redis, and notification based on ZooKeeper. The comparison of their advantages and disadvantages is shown in the following table. Considering the throughput, real-time performance, persistence, and difficulty of implementation, RocketMQ was finally selected.

Comparison of broadcast implementation schemes:

6. Ideas for the realization of the new scheme

6.1 System Architecture
The overall architecture of the gateway is shown in the figure below:

The overall process of the gateway is as follows:

1) The client shakes hands with any node of the gateway to establish a long connection, and the node adds it to the long connection queue maintained by the memory. The client regularly sends heartbeat messages to the server. If the heartbeat is not received after the set time, it is considered that the long connection between the client and the server has been disconnected, and the server will close the connection and clean up the session in the memory.

2) When the business system needs to push data to the client, it sends the data to the gateway through the HTTP interface provided by the gateway.

3) After the gateway receives the push request, it writes the message to RocketMQ.

4) As a consumer, the gateway consumes messages in a broadcast mode, and all nodes will receive the messages.

5) After the node receives the message, it judges whether the target of the message pushed is in the persistent connection queue maintained in its own memory, and if it exists, it pushes the data through the persistent connection, otherwise it is ignored directly.

The gateway forms a cluster in a multi-node manner, and each node is responsible for a part of long connections, which can achieve load balancing. When faced with a large number of connections, it can also share the pressure by adding nodes to achieve horizontal expansion.

At the same time, when a node goes down, the client will try to re-shakes hands to establish a long connection with other nodes to ensure the overall availability of the service.

6.2 Session Management
After the long WebSocket connection is established, the session is maintained in the memory of each node. The SessionManager component is responsible for managing sessions, and uses a hash table internally to maintain the relationship between UID and UserSession.

UserSession represents a session in the user dimension. A user may establish multiple long connections at the same time. Therefore, a hash table is also used inside the UserSession to maintain the relationship between Channel and ChannelSession.

In order to prevent users from creating unrestricted long connections, when the internal ChannelSession of UserSession exceeds a certain number, the earliest established ChannelSession will be closed to reduce server resource occupation. The relationship between SessionManager, UserSession, and ChannelSession is shown in the figure below.

SessionManager components:

6.3 Monitoring and alarm
In order to know how many long connections are established in the cluster and how many users are included, the gateway provides basic monitoring and alarm capabilities.

The gateway is connected to Micrometer, and exposes the number of connections and users as custom indicators for Prometheus to collect, and realizes the connection with the existing microservice monitoring system.

In Grafana, you can easily view indicator data such as the number of connections, the number of users, JVM, CPU, and memory to understand the current service capabilities and pressure of the gateway. Alarm rules can also be configured in Grafana, and Qixin (internal alarm platform) alarms are triggered when data is abnormal.

7. Performance pressure test of the new scheme

Pressure test preparation:

1) For stress testing, select two virtual machines configured with 4 cores and 16G, as the server and the client respectively;
2) During the stress test, 20 ports were selected for the gateway and 20 clients were established at the same time;
3) Each client uses a server port to establish 50,000 connections, which can create millions of connections at the same time.
The number of connections (million levels) and memory usage are shown in the following figure:

Sending a message to millions of long connections at the same time, using single-threaded sending, the average time taken by the server to complete the sending is about 10s, as shown in the figure below.

Server push time-consuming:

Generally, the long connections established by the same user at the same time are in the single digits. Taking 10 long connections as an example, under the conditions of a concurrent number of 600 and a duration of 120s, the TPS of the push interface is about 1600+, as shown in the figure below.

Pressure test data with 10 long connections, 600 concurrent, 120s duration:

The current performance indicators have met our actual business scenarios and can support future business growth.

8. Practical application cases of the new scheme

In order to illustrate the optimization effect more vividly, at the end of the article, we also take the effect of adding a filter to the cover image as an example to introduce a case of iQiyi using the new WebSocket gateway solution.

When iQIYI publishes a video from the media, it can choose to add filter effects to the cover image to guide users to provide better cover.

When the user selects a cover image, an asynchronous background processing task will be submitted. When the asynchronous task is processed, the pictures processed by different filter effects are returned to the browser through WebSocket. The business scenario is shown in the following figure.

From the perspective of R&D efficiency, if WebSocket is integrated into the business system, it will take at least 1-2 days of development time.

If you directly use the push capability of the new WebSocket gateway, you only need a simple interface call to realize data push, the development time is reduced to the minute level, and the research and development efficiency is greatly improved.

From the perspective of operation and maintenance costs, the business system no longer contains communication details that have nothing to do with business logic, the maintainability of the code is stronger, the system architecture becomes simpler, and the operation and maintenance cost is greatly reduced.

9. Write at the end

WebSocket is currently the mainstream technology for server-side push. Appropriate use can effectively provide system response capabilities and improve user experience. Through the WebSocket long connection gateway, you can quickly increase the data push capability for the system, effectively reduce operation and maintenance costs, and improve development efficiency.

long connection gateway is:

  • 1) It encapsulates the details of WebSocket communication and is decoupled from the business system, so that the long-connected gateway and business system can independently optimize and iterate, avoid repeated development, and facilitate development and maintenance;
  • 2) The gateway provides a simple and easy-to-use HTTP push channel, supports access to multiple development languages, and facilitates system integration and use;
  • 3) The gateway adopts a distributed architecture, which can achieve horizontal expansion, load balancing and high availability of services;
  • 4) The gateway integrates monitoring and alarming, and can provide timely warning when the system is abnormal to ensure the health and stability of the service.

At present, the new WebSocket long-connection real-time gateway has been applied in a number of business scenarios such as the notification of Iqiyi's image filter results and the MCN electronic signature.

There are still many aspects to be explored in the future, such as message retransmission and ACK, WebSocket binary data support, and multi-tenant support.

Appendix: More related technical information

[1] About the development of web-side instant messaging:

"Beginner Post: Detailed Explanation of the Principles of the Most Complete Web-side Instant Messaging Technology in History"

"Inventory of Instant Messaging Technologies on the Web: Short Polling, Comet, Websocket, SSE"

"SSE Technology Explained: A New HTML5 Server Push Event Technology"

"Comet Technology Explained: Web-side Real-time Communication Technology Based on HTTP Long Connection"

"Quick Start for Novices: A Concise Tutorial on WebSocket"

"WebSocket Detailed Explanation (1): A Preliminary Understanding of WebSocket Technology"

"WebSocket Detailed Explanation (2): Technical Principles, Code Demonstrations and Application Cases"

"WebSocket Detailed (3): In-depth WebSocket Communication Protocol Details"

"Detailed Explanation of WebSocket (4): Questioning the relationship between HTTP and WebSocket (Part 1)"

"WebSocket Explained (5): The relationship between HTTP and WebSocket (Part 2)"

"Detailed Explanation of WebSocket (6): Questioning the relationship between WebSocket and Socket"

"Practice and Ideas for Socket.io to Realize Message Push"

"LinkedIn's Web-side Instant Messaging Practice: Realizing Hundreds of Thousands of Long Connections on a Single Machine"

"The Development of Web Instant Messaging Technology and the Technical Practice of WebSocket and Socket.io"

"Web-side instant messaging security: detailed explanation of cross-site WebSocket hijacking vulnerabilities (including sample code)"

"Practice of Open Source Framework Pomelo: Building High-performance Distributed IM Chat Server on Web"

"Using WebSocket and SSE Technology to Realize Web-side Message Push"

"Explain the evolution of web-side communication: from Ajax, JSONP to SSE, Websocket"

"Why does MobileIMSDK-Web's network layer framework use Socket.io instead of Netty? 》

"Integrating Theory with Practice: Understanding the Communication Principle, Protocol Format, and Security of WebSocket from Zero"

"How to use WebSocket to realize long connection in WeChat applet (including complete source code)"

"Eight Questions about WebSocket Protocol: Quickly Answer Hot Questions about WebSocket"

"Web-side instant messaging practice dry goods: How to make your WebSocket disconnect and reconnect faster? 》

"WebSocket from beginner to proficient, half an hour is enough! 》

"WebSocket Hardcore Introduction: 200 lines of code, teach you to use a WebSocket server by hand"

More similar articles...

[2] Articles about push technology:

"A complete Android push Demo based on MQTT communication protocol"

"Seeking advice on android message push: the pros and cons of GCM, XMPP, and MQTT"

"Analysis of Mobile Real-time Message Push Technology"

"Absolute Dry Goods: Technical Essentials of Push Service for Massive Access Based on Netty"

"Technical Practice Sharing of Large-scale and High-concurrency Architecture of Aurora Push System"

"Meizu 25 million long-connected real-time message push architecture technical practice sharing"

"Interview with Meizu Architects: Experiences and Experiences of Real-time Message Push System with Massive Long Connections"

"Practice of Pushing Messages in Hybrid Mobile Applications Based on WebSocket (including code examples)"

"Implementation Ideas for a Secure and Scalable Subscription/Push Service Based on Long Connections"

"Practice Sharing: How to build a highly available mobile messaging system? 》

"The Practice of Go Language to Build a Ten Million-Level Online Highly Concurrent Message Push System (From 360 Company)"

"Tencent pigeon technology sharing: practical experience of tens of billions of real-time message push"

"Millions Online's Real-time Push Technology Practice Road of Meipai Live Barrage System"

"The Evolution of the Message Push Architecture of the Jingtokyo Mai Merchant Open Platform"

"Technical dry goods: from scratch, teach you to design a million-level message push system"

"Special Topic on Long Connection Gateway Technology (4): Practice of iQiyi WebSocket Real-time Push Gateway Technology"

More similar articles...

This article has been simultaneously published on the official account of "Instant Messaging Technology Circle".

▲ The link of this article on the official account is: click here to enter. The synchronous publishing link is: http://www.52im.net/thread-3539-1-1.html


JackJiang
1.6k 声望808 粉丝

专注即时通讯(IM/推送)技术学习和研究。