One article is enough to understand modern web-side instant messaging technology: WebSocket, socket.io, SSE

This article is quoted from the "Three Axes of JS Real-time Communication" series of articles in "Doumi Blog", with optimizations and changes.

1 Introduction

I have sorted out many articles about web-side instant messaging technology. Readers who have read it may be familiar with it. The early web-side instant messaging solutions were limited by the technical limitations of the Web client, and wanted to achieve real "instant" communication. The difficulty is quite large.

The traditional Web-side instant messaging technology ranges from short polling to long-link query, and then to Comet technology. Under such primitive HTML standards, in order to realize the so-called "instant" communication, it can be said that technically it can be described as racking its brains and doing its best. .

Since the publication of the HTML5 standard, technologies such as WebSocket have been born, and the convenience of realizing Web-side instant messaging technology has been greatly advanced. Real full-duplex real-time communication, which has never been imagined in the past, has been possible for a long time.

This article will specifically introduce several modern web-side instant messaging technologies such as WebSocket, socket.io, and SSE. From applicable scenarios to technical principles, popular and in-depth texts are especially suitable for having a certain understanding of web-side instant messaging technologies, and Readers who want to learn WebSocket and other modern Web-side "real-time" communication technologies, but don't want to spend time reading the boring IETF technical manuals.

study Exchange:

5 groups for instant messaging/push technology development and communication: 215477170 [recommended]
Introduction to Mobile IM Development: "One entry is enough for novices: Develop mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article was published simultaneously at: http://www.52im.net/thread-3695-1-1.html)

2. The author of this article

"Doumi": Lives in Hangzhou, loves the front end, and loves the Internet. Doumi is the abbreviation for "potato (potato-bean)" and "Micha (rice)".
Author's blog: https://blog.5udou.cn/
Author Github: https://github.com/linxiaowu66/

3. Knowledge preparation

If you don’t know the past and present of web-side instant messaging technology, I suggest you read the following articles first:

"Beginner's Post: Detailed Explanation of the Principles of the Most Complete Web-side Instant Messaging Technology in History"
"Inventory of Instant Messaging Technologies on the Web: Short Polling, Comet, Websocket, SSE"
"Explain the evolution of web-side communication: from Ajax, JSONP to SSE, Websocket"
"Quick Start of IM Communication Technology on the Web: Short Polling, Long Polling, SSE, WebSocket"

If you have an understanding of the technology that will be introduced in this article, it is recommended to conduct a special study for in-depth mastery:

"Comet Technology Explained: Web-side Real-time Communication Technology Based on HTTP Long Connection"
"SSE Technology Explained: A New HTML5 Server Push Event Technology"
"WebSocket Detailed (3): In-depth WebSocket Communication Protocol Details"
"Integrating Theory with Practice: Understanding the Communication Principle, Protocol Format, and Security of WebSocket from Zero"
"WebSocket from entry to proficiency, half an hour is enough! 》

4、WebSocket

I do not intend to introduce the content of the entire WebSocket protocol in detail here. According to my own learning ideas of the previous protocol, I will focus on using the question and answer method to introduce the protocol, so that it will not be so boring to read.

4.1 Basic situation
On which layer of OSI does the protocol run?

At the application layer, the WebSocket protocol is an independent TCP-based protocol. The only relationship between it and HTTP is that its handshake is interpreted by the HTTP server as an Upgrade request.

What is the standard port number on which the protocol runs?

By default, the WebSocket protocol uses port 80 for regular WebSocket connections and port 443 for WebSocket connections, which is a tunneled port on top of Transport Layer Security (TLS) RFC2818.

4.2 How does the agreement work?
The workflow of the agreement can refer to the following figure:

Some important fields of the frame need to be explained:

1) Upgrade: upgrade header domain used to define the conversion protocol in HTTP1.1. It said that if the server supports it, the client wants to use the existing "network layer" already established "connection (here is a TCP connection)" and switch to another "application layer" (here, WebSocket) protocol ；
2) Connection: Upgrade fixed field. Connection has other fields, you can give yourself popular science;
3) Sec-WebSocket-Key: used to send to the server (the server will use this field to assemble another key value and send it to the client in the handshake return message);
4) Sec-WebSocket-Protocol: Identifies the list of sub-protocols supported by the client;
5) Sec-WebSocket-Version: Identifies the version list of the WS protocol supported by the client. If the server does not support this version, it must respond to the version it supports;
6) Origin: For safe use to prevent cross-site attacks, browsers generally use this to identify the original domain;
7) Sec-WebSocket-Accept: The server response contains the signature value of Sec-WebSocket-Key, proving that it supports the requested protocol version.

The calculation of Sec-WebSocket-Key and Sec-WebSocket-Accept is as follows:

All RFC 6455-compliant WebSocket servers use the same algorithm to calculate the answer to the client challenge: concatenate the content of the Sec-WebSocket-Key with the unique GUID character (258EAFA5-E914-47DA-95CA-C5AB0DC85B11) defined by the standard, and calculate Get the SHA1 hash value, the result is a base-64 encoded string, just send this string to the client.

The code is implemented as follows:

const key = crypto.createHash('sha1')
      .update(req.headers['sec-websocket-key'] + constants.GUID, 'binary')
      .digest('base64')

As for why such a step is needed, you can refer to the article "Integrating Theory with Practice: Understanding the Communication Principle, Protocol Format, and Security of WebSocket from Zero".

Quoted as follows:

The main role of Sec-WebSocket-Key/Sec-WebSocket-Accept is to provide basic protection and reduce malicious and accidental connections.

The role is roughly summarized as follows:

1) Prevent the server from receiving illegal websocket connections (for example, if the http client accidentally requests to connect to the websocket service, the server can directly refuse the connection);
2) Ensure that the server understands the websocket connection. Because the HTTP protocol is used in the ws handshake phase, it is possible that the ws connection is processed and returned by an http server. At this time, the client can use the Sec-WebSocket-Key to ensure that the server recognizes the ws protocol. (It is not 100% insurance. For example, there are always some boring http servers that only deal with Sec-WebSocket-Key, but the ws protocol is not implemented...);
3) Sec-WebSocket-Key and other related headers are prohibited when an ajax request is initiated in the browser and the header is set. This can prevent the client from accidentally requesting a protocol upgrade (websocket upgrade) when sending an Ajax request;
4) It can prevent the reverse proxy (do not understand the ws protocol) from returning wrong data. For example, the reverse proxy receives two upgrade requests for ws connections before and after, the reverse proxy returns the first request to the cache, and then directly returns the cached request when the second request arrives (meaningless return) ；
5) The main purpose of Sec-WebSocket-Key is not to ensure data security, because the conversion calculation formulas of Sec-WebSocket-Key and Sec-WebSocket-Accept are public and very simple. The main function is to prevent some common ones. Unexpected situation (unintentional).

Emphasize: The conversion of Sec-WebSocket-Key/Sec-WebSocket-Accept can only bring basic guarantees, but whether the connection is safe, whether the data is safe, whether the client/server is legitimate ws client, ws server, in fact There is no guarantee of practicality.

4.3 What is the frame format of the protocol transmission?
The format defined by the frame format is as follows:

The explanation of each field is as follows:

1) FIN: 1bit, used to indicate that this is the last message fragment of a message, of course, the first message fragment may also be the last message fragment;
2) RSV1, RSV2, RSV3: Each is 1 bit. If there is no custom agreement between the two parties, then the value of these bits must be 0, otherwise the WebSocket connection must be disconnected. RSV1 is used in ws to indicate whether the message is compressed;
3) opcode: 4 bit, indicating the type of frame being transmitted:

%x0 represents a continuous message fragment;
%x1 represents a text message fragment;
%x2 represents fragments of binary messages;
%x3-7 Opcodes reserved for future non-control message fragments;
%x8 means the connection is closed;
%x9 represents the ping of the heartbeat check;
%xA represents the pong of the heartbeat check;
%xB-F is the reserved opcode for future control message fragments.
4) Mask: 1 bit. Define whether the transmitted data is masked. If it is set to 1, the masking key must be placed in the masking-key area. This bit is 1 for all messages sent by the client to the server;
5) Payload length: The length of the transmitted data, expressed in bytes: 7 bits, 7+16 bits, or 7+64 bits. If the value expressed in bytes is in the range of 0-125, then this value represents the length of the transmitted data; if the value is 126, the next two bytes represent an unsigned hexadecimal number, which is used to Indicates the length of the transmitted data; if the value is 127, then a 64-bit non-coincidence number represented by 8 bytes follows, and this number is used to indicate the length of the transmitted data. The number of multi-byte lengths is expressed in the order of network bytes. The length of the load data is the sum of the extended data and the application data. The length of the extended data may be 0, so the length of the load data at this time is the length of the application data;
6) Masking-key: 0 or 4 bytes. The data sent by the client to the server is masked by an embedded 32-bit value; the mask key only exists when the mask bit is set to 1. ；
7) Extension data: x bit, if there is no special agreement between the client and the server, the length of the extension data is always 0, any extension must specify the length of the extension data, or the calculation method of the length, and during the handshake How to determine the correct handshake method. If there is extended data, the extended data will be included in the length of the load data;
8) Application data: y bit, arbitrary application data, placed after the extended data, the length of the application data = the length of the payload data-the length of the extended data;
9) Payload data: (x+y) bit, the load data is the sum of extended data and application data length;

For more details, please refer to RFC6455-Data Frame, which will not be repeated here.

For the introduction of the above fields, there is a Mask that needs to be said.

Masking-key is a 32-bit random number selected by the client. The mask operation does not affect the length of the data payload.

The following algorithms are used for masking and de-masking operations.

First, suppose:

1) original-octet-i: the i-th byte of the original data;
2) transformed-octet-i: the i-th byte of the transformed data;
3) j: is the result of i mod 4;
4) masking-key-octet-j: is the jth byte of the mask key.

The algorithm is described as: original-octet-i and masking-key-octet-j are XORed to get transformed-octet-i.

That is: j = i MOD 4 transformed-octet-i = original-octet-i XOR masking-key-octet-j

Realize with code:

const mask = (source, mask, output, offset, length) => {
  for(vari = 0; i < length; i++) {
    output[offset + i] = source[i ] ^ mask[i & 3];
  }
};

Unmasking is the reverse operation:

const unmask = (buffer, mask) => {
  // Required until [url=https://github.com/nodejs/node/issues/9006]https://github.com/nodejs/node/issues/9006[/url] is resolved.
  const length = buffer.length;
  for(vari = 0; i < length; i++) {
    buffer[i ] ^= mask[i & 3];
  }
};

For the same reason why the mask operation is needed, you can also refer to the previous article: "Integrating Theory with Practice: Understanding the Communication Principles, Protocol Formats, and Security of WebSocket from Zero", I will not list the complete ones.

The key points that need to be paid attention to, let me quote:

In the WebSocket protocol, the function of the data mask is to enhance the security of the protocol. But the data mask is not to protect the data itself, because the algorithm itself is public and the calculation is not complicated. Except for the encrypted channel itself, it seems that there are not many effective ways to protect communication security.

So why introduce mask calculation? It seems that there is not much benefit besides increasing the computing capacity of the computing machine (this is also the point that many students are puzzled).

The answer is still two words: security. But not to prevent data leakage, but to prevent problems such as proxy cache poisoning attacks that existed in earlier versions of the protocol.

5、socket.io

5.1 Introduction to this section

After introducing the WebSocket protocol in the previous section, we turned our attention to the second weapon of modern Web-side instant messaging technology: socket.io.

It is estimated that some readers will ask, what is the difference between WebSocket and socket.io?

Before understanding socket.io, let's talk about the implementation background of traditional web-side instant messaging "long connection" technology.

5.2 Technical implementation background of traditional web long connection
In actual web-side products, not all web clients support persistent connections, or in other words, before the WebSocket protocol came out, there were three ways to achieve similar functions of WebSocket.

The three ways are:

1) Flash: Using Flash is a simple method. But the obvious disadvantage is that Flash will not be installed on all clients, such as iPhone/iPad.
2) Long-Polling: This is the well-known "long polling". In the past, this was an effective technique, but it did not optimize message sending. Although I would not regard AJAX long polling as a hack technique, it is really not an optimal method;
3) Comet: In the past, this was called the "server push" technology on the Web side. Compared with traditional Web applications, the development of Comet applications has a certain degree of challenge. Real-time communication technology.

So if you simply use WebSocket, what about the clients that do not support it? Don't you just give up?

of course not. Guillermo Rauch wrote the socket.io library to encapsulate WebSocket so that long connections can meet all scenarios, but of course the corresponding client code must be used in conjunction.

Socket.io will use the feature detection method to decide to establish a connection by websocket/ajax long polling/flash and other methods.

So how does socket.io do this?

We take the following questions to learn:

1) What are the new features of socket.io?
2) How does socket.io implement feature detection?
3) What are the pitfalls of socket.io?
4) What is the actual application of socket.io and what should I pay attention to?

If there are children's shoes that are already clear about the above issues, there is no need to read on.

5.3 Introduction to socket.io
Through the previous chapters, readers all know the functions of WebSocket. Compared with WebSocket, what new things does socket.io encapsulate on this basis?

Socket.io actually has a set of protocols that encapsulate websocket, called the engine.io protocol, on which a set of underlying two-way communication engine Engine.io is implemented.

And socket.io is an application layer framework built on engine.io. So the focus of our research is the engine.io protocol.

Some new features of its implementation are mentioned in the README of socket.io (answered question one):

1) Reliability: The connection can still be established even if the application environment exists: proxy or load balancer personal firewall or anti-virus software;
2) Support automatic connection: Unless otherwise specified, a disconnected client will always reconnect to the server until the server is available again;
3) Disconnection detection: A heartbeat mechanism is implemented in the Engine.io layer, which allows the client and server to know when one of them cannot respond. This function is realized by setting the timers on the server and the client. During the connection handshake, the server will actively inform the client of the heartbeat interval and timeout time;
4) Binary support: any serialized data structure can be used to send;
5) Cross-browser support: the library even supports IE8;
6) Support multiplexing: In order to isolate the created concerns in the application, Socket.io allows you to create multiple namespaces, which have separate communication channels, but will share the same underlying connection;
7) Support Room: Under each namespace, you can define any number of channels, which we call "rooms". You can join or leave the room, and even broadcast messages to the specified room.

Note: Socket.IO is not an implementation of WebSocket. Although Socket.IO does use WebSocket as a transport when possible, it adds a lot of metadata to each message: message type, namespace and ack Id. This is why the standard WebSocket client cannot successfully connect to the Socket.IO server, and the same Socket.IO client cannot connect to the standard WebSocket server.

5.4 Introduction to engine.io protocol
The handshake process of the complete engine.io protocol is as follows:

The current version of the engine.io protocol is 3. Let's roughly introduce the engine.io protocol based on the figure above.

5.4.1) Engine.io protocol request fields:

What we see is that the requested url is not the same as WebSocket. Explain:

1) EIO=3: It means that the Engine.io protocol version 3 is used;
2) transport=polling/websocket: indicates whether the long connection method used is polling or WebSocket;
3) t=xxxxx: Yeast is used in the code to generate a unique string based on the timestamp;
4) sid=xxxx: The session id obtained after the client and the server establish a connection, the client must add this field to each request after it is obtained.

In addition to the above three fields, the agreement also describes the following fields:

1) j: If the transport is polling, but a JSONP response is required, then j should be set to the index value of the JSONP response;
2) b64: If the client does not support XHR, then the client should set b64=1 and send it to the server, telling the server that all binary data should be base64 encoded before sending.

In addition, the default path of engine.io is /engine.io, and socket.io is set to /socket.io when it is initialized, so the path everyone sees is /socket.io:

function Server(srv, opts){
  if(!(this instanceof Server)) return new Server(srv, opts);
  if('object'== typeof srv && srv instanceof Object && !srv.listen) {
    opts = srv;
    srv = null;
  }

  opts = opts || {};
  this.nsps = {};
  this.parentNsps = new Map();
  this.path(opts.path || '/socket.io');

5.4.2) Data packet coding requirements:

The data packet encoding of the engine.io protocol has its own set of formats. In the introduction of the protocol, engine.io-protocol defines two encoding types: packet and payload.

An encoded packet is in the following format:
<packettype id>[<data>]

Then the protocol defines the following packet types (identified by numbers):

1) 0 (open): When starting a new transport, the server will send this type of packet;
2) 1(close): Request to close the transport but do not close the connection by yourself;
3) 2(ping): For the ping packet sent by the client, the server must respond with a pong packet containing the same data;
4) 3(pong): In response to the ping packet, the server sends it;
5) 4(message): The actual message, both the client and the server can monitor the message event to obtain the message content;
6) 5(upgrade): Before engine.io switches the transport, it will be used to test whether the server and the client are communicating on the transport. If the test is successful, the client will send an upgrade package to let the server refresh its cache and switch to the new transport;
7) 6(noop): Mainly used to force a polling cycle when a WebSocket connection is received.

The payload also has corresponding format requirements:

1) If only string is sent and XHR is not supported, the encoding format is::[:[...]];
2) When XHR2 is not supported and binary data is sent, but base64 encoded string is used, the encoding format is: b[...];
3) When XHR2 is supported, all data is encoded into binary, the format is: <0 for string data, 1 for binary data>[...];
4) If the content to be sent is mixed with UTF-8 characters and binary data, each character of the string is written as a character code, which is represented by 1 byte.

Note: The encoding requirements of the payload do not apply to WebSocket communication.

In response to the above coding requirements, let's just give an example.

Before the first polling request, the server code sent this data:

97:0{"sid":"Peed250dk55pprwgAAAA","upgrades":["websocket"],"pingInterval":25000,"pingTimeout":60000}2:40

Based on the above knowledge, we know that the server will send an open packet for the first time.

So the assembled packet is:
0

Then the server will tell the client to try to upgrade to websocket and inform the corresponding sid.

So after the integration is:

0{"sid":"Peed250dk55pprwgAAAA","upgrades":"websocket","pingInterval":25000,"pingTimeout":60000}

Then according to the encoding format of the payload, because it is a string, and the length is 97 bytes.

so it is:

97:0{"sid":"Peed250dk55pprwgAAAA","upgrades":"websocket","pingInterval":25000,"pingTimeout":60000}

Then the second part of the data is the message packet type, and the data is 0, so it is 40, and the length is 2 bytes, so it is 2:40, and finally put together the result you just saw.

Notice:

The ping/pong interval time is notified by the server to the client: "pingInterval": 25000, "pingTimeout": 60000, which means that the heartbeat time is 25 seconds by default, and the waiting time for pong response is 60s by default.

5.5 Necessary process for upgrading the agreement
The protocol defines a necessary process for upgrading transport to websocket.

As shown below:

The test of WebSocket starts with sending the probe. If the server responds to the probe, the client must send an upgrade packet.

In order to ensure that there is no packet loss, the upgrade packet can be sent only when all the buffers of the current transport are refreshed and the transport is considered paused. When the server receives the upgrade package, the server must assume that this is a new channel and send all the stored buffers to this channel

The effect on Chrome is as follows:

5.6 Code implementation of engine.io
After getting familiar with the engine.io protocol, let's take a look at how the code implements the main process.

The main implementation process of the client's engine.io is introduced in the text above.

Combined with the code engine.io, draw such a client flow chart:

The code of the server side is very similar to that of the client side, and its implementation flow chart is as follows:

6、SSE

6.1 Introduction to this section

The first two sections of this article analyzed WebSocket and socket.io, now let’s take a look at SSE.

Many people may be curious, with the real-time communication like WebSocket, why do we need SSE?

The answer is actually very simple: SSE is actually one-way communication, while WebSocket is two-way communication.

For example, in scenarios such as stock quotations and news feeds that only require the server to send messages to the client, it may be more appropriate to use SSE.

In addition: SSE is transmitted using HTTP, which means that we can use it without a special protocol or additional implementation. WebSocket requires a full-duplex connection and a new WebSocket server to handle it. In addition, SSE has some features that WebSocket does not have when designing, such as automatic reconnection, event IDs, and the ability to send random events, so each has its own specialties. We need to choose different application solutions according to actual application scenarios. .

6.2 Introduction to SSE
The simple model of SSE is: a client subscribes to a "stream" from the server, and then the server can send a message to the client until the server or the client closes the "stream", so the full name of SSE is "server-sent-event" ".

Compared with previous polling, SSE can bring higher efficiency to B2C.

There is a picture that draws the difference between the two:

6.3 Format of SSE data frame
SSE must be encoded into utf-8 format, each field of the message is divided by "\n", and the following 4 fields defined by the specification are required.

The 4 fields are:

1) Event: Event type;
2) Data: the data sent;
3) ID: the ID of each event stream;
4) Retry: Tell the browser to wait for the time to reopen a new connection after all connections are lost. During the automatic reconnection process, the last event stream ID received before will be sent to the server.

The following figure is the original format of the data packet captured by wireshark:

6.4 SSE communication process
The communication process of SSE is relatively simple, and some of the underlying implementations are encapsulated by the browser, including data processing.

The general process is as follows:

The screenshot in the browser is as follows:

The data carried is in JSON format, and the browser will help you integrate it into an Object:

In wireshark, the communication process is as follows.

send request:

Get the response:

Before starting to push the information stream, the server will also send a packet that the client will ignore. The specific reason for this is not clear:

Retransmission after disconnection:

6.5 Simple usage example of SSE
Use on the browser side:
const es = new EventSource('/sse')

Use of the server:

const sseStream = new SseStream(req)
sseStream.pipe(res)
sseStream.write({
  id: sendCount,
  event: 'server-time',
  retry: 20000, // 告诉客户端,如果断开连接后,20秒后再重试连接
  data: {ts: newDate().toTimeString(), count: sendCount++}
})

For more API usage and demo introduction, please refer to: SSE API, demo code.

6.6 Compatibility and disadvantages
compatibility:

▲ The picture above is from https://caniuse.com/?search=Server-Sent-Events

shortcoming:

1) Because it belongs to the server -> client, it cannot handle the client request stream;
2) Because it is explicitly designated for the transmission of UTF-8 data, it is inefficient for the transmission of binary streams. Even if you switch to base64, it will increase the bandwidth load, and the gain is not worth the loss.

7. Reference materials

[1] WebSocket API documentation
[2] SSE API documentation
[3] Beginner’s post: the most comprehensive web-side instant messaging technology in history.
[4] Web-side instant messaging technology inventory: short polling, Comet, Websocket, SSE
[5] Detailed SSE technology: a new HTML5 server push event technology
[6] Comet technical details: Web-side real-time communication technology based on HTTP long connection
[7] Quick start for novices: WebSocket concise tutorial
[8] Detailed WebSocket (3): In-depth WebSocket communication protocol details
[9] Detailed WebSocket (4): Questioning the relationship between HTTP and WebSocket (Part 1)
[10] Detailed WebSocket (5): Questioning the relationship between HTTP and WebSocket (Part 2)
[11] Use WebSocket and SSE technology to achieve Web-side message push
[12] Explain the evolution of web-side communication methods: from Ajax and JSONP to SSE and Websocket
[13] Why does MobileIMSDK-Web's network layer framework use Socket.io instead of Netty?
[14] Combining theory with practice: understanding the communication principle, protocol format, and security of WebSocket from scratch
[15] WebSocket from entry to proficiency, half an hour is enough!
[16] Introduction to WebSocket Hardcore: 200 lines of code, teach you how to use a WebSocket server by hand
[17] Quick start of web-side IM communication technology: short polling, long polling, SSE, WebSocket

This article has been simultaneously published on the official account of "Instant Messaging Technology Circle".
The synchronous publishing link is: http://www.52im.net/thread-3695-1-1.html

One article is enough to understand modern web-side instant messaging technology: WebSocket, socket.io, SSE

1 Introduction

2. The author of this article

3. Knowledge preparation

4、WebSocket

5、socket.io

6、SSE

7. Reference materials

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

AI 爆火背后，Spring Boot SSE 推送该怎么学？

极致出海友好，融云 IM 支持消息免打扰设置时区

Python3 使用 websockets 调用阿里云实时语音识别（qbit）

泰国股票实时报价 API 对比及iTick数据优势分析

印度股票实时数据API接口选型指南：iTick如何成为开发者优选

几款免费德国股票报价API对比｜实时股票数据API