后端 - Technical Practice｜Netease Yunxin IM SDK Service High-Availability Technical Solution - 网易云信技术小站

Introduction: "Domain hijacking is a method of Internet attacks. By attacking the domain name resolution server (DNS), or forging the domain name resolution server (DNS), the target website domain name is resolved to the wrong IP address so that users cannot access The purpose of the target website, or the purpose of deliberately/maliciously requesting users to access the designated IP address (website)." (The above content is quoted from "Domain Hijacking" Baidu Encyclopedia). NetEase Yunxin IM SDK, as a ToB product, supports the development of various tripartite businesses. Faced with various complex network environments, DNS hijacking and DNS pollution occur from time to time. So how should we avoid such accidents in the process of providing services?

Text｜Hao Kui

Senior C++ Development Engineer, NetEase Yunxin

Technology is a double-edged sword.

Although "domain name hijacking" carries the word "hijacking", it is really a neutral term in this environment. For example, for access to some illegal websites, the corresponding domain name can be resolved to an inaccessible IP address through the DNS service, so as to prevent access to the illegal website and give a warning. During the operation and maintenance of Netease Yunxin's instant messaging products, there have been incidents in which the service domain name "netease.im" was maliciously hijacked by people or organizations with ulterior motives, which resulted in applications that access Netease Yunxin IM SDK not being able to log in normally, giving customers and The customer’s users make an impact. In order to figure out how this accident happened, let's analyze the login process of NetEase Yunxin IM SDK:

From the process point of view, if DNS hijacking occurs on the "Update LBS" node, it may time out or get an incorrect response when accessing the NetEase Yunxin LBS service, causing the IM SDK to fail to obtain the normal Link server address and port. How to avoid such accidents? This article will focus on the high-availability technical solutions of NetEase cloud-side services and the implementation of high-availability components for specific sharing.

1. How to prevent DNS hijacking

Usually, after the domain name is hijacked, we can use the following methods:

The above-mentioned methods are all solutions taken after hijacking occurs. Both are not flexible enough from the service provider side or the service user side. In order to solve these problems and prevent accidents in advance, we mainly adopt the following two solutions:

The solution with access to the HttpDNS service reduces the risk of domain name hijacking in all scenarios.

1. LocalDNS domain name hijacking

Domain name hijacking has always been one of the problems that plagued many developers. Its manifestation is that the DNS resolution result IP1 that domain name A should return is maliciously replaced with IP2, causing A's access to fail or visiting an insecure site. Common methods of domain name hijacking include The following:

Hackers invade broadband routers, tamper with the end user LocalDNS, and point to the forged LocalDNS, and use the logic of controlling LocalDNS to return wrong IP information for domain name hijacking.

Monitor the end user's domain name resolution request, and pass the forged DNS resolution response to the end user before LocalDNS returns the correct result, thereby controlling the end user's domain name access behavior.

Cache pollution, LocalDNS caches the resolution results of the domain name and changes some of the domain name results, causing users to access the changed address. The schematic diagram is as follows:

2. HttpDNS implementation principle

S tep1: The client directly accesses the HttpDNS interface to obtain the "correct" and "optimal access speed" IP list configured on the domain name configuration management system for the business.

S tep2: After obtaining the IP, the client sends a service protocol request directly to this IP. Take HTTP request as an example. By specifying the Host field in the header, you can send a standard Http request to the IP returned by HttpDNS. If it is Https, SNI should also be considered.

2. NetEase Cloud Trust Service High-Availability Strategy

In order to improve the high availability of the service, NetEase Yunxin SDK has access to the HttpDNS service. The overall structure of the high availability solution is as follows:

The general process of IM SDK accessing HttpDNS service to achieve high service availability:

1. Implementation of end-side HttpDNS SDK

The IM SDK high-availability component adopts a cross-platform development solution, mainly supporting Native SDK (Windows, MacOS, iOS, Android), and the basic structure is as follows:

In order to ensure high availability, timeliness of response, and correctness of results, high-availability components need to complete the following functions during design:

HttpDNS service interface update and cache maintenance

Domain name query result update and cache maintenance

Implementation of HTTP request

1.1 Tiered HTTP request

A certain domain name may resolve multiple IP addresses through the HttpDNS domain name query service. If these addresses contain nodes that are offline or have unsatisfactory access speeds, the entire HTTP access time will be longer. The worst case is All addresses have been accessed and timed out, as shown in the following figure:

In order to improve the access efficiency of HTTP requests, a multi-address stepped HTTP request mechanism has been introduced in the NetEase Yunxin high-availability component. For example, the HTTP request timeout period of a single link is specified as 30s, and a timeout is enabled when the first link is accessed. Timer0 with a time of 3 seconds. If the request is returned within 3 seconds and verified as a correct response by the business module, the entire multi-address stepped request ends at this time. Otherwise, when Timer0 is triggered, the HTTP request for the next link is started , And start a new timer Time1 with a timeout of 3 seconds, and so on, until there is a correct response result or all links have been accessed, the process is as follows:

1.2 Update of HttpDNS service interface and cache maintenance

HttpDNS is also an HTTP service that may be hijacked, so in addition to using the HttpDNS domain name, the high-availability component also has built-in multiple fixed IPs to solve the problem of inaccessibility after the HttpDNS domain name is hijacked. Using a fixed IP can solve the domain name. The hijacked problem is not necessarily the latest and optimal node. In order to solve this problem, the high-availability component will update the HttpDNS service address within a specified time to get the latest and optimal node. In order to reduce the amount of visits to HttpDNS, the TTL (usually 1 hour) mechanism of the service address is introduced, that is, the local cache is used during the validity period of the service address, and the service address is no longer requested from HttpDNS.

The initialization process of HttpDNS service is as follows:

1.3 Update of domain name query results and cache maintenance

In order to ensure timely response to domain name queries and reduce the amount of access to HttpDNS, the high-availability component introduces the "query result cache", which locally caches the domain name results that have been queried, and also adds a TTL mechanism to improve the correctness ( Usually 5 minutes). After the specified time, the result will be updated again. At the same time, in order to ensure the timeliness of the response, redundant time is added on the basis of TTL (usually TTL*0.75), so the high-availability component query is called There are three results for a domain name:

The cache does not reach the redundant time, and the cached result is returned.

If the cache reaches the redundant time but has not expired, the cached result is returned and an update request is initiated at the same time.

The cache has expired, and an update request is initiated. After the response is successful, the cache is updated and the upper-level call is responded to. If the response fails, the cached data is continued to be used.

The process of calling highly available components for domain name query is as follows:

1.4 HTTP access process design

2. Report the suspected hijacking incident

When calling high-availability components for HTTP requests, if the request fails due to non-network reasons, and the HttpDNS query domain name operation is triggered, it is determined that the domain name accessed this time may have been hijacked, and the high-availability components will collect and The domain name-related information content is reported to the NetEase Yunxin data market. The data market backend locates whether there is domain name hijacking based on the reported information, and decides whether to reconfigure the HttpDNS resolution result or cooperate with the security department to perform the corresponding App according to the actual situation. Banned processing.

3. SNI processing

In order to allow multiple domain names to reuse one IP, the concept of virtual host is introduced on the HTTP server. In an HTTPS server with multiple virtual hosts sharing IP, the server cannot know the specific Host requested by the client before the handshake is established, so the request cannot be handed over to a specific virtual host, which causes the server to be unable to read the certificate information configured in the virtual host. SNI is used to solve this problem. SNI is an extension protocol of SSL and TLS. SNI requires the client to carry the Host information of the domain name to be accessed when shaking hands with the server. The specific implementation method is to add the Server Name extension field to the request header of the client "Client Hello" message, so the server will know Which virtual host's certificate needs to be used to shake hands with the client and establish a TLS connection.

The following is a code snippet using libcurl to send an HTTP request:

DNS pollution-baidu encyclopedia (baidu.com)

Domain hijacking-baidu encyclopedia (baidu.com)

Domain name server cache pollution-Wikipedia, the free encyclopedia (wikipedia.org)

The above is the sharing of the high-availability technology solutions of NetEase cloud-side services and the implementation of high-availability components. If you are interested, please leave a message on the official account to discuss with me.

about the author

Hao Kui, a senior C++ development engineer at NetEase Yunxin, is mainly responsible for the development, maintenance, and refactoring of NetEase Yunxin IM SDK. He has many years of experience in C++ client development and is now committed to cross-platform C++ development.

Technical Practice｜Netease Yunxin IM SDK Service High-Availability Technical Solution

网易数智

引用和评论

InfoQ官媒报道|网易云信裴明明：云原生架构下中间件联邦高可用架构实践

C++ 中 VS 项目引入公共配置文件

疯狂推荐！从零开始 Dify 部署全攻略！

Cherry Studio 入门 MCP：为你的大模型插上翅膀

Visual Studio Code (VS Code) – C/C++ 入门

狂揽17k star！Docker可视化神器，一键部署项目真香！

OpenWebUI：一站式 AI 应用构建平台体验