Simple dynamic string of Redis design and implementation

simple dynamic string

What is SDS

SDS , namely Simple Dynamic String, simple dynamic string.

Instead of using C's traditional string representation (a null-terminated array of characters) directly, Redis builds an abstract type called simple dynamic string (SDS) by itself, and uses SDS as Default string representation for Redis.

In Redis, C strings are only used as string literals in some 无须对字符串值进行修改的地方 , such as printing logs.

When Redis needs more than just a string literal, but a string value that can be modified, Redis will use SDS to represent string values, such as in the Redis database, containing string values 键值 pairs are implemented by SDS under the hood.

Definition of SDS

Included in the SDS structure:

buf : byte array to hold strings
len : record the number of bytes used in the buf array, which is equal to the length of the string saved by SDS
free : record the number of unused bytes in the buf array

Example:

数组空间使用完的情况

The value of the len attribute is equal to 5, indicating that a five-byte string is stored in the SDS
The value of the free attribute is equal to 0, indicating that no unused space is allocated in the SDS
The buf property is a char array, the first five bits hold five characters, and the last byte holds a null character '\0'

Note: SDS follows the null-terminated convention for C strings, the 1-byte space that holds the null character is not counted in the len property of the SDS, and this additional 1-byte space is allocated, and the null character is added to Operations such as the end of the string are automatically completed by the SDS function.

The case where free is 0 is shown just now, and when there is unused space, it is as shown below. We still use the "Redis" string.

数组空间未使用完的情况

Differences between SDS and C strings

We just said that the C language uses a character array of length N+1 to represent a string of length N, and the last element of the character array is always the null character '\0'.

This simple string representation cannot meet the security, efficiency and functional requirements of Redis for strings. Let's compare the differences between SDS and C strings next.

Constant complexity to get string length

First, the C string does not record its own length information. If you want to obtain the length of the C string, you must traverse the entire string and count each character encountered until the end identifier '\0' is encountered. The complexity is O(N).

Unlike SDS, the len attribute in SDS records the length of the SDS itself, so the complexity of obtaining the length of the SDS string is only O(1).

It should be noted that the work of setting and updating the SDS length is done automatically by the SDS API at the time of execution.

Using SDS reduces the complexity of obtaining string length from O(N) to O(1), ensuring that obtaining string length in Redis will not become a performance bottleneck.

Avoid buffer overflow

In addition to the high complexity of obtaining the length of the string, another problem caused by the C string not recording its own length is that it is easy to cause buffer overflow.

For example, the strcat function in string.h can concatenate the contents of the src string to the end of the dest string.

 char *strcat(char *dest, const char *src);

Since C strings don't keep track of their length, a buffer overflow will occur if dest is not allocated enough memory to hold all the contents of the src string.

📢 It should be noted that if the two strings s1 and s2 are adjacent to each other in the memory, if not enough space is allocated when modifying the s1 string, it may overflow to the memory space where the s2 string is located, resulting in the s2 string being tampered with .

And SDS is different, SDS 空间分配策略 completely eliminates the possibility of buffer overflow. When modifying the SDS, the API will first check whether the required requirements are met, if not, it will automatically expand the capacity, and then modify it.

Example:

执行sdscat之前

As shown above, at this time we execute

 sdscat(s," Cluster");

First, before splicing, it will check whether the length of the current s is enough. After finding that it is not enough to splicing "Cluster", expand the capacity, and then splicing, as shown in the following figure.

执行sdscat之后

📢 Note: SDS not only performs the splicing operation, but also allocates 13 bytes of unused space. Next, we will understand the space allocation strategy of SDS.

Reduce the number of memory reallocations when modifying strings

As we just said, C strings do not record the length of the string, so when adding or shortening a string, a memory reallocation operation must be performed.

If you perform a growing string operation, such as append, you need to expand the size of the underlying array through a memory reallocation strategy before the operation. If you forget this step, a buffer overflow will occur
If you perform a shortening string operation, such as trim, then you need to perform memory reallocation to release the space that is no longer used after performing this operation. If you forget this step, a memory leak will occur

In order to avoid this defect of C strings, SDS releases the association between the length of the string and the length of the underlying array by freeing the unused space. In SDS, the length of the buf array is not necessarily the number of characters plus one, and may also contain Bytes are not used.

By freeing unused space, SDS implements two optimization strategies: space pre-allocation and lazy space release .

1.空间预分配

As the name implies, when SDS is expanding, it will not only allocate space necessary for modification to SDS, but also allocate additional unused space to SDS.

Free space allocation strategy:

After modification, if the length len of SDS is less than 1MB, then free space will be allocated with the same size as len. ie len = free
After modification, if the length len of SDS is greater than or equal to 1MB, 1MB free space will be allocated to free. For example, if SDS len is 30MB, then 1MB unused space will be allocated to free. At this time, the length of buf array is 30MB+1MB +1byte

Through this pre-allocation strategy, SDS reduces the number of memory operations required for a string that continuously grows N times from a certain number of times to a maximum of N times.

2.惰性空间释放

When performing a string shortening operation, SDS does not immediately use memory reallocation to reclaim excess space, but uses free for recording. If subsequent growth operations are performed, expansion may not be required. SDS also provides an API to free up unused space in SDS.

binary safe

We know that the end of a C string is represented by a null character, and it cannot contain a null character in the middle, otherwise it will be considered as the end of a character. And it needs to conform to a certain encoding (such as ASCII), so that the C string can only save text data, but cannot save binary data such as pictures and videos.

The API of SDS is binary safe, the program does not do any processing on the data, what it looks like when it is written, what it looks like when it is read, and it can contain null characters in SDS, because len is used in SDS to Check if the string ends.

But why is there still a null character at the end of the SDS? This is to reuse some functions in <string.h> and avoid unnecessary code duplication.

Summarize

C string	SDS
Getting string length is O(N)	Getting string length is O(1)
API is not safe and may cause buffer overflow	API is safe and will not cause buffer overflows
Modifying a string N times requires N memory reallocations	Modifying the string N times will perform at most N memory reallocations
Only text data can be saved	Not only can save text data but also binary data
All functions in the <string.h> library can be used	Some functions in the <string.h> library can be used

Simple dynamic string of Redis design and implementation