String pooling, reducing memory usage by 1/3

Hello everyone, I am a researcher in this issue of MVP Lab: Yu Kun, today I will combine some specific examples to show you how to pool strings to reduce the memory usage of .Net, and learn how to observe memory and other peripheral knowledge . Are you ready? This is on the way.

Microsoft MVP Lab Researcher

This article uses a simple business scenario to describe how to reduce repeated string instances in memory through string pooling, thereby reducing memory usage.

In business, we assume the following:

There are one million products, and each product has a ProductId and Color column stored in the database
Need to load all the data into the memory and use it as a cache
Every product has Color
The range of Color is a limited range, we assume it is about eighty

Learn dotMemory to measure memory

Since the reliability of memory optimization needs to be measured, a simple and effective measurement tool is naturally essential.

In this article, we introduce the combination of Rider + dotMemory and how to perform simple memory measurement. Readers can also choose their favorite tools according to their actual conditions.

First, we create a unit test project and write a simple memory dictionary construction process:

public const int ProductCount = 1_000_000;

public static readonly List<string> Colors = new[]
    {
        "amber", // 此处实际上有80个左右的字符串，省略篇幅
    }.OrderBy(x => x).ToList();

public static Dictionary<int, ProductInfo> CreateDict()
{
    var random = new Random(36524);
    var dict = new Dictionary<int, ProductInfo>(ProductCount);
    for (int i = 0; i < ProductCount; i++)
    {
        dict.Add(i, new ProductInfo
        {
            ProductId = i,
            Color = Colors[random.Next(0, Colors.Count)]
        });
    }

    return dict;
}

It can be seen from the above code:

Create one million commodity objects, among which Color is randomly selected by random numbers.

Specifying the expected value of the dictionary size in advance is actually an optimization.

See:
https://docs.microsoft.com/dotnet/api/system.collections.generic.dictionary-2.-ctor?view=net-5.0&WT.mc_id=DX-MVP-5003606#System_Collections_Generic_Dictionary_2__ctor_System_Int32_?ocid=AID3041048

Then, we introduce the nuget package necessary for the dotMemory unit test measurement, and some other insignificant packages:

<ItemGroup>
    <PackageReference Include="JetBrains.DotMemoryUnit" Version="3.1.20200127.214830" />

    <PackageReference Include="Humanizer" Version="2.11.10" />
</ItemGroup>

Next, we create a simple test to measure the changes in memory before and after the above dictionary is created:

public class NormalDictTest
{
    [Test]
    [DotMemoryUnit(FailIfRunWithoutSupport = false)]
    public void CreateDictTest()
    {
        var beforeStart = dotMemory.Check();
        var dict = HelperTest.CreateDict();
        GC.Collect();
        dotMemory.Check(memory =>
        {
            var snapshotDifference = memory.GetDifference(beforeStart);
            Console.WriteLine(snapshotDifference.GetNewObjects().SizeInBytes.Bytes());
        });
    }
}

It can be seen from the above code:

Before the dictionary is created, we use dotMemory.Check() to capture a snapshot of the current memory for subsequent comparison
After the dictionary is created, we compare the sizes of the newly added objects in the two checkpoints before and after.

Finally, click the button as shown in the figure below to run this test:

run dotMemory
Then, the result will be as follows:

result
Therefore, we can draw such a simple conclusion. Such a dictionary requires approximately 61MB of memory.

And this is theoretically, this dictionary occupies the least amount of memory. Because, each Color uses one of the above 80 ranges. Therefore, they achieved the goal of not having any duplicate examples.

This data will be used as a benchmark for subsequent codes.

Try to load from database to memory

The actual business must be loaded into memory from a persistent storage such as a database. Therefore, let's measure how much memory overhead this loading method probably requires without optimization.
Here, we use SQLite as the storage database for the demonstration. In fact, we can use anything, because we care about the size of the final cache.
Let's introduce some irrelevant packages:

<ItemGroup>
    <PackageReference Include="Dapper" Version="2.0.90" />
    <PackageReference Include="System.Data.SQLite.Core" Version="1.0.115" />
</ItemGroup>

We write a test code to write one million test data into the test library:

[Test]
public async Task CreateDb()
{
    var fileName = "data.db";
    if (File.Exists(fileName))
    {
        return;
    }

    var connectionString = GetConnectionString(fileName);
    await using var sqlConnection = new SQLiteConnection(connectionString);
    await sqlConnection.OpenAsync();
    await using var transaction = await sqlConnection.BeginTransactionAsync();
    await sqlConnection.ExecuteAsync(@"
CREATE TABLE Product(
    ProductId int PRIMARY KEY,
    Color TEXT
)", transaction);

    var dict = CreateDict();
    foreach (var (_, p) in dict)
    {
        await sqlConnection.ExecuteAsync(@"
INSERT INTO Product(ProductId,Color)
VALUES(@ProductId,@Color)", p, transaction);
    }

    await transaction.CommitAsync();
}

public static string GetConnectionString(string filename)
{
    var re =
        $"Data Source={filename};Cache Size=5000;Journal Mode=WAL;Pooling=True;Default IsolationLevel=ReadCommitted";
    return re;
}

The above code:

Create a data named data.db
Create a Product table in the database, containing two columns ProductId and Color
Insert all the data in the dictionary into these two tables, which is actually the dictionary created above

Run this test for about ten seconds, and the test data will be ready. Later, we will repeatedly read data from this database as our test case.

Now, we write a code that reads data from the database and then loads it into the dictionary, and measures the changes in memory:

[Test]
[DotMemoryUnit(FailIfRunWithoutSupport = false)]
public async Task LoadFromDbAsync()
{
    var beforeStart = dotMemory.Check();
    var dict = new Dictionary<int, ProductInfo>(HelperTest.ProductCount);
    await LoadCoreAsync(dict);
    GC.Collect();
    dotMemory.Check(memory =>
    {
        var snapshotDifference = memory.GetDifference(beforeStart);
        Console.WriteLine(snapshotDifference.GetNewObjects().SizeInBytes.Bytes());
    });
}

public static async Task LoadCoreAsync(Dictionary<int, ProductInfo> dict)
{
    var connectionString = HelperTest.GetConnectionString();
    await using var sqlConnection = new SQLiteConnection(connectionString);
    await sqlConnection.OpenAsync();
    await using var reader = await sqlConnection.ExecuteReaderAsync(
        "SELECT ProductId, Color FROM Product");
    var rowParser = reader.GetRowParser<ProductInfo>();
    while (await reader.ReadAsync())
    {
        var productInfo = rowParser.Invoke(reader);
        dict[productInfo.ProductId] = productInfo;
    }
}

The above code:

We have changed the way the dictionary is created, the data in it is read from the database and loaded
Use Dapper to read DataReader and load all dictionaries

Similarly, when we run dotMemory to measure changes, we can get the data as follows:

95.1 MB
Therefore, we conclude that this method consumes about 30MB of memory. It looks very small, but it is actually 50% more than the previous one. (Salary increase from 1,500 to 3,000, 100% immediate sense of salary increase)

Of course, you might suspect that the extra overhead is actually consumed by database operations. But through the following optimization, we can know in advance:

These extra overheads are actually due to repeated string consumption.

Eliminate duplicate string instances

Since we suspect that the extra overhead is repeated strings, then we can consider reducing the repeated strings in the dictionary by converting them to the same object.

So, we have the following version of the test code:

[Test]
[DotMemoryUnit(FailIfRunWithoutSupport = false)]
public async Task LoadFromDbAsync()
{
    var beforeStart = dotMemory.Check();
    var dict = new Dictionary<int, ProductInfo>(HelperTest.ProductCount);
    await DbReadingTest.LoadCoreAsync(dict);
    foreach (var (_, p) in dict)
    {
        var colorIndex = HelperTest.Colors.BinarySearch(p.Color);
        var color = HelperTest.Colors[colorIndex];
        p.Color = color;
    }
    GC.Collect();
    dotMemory.Check(memory =>
    {
        var snapshotDifference = memory.GetDifference(beforeStart);
        Console.WriteLine(snapshotDifference.GetNewObjects().SizeInBytes.Bytes());
    });
}

The above code:

We still load all the data from the database into the dictionary, the loaded code is exactly the same as before, so it is not shown
After loading, we traverse the dictionary again. And from the Color List that existed as early as the first version, the corresponding string instance is searched and assigned to the Color in the dictionary
Through this search, one read, one change. We make the Color in the dictionary all come from the Color List

So, we ran dotMemory again to measure, and the result was amazing:

61.69 MB
Although, in the end, the cost of this figure has increased slightly in the first version, but it has actually reached the point where it is almost the same.

We convert the same Color in the dictionary to the same instance by converting the same string into the same instance. The 30MB temporary strings will be collected immediately in the most recent GC because there are no objects to refer to them. Everything is so easy and happy.

Directly import StringPool

In the previous article, we have found the reason for the overhead and optimized it through methods. However, there are some issues to actually consider:

Many times the Color List is not a static list. She may be very happy in the morning and angry in the afternoon.
Color List cannot be infinitely large. We need an elimination algorithm to eliminate the last 10% and send them to the society.

Therefore, we can consider using StringPool directly, the code written by others is great, now it is ours.

Let's introduce some more irrelevant packages:

<ItemGroup>
    <PackageReference Include="Microsoft.Toolkit.HighPerformance" Version="7.0.2" />
</ItemGroup>

With a slight change, there is a new version:

[Test]
[DotMemoryUnit(FailIfRunWithoutSupport = false)]
public async Task LoadFromDbAsync()
{
    var beforeStart = dotMemory.Check();
    var dict = new Dictionary<int, ProductInfo>(HelperTest.ProductCount);
    await DbReadingTest.LoadCoreAsync(dict);
    var stringPool = StringPool.Shared;
    foreach (var (_, p) in dict)
    {
        p.Color = stringPool.GetOrAdd(p.Color);
    }
    GC.Collect();
    dotMemory.Check(memory =>
    {
        var snapshotDifference = memory.GetDifference(beforeStart);
        Console.WriteLine(snapshotDifference.GetNewObjects().SizeInBytes.Bytes());
    });
}

The above code:

Use StringPool.Shared instance to store string instance
GetOrAdd actually implements our previous one-search, one-read, one-for-one-change three-step strategy

Of course, the result is also a surprise without surprises at all:

61.81 MB
Everything is so relaxed and happy.

Summarize

There are more poses for dotMemory measurement, you can try more.

Repeat, pooling. This is a very common optimization scheme. Master them, this may help you when you need them.

The code examples in this article can be found at the following address, don’t forget to star for the project:

https://github.com/newbe36524/Newbe.Demo/tree/main/src/BlogDemos/Newbe.StringPools

Microsoft's Most Valuable Professional (MVP)

Microsoft's Most Valuable Expert is a global award granted by Microsoft to third-party technology professionals. For 28 years, technology community leaders around the world have won this award for sharing their expertise and experience in online and offline technology communities.

MVP is a rigorously selected team of experts. They represent the most skilled and intelligent people. They are experts who are passionate and helpful to the community. MVP is committed to helping others through speeches, forum questions and answers, creating websites, writing blogs, sharing videos, open source projects, organizing conferences, etc., and to help users in the Microsoft technology community use Microsoft technology to the greatest extent.
For more details, please visit the official website:
https://mvp.microsoft.com/zh-cn

Scan the QR code to follow Microsoft China MSDN to get more first-hand technical information and official learning materials from Microsoft!

String pooling, reducing memory usage by 1/3

Microsoft MVP Lab Researcher

Learn dotMemory to measure memory

Try to load from database to memory

Eliminate duplicate string instances

Directly import StringPool

Further reading

What are the similarities and differences between StringPool and string.Intern()?

How is StringPool implemented?

Under what circumstances should I consider using StringPool?

Summarize

微软技术栈

引用和评论

对话声网 JCFTP AI Studios：以技术温度叩开商业价值之门

使用 Office Tool Plus 安装并激活 Microsoft 365（原 Office 365）

全球知名快消企业实战：Power BI销售战略沙盘让决策提速50%！

微软全球技术“领航员”空降北京，4月23日带你进入智能体世界

极客说｜Unsloth 的全微调之路：从 Adapter 到 Full Fine-tuning

微软创想未来峰会 | 四大行业案例「预习资料」放送中

微软智能技术开发者挑战赛 |最受开发者欢迎奖投票通道开启！快来为喜爱的团队投票