头图

Hello everyone, I am a researcher in this issue of MVP Lab: Yu Kun, today I will combine some specific examples to show you how to pool strings to reduce the memory usage of .Net, and learn how to observe memory and other peripheral knowledge . Are you ready? This is on the way.

Microsoft MVP Lab Researcher

image.png
This article uses a simple business scenario to describe how to reduce repeated string instances in memory through string pooling, thereby reducing memory usage.

In business, we assume the following:

  • There are one million products, and each product has a ProductId and Color column stored in the database
  • Need to load all the data into the memory and use it as a cache
  • Every product has Color
  • The range of Color is a limited range, we assume it is about eighty

Learn dotMemory to measure memory

Since the reliability of memory optimization needs to be measured, a simple and effective measurement tool is naturally essential.

In this article, we introduce the combination of Rider + dotMemory and how to perform simple memory measurement. Readers can also choose their favorite tools according to their actual conditions.

First, we create a unit test project and write a simple memory dictionary construction process:

public const int ProductCount = 1_000_000;

public static readonly List<string> Colors = new[]
    {
        "amber", // 此处实际上有80个左右的字符串,省略篇幅
    }.OrderBy(x => x).ToList();

public static Dictionary<int, ProductInfo> CreateDict()
{
    var random = new Random(36524);
    var dict = new Dictionary<int, ProductInfo>(ProductCount);
    for (int i = 0; i < ProductCount; i++)
    {
        dict.Add(i, new ProductInfo
        {
            ProductId = i,
            Color = Colors[random.Next(0, Colors.Count)]
        });
    }

    return dict;
}

It can be seen from the above code:

  • Create one million commodity objects, among which Color is randomly selected by random numbers.

Specifying the expected value of the dictionary size in advance is actually an optimization.

See:
https://docs.microsoft.com/dotnet/api/system.collections.generic.dictionary-2.-ctor?view=net-5.0&WT.mc_id=DX-MVP-5003606#System_Collections_Generic_Dictionary_2__ctor_System_Int32_?ocid=AID3041048

Then, we introduce the nuget package necessary for the dotMemory unit test measurement, and some other insignificant packages:

<ItemGroup>
    <PackageReference Include="JetBrains.DotMemoryUnit" Version="3.1.20200127.214830" />

    <PackageReference Include="Humanizer" Version="2.11.10" />
</ItemGroup>

Next, we create a simple test to measure the changes in memory before and after the above dictionary is created:

public class NormalDictTest
{
    [Test]
    [DotMemoryUnit(FailIfRunWithoutSupport = false)]
    public void CreateDictTest()
    {
        var beforeStart = dotMemory.Check();
        var dict = HelperTest.CreateDict();
        GC.Collect();
        dotMemory.Check(memory =>
        {
            var snapshotDifference = memory.GetDifference(beforeStart);
            Console.WriteLine(snapshotDifference.GetNewObjects().SizeInBytes.Bytes());
        });
    }
}

It can be seen from the above code:

  • Before the dictionary is created, we use dotMemory.Check() to capture a snapshot of the current memory for subsequent comparison
  • After the dictionary is created, we compare the sizes of the newly added objects in the two checkpoints before and after.

Finally, click the button as shown in the figure below to run this test:

image.png

run dotMemory
Then, the result will be as follows:

image.png

result
Therefore, we can draw such a simple conclusion. Such a dictionary requires approximately 61MB of memory.

And this is theoretically, this dictionary occupies the least amount of memory. Because, each Color uses one of the above 80 ranges. Therefore, they achieved the goal of not having any duplicate examples.

This data will be used as a benchmark for subsequent codes.

Try to load from database to memory

The actual business must be loaded into memory from a persistent storage such as a database. Therefore, let's measure how much memory overhead this loading method probably requires without optimization.
Here, we use SQLite as the storage database for the demonstration. In fact, we can use anything, because we care about the size of the final cache.
Let's introduce some irrelevant packages:

<ItemGroup>
    <PackageReference Include="Dapper" Version="2.0.90" />
    <PackageReference Include="System.Data.SQLite.Core" Version="1.0.115" />
</ItemGroup>

We write a test code to write one million test data into the test library:

[Test]
public async Task CreateDb()
{
    var fileName = "data.db";
    if (File.Exists(fileName))
    {
        return;
    }

    var connectionString = GetConnectionString(fileName);
    await using var sqlConnection = new SQLiteConnection(connectionString);
    await sqlConnection.OpenAsync();
    await using var transaction = await sqlConnection.BeginTransactionAsync();
    await sqlConnection.ExecuteAsync(@"
CREATE TABLE Product(
    ProductId int PRIMARY KEY,
    Color TEXT
)", transaction);

    var dict = CreateDict();
    foreach (var (_, p) in dict)
    {
        await sqlConnection.ExecuteAsync(@"
INSERT INTO Product(ProductId,Color)
VALUES(@ProductId,@Color)", p, transaction);
    }

    await transaction.CommitAsync();
}

public static string GetConnectionString(string filename)
{
    var re =
        $"Data Source={filename};Cache Size=5000;Journal Mode=WAL;Pooling=True;Default IsolationLevel=ReadCommitted";
    return re;
}

The above code:

  • Create a data named data.db
  • Create a Product table in the database, containing two columns ProductId and Color
  • Insert all the data in the dictionary into these two tables, which is actually the dictionary created above

Run this test for about ten seconds, and the test data will be ready. Later, we will repeatedly read data from this database as our test case.

Now, we write a code that reads data from the database and then loads it into the dictionary, and measures the changes in memory:

[Test]
[DotMemoryUnit(FailIfRunWithoutSupport = false)]
public async Task LoadFromDbAsync()
{
    var beforeStart = dotMemory.Check();
    var dict = new Dictionary<int, ProductInfo>(HelperTest.ProductCount);
    await LoadCoreAsync(dict);
    GC.Collect();
    dotMemory.Check(memory =>
    {
        var snapshotDifference = memory.GetDifference(beforeStart);
        Console.WriteLine(snapshotDifference.GetNewObjects().SizeInBytes.Bytes());
    });
}

public static async Task LoadCoreAsync(Dictionary<int, ProductInfo> dict)
{
    var connectionString = HelperTest.GetConnectionString();
    await using var sqlConnection = new SQLiteConnection(connectionString);
    await sqlConnection.OpenAsync();
    await using var reader = await sqlConnection.ExecuteReaderAsync(
        "SELECT ProductId, Color FROM Product");
    var rowParser = reader.GetRowParser<ProductInfo>();
    while (await reader.ReadAsync())
    {
        var productInfo = rowParser.Invoke(reader);
        dict[productInfo.ProductId] = productInfo;
    }
}

The above code:

  • We have changed the way the dictionary is created, the data in it is read from the database and loaded
  • Use Dapper to read DataReader and load all dictionaries

Similarly, when we run dotMemory to measure changes, we can get the data as follows:

95.1 MB
Therefore, we conclude that this method consumes about 30MB of memory. It looks very small, but it is actually 50% more than the previous one. (Salary increase from 1,500 to 3,000, 100% immediate sense of salary increase)

Of course, you might suspect that the extra overhead is actually consumed by database operations. But through the following optimization, we can know in advance:

These extra overheads are actually due to repeated string consumption.

Eliminate duplicate string instances

Since we suspect that the extra overhead is repeated strings, then we can consider reducing the repeated strings in the dictionary by converting them to the same object.

So, we have the following version of the test code:

[Test]
[DotMemoryUnit(FailIfRunWithoutSupport = false)]
public async Task LoadFromDbAsync()
{
    var beforeStart = dotMemory.Check();
    var dict = new Dictionary<int, ProductInfo>(HelperTest.ProductCount);
    await DbReadingTest.LoadCoreAsync(dict);
    foreach (var (_, p) in dict)
    {
        var colorIndex = HelperTest.Colors.BinarySearch(p.Color);
        var color = HelperTest.Colors[colorIndex];
        p.Color = color;
    }
    GC.Collect();
    dotMemory.Check(memory =>
    {
        var snapshotDifference = memory.GetDifference(beforeStart);
        Console.WriteLine(snapshotDifference.GetNewObjects().SizeInBytes.Bytes());
    });
}

The above code:

  • We still load all the data from the database into the dictionary, the loaded code is exactly the same as before, so it is not shown
  • After loading, we traverse the dictionary again. And from the Color List that existed as early as the first version, the corresponding string instance is searched and assigned to the Color in the dictionary
  • Through this search, one read, one change. We make the Color in the dictionary all come from the Color List

So, we ran dotMemory again to measure, and the result was amazing:

61.69 MB
Although, in the end, the cost of this figure has increased slightly in the first version, but it has actually reached the point where it is almost the same.

We convert the same Color in the dictionary to the same instance by converting the same string into the same instance. The 30MB temporary strings will be collected immediately in the most recent GC because there are no objects to refer to them. Everything is so easy and happy.

Directly import StringPool

In the previous article, we have found the reason for the overhead and optimized it through methods. However, there are some issues to actually consider:

  • Many times the Color List is not a static list. She may be very happy in the morning and angry in the afternoon.
  • Color List cannot be infinitely large. We need an elimination algorithm to eliminate the last 10% and send them to the society.

Therefore, we can consider using StringPool directly, the code written by others is great, now it is ours.

Let's introduce some more irrelevant packages:

<ItemGroup>
    <PackageReference Include="Microsoft.Toolkit.HighPerformance" Version="7.0.2" />
</ItemGroup>

With a slight change, there is a new version:

[Test]
[DotMemoryUnit(FailIfRunWithoutSupport = false)]
public async Task LoadFromDbAsync()
{
    var beforeStart = dotMemory.Check();
    var dict = new Dictionary<int, ProductInfo>(HelperTest.ProductCount);
    await DbReadingTest.LoadCoreAsync(dict);
    var stringPool = StringPool.Shared;
    foreach (var (_, p) in dict)
    {
        p.Color = stringPool.GetOrAdd(p.Color);
    }
    GC.Collect();
    dotMemory.Check(memory =>
    {
        var snapshotDifference = memory.GetDifference(beforeStart);
        Console.WriteLine(snapshotDifference.GetNewObjects().SizeInBytes.Bytes());
    });
}

The above code:

  • Use StringPool.Shared instance to store string instance
  • GetOrAdd actually implements our previous one-search, one-read, one-for-one-change three-step strategy

Of course, the result is also a surprise without surprises at all:

61.81 MB
Everything is so relaxed and happy.

image.png

Further reading

What are the similarities and differences between StringPool and string.Intern()?

They are all to solve the problem of too many repeated string instances, which leads to a waste of memory.

The difference in effect is mainly the difference in survival period. string.Intern is a lifetime system, once it is added, it will always exist as long as the program does not restart. This is very different from StringPool.

Therefore, if you have lifetime considerations, please choose carefully.

string.Intern can refer to:

https://docs.microsoft.com/dotnet/api/system.string.intern?view=net-5.0&WT.mc_id=DX-MVP-5003606?ocid=AID3041048

How is StringPool implemented?

I don't understand, and we dare not talk nonsense. Generally speaking, it is a priority queue marked with a usage count. I can't read the source code either.

The area in front is left to you to explore:

https://github.com/CommunityToolkit/WindowsCommunityToolkit/blob/main/Microsoft.Toolkit.HighPerformance/Buffers/StringPool.cs

Under what circumstances should I consider using StringPool?

The author suggests to consider these strings into the pool:

  1. This string may be referenced by many instances
  2. This string needs to reside for a long time, or the object holding it is a long-term object
  3. Memory optimization has indeed become a thing you need to consider

Of course, there is actually an easiest basis to judge. You can directly dump the memory on the production line to see if there are many duplicate strings in it, and then optimize them. It's 2021. There will be no one who will not dump memory, will it, will it? (Manually if you don’t know how to dump memory, you can refer to the video shared by Mr. Huang on Microsoft Reactor to learn:

https://www.bilibili.com/video/BV1jZ4y1P7EY
Great! I can use StringPool to store the enumerated DisplayName

Indeed, there is nothing wrong. However, there are actually some better solutions:

https://github.com/Spinnernicholas/EnumFastToStringDotNet

Summarize

There are more poses for dotMemory measurement, you can try more.

Repeat, pooling. This is a very common optimization scheme. Master them, this may help you when you need them.

The code examples in this article can be found at the following address, don’t forget to star for the project:

https://github.com/newbe36524/Newbe.Demo/tree/main/src/BlogDemos/Newbe.StringPools

Microsoft's Most Valuable Professional (MVP)

image.png

Microsoft's Most Valuable Expert is a global award granted by Microsoft to third-party technology professionals. For 28 years, technology community leaders around the world have won this award for sharing their expertise and experience in online and offline technology communities.

MVP is a rigorously selected team of experts. They represent the most skilled and intelligent people. They are experts who are passionate and helpful to the community. MVP is committed to helping others through speeches, forum questions and answers, creating websites, writing blogs, sharing videos, open source projects, organizing conferences, etc., and to help users in the Microsoft technology community use Microsoft technology to the greatest extent.
For more details, please visit the official website:
https://mvp.microsoft.com/zh-cn


Scan the QR code to follow Microsoft China MSDN to get more first-hand technical information and official learning materials from Microsoft!
image.png


微软技术栈
423 声望997 粉丝

微软技术生态官方平台。予力众生,成就不凡!微软致力于用技术改变世界,助力企业实现数字化转型。