当 95% 的情况下的值是 0 或 1 时，对非常大的数组进行随机访问的任何优化？

Question

新手上路，请多包涵

对非常大的数组进行随机访问是否有任何可能的优化（我目前使用 uint8_t ，我在问什么更好）

 uint8_t MyArray[10000000];

当数组中任意位置的值为

95% 的情况为 0 或 1 ，
2 在 4% 的情况下，
在其他 1% 的情况下，在 3 到 255 之间？

那么，有什么比 uint8_t 数组更好的呢？应该尽可能快地以随机顺序循环整个阵列，这对 RAM 带宽来说非常沉重，因此当有多个线程同时为不同的阵列执行此操作时，当前整个 RAM 带宽很快就饱和了。

我问是因为拥有如此大的数组（10 MB）感觉非常低效，而实际上除了 5% 之外，几乎所有值都是 0 或 1。所以当数组中所有值的 95% 时实际上只需要 1 位而不是 8 位，这将减少内存使用量几乎一个数量级。感觉必须有一个内存效率更高的解决方案，这将大大减少所需的 RAM 带宽，因此随机访问也明显更快。

原文由 JohnAl 发布，翻译遵循 CC BY-SA 4.0 许可协议

c++arrays performance optimization memory-bandwidth

阅读 309

1 个回答

得票最新

社区维基

1

发布于
2022-11-08

✓ 已被采纳

想到的一个简单的可能性是为常见情况保留每个值 2 位的压缩数组，每个值保留 4 个字节（原始元素索引为 24 位，实际值为 8 位，所以 (idx << 8) | value) ) 其他数组的排序数组。

查找值时，首先在 2bpp 数组中查找 (O(1))；如果您找到 0、1 或 2，这就是您想要的值；如果你找到 3 这意味着你必须在辅助数组中查找它。在这里，您将执行二进制搜索以查找您感兴趣的索引左移 8（O(log(n)，n 较小，因为这应该是 1%），并从 4-字节的东西。

 std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;

uint8_t lookup(unsigned idx) {
    // extract the 2 bits of our interest from the main array
    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
    // usual (likely) case: value between 0 and 2
    if(v != 3) return v;
    // bad case: lookup the index<<8 in the secondary array
    // lower_bound finds the first >=, so we don't need to mask out the value
    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
    // some coherency checks
    if(ptr == sec_arr.end()) std::abort();
    if((*ptr >> 8) != idx) std::abort();
#endif
    // extract our 8-bit value from the 32 bit (index, value) thingie
    return (*ptr) & 0xff;
}

void populate(uint8_t *source, size_t size) {
    main_arr.clear(); sec_arr.clear();
    // size the main storage (round up)
    main_arr.resize((size+3)/4);
    for(size_t idx = 0; idx < size; ++idx) {
        uint8_t in = source[idx];
        uint8_t &target = main_arr[idx>>2];
        // if the input doesn't fit, cap to 3 and put in secondary storage
        if(in >= 3) {
            // top 24 bits: index; low 8 bit: value
            sec_arr.push_back((idx << 8) | in);
            in = 3;
        }
        // store in the target according to the position
        target |= in << ((idx & 3)*2);
    }
}

对于您建议的数组，第一个数组需要 10000000 / 4 = 2500000 字节，第二个数组需要 10000000 * 1% * 4 B = 400000 字节；因此 2900000 字节，即小于原始数组的三分之一，并且最常用的部分都保存在内存中，这应该有利于缓存（它甚至可能适合 L3）。

如果您需要超过 24 位寻址，则必须调整“辅助存储”；扩展它的一种简单方法是使用 256 元素指针数组来切换索引的前 8 位并转发到如上所述的 24 位索引排序数组。

快速基准测试

#include <algorithm>
#include <vector>
#include <stdint.h>
#include <chrono>
#include <stdio.h>
#include <math.h>

using namespace std::chrono;

/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality
/// than LCG but fail some test suites
struct XorShift32 {
    /// This stuff allows to use this class wherever a library function
    /// requires a UniformRandomBitGenerator (e.g. std::shuffle)
    typedef uint32_t result_type;
    static uint32_t min() { return 1; }
    static uint32_t max() { return uint32_t(-1); }

    /// PRNG state
    uint32_t y;

    /// Initializes with seed
    XorShift32(uint32_t seed = 0) : y(seed) {
        if(y == 0) y = 2463534242UL;
    }

    /// Returns a value in the range [1, 1<<32)
    uint32_t operator()() {
        y ^= (y<<13);
        y ^= (y>>17);
        y ^= (y<<15);
        return y;
    }

    /// Returns a value in the range [0, limit); this conforms to the RandomFunc
    /// requirements for std::random_shuffle
    uint32_t operator()(uint32_t limit) {
        return (*this)()%limit;
    }
};

struct mean_variance {
    double rmean = 0.;
    double rvariance = 0.;
    int count = 0;

    void operator()(double x) {
        ++count;
        double ormean = rmean;
        rmean     += (x-rmean)/count;
        rvariance += (x-ormean)*(x-rmean);
    }

    double mean()     const { return rmean; }
    double variance() const { return rvariance/(count-1); }
    double stddev()   const { return std::sqrt(variance()); }
};

std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;

uint8_t lookup(unsigned idx) {
    // extract the 2 bits of our interest from the main array
    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
    // usual (likely) case: value between 0 and 2
    if(v != 3) return v;
    // bad case: lookup the index<<8 in the secondary array
    // lower_bound finds the first >=, so we don't need to mask out the value
    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
    // some coherency checks
    if(ptr == sec_arr.end()) std::abort();
    if((*ptr >> 8) != idx) std::abort();
#endif
    // extract our 8-bit value from the 32 bit (index, value) thingie
    return (*ptr) & 0xff;
}

void populate(uint8_t *source, size_t size) {
    main_arr.clear(); sec_arr.clear();
    // size the main storage (round up)
    main_arr.resize((size+3)/4);
    for(size_t idx = 0; idx < size; ++idx) {
        uint8_t in = source[idx];
        uint8_t &target = main_arr[idx>>2];
        // if the input doesn't fit, cap to 3 and put in secondary storage
        if(in >= 3) {
            // top 24 bits: index; low 8 bit: value
            sec_arr.push_back((idx << 8) | in);
            in = 3;
        }
        // store in the target according to the position
        target |= in << ((idx & 3)*2);
    }
}

volatile unsigned out;

int main() {
    XorShift32 xs;
    std::vector<uint8_t> vec;
    int size = 10000000;
    for(int i = 0; i<size; ++i) {
        uint32_t v = xs();
        if(v < 1825361101)      v = 0; // 42.5%
        else if(v < 4080218931) v = 1; // 95.0%
        else if(v < 4252017623) v = 2; // 99.0%
        else {
            while((v & 0xff) < 3) v = xs();
        }
        vec.push_back(v);
    }
    populate(vec.data(), vec.size());
    mean_variance lk_t, arr_t;
    for(int i = 0; i<50; ++i) {
        {
            unsigned o = 0;
            auto beg = high_resolution_clock::now();
            for(int i = 0; i < size; ++i) {
                o += lookup(xs() % size);
            }
            out += o;
            int dur = (high_resolution_clock::now()-beg)/microseconds(1);
            fprintf(stderr, "lookup: %10d µs\n", dur);
            lk_t(dur);
        }
        {
            unsigned o = 0;
            auto beg = high_resolution_clock::now();
            for(int i = 0; i < size; ++i) {
                o += vec[xs() % size];
            }
            out += o;
            int dur = (high_resolution_clock::now()-beg)/microseconds(1);
            fprintf(stderr, "array:  %10d µs\n", dur);
            arr_t(dur);
        }
    }

    fprintf(stderr, " lookup |   ±  |  array  |   ±  | speedup\n");
    printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
            lk_t.mean(), lk_t.stddev(),
            arr_t.mean(), arr_t.stddev(),
            arr_t.mean()/lk_t.mean());
    return 0;
}

（代码和数据总是在我的 Bitbucket 中更新）

上面的代码填充了一个 10M 元素数组，其中随机数据分布为他们帖子中指定的 OP，初始化我的数据结构，然后：

使用我的数据结构随机查找 10M 元素
通过原始数组做同样的事情。

（请注意，在顺序查找的情况下，数组总是以巨大的优势获胜，因为它是您可以做的最适合缓存的查找）

最后两个块重复 50 次并计时；最后，计算和打印每种查找类型的平均值和标准差，以及加速比（lookup_mean/array_mean）。

我在 Ubuntu 16.04 上使用 g++ 5.4.0（ -O3 -static ，加上一些警告）编译了上面的代码，并在一些机器上运行它；他们中的大多数都在运行 Ubuntu 16.04，一些是较旧的 Linux，一些是较新的 Linux。在这种情况下，我认为操作系统根本不应该相关。

             CPU           |  cache   |  lookup (µs)   |     array (µs)  | speedup (x)
Xeon E5-1650 v3 @ 3.50GHz | 15360 KB |  60011 ±  3667 |   29313 ±  2137 | 0.49
Xeon E5-2697 v3 @ 2.60GHz | 35840 KB |  66571 ±  7477 |   33197 ±  3619 | 0.50
Celeron G1610T  @ 2.30GHz |  2048 KB | 172090 ±   629 |  162328 ±   326 | 0.94
Core i3-3220T   @ 2.80GHz |  3072 KB | 111025 ±  5507 |  114415 ±  2528 | 1.03
Core i5-7200U   @ 2.50GHz |  3072 KB |  92447 ±  1494 |   95249 ±  1134 | 1.03
Xeon X3430      @ 2.40GHz |  8192 KB | 111303 ±   936 |  127647 ±  1503 | 1.15
Core i7 920     @ 2.67GHz |  8192 KB | 123161 ± 35113 |  156068 ± 45355 | 1.27
Xeon X5650      @ 2.67GHz | 12288 KB | 106015 ±  5364 |  140335 ±  6739 | 1.32
Core i7 870     @ 2.93GHz |  8192 KB |  77986 ±   429 |  106040 ±  1043 | 1.36
Core i7-6700    @ 3.40GHz |  8192 KB |  47854 ±   573 |   66893 ±  1367 | 1.40
Core i3-4150    @ 3.50GHz |  3072 KB |  76162 ±   983 |  113265 ±   239 | 1.49
Xeon X5650      @ 2.67GHz | 12288 KB | 101384 ±   796 |  152720 ±  2440 | 1.51
Core i7-3770T   @ 2.50GHz |  8192 KB |  69551 ±  1961 |  128929 ±  2631 | 1.85

结果……好坏参半！

一般来说，在大多数这些机器上都有某种加速，或者至少它们是相当的。
阵列真正胜过“智能结构”查找的两种情况是在具有大量缓存且不是特别繁忙的机器上：以上 Xeon E5-1650（15 MB 缓存）是夜间构建机器，目前相当空闲； Xeon E5-2697（35 MB 缓存）是一台用于高性能计算的机器，在空闲时也是如此。确实有道理，原始数组完全适合其巨大的缓存，因此紧凑的数据结构只会增加复杂性。
在“性能范围”的另一端——但阵列再次稍微快一点，有为我的 NAS 提供动力的不起眼的赛扬；它的缓存太少，以至于阵列和“智能结构”都根本不适合它。其他缓存足够小的机器执行类似的操作。
Xeon X5650 必须谨慎使用——它们是非常繁忙的双路虚拟机服务器上的虚拟机；很可能，虽然名义上它有相当数量的缓存，但在测试期间它被完全不相关的虚拟机抢占了好几次。

原文由 Matteo Italia 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

Stack Overflow 翻译

子站问答

访问

本篇内容翻译自 Stack Overflow，如果你觉得翻译结果值得改进，欢迎直接编辑修改，感谢你为社区贡献。

相似问题

找不到问题？创建新问题

当 95% 的情况下的值是 0 或 1 时，对非常大的数组进行随机访问的任何优化？

快速基准测试

你尚未登录，登录后可以

c++模板类链表链接错误？

c++98环境循环单链表类的私有结构体与引用其的成员函数顺序问题？

头文件保护为什么报warring?

请问下这种数据结构怎么选择?

为什么我的这个std::sort会排序数组之外的一个元素?

我希望能让自己设计的qt界面更美观,更有条理,应该看哪些书/学哪些技术?

员工电脑装了公司内部的证书，那么通过公共wifi可以拦截并解密用户的微信信息吗？

Stack Overflow 翻译