C++ 中文周刊 第134期

欢迎投稿，推荐或自荐文章/软件/资源等

请提交 issue

感谢不语赞助

资讯

标准委员会动态/ide/编译器信息放在这里

编译器信息最新动态推荐关注hellogcc公众号 OSDT Weekly 2023-10-11 第223期

brpc rpcz功能存在xss跨站漏洞，建议尽快升级 1.6.1

如果无法升级，可以打补丁 https://github.com/apache/brpc/pull/2411/files

文章

GCC Preparing To Introduce "-fhardened" Security Hardening Option

Mick235711 投稿

gcc 14.1的新选项-fhardened会自动开启大部分相对轻量级的安全检查，包括3级保护（常见C库函数，比如strcpy等等的内存溢出诊断），标准库断言（这里面包括std::span和其他容器的operator[]边界检查），栈溢出检查等等

How to compare signed and unsigned integers in C++20?

介绍 std::cmp_xx的之前也介绍过

视频

接着上期的内容

cppcon 2022

A-Faster-Serialization-Library-Based-on-Compile-time-Reflection-and-C-20-Yu-Qi-CppCon-2022

介绍qimosmos的strucpack的。挺有意思。资料也很多，这里就不罗列了，感兴趣的可以搜一下，标记个TODO，咱们以后有时间单独讲一下

binary-search-cppcon

优化二分

二分谁都知道吧

int lower_bound(int x) {
    int l = 0, r = n - 1;
    while (l < r) {
        int m = (l + r) / 2;
        if (t[m] >= x)
            r = m;
        else
            l = m + 1;
    }
    return t[l];
}

不会写的深蹲十个

CPU执行基本流程还记得吧 fetch decode execute write。然后CPU有流水线pipeline优化，整个流程15-20cycle

流水线优化能并发上面的步骤，什么能阻碍流水线优化？

structural hazard 太多指令用相同的CPU地址，这个是fetch decode机器码环节，无解
data hazard 需要等待前面的数据，这个其实是软件问题，也没啥好办法
control hazard CPU不知道下一次该执行什么指令，存在依赖分支miss的场景，这个是可以挽救修正的

我们回头再看二分的代码

while循环还好说，里面的if是非常重的

怎么改成无分支版本？换一种思路，我们不挪动index，我们挪动数组地址

int lower_bound(int x) {
  int* base = t ,len = n;
  while(len> 1) {
    int half = len / 2;
    if (base[half - 1] < x) {
      base += half;
      len = len - half; // = ceil(len / 2)
    } else {
      len = half; // = floor(len / 2)
    }
  }
  return *base;
}

注意到重复代码，这样能把else去掉

int lower_bound(int x) {
  int* base = t ,len = n;
  while(len> 1) {
    int half = len / 2;
    if (base[half - 1] < x) {
      base += half;
    }
    len -= half; //  = ceil(len / 2)
  }
  return *base;
}

显然，这个if也能优化掉

  while(len> 1) {
    int half = len / 2;
    base += (base[half - 1] < x) * half; // will be replaced with a "cmov"
    len -= half; //  = ceil(len / 2)
  }

改成无分支版本，性能直接提升一大截，但是，对于大数组，性能是下降的，怎么办？prefetch

  while(len > 1) {
    int half = len / 2;
    len -= half;
    __builtin_prefetch(&base[len / 2 - 1]);
    // middle of the left half
    __builtin_prefetch(&base[half + len / 2 - 1]); // middle of the right half
    base += (base[half - 1] < x) * half;
  }

接下来要上强度了朋友们

prefetch实际上也解决不了特大数组的问题，因为二分，一开始的块必然很大，你怎么prefetch也白搭

我们需要从另一种角度解决问题，比如二叉树堆线段树的特性

利用树的特点，以及树的局部性友好，对于二分开头有明显的加速效果

二叉树的的特点就决定了，肯定不需要手写分支

那怎么构造堆呢

int a[n];
alignas(64) int t[n + 1]; //the original sorted array and the eytzinger array we build
//^ we need one element more because of one-based indexing
void eytzinger(int k = 1) {
  static int i = 0;
  if (k <= n) {
    eytzinger(2 * k);
    t[k] = a[i++];
    eytzinger(2 * k + 1);
  }
}
int lower_bound(int x) {
  int k = 1;
  while (k <= n) {
    __builtin_prefetch(&t[k * 16]);
    k = 2 * k + (t[k]< x);
  }
  k >>= __builtin_ffs(~k);
  return t[k];
}

性能好起来了，但感觉有优化空间

prefetch感觉有点多
带宽bandwidth换延迟，如果内存带宽没这么富裕怎么办

考虑b树，深度更低，局部性更好，跳转更少, 降低带宽

如何构造

const int B = 16, nblocks = (n + B - 1) / B;
int btree[nblocks][B];
int go(int k, int i) { return k * (B + 1) + i + 1; }
void build(int k = 0) {
  static int t = 0;
  while (k < nblocks) {
    for (int i = 0; i < B; i++) {
      build(go(k, i));
      btree[k][i] = (t < n ? a[t++] : INT_MAX);
    }
    build(go(k, B));
  }
}

如何找节点的二分？

// compute the "local" lower bound in a node
int rank(int x, int *node) {
  for (int i = 0; i < B; i++)
    if (node[i] >= x)
      return i;
  return B;
}

优化if

int rank(int x, int *node) {
  int mask = (1 << B);
  for (int i = 0; i < B; i++)
    mask |= (btree[k][i] >= x) << i;
  return __builtin_ffs(mask) - 1;
}

优化for循环,SIMD

typedef __m256i reg;
// compute a 8-bit mask corresponding to "<" elements
int cmp(reg x_vec, int* y_ptr) {
  reg y_vec = _mm256_load_si256((reg*) y_ptr); // load 8 sorted elements
  reg mask = _mm256_cmpgt_epi32(x_vec, y_vec); // compare against the key
  return _mm256_movemask_ps((__m256)mask); // extract the 8-bit mask
}
int rank(reg x_vec, int *node) {
  int mask = ~(
    cmp(x, node) +
    (cmp(x, node + 8) << 8)
  );
  return __builtin_ffs(mask) - 1; // alternative: popcount
}

最终代码

int lower_bound(int _x) {
  int k = 0, res = INT_MAX;
  reg x = _mm256_set1_epi32(_x);
  while (k < nblocks) {
    int i = rank(x,btree[k]);
    if (i < B)// a local lower bound may not exist in the leaf node
      res = btree[k][i];
    k = go(k, i) ;
  }
  return res;
}

这个if很难受，怎么优化？

考虑b+树，说实话我已经汗流浃背了。就不考虑了

作者还探索了其他树，优化更彻底

代码在这 https://github.com/sslotin/amh-code/blob/main/binsearch

文章在这里 https://en.algorithmica.org/hpc/data-structures/binary-search/

他的博客咱们推荐过很多次。写的很好，就是太深了得研究半天，这里标记个TODO，后面再看

见识到思路其实是很巧妙的，换种角度考虑问题

Fast-High-Quality-Pseudo-Random-Numbers-CPPCon2022-Roth-Michaels

简单来说，PCG32 Xoshiro128比标准库的rand以及mt19937快得多

HPX-A-C-Standard-Library-for-Parallelism-and-Concurrency-CppCon-2022-1

介绍HPX的，基本每年都介绍

介绍c++20线程相关的组件,jthread就不说了

stop resource

void stoppable_func(std::stop_token st){
    while(!st.stop_requested()){
        do_stuff();
    }
}
void stopper(std::stop_source source){
    while(!done()){
        do_something();
    }
    source.request_stop();
}

也可以定制

Data read_file(std::stop_token st, std::filesystem::path filename ){
    auto handle=open_file(filename);
    std::stop_callback cb(st,[&]{ cancel_io(handle);});
    return read_data(handle); // blocking
}

latch

void foo(){
    unsigned const thread_count=...;
    std::latch done(thread_count);
    std::vector<std::optional<my_data>> data(thread_count);
    std::vector<std::jthread> threads;
    for(unsigned i=0;i<thread_count;++i)
        threads.push_back(std::jthread([&,i]{
            data[i]=make_data(i);
            done.count_down();
            do_more_stuff();
        }));
    done.wait();
    process_data(data);
}

barrier，感觉就是latch加上callback了

unsigned const num_threads=...;
void finish_task();
std::barrier<std::function<void()>> b(num_threads,finish_task);
void worker_thread(std::stop_token st,unsigned i){
    while(!st.stop_requested()){
        do_stuff(i);
        b.arrive_and_wait();
    }
}

mutex 一种死锁场景

class account {
std::mutex m;
currency_value balance;
 public:
  friend void transfer(account& from,account& to, currency_value amount) {
    std::scoped_lock lock_from(from.m);
    std::scoped_lock lock_to(to.m);
    from.balance -= amount;
    to.balance += amount;
    }
};

相信各位也看出来什么场景会死锁（同时发生互相转账）

c++20之后 scoped_lock可以同时锁多个锁

friend void transfer(account& from,account& to, currency_value amount)
{
    std::scoped_lock locks(from.m,to.m);
    from.balance -= amount;
    to.balance += amount;
}

间接规避了死锁的问题其实相当于两把锁合成一个来锁。

相当于要么同时锁上，要么等待。避免了两个上锁之间的间隔，也就避免了循环死锁问题。增加点耗时就是了，反正不出错

还有一些别的。没啥新鲜的东西，就不说了

Managing APIs in Enterprise Systems

这个是通过visit来合并不同API的

场景是两个不同的Response，通过一个接口处理

Optimization-Remarks

Rpass llvm-opt-report opt-viewer 三板斧

opt viewer godbolt上也集成了 https://godbolt.org/z/jG5jq7c9a

作者写了个optview2

如何看懂optview告警

| Symptom | Probable cause | Action |
| ------------------ | ------------------ | ------------------ |
| Inlining Failure | | Add header / forceinline /increase threshold |
| Clobbered by store | Aliasing | restrict / force type diff |
| Clobbered by load | Escape | Attributes pure / const /noescape (typically before the remark site) |
| Failed to move load loop invariant | Escape | All the above + copy to local |
| 其他场景 | 看不懂 | 最小代码段扔进godbolt再看 |

The-Surprising-Complexity-of-Formatting-Ranges

介绍 fmt 占位符解析实现的。很长

Type-Erasure-The-Implementation-Details-Klaus-Iglberger-CppCon-2022

介绍type erasure技术（function，any），这个技术大家都知道，还介绍了一些优化，比如SBO

所谓SBO就是给对象加一个数组buffer，当对象足够小，就用buffer placement new，避免系统new

代码大概这样

static constexpr size_t buffersize = 128UL;
static constexpr size_t alignment = 16UL;
alignas(alignment) std::array<std::byte,buffersize> buffer;
template< typename ShapeT >
Shape( ShapeT const& x ) {
    using M = Model<ShapeT>;
    static_assert( sizeof(M) <= buffersize, "Given type is too large" );
    static_assert( alignof(M) <= alignment, "Given type is overaligned" );
    ::new (pimpl()) M( shape );
}

还有就是手工绑定 manual virtual dispatch MVD

去掉虚函数，定义好的接口直接绑定函数指针, 得绑定的很深，不灵活。用虚函数就是为了灵活

class ShapeConstRef{
 public:
  template< typename ShapeT >
  ShapeConstRef( ShapeT const& shape )
  : shape_{ std::addressof(shape) },
    draw_{ []( void const* shape ){
            draw( *static_cast<ShapeT const*>(shape) );} 
        }
    {}
 private:
  friend void draw( ShapeConstRef const& shape ) {
      shape.draw_( shape.shape_ );
  }
  using DrawOperation = void(void const*);
  void const* shape_{ nullptr };
  DrawOperation* draw_{ nullptr };
};

class Circle { /*...*/ };
class Square { /*...*/ };
void draw( ShapeConstRef shape ) {
    /* Drawing the given shape */
}

int main(){
    Circle circle( 2.3 );
    Square square( 1.2 );
    draw( circle );
    draw( square );
    // ...
}

当然从性能上考虑MVD是最快的

Using-Modern-C-to-Eliminate-Virtual-Functions-Jonathan-Gopel-CppCon-2022

给了另一种多态方案，类似variant但是还不太一样，也没有virtual，我直接把两种代码贴出来，挺有意思，但是得预设写死

用virtual的版本

struct FooInterface {
  [[nodiscard]] virtual auto func() const -> int = 0;
  FooInterface() = default;
  FooInterface(const FooInterface&) = default;
  FooInterface(FooInterface&&) = default;
  FooInterface& operator=(const FooInterface&) = default;
  FooInterface& operator=(FooInterface&&) = default;
  virtual ~FooInterface() = default;
};


class Baz {
public:
  auto store(std::unique_ptr<FooInterface> value) -> void {
    data.push_back(std::move(value));
  }
private:
  std::vector<std::unique_ptr<FooInterface>> data{};
};

不用virtual的版本

template <typename T>
concept CFoo = requires(T foo) {
{ foo.func() } -> std::intergral;
};

template <typename T, typename... Ts>
concept same_as_any = (... or std::same_as<T, Ts>);

template <CFoo... TFoos>
class Baz {
public:
template <same_as_any<TFoos...> T>
  auto store(T value) {
    return std::get<std::vector<T>>(data).push_back(value);
  }
private:
  std::tuple<std::vector<TFoos>...> data{};
};

使用效果

// with virtual
Baz baz{};
baz.store(std::make_unique<Foo1>());
baz.store(std::make_unique<Foo2>());
// without virtual
Baz<Foo1, Foo2> baz{};
baz.store(Foo1{});
baz.store(Foo2{})

concept相当于做了绑定的工作，其实和上一个提到的MVD差不多，问题在于concept得写死，不够动态，如果能预设好的话concept非常干净

Your-Compiler-Understands-It-But-Does-Anyone-Else

介绍了一些重构经验

多用excepted，多用lazy iteration，https://github.com/andreasbuhr/cppcoro

测试使用注入 https://github.com/ybainier/Hypodermic 说实话没看懂怎么用

vittorio_romeo_pragmatic_simplicity

介绍了几个代码设计上的问题

比如noexcept怎么用，如果成员函数都加上noexcept合适吗？

auto怎么用，循环里的auto能随便用吗？

emplace_back一定是好的吗？

scope_lock如果参数为空有啥影响？

irange和手写循环哪个更好？

都是案例，这里直接留作讨论题了，欢迎读者思考

作者最后不忘卖书 Embracing Modern C++ Safely

这个国内我猜会引进翻译吧，pdf zlib可以搜到这里就不传播了

What I Learned From Sockets Applying the Unix Readiness Model When Composing Concurrent Operations in C++

这个讲网络库封装成sender，还算有点意思

Taking_a_Byte_Out_of_C++

这个是讲二进制问题的，start_lifitime_as接口。可能我理解的不够透彻，后面再看看

开源项目需要人手

asteria 一个脚本语言，可嵌入，长期找人，希望胖友们帮帮忙，也可以加群384042845和作者对线
Unilang deepin的一个通用编程语言，点子有点意思，也缺人，感兴趣的可以github讨论区或者deepin论坛看一看。这里也挂着长期推荐了
gcc-mcf 懂的都懂
https://gitee.com/okstar-org/ok-edu-desktop 一个IM通信软件，做IM的可以关注，现在正在做全面整合阶段，开始组建商业团队阶段，年底开始融资，你参加了就快发财了，会的快来

工作招聘

金山找写qt的，详情点链接 https://app.mokahr.com/m/recommendation-apply/wps/29467?recom...
华为云数据库老东家一直在招人，nosql方向

华为云数据库团队广纳英才，社会招聘火热进行中（华为自有岗！！！）。

年底冲刺，HC开放，只限两周，有兴趣直接联系我，具体产品的华为云NoSQL数据库方向。

这边用上C++20了，技术上非常open。可以加irelandken帮忙推荐

老东家，技术大牛还是有很多的

互动环节

上次收集了一些面试题，加上群里讨论的面试题，这里放到最后，感兴趣的可以思考一下，浪费一下大家的时间

blacktail投稿空class是几个字节？

我猜问提这个问题的面试官可能看了深度探索c++对象模型，可能不知道compressed_pair和no_unique_address

yangyang投稿 leetcode979 在二叉树中分配硬币

说实话看题目反应半天

C++ 中文周刊第134期

资讯

文章

GCC Preparing To Introduce "-fhardened" Security Hardening Option

How to compare signed and unsigned integers in C++20?

使用 jegdb 来调试内存相关 crash

flat_map性能调研

C++26静态反射提案解析

视频

cppcon 2022

开源项目需要人手

工作招聘

互动环节

很水

引用和评论

C++ 中 VS 项目引入公共配置文件

Visual Studio Code (VS Code) – C/C++ 入门

如何系统地入门学习stm32？

AI处理器组合

想从事嵌入式软件，有推荐的吗？

程序员如何利用周末提升自己

嵌入式行业真的没前途吗？

C++ 中文周刊 第134期

资讯

文章

GCC Preparing To Introduce "-fhardened" Security Hardening Option

How to compare signed and unsigned integers in C++20?

使用 jegdb 来调试内存相关 crash

flat_map性能调研

C++26静态反射提案解析

视频

cppcon 2022

开源项目需要人手

工作招聘

互动环节

很水

引用和评论

C++ 中 VS 项目引入公共配置文件

Visual Studio Code (VS Code) – C/C++ 入门

如何系统地入门学习stm32？

AI处理器组合

想从事嵌入式软件，有推荐的吗？

程序员如何利用周末提升自己

嵌入式行业真的没前途吗？

C++ 中文周刊第134期