- Optimizing cache access patterns with blocking: The first step to higher performance is using blocking. By simply partitioning input matrices (not using specialized assembly kernels but relying on native BQN idiom), a speed-up of about sixfold is achievable for matrices larger than the machine's cache size. For example, with
mat‿mbtandma‿mb, the time taken is shown. This requires a small code change from+˝∘×⎉1‿∞to∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞. - Abstracting logic into reusable code with MPB: An abstract function
MPBis defined to compute powers of a square matrix𝕩using blocks of size𝕨and padding with zeros if needed. Empirical search for the optimal block size gives certain results. - Attempting nested tiling with MPB2: Nested tiling is implemented using
MPB2, but experiments show no performance improvement. - Reducing algorithm's asymptotic complexity: The next step is to attack the problem by reducing the algorithm's asymptotic complexity. A classic radix-2 divide-and-conquer (and cache-oblivious) algorithm
\_strassen\_is presented. It works for any square matrix. When used withMPBand compared toDgemm, a significant speed-up is achieved. For example, with a 4096x4096 matrix, the time taken is shown. This is considered the limit of pure, single-threaded BQN implementation.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用。你还可以使用@来通知其他用户。