- Optimizing cache access patterns with blocking: The first step to higher performance is using blocking. By simply partitioning input matrices (not using specialized assembly kernels but relying on native BQN idiom), a speed-up of about sixfold is achievable for matrices larger than the machine's cache size. For example, with
mat‿mbt
andma‿mb
, the time taken is shown. This requires a small code change from+˝∘×⎉1‿∞
to∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞
. - Abstracting logic into reusable code with MPB: An abstract function
MPB
is defined to compute powers of a square matrix𝕩
using blocks of size𝕨
and padding with zeros if needed. Empirical search for the optimal block size gives certain results. - Attempting nested tiling with MPB2: Nested tiling is implemented using
MPB2
, but experiments show no performance improvement. - Reducing algorithm's asymptotic complexity: The next step is to attack the problem by reducing the algorithm's asymptotic complexity. A classic radix-2 divide-and-conquer (and cache-oblivious) algorithm
\_strassen\_
is presented. It works for any square matrix. When used withMPB
and compared toDgemm
, a significant speed-up is achieved. For example, with a 4096x4096 matrix, the time taken is shown. This is considered the limit of pure, single-threaded BQN implementation.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。