炽热矩阵产品

  • Optimizing cache access patterns with blocking: The first step to higher performance is using blocking. By simply partitioning input matrices (not using specialized assembly kernels but relying on native BQN idiom), a speed-up of about sixfold is achievable for matrices larger than the machine's cache size. For example, with mat‿mbt and ma‿mb, the time taken is shown. This requires a small code change from +˝∘×⎉1‿∞ to ∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞.
  • Abstracting logic into reusable code with MPB: An abstract function MPB is defined to compute powers of a square matrix 𝕩 using blocks of size 𝕨 and padding with zeros if needed. Empirical search for the optimal block size gives certain results.
  • Attempting nested tiling with MPB2: Nested tiling is implemented using MPB2, but experiments show no performance improvement.
  • Reducing algorithm's asymptotic complexity: The next step is to attack the problem by reducing the algorithm's asymptotic complexity. A classic radix-2 divide-and-conquer (and cache-oblivious) algorithm \_strassen\_ is presented. It works for any square matrix. When used with MPB and compared to Dgemm, a significant speed-up is achieved. For example, with a 4096x4096 matrix, the time taken is shown. This is considered the limit of pure, single-threaded BQN implementation.
阅读 11
0 条评论