机械同情：为 CPU 性能进行编码

Introduction: Even elegant algorithms can be slow due to computer hardware. Mediocre and exceptional performance depends on code working with CPU architecture. Modern processors use mechanisms like instruction pipelining, memory caching, and speculative execution.
Drive-through restaurant analogy: Similar to how a restaurant optimizes order processing, a CPU executes instructions. The restaurant had inefficiencies but improved by building a pipeline.
Instruction pipelining: CPU processes instructions in stages. A simple five-stage pipeline (Fetch, Decode, Execute, Memory access, Write back) can process multiple instructions simultaneously, increasing instruction throughput. Modern processors have more complex pipelines.
Scaling with parallel pipelines: Installing more order windows and workers increased the restaurant's order processing capacity. Modern processors also have multiple execution resources to issue multiple instructions in parallel.
Instruction level parallelism (ILP): Processors look for independent instructions to execute in parallel to achieve peak instruction throughput. Out-of-order execution enables this but ensures program logic remains correct.
Mechanical sympathy example: Loop unrolling: In software, loop unrolling can improve superscalar utilization by executing multiple steps in a single iteration. Compilers can do this optimization, but sometimes manual intervention is needed.
Memory caching: The restaurant installed local shelves and a basement cache to store frequently used ingredients. Processors have a hierarchy of cache memories (L1, L2, L3) to improve performance by caching recently used data.
Cache optimization: Optimizing data structure layout and memory access patterns is crucial. Keeping frequently accessed fields together and avoiding scattered fields reduces cache misses. Spatial locality optimization groups related data together.
Cache prefetching: Modern processors have cache prefetching to prevent cache misses by fetching cache lines before they are accessed.
Speculative execution: The restaurant implemented a predictive system to avoid pipeline stalls. The CPU also does speculative instruction execution to keep the pipeline full but incurs a performance penalty on mispredictions.
Branch predictors: Branch predictors predict the target address of branches to avoid performance degradation. Predictable branch patterns help improve performance. Converting code to branch-free code and grouping objects by type can help.
Conclusion: Understanding processor behavior is essential for writing performant code. Write clean code first, profile for bottlenecks, and apply mechanical sympathy principles to hot spots. Work with the hardware instead of fighting it. Huge thanks to Sanjeev Dwivedi for reviewing.