GitHub - UCSBarchlab/OpenTPU：谷歌张量处理单元（TPU）的一个开源重新实现。

UCSB ArchLab OpenTPU Project: OpenTPU is an open-source re-implementation by UC Santa Barbara ArchLab based on Google's TPU paper. It's powered by PyRTL and requires Python 3, PyRTL >= 0.8.5, and numpy.
- How to Run: Run simple matrix multiply test with MATSIZE set to 8: python3 assembler.py simplemult.a; python3 runtpu.py simplemult.out simplemult_hostmem.npy simplemult_weights.npy; python3 sim.py simplemult.out simplemult_hostmem.npy simplemult_weights.npy. For Boston housing data regression test with MATSIZE set to 16: python3 assembler.py boston.a; python3 runtpu.py boston.out boston_inputs.npy boston_weights.npy; python3 sim.py boston.out boston_inputs.npy boston_weights.npy.
- Hardware Simulation: Run executable hardware spec using runtpu.py with binary program and numpy array files. Ensure MATSIZE in config.py is correct.
- Functional Simulation: sim.py implements functional simulator. It reads assembly program, host memory file, and weights file. Runs in 32b float and 8b int modes. Numpy matrices can be generated with numpy.save. checker.py verifies results.
FAQs:
- How big/efficient/fast: No hard synthesis figures for full 256x256 OpenTPU yet.
- What can do: Handles matrix multiplies and activations for ReLU and sigmoid.
- What features missing: Convolution, pooling, programmable normalization.
- Design follow TPU: Uses high-level design details but implementations may differ.
- Support same instructions: No, currently supports specific instructions. Final ISA will likely have differences.
- Binary compatible: No, no public interface or spec.
- Need Verilog: PyRTL can output structural Verilog with OutputToVerilog.
- Suggestions/Contribute: Get in touch with Deeksha or Joseph.
Software details:
- ISA: Includes RHM, WHM, RW, MMC, ACT, NOP, HLT instructions with their specific functions.
- Writing a Program: No dynamic scheduling, uses deterministic hardware and requires many NOPs. DRAM causes non-deterministic latency.
- Generating Data: For simple one hot 2-layer NN, use gen_one_hot.py and simple_nn.py. For Tensorflow DNN regression, use tf_nn.py.
- Latencies: Gives hardware execution latency for each instruction.
Microarchitecture:
- Matrix Multiply (MM) Unit: Core compute with parametrizable 8-bit MACs, two weight buffers, and input vector feeding.
- Accumulator Buffers: Store result vectors from MM Array, instructions specify add or overwrite.
- Weight FIFO: Buffers weight tiles to avoid stalls when moving from off-chip DRAM.
- Systolic Setup: Sequential buffers for diagonal feeding of vectors into MM Array.
- Memory Controllers: Emulated with no delay. Connection to Host Memory is one vector. Connection to Weight DRAM is 64 bytes. Configuration can specify Unified Buffer, Accumulator Buffer, and MM Array size.