UCSB ArchLab OpenTPU Project: OpenTPU is an open-source re-implementation by UC Santa Barbara ArchLab based on Google's TPU paper. It's powered by PyRTL and requires Python 3, PyRTL >= 0.8.5, and numpy.
- How to Run: Run simple matrix multiply test with MATSIZE set to 8:
python3 assembler.py simplemult.a; python3 runtpu.py simplemult.out simplemult_hostmem.npy simplemult_weights.npy; python3 sim.py simplemult.out simplemult_hostmem.npy simplemult_weights.npy
. For Boston housing data regression test with MATSIZE set to 16:python3 assembler.py boston.a; python3 runtpu.py boston.out boston_inputs.npy boston_weights.npy; python3 sim.py boston.out boston_inputs.npy boston_weights.npy
. - Hardware Simulation: Run executable hardware spec using
runtpu.py
with binary program and numpy array files. Ensure MATSIZE inconfig.py
is correct. - Functional Simulation:
sim.py
implements functional simulator. It reads assembly program, host memory file, and weights file. Runs in 32b float and 8b int modes. Numpy matrices can be generated withnumpy.save
.checker.py
verifies results.
- How to Run: Run simple matrix multiply test with MATSIZE set to 8:
FAQs:
- How big/efficient/fast: No hard synthesis figures for full 256x256 OpenTPU yet.
- What can do: Handles matrix multiplies and activations for ReLU and sigmoid.
- What features missing: Convolution, pooling, programmable normalization.
- Design follow TPU: Uses high-level design details but implementations may differ.
- Support same instructions: No, currently supports specific instructions. Final ISA will likely have differences.
- Binary compatible: No, no public interface or spec.
- Need Verilog: PyRTL can output structural Verilog with
OutputToVerilog
. - Suggestions/Contribute: Get in touch with Deeksha or Joseph.
Software details:
- ISA: Includes RHM, WHM, RW, MMC, ACT, NOP, HLT instructions with their specific functions.
- Writing a Program: No dynamic scheduling, uses deterministic hardware and requires many NOPs. DRAM causes non-deterministic latency.
- Generating Data: For simple one hot 2-layer NN, use gen_one_hot.py and simple_nn.py. For Tensorflow DNN regression, use tf_nn.py.
- Latencies: Gives hardware execution latency for each instruction.
Microarchitecture:
- Matrix Multiply (MM) Unit: Core compute with parametrizable 8-bit MACs, two weight buffers, and input vector feeding.
- Accumulator Buffers: Store result vectors from MM Array, instructions specify add or overwrite.
- Weight FIFO: Buffers weight tiles to avoid stalls when moving from off-chip DRAM.
- Systolic Setup: Sequential buffers for diagonal feeding of vectors into MM Array.
- Memory Controllers: Emulated with no delay. Connection to Host Memory is one vector. Connection to Weight DRAM is 64 bytes. Configuration can specify Unified Buffer, Accumulator Buffer, and MM Array size.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。