构建 Meta 的通用人工智能基础设施

Meta is making a major investment in AI future by announcing two 24k GPU clusters. Details on hardware, network, storage, design, performance and software are shared. The clusters are used for Llama 3 training and are built on open compute and open source technologies like Grand Teton, OpenRack and PyTorch.
Meta's long-term vision is to build AGI. The RSC with 16,000 NVIDIA A100 GPUs has accelerated AI research. The newer clusters build on RSC's successes and focus on researcher and developer experience.
Network: One cluster has a RoCE network fabric with Arista 7800, Wedge400 and Minipack2 switches, and the other has an NVIDIA Quantum2 InfiniBand fabric. Both interconnect 400 Gbps endpoints.
Compute: Built on Grand Teton, an in-house-designed open GPU hardware platform contributed to OCP, it integrates power, control, compute and fabric interfaces.
Storage: A home-grown Linux FUSE API-backed storage solution with Meta's 'Tectonic' distributed storage and partnership with Hammerspace meets data and checkpointing needs.
Performance: Maximizing performance and ease of use simultaneously, changes in job scheduling, network routing and working with training frameworks improved large cluster performance. Tools like desync debug and evolving PyTorch are also being developed.
Commitment to open AI innovation: Meta is a founding member of OCP and the largest contributor to PyTorch. It also launched the Open Innovation AI Research Community and the AI Alliance for open and responsible AI development.
Future: These cluster designs are part of the larger roadmap. By 2024, Meta aims to grow its infrastructure with 350,000 NVIDIA H100s equivalent to nearly 600,000 H100s in compute power. Meta is constantly evaluating and improving its infrastructure for future needs.