Jamba：一种混合 Transformer-Mamba 语言模型

Authors: Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham
View PDF: https://arxiv.org/pdf/2403.19887
HTML (experimental): https://arxiv.org/html/2403.19887v2
Abstract: Presented is Jamba, a new base large language model with a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. It interleaves Transformer and Mamba layers to get benefits from both. MoE is added in some layers to increase capacity while managing active parameter usage. This flexible architecture allows for resource- and objective-specific configurations. The implemented configuration fits in a single 80GB GPU and provides high throughput and small memory footprint compared to vanilla Transformers, with state-of-the-art performance on standard benchmarks and long-context evaluations. It shows strong results for up to 256K tokens context length. Various architectural decisions like combining layers and mixing experts are studied and shown to be crucial in large scale modeling. Several interesting properties revealed during training and evaluation are described and checkpoints from ablation runs will be released to encourage exploration. The weights of Jamba are made publicly available under a permissive license.
Comments: Webpage: this https URL
Subjects: Computation and Language (cs.CL), Machine Learning (cs.LG)
Cite as: arXiv:2403.19887 [cs.CL] (or arXiv:2403.19887v2 [cs.CL] for this version), https://doi.org/10.48550/arXiv.2403.19887 (arXiv-issued DOI via DataCite)
Submission history: From Yonatan Belinkov. [v1] Thu, 28 Mar 2024 23:55:06 UTC (941 KB), [v2] Wed, 3 Jul 2024 14:30:33 UTC (1,121 KB)