谷歌的 Gemma 3 量化感知训练语言模型可在消费级 GPU 上本地运行

发布于 4 月 29 日

Google released Gemma 3 QAT family: Quantized versions of open-weight Gemma 3 language models with 4 sizes (1B, 4B, 12B, 27B) using Quantization-Aware Training (QAT) to maintain high accuracy when weights are quantized from 16 to 4 bits.
Resource requirements and compatibility: Quantized versions require only 25% of VRAM needed by 16 bit models. 27B model can run on desktop NVIDIA RTX 3090 GPU with 24GB VRAM, 12B model on laptop NVIDIA RTX 4060 GPU with 8GB VRAM, and smaller models on mobile phones or edge devices. Unquantized Gemma 3 models require substantial GPU resources like RTX 5090 with 32GB VRAM.
Google's commitment: Committed to making powerful AI accessible by enabling efficient performance on consumer-grade GPUs. Bringing state-of-the-art AI performance to accessible hardware is a key step in democratizing AI development.
Initial launches and performance: InfoQ covered Google's initial launch of Gemma series in 2024 followed by Gemma 2. Gemma 3 has performance improvements and added vision capabilities (except 1B size), achieving performance competitive with models 2x larger.
Using QAT and room for improvement: Google used QAT to allow quantization of model weights without sacrificing performance. Omar Sanseviero suggested there was room for improvement like not quantizing embeddings and trying 3-bit quantization.
User feedback: Users praised the QAT models' performance on Hacker News. One user was shocked at the information density in 13GB of weights. Simon Willison said it may be his new favorite local model with Ollama using 22GB of RAM on his 64GB machine.
Availability: Gemma 3 QAT model weights are available on HuggingFace and in popular LLM frameworks like Ollama, LM Studio, Gemma.cpp, and llama.cpp.

阅读 11