Gemma 3 支持视觉语言理解、长上下文处理和改进的多语言性

发布于 5 月 20 日

Google's Gemma 3: An open-source generative AI model with vision-language understanding, long context handling, and improved multilinguality.
- New features in blog post: Discussed by Google DeepMind and AI Studio teams, includes KV-cache memory reduction, a new tokenizer, and better performance/higher resolution vision encoders.
- Technical Report summary: Custom Sigmoid loss for Language-Image Pre-training (SigLIP) vision encoder, Pan & Scan algorithm for handling different image aspects, treating images as sequence of "soft tokens", and bi-directional attention with image inputs.
- Memory efficiency changes: Reduces KV-cache memory usage for long context, allowing analysis of longer documents and conversations.
- Improved tokenizer: Vocabulary size changed to 262k using the same SentencePiece tokenizer as Gemini, more balanced for non-English languages.
- Multilingual capabilities: Revisited data mixture with more multilingual data and revised pre-training/post-training process.
Performance comparison: Better than Gemma 2 on pre-trained instruction-tuned versions across benchmarks, ranks among top 10 in LM Arena as of Apr 12, 2025, with higher Elo score.
Long context handling: Generalizes to 128k context length after Rotary Position Embedding (RoPE) rescaling during pre-training.
For more information: Check developer guide, model card, meme generator, and Gemmaverse.

阅读 7