Gemma 3 支持视觉语言理解、长上下文处理和改进的多语言性

  • Google's Gemma 3: An open-source generative AI model with vision-language understanding, long context handling, and improved multilinguality.

    • New features in blog post: Discussed by Google DeepMind and AI Studio teams, includes KV-cache memory reduction, a new tokenizer, and better performance/higher resolution vision encoders.
    • Technical Report summary: Custom Sigmoid loss for Language-Image Pre-training (SigLIP) vision encoder, Pan & Scan algorithm for handling different image aspects, treating images as sequence of "soft tokens", and bi-directional attention with image inputs.
    • Memory efficiency changes: Reduces KV-cache memory usage for long context, allowing analysis of longer documents and conversations.
    • Improved tokenizer: Vocabulary size changed to 262k using the same SentencePiece tokenizer as Gemini, more balanced for non-English languages.
    • Multilingual capabilities: Revisited data mixture with more multilingual data and revised pre-training/post-training process.
  • Performance comparison: Better than Gemma 2 on pre-trained instruction-tuned versions across benchmarks, ranks among top 10 in LM Arena as of Apr 12, 2025, with higher Elo score.
  • Long context handling: Generalizes to 128k context length after Rotary Position Embedding (RoPE) rescaling during pre-training.
  • For more information: Check developer guide, model card, meme generator, and Gemmaverse.
阅读 7
0 条评论