A family of lightweight, state-of-the-art open models developed by Google, built on the same research behind Gemini 2.0. It supports advanced text, image, and short video understanding, with strong reasoning capabilities across tasks and languages.
Available in 1B, 4B, 12B, and 27B sizes, Gemma 3 offers flexibility for a range of hardware, from laptops to cloud clusters. With a 128K-token context window (32K for 1B), it can handle long-form input for more complex tasks.
Key features:
Points to be cautious about:
An instruction-tuned multimodal small model from Google DeepMind, built for on-device and other low-resource deployments. It accepts text, image, audio, and video inputs and generates text outputs. While the raw parameter count is around 5B, it uses selective parameter activation, so it can run with a memory footprint closer to a traditional 2B model in many deployments. The Gemma 3n family is trained on data spanning 140+ languages, which is a big deal if you need multilingual support without jumping to much larger models.
Why should you use Gemma-3n-E2B-IT:
Points to be cautious about:
Google’s Gemma 2 9B pushes the boundary of what qualifies as “small.” At 9B parameters, it’s the heaviest model on this list, but it is competitive with 13B-class models on many benchmarks. Use this when you need the best quality your laptop can handle.
Safety and instruction-following. Gemma 2 was trained with extensive safety filtering and alignment work. It refuses harmful requests more reliably than other models and follows complex, multi-step instructions accurately.
Best for: Complex instruction-following · Tasks requiring careful safety handling · General knowledge Q&A · Content moderation
Hardware: Quantized (4-bit) requires 12GB RAM · Full precision (16-bit) requires 24GB RAM · Recommended: 16GB+ RAM for production use
Download / Run locally: Available on Hugging Face under the google org (Gemma 2 9B IT). You’ll need to accept Google’s license terms (and may need authentication depending on your tooling). For Ollama: ollama pull gemma2:9b-instruct-*. Ollama provides both base and instruct tags. Pick the one that matches your use case.
| Technical Aspect | Details |
|---|---|
| Effective Parameters | 2.3B (5.1B total with embeddings) |
| Architecture | Dense, hybrid attention (sliding window + global) + PLE |
| Layers | 35 |
| Sliding Window | 512 tokens |
| Context Length | 128K tokens |
| Vocabulary Size | 262K |
| Modalities | Text, Image, Audio (≤30 sec), Video (as frames) |
| Multilingual | 35+ native, trained on 140+ languages |
| Tool Calling | Yes: native function calling |
| License | Apache 2.0 |
Gemma-4-E2B is part of Google DeepMind's Gemma 4 family, which features a hybrid attention mechanism, local sliding window attention with full global attention. This design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. The "E" in E2B stands for "effective" parameters, enabled by a key architectural innovation called Per-Layer Embeddings (PLE), which adds a dedicated conditioning vector at every decoder layer. This is the mechanism which allows the E2B to run in under 1.5 GB of memory with quantization and still produce valuable outputs. The model supports native function calling, enabling agentic workflows, and is optimized for on-device deployment on mobile and IoT devices, capable of handling text, image, audio, and video inputs. Released under Apache 2.0 (a change from earlier Gemma generations' more restrictive custom license), Gemma 4 E2B is an attractive option for developers building multimodal agentic applications running entirely at the edge.
Last modified 21 June 2026