Model instances

Smol-3B (SLM)

Technical Aspect Details
Parameters 3B
Architecture Decoder-only transformer (GQA + NoPE, 3:1 ratio)
Context Length 64K native; up to 128K with YaRN extrapolation
Training Tokens 11.2T
Multilingual Support 6 languages (EN, FR, ES, DE, IT, PT)
Reasoning Mode Dual-mode (thinking / no-think toggle)
Tool Calling Yes: JSON/XML (xml_tools) and Python (python_tools)
License Apache 2.0

SmolLM3 is a 3B parameter language model designed to push the boundaries of small models, supporting dual-mode reasoning, 6 languages, and long context. It is a decoder-only transformer using Grouped Query Attention (GQA) and No Positional Embeddings (NoPE) (with a 3:1 ratio), pretrained on 11.2T tokens with a staged curriculum of web, code, math, and reasoning data. Post-training included a mid-training phase on 140 billion reasoning tokens, followed by supervised fine-tuning and alignment via Anchored Preference Optimization (APO), HuggingFace's off-policy approach to preference alignment. The model supports two distinct tool-calling interfaces, JSON/XML blobs via xml_tools and Python-style function calls via python_tools, making it highly flexible for agentic pipelines and RAG systems. As a fully open release, including weights, datasets, and training code, SmolLM3 is ideal for chatbots, RAG systems, and code assistants on constrained hardware such as edge devices or low-VRAM machines.

A fully open instruct and reasoning model. At the 3B scale, it outperforms Llama-3.2-3B and Qwen2.5-3B, while staying competitive with many 4B-class alternatives (including Qwen3 and Gemma 3) across 12 popular LLM benchmarks.

What also sets SmolLM3 apart is the level of transparency. Hugging Face published the full engineering blueprint of it, including architecture decisions, data mixture, and post-training methodology. If you’re building internal variants or want to understand what actually drives quality at 3B, that matters.

Why should you use SmolLM3-3B:

Points to be cautious about:

SmolLM2 1.7B

Hugging Face’s SmolLM2 is one of the smallest models here, designed for rapid experimentation and learning. It’s not production-ready for complex tasks, but it’s perfect for prototyping, testing pipelines, and understanding how small models behave.

Speed and accessibility. SmolLM2 runs in seconds, making it ideal for rapid iteration during development. Use it to test your fine-tuning pipeline before scaling to larger models.

Best for: Rapid prototyping · Learning and experimentation · Simple NLP tasks (sentiment analysis, categorization) · Educational projects

Hardware: Quantized (4-bit) requires 4GB RAM · Full precision (16-bit) requires 6GB RAM · Recommended: Runs on any modern laptop

Download / Run locally: Available on Hugging Face under HuggingFaceTB (SmolLM2 1.7B Instruct). For Ollama: ollama pull smollm2.

Reading

Articles


Tags: ai   model   slm  

Last modified 21 June 2026