4.5: HuggingFace
- Exceptional Benchmark Performance: State of the art vision language performance with a 77.0 average score on OpenCompass, surpassing larger models like GPT-4o-latest and Gemini-2.0 Pro
- Revolutionary Video Processing: Efficient video understanding using a unified 3D-Resampler that compresses video tokens 96 times, enabling high-FPS processing up to 10 frames per second
- Flexible Reasoning Modes: Controllable hybrid fast and deep thinking modes for switching between quick responses and complex reasoning
- Advanced Text Recognition: Strong OCR and document parsing that processes high resolution images up to 1.8 million pixels, achieving leading scores on OCRBench and OmniDocBench
- Versatile Platform Support: Easy deployment across platforms with llama.cpp and ollama support, 16 quantized model sizes, SGLang and vLLM integration, fine tuning options, WebUI demo, iOS app, and online web demo
Tags:
ai
model
vision
ocr
Last modified 22 March 2026