Allen AI | Website | HuggingFace | Blog

A family of open-source VLMs developed by the Allen Institute for AI. Available in 1B, 7B, and 72B parameters, Molmo models deliver state-of-the-art performance for their class. According to the benchmarks, they can perform on a par with proprietary models like GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet.

The key to Molmo’s performance lies in its unique training data, PixMo. This highly curated dataset consists of 1 million image-text pairs and includes two main types of data:

Interestingly, Molmo researchers used an innovative approach to data collection: Asked annotators to provide spoken descriptions of images within 60 to 90 seconds. Specifically, these detailed descriptions included everything visible, even the spatial positioning and relationships among objects. The results show that annotators provided detailed captions far more efficiently than traditional methods (writing them down). Overall, they collected high-quality audio descriptions for 712k images that were sampled from 50 high-level topics.

Key features:

Points to be cautious about:

Reading

Articles


Tags: ai   model   vision  

Last modified 22 March 2026