vLLM

Website | Source | Docs | Docker Hub

vLLM Production Stack - vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization

Looks more complicated than Ollama or LocalAI, which makes a certain amount of sense, I suppose.

vLLM can only serve one model at a time; to run multiple models, run separate instances of vLLM on different ports. (I think this is probably true of all engines.)

Getting Started

Docker

vLLM offers an official Docker image for deployment. The image can be used to run OpenAI compatible server and is available on Docker Hub as vllm/vllm-openai.

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen3-0.6B

Related Implementations

vllm-gfx906 - vLLM for AMD gfx906 GPUs, e.g. Radeon VII / MI50 / MI60
vllm-playground | Source: A modern web interface for managing and interacting with vLLM servers. Supports both GPU and CPU modes, with special optimizations for macOS Apple Silicon and enterprise deployment on OpenShift/Kubernetes.

Resources

Hardware

vLLM on Mac Studio M4

Reading

Moving from Ollama to vLLM: Finding Stability for High-Throughput LLM Serving

Tags: ai engine

Last modified 07 May 2026