Website | Source | Docs | Docker Hub
vLLM Production Stack - vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization
Looks more complicated than Ollama or LocalAI, which makes a certain amount of sense, I suppose.
vLLM can only serve one model at a time; to run multiple models, run separate instances of vLLM on different ports. (I think this is probably true of all engines.)
vLLM offers an official Docker image for deployment. The image can be used to run OpenAI compatible server and is available on Docker Hub as vllm/vllm-openai.
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-0.6B
Last modified 07 May 2026