Website | Source | Search (models)

Installing

Docker

docker run --name=ollama -d --gpus=all --volume ./ollama:/root/.ollama -p 11434:11434 ollama/ollama

Might also want:

Using

ollama run model name: Download and run the given model
ollama pull model name: Download the given model
ollama serve model name: Run the downloaded model
ollama launch tool: Run the given tool (or, if no tool is given, provide a list--Claude Code, Codex, etc)

Client SDKs

Python | JS |

Install: pip install ollama

Usage:

from ollama import chat
from ollama import ChatResponse

response: ChatResponse = chat(model='gemma3', messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  },
])
print(response['message']['content'])
# or access fields directly from the response object
print(response.message.content)

Streaming responses:

from ollama import chat

stream = chat(
    model='gemma3',
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
    stream=True,
)

for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)

Using from VSCode

Install Continue.dev extension;

Using with Claude Code

ollama launch claude - seems to handle configuration well (v 0.15+)

Another article recommends exporting some environment variables:

"To run Claude Code, type the following command in a PowerShell command-line window."

PS C:\Users\thoma> $env:ANTHROPIC_AUTH_TOKEN = "ollama"
PS C:\Users\thoma> $env:ANTHROPIC_API_KEY = ""
PS C:\Users\thoma> $env:ANTHROPIC_BASE_URL = "http://localhost:11434"

PS C:\Users\thoma> claude --model gpt-oss:20b --allow-dangerously-skip-permissions

This article confirms that:

"The setup handles the Anthropic API configuration behind the scenes. Previously, you’d need to manually set:

export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434

The ANTHROPIC_AUTH_TOKEN is set to "ollama" because the API key is required by the SDK, but Ollama doesn't validate it. The ANTHROPIC_BASE_URL points to your local Ollama instance running on port 11434.

By default, Ollama sets context length to 4,096 tokens.

Coding tools need much more than that to work properly. ***You’ll want at least 64,000 tokens for Claude Code and similar tools.***"

Models

Local models are impressive, but they require adequate hardware to run :

The speed depends on your specific setup. Smaller models like qwen2.5-coder:7b should be fast enough for daily coding work.

Check current context length: ollama show qwen2.5-coder:7b

Run with custom context length: You can set the context window directly when starting the model by creating a custom Modelfile. First, create a file called Modelfile:

FROM qwen2.5-coder:7b
PARAMETER num_ctx 32768

Then create a new model with this configuration: ollama create qwen-32k -f Modelfile

Now run Claude Code with the larger context: claude --model qwen-32k

Common context sizes:

The larger the context, the more RAM your model will use. Balance your context needs with available system resources.

Modelfiles

Reading

Articles

This Modelfile instructs Ollama to base a new model on Llama 2, set a specific inference temperature, and provide a default system prompt. The tool also boasts comprehensive cross-platform support, ensuring its functionality across multiple operating systems including macOS, Linux, and Windows, thereby broadening its accessibility to a diverse user base. Crucially, Ollama leverages GPU hardware for accelerated inference whenever such resources are available, which dramatically speeds up model responses and processing times by offloading computational tasks. Finally, it maintains a continually growing model hub, providing a curated collection of pre-trained models, including popular choices like Llama 2, Mistral, and various others, all available for direct and convenient download.

Docs


Tags: ai   inference engine  

Last modified 22 March 2026