Ollama
Printed from:
Complete Ollama Cheatsheet
Table of Contents
- Overview
- Installation
- Core CLI Commands
- Working with Models
- Popular Models
- The Modelfile
- Native REST API
- OpenAI-Compatible API
- Environment Variables
- Tool / Function Calling
- Structured Outputs (JSON)
- Multimodal & Vision Models
- Embeddings
- GPU / Hardware
- Docker & Server Deployment
- Integrations
- Troubleshooting
- Quick Reference
Overview
Ollama is a local LLM runtime that pulls open-weight models (Llama, Qwen, Mistral, Gemma, Phi, DeepSeek, …) and serves them via a CLI and HTTP API on http://localhost:11434. It packages models as single files with a Modelfile describing weights, prompt template, and parameters.
123ollama run llama3.2
# (or any model from https://ollama.com/library)
Installation
1234567891011121314151617181920212223242526# macOS — download the .app from ollama.com, or:
brew install ollama
brew services start ollama # run as a background service
# Linux (official one-liner)
curl -fsSL https://ollama.com/install.sh | sh
# Linux (manual systemd)
sudo useradd -r -s /bin/false ollama
sudo systemctl enable --now ollama
# Windows — download the .exe installer from ollama.com
# Docker
docker run -d --name ollama -p 11434:11434 \
-v ollama:/root/.ollama ollama/ollama
# Docker with NVIDIA GPU
docker run -d --gpus=all --name ollama \
-p 11434:11434 -v ollama:/root/.ollama \
ollama/ollama
# Verify
ollama --version
curl http://localhost:11434/api/version
Core CLI Commands
12345678910111213141516171819202122232425262728293031# Run / chat (downloads if missing)
ollama run <model>
ollama run <model> "Single-shot prompt"
ollama run <model> --verbose # show tokens/sec
ollama run <model> --format json # JSON-formatted output
# Model management
ollama pull <model> # download only
ollama pull <model>:<tag> # specific tag (e.g. llama3.2:3b)
ollama push <user>/<model> # push to ollama.com (logged in)
ollama list # local models
ollama ls # alias of list
ollama ps # running/loaded models in VRAM
ollama show <model> # Modelfile, params, template
ollama show <model> --modelfile # raw Modelfile
ollama show <model> --parameters
ollama show <model> --template
ollama show <model> --system
ollama show <model> --license
ollama cp <source> <dest> # duplicate a local model
ollama rm <model> # delete local model
ollama stop <model> # unload from memory
# Custom model creation
ollama create <name> -f Modelfile
ollama create <name> -f Modelfile -q q4_K_M # quantize on create
# Server
ollama serve # foreground; reads OLLAMA_* env
ollama help <command>
Inside the REPL (ollama run)
12345678910111213/? help /set system "<msg>" set a system prompt in this session /set parameter temperature 0.2 /set parameter num_ctx 8192 /set format json /set verbose /show info /show modelfile /load <model> /save <name> save current session to a new model /clear clear chat context /bye exit
Working with Models
1234567891011121314# Specifying size / quantization via tag
ollama pull llama3.1:8b
ollama pull llama3.1:70b
ollama pull llama3.2:1b
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama pull mistral:7b-instruct-v0.3
# Inspect storage
ls ~/.ollama/models # default location
ollama show <model> | grep -i "parameters\|quant"
# Update all local models
for m in $(ollama list | awk 'NR>1 {print $1}'); do ollama pull "$m"; done
Popular Models
| Model | Pull tag examples | Notes |
|---|---|---|
| Llama 3.1 | llama3.1, llama3.1:70b | Meta general-purpose chat |
| Llama 3.2 | llama3.2, llama3.2:1b, :3b | Small/edge models |
| Llama 3.2-vision | llama3.2-vision:11b, :90b | Vision-capable |
| Llama 3.3 | llama3.3:70b | Improved reasoning |
| Qwen 2.5 | qwen2.5:7b, :14b, :32b, :72b | Strong multilingual |
| Qwen 2.5-coder | qwen2.5-coder:7b, :32b | Coding model |
| Mistral / Nemo | mistral, mistral-nemo, mistral-large | |
| Mixtral | mixtral:8x7b, mixtral:8x22b | MoE |
| Gemma 2 | gemma2:2b, :9b, :27b | Google open |
| Phi-3 / Phi-3.5 | phi3, phi3.5 | Small Microsoft models |
| DeepSeek-Coder-V2 | deepseek-coder-v2:16b, :236b | Coding |
| DeepSeek R1 | deepseek-r1:7b, :14b, :32b, :70b, :671b | Reasoning |
| Granite 3 | granite3-dense:8b, granite3-moe:3b | IBM |
| LLaVA | llava:7b, :13b, :34b | Vision |
| MiniCPM-V | minicpm-v:8b | Vision |
| Nomic Embed | nomic-embed-text | Embeddings |
| MXBai Embed | mxbai-embed-large | Embeddings |
Tags follow
<model>:<size>-<variant>-<quant>(e.g.qwen2.5:14b-instruct-q4_K_M). Browsehttps://ollama.com/libraryfor the full list.
The Modelfile
A Modelfile builds a custom model on top of a base.
1234567891011121314151617181920212223242526272829303132333435# Modelfile FROM llama3.2:3b # Sampling PARAMETER temperature 0.4 PARAMETER top_p 0.9 PARAMETER top_k 40 PARAMETER repeat_penalty 1.1 PARAMETER num_ctx 8192 PARAMETER num_predict 512 PARAMETER stop "<|eot_id|>" PARAMETER seed 42 # System prompt baked in SYSTEM """ You are a concise, friendly devops assistant. Answer in <= 6 lines. """ # Optional: override the chat template TEMPLATE """{{ if .System }}<|system|> {{ .System }}<|end|>{{ end }} <|user|> {{ .Prompt }}<|end|> <|assistant|>""" # Optional: LoRA adapter ADAPTER ./my-lora.safetensors # Optional: a few-shot example MESSAGE user "Hi!" MESSAGE assistant "Hi — what shall we deploy today?" # License / metadata LICENSE "MIT"
123456789# Build the custom model
ollama create devops-helper -f Modelfile
# Or quantize the base on the way in
ollama create devops-helper:q4 -f Modelfile -q q4_K_M
# Run it
ollama run devops-helper "Plan a zero-downtime deploy on Kubernetes."
Importing GGUF / Safetensors weights
12345FROM ./my-model.gguf FROM ./my-model-Q5_K_M.gguf # Safetensors directory (Ollama auto-converts) FROM ./hf-snapshot/
Native REST API
Base URL: http://localhost:11434
Generate (single completion)
12curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false,
"options": { "temperature": 0.2, "num_ctx": 4096 }
}'
Chat (multi-turn)
12curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{ "role": "system", "content": "Be terse." },
{ "role": "user", "content": "Summarise GitFlow in 1 sentence." }
],
"stream": false
}'
Streaming
123# Default is stream:true — newline-delimited JSON
curl -N http://localhost:11434/api/chat -d '{
"model":"llama3.2",
"messages":[{"role":"user","content":"count to 5"}]
}'
Other endpoints
1234567891011GET /api/tags # list local models
POST /api/show { "name": "llama3.2" }
POST /api/pull { "name": "llama3.2" }
POST /api/push { "name": "user/model" }
POST /api/copy { "source": "a", "destination": "b" }
DELETE /api/delete { "name": "llama3.2" }
POST /api/embed { "model": "nomic-embed-text", "input": "..." }
POST /api/create { "name": "...", "modelfile": "FROM llama3.2\n..." }
GET /api/ps # loaded models
GET /api/version
Common request fields
123456789101112131415161718192021222324252627{
"model": "llama3.2",
"prompt": "…", // /api/generate
"messages": [ … ], // /api/chat
"system": "…",
"template": "…",
"format": "json", // or a JSON Schema (see Structured Outputs)
"stream": true,
"keep_alive": "5m", // unload after idle
"context": [ … ], // /api/generate: previous returned context for continuation
"raw": false, // bypass templating
"images": [ "<base64>" ], // multimodal
"tools": [ … ], // tool calling
"options": {
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"num_ctx": 8192,
"num_predict": 512,
"repeat_penalty": 1.1,
"seed": 42,
"stop": ["</s>"],
"num_gpu": 99,
"num_thread": 8
}
}
OpenAI-Compatible API
Drop-in for OpenAI SDKs — point base_url at Ollama.
12345678910111213# Chat completions
curl http://localhost:11434/v1/chat/completions \
-H "Authorization: Bearer ollama" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role":"user","content":"Hello"}]
}'
# Completions (legacy)
POST /v1/completions
# Embeddings
POST /v1/embeddings
# Models list
GET /v1/models
Python (OpenAI SDK)
12345678from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Say hi"}],
)
print(resp.choices[0].message.content)
Node.js (OpenAI SDK)
1234567import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:11434/v1", apiKey: "ollama" });
const r = await client.chat.completions.create({
model: "llama3.2",
messages: [{ role: "user", content: "Say hi" }],
});
Environment Variables
Set via shell, systemd drop-in, or Docker -e.
12345678910111213141516OLLAMA_HOST=0.0.0.0:11434 # bind address (default 127.0.0.1:11434)
OLLAMA_ORIGINS="*" # allow CORS origins (comma-separated)
OLLAMA_MODELS=/data/ollama/models # model storage path
OLLAMA_KEEP_ALIVE=5m # how long to keep a model loaded; "-1" forever, "0" unload immediately
OLLAMA_NUM_PARALLEL=4 # concurrent requests per model
OLLAMA_MAX_LOADED_MODELS=3 # cap on simultaneously-loaded models
OLLAMA_MAX_QUEUE=512 # queued request limit
OLLAMA_FLASH_ATTENTION=1 # enable Flash Attention (compatible models)
OLLAMA_KV_CACHE_TYPE=f16 # f16 | q8_0 | q4_0
OLLAMA_GPU_OVERHEAD=512MiB # VRAM headroom to leave free
OLLAMA_LLM_LIBRARY=cuda_v12 # force backend (cuda_v11/12, rocm, metal, cpu)
OLLAMA_NOPRUNE=1 # don't auto-prune unused blobs
OLLAMA_DEBUG=1 # verbose logs
OLLAMA_NOHISTORY=1 # disable REPL history
HSA_OVERRIDE_GFX_VERSION=10.3.0 # AMD ROCm spoofing if your card needs it
Linux: persisting env for systemd
123456789sudo systemctl edit ollama.service
# Add:
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0:11434"
# Environment="OLLAMA_KEEP_ALIVE=24h"
# Environment="OLLAMA_NUM_PARALLEL=4"
sudo systemctl daemon-reload
sudo systemctl restart ollama
Tool / Function Calling
Models supporting tools include Llama 3.1/3.2/3.3, Qwen 2.5, Mistral-Nemo, Mistral-Large, Granite.
12curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [{"role":"user","content":"Weather in Paris?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"city": { "type": "string" },
"units": { "type": "string", "enum": ["c","f"] }
},
"required": ["city"]
}
}
}],
"stream": false
}'
The response contains a tool_calls array; you execute the function and post back a role: "tool" message with the result, then re-call /api/chat.
Structured Outputs (JSON)
123456# Free-form JSON
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role":"user","content":"Give 3 fruits as JSON list"}],
"format": "json",
"stream": false
}'
# Strict JSON Schema
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role":"user","content":"Bio of Ada Lovelace"}],
"format": {
"type": "object",
"properties": {
"name": { "type": "string" },
"born": { "type": "integer" },
"famous_for":{"type":"array","items":{"type":"string"}}
},
"required": ["name","born","famous_for"]
},
"stream": false
}'
Multimodal & Vision Models
123456# CLI: drag image path into the prompt
ollama run llava "What is in ./photo.png?"
# API: pass base64 images
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2-vision",
"messages": [{
"role": "user",
"content": "Describe this picture.",
"images": ["'"$(base64 -w0 photo.png)"'"]
}],
"stream": false
}'
Vision-capable Ollama models: llava, llava-llama3, llava-phi3, bakllava, moondream, minicpm-v, llama3.2-vision.
Embeddings
123# CLI / API
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": ["hello world", "another sentence"]
}'
Recommended embedding models:
nomic-embed-text(768-dim, fast)mxbai-embed-large(1024-dim, strong English)all-minilm(384-dim, tiny)bge-m3(multilingual, 1024-dim)snowflake-arctic-embed(multiple sizes)
Python (OpenAI SDK)
12client.embeddings.create(model="nomic-embed-text", input=["doc 1","doc 2"])
GPU / Hardware
1234567891011121314151617# Verify GPU detection
ollama ps # shows "100% GPU" / "50/50 CPU/GPU"
journalctl -u ollama -e # logs (Linux)
# Force a specific GPU (NVIDIA)
CUDA_VISIBLE_DEVICES=0 ollama serve
# Force a specific GPU (AMD ROCm)
HIP_VISIBLE_DEVICES=0 ollama serve
# Limit layers offloaded to GPU
# (per-request via "options": { "num_gpu": 35 })
# 0 = pure CPU; 99 = "as many as fit"
# Quick VRAM math: a Q4 model ≈ params × 0.6 GB
# (7B Q4 ≈ 4.2 GB, 13B Q4 ≈ 7.5 GB, 70B Q4 ≈ 42 GB)
Supported backends: NVIDIA CUDA (12.x/11.x), AMD ROCm, Apple Metal, CPU (AVX2/AVX‑512), Intel oneAPI (experimental).
Docker & Server Deployment
123456789101112131415161718192021222324252627# Simple
docker run -d --name ollama -p 11434:11434 -v ollama:/root/.ollama ollama/ollama
# With GPU
docker run -d --gpus all --name ollama -p 11434:11434 \
-v ollama:/root/.ollama ollama/ollama
# Pre-pull a model into the container
docker exec -it ollama ollama pull llama3.2
# docker-compose.yml
services:
ollama:
image: ollama/ollama
ports: ["11434:11434"]
volumes: ["ollama:/root/.ollama"]
environment:
OLLAMA_KEEP_ALIVE: "24h"
OLLAMA_NUM_PARALLEL: "4"
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: ["gpu"]
volumes: { ollama: {} }
Production tips
- Pin a reverse proxy (Caddy / nginx) and terminate TLS.
- Set
OLLAMA_HOST=0.0.0.0:11434and protect with auth at the proxy layer (Ollama has no built-in auth). - Restrict
OLLAMA_ORIGINSto known frontends. - Set
OLLAMA_KEEP_ALIVE=-1for hot models,0for rarely-used ones.
Integrations
123456789101112131415# LangChain
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.2", temperature=0)
# LlamaIndex
from llama_index.llms.ollama import Ollama
llm = Ollama(model="llama3.2", request_timeout=120.0)
# LiteLLM
import litellm
litellm.completion(model="ollama/llama3.2", messages=[{"role":"user","content":"hi"}])
# Continue, Open WebUI, Page Assist, Msty, Cody, Aider — all support
# Ollama out of the box; point them at http://localhost:11434.
Troubleshooting
12345678910111213141516171819202122232425262728293031# "model not found"
ollama list # spelling / tag?
ollama pull <model>
# "Error: listen tcp 127.0.0.1:11434: bind: address already in use"
lsof -i :11434 # find offender
sudo systemctl restart ollama
OLLAMA_HOST=127.0.0.1:11435 ollama serve
# Stuck pull
rm -rf ~/.ollama/models/blobs/sha256-*.partial
ollama pull <model>
# Out of memory / VRAM
# - smaller quant (q4_0 → q3_K_M)
# - smaller size (70b → 8b)
# - lower num_ctx
# - num_gpu N to offload only N layers
ollama show <model> | grep "context length"
# Slow CPU performance
# - install AVX2-capable build
# - set OMP_NUM_THREADS / OLLAMA_NUM_PARALLEL
# - enable OLLAMA_FLASH_ATTENTION=1 (compatible models)
# Verbose debugging
OLLAMA_DEBUG=1 ollama serve
# Wipe everything
rm -rf ~/.ollama
Quick Reference
12345678910111213141516171819# Daily
ollama run llama3.2
ollama ps
ollama list
ollama show llama3.2
# Customize
ollama create mybot -f Modelfile
ollama cp llama3.2 llama3.2-tuned
# Maintenance
ollama pull llama3.2 # update
ollama rm llama3.2:1b
ollama stop llama3.2
# API smoke test
curl http://localhost:11434/api/version
curl http://localhost:11434/api/tags
Tip: treat
keep_aliveas a knob:"-1"to pin a model in VRAM,"0"to unload after every request, or a duration like"30m"for batch jobs.
Continue Learning
Discover more cheatsheets to boost your productivity