Complete Ollama Cheatsheet

Overview
Installation
Core CLI Commands
Working with Models
Popular Models
The Modelfile
Native REST API
OpenAI-Compatible API
Environment Variables
Tool / Function Calling
Structured Outputs (JSON)
Multimodal & Vision Models
Embeddings
GPU / Hardware
Docker & Server Deployment
Integrations
Troubleshooting
Quick Reference

Overview

Ollama is a local LLM runtime that pulls open-weight models (Llama, Qwen, Mistral, Gemma, Phi, DeepSeek, …) and serves them via a CLI and HTTP API on http://localhost:11434. It packages models as single files with a Modelfile describing weights, prompt template, and parameters.

Bash

1
2
3
ollama run llama3.2
# (or any model from https://ollama.com/library)

Installation

Bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# macOS — download the .app from ollama.com, or:
brew install ollama
brew services start ollama          # run as a background service

# Linux (official one-liner)
curl -fsSL https://ollama.com/install.sh | sh

# Linux (manual systemd)
sudo useradd -r -s /bin/false ollama
sudo systemctl enable --now ollama

# Windows — download the .exe installer from ollama.com

# Docker
docker run -d --name ollama -p 11434:11434 \
    -v ollama:/root/.ollama ollama/ollama

# Docker with NVIDIA GPU
docker run -d --gpus=all --name ollama \
    -p 11434:11434 -v ollama:/root/.ollama \
    ollama/ollama

# Verify
ollama --version
curl http://localhost:11434/api/version

Core CLI Commands

Bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Run / chat (downloads if missing)
ollama run <model>
ollama run <model> "Single-shot prompt"
ollama run <model> --verbose                  # show tokens/sec
ollama run <model> --format json              # JSON-formatted output

# Model management
ollama pull <model>                           # download only
ollama pull <model>:<tag>                     # specific tag (e.g. llama3.2:3b)
ollama push <user>/<model>                    # push to ollama.com (logged in)
ollama list                                   # local models
ollama ls                                     # alias of list
ollama ps                                     # running/loaded models in VRAM
ollama show <model>                           # Modelfile, params, template
ollama show <model> --modelfile               # raw Modelfile
ollama show <model> --parameters
ollama show <model> --template
ollama show <model> --system
ollama show <model> --license
ollama cp <source> <dest>                     # duplicate a local model
ollama rm <model>                             # delete local model
ollama stop <model>                           # unload from memory

# Custom model creation
ollama create <name> -f Modelfile
ollama create <name> -f Modelfile -q q4_K_M   # quantize on create

# Server
ollama serve                                  # foreground; reads OLLAMA_* env
ollama help <command>

Inside the REPL (`ollama run`)

text

1
2
3
4
5
6
7
8
9
10
11
12
13
/?            help
/set system "<msg>"   set a system prompt in this session
/set parameter temperature 0.2
/set parameter num_ctx 8192
/set format json
/set verbose
/show info
/show modelfile
/load <model>
/save <name>          save current session to a new model
/clear                clear chat context
/bye                  exit

Working with Models

Bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Specifying size / quantization via tag
ollama pull llama3.1:8b
ollama pull llama3.1:70b
ollama pull llama3.2:1b
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama pull mistral:7b-instruct-v0.3

# Inspect storage
ls ~/.ollama/models                            # default location
ollama show <model> | grep -i "parameters\|quant"

# Update all local models
for m in $(ollama list | awk 'NR>1 {print $1}'); do ollama pull "$m"; done

Popular Models

Model	Pull tag examples	Notes
Llama 3.1	`llama3.1`, `llama3.1:70b`	Meta general-purpose chat
Llama 3.2	`llama3.2`, `llama3.2:1b`, `:3b`	Small/edge models
Llama 3.2-vision	`llama3.2-vision:11b`, `:90b`	Vision-capable
Llama 3.3	`llama3.3:70b`	Improved reasoning
Qwen 2.5	`qwen2.5:7b`, `:14b`, `:32b`, `:72b`	Strong multilingual
Qwen 2.5-coder	`qwen2.5-coder:7b`, `:32b`	Coding model
Mistral / Nemo	`mistral`, `mistral-nemo`, `mistral-large`
Mixtral	`mixtral:8x7b`, `mixtral:8x22b`	MoE
Gemma 2	`gemma2:2b`, `:9b`, `:27b`	Google open
Phi-3 / Phi-3.5	`phi3`, `phi3.5`	Small Microsoft models
DeepSeek-Coder-V2	`deepseek-coder-v2:16b`, `:236b`	Coding
DeepSeek R1	`deepseek-r1:7b`, `:14b`, `:32b`, `:70b`, `:671b`	Reasoning
Granite 3	`granite3-dense:8b`, `granite3-moe:3b`	IBM
LLaVA	`llava:7b`, `:13b`, `:34b`	Vision
MiniCPM-V	`minicpm-v:8b`	Vision
Nomic Embed	`nomic-embed-text`	Embeddings
MXBai Embed	`mxbai-embed-large`	Embeddings

Tags follow <model>:<size>-<variant>-<quant> (e.g. qwen2.5:14b-instruct-q4_K_M). Browse https://ollama.com/library for the full list.

The Modelfile

A Modelfile builds a custom model on top of a base.

Dockerfile

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Modelfile
FROM llama3.2:3b

# Sampling
PARAMETER temperature 0.4
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192
PARAMETER num_predict 512
PARAMETER stop "<|eot_id|>"
PARAMETER seed 42

# System prompt baked in
SYSTEM """
You are a concise, friendly devops assistant. Answer in <= 6 lines.
"""

# Optional: override the chat template
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>{{ end }}
<|user|>
{{ .Prompt }}<|end|>
<|assistant|>"""

# Optional: LoRA adapter
ADAPTER ./my-lora.safetensors

# Optional: a few-shot example
MESSAGE user "Hi!"
MESSAGE assistant "Hi — what shall we deploy today?"

# License / metadata
LICENSE "MIT"

Bash

1
2
3
4
5
6
7
8
9
# Build the custom model
ollama create devops-helper -f Modelfile

# Or quantize the base on the way in
ollama create devops-helper:q4 -f Modelfile -q q4_K_M

# Run it
ollama run devops-helper "Plan a zero-downtime deploy on Kubernetes."

Importing GGUF / Safetensors weights

Dockerfile

1
2
3
4
5
FROM ./my-model.gguf
FROM ./my-model-Q5_K_M.gguf
# Safetensors directory (Ollama auto-converts)
FROM ./hf-snapshot/

Native REST API

Base URL: http://localhost:11434

Generate (single completion)

Bash

1
2
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false,
  "options": { "temperature": 0.2, "num_ctx": 4096 }
}'

Chat (multi-turn)

Bash

1
2
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "system", "content": "Be terse." },
    { "role": "user",   "content": "Summarise GitFlow in 1 sentence." }
  ],
  "stream": false
}'

Streaming

Bash

1
2
3
# Default is stream:true — newline-delimited JSON
curl -N http://localhost:11434/api/chat -d '{
  "model":"llama3.2",
  "messages":[{"role":"user","content":"count to 5"}]
}'

Other endpoints

Bash

1
2
3
4
5
6
7
8
9
10
11
GET  /api/tags                    # list local models
POST /api/show       { "name": "llama3.2" }
POST /api/pull       { "name": "llama3.2" }
POST /api/push       { "name": "user/model" }
POST /api/copy       { "source": "a", "destination": "b" }
DELETE /api/delete   { "name": "llama3.2" }
POST /api/embed      { "model": "nomic-embed-text", "input": "..." }
POST /api/create     { "name": "...", "modelfile": "FROM llama3.2\n..." }
GET  /api/ps                       # loaded models
GET  /api/version

Common request fields

jsonc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
  "model": "llama3.2",
  "prompt": "…",            // /api/generate
  "messages": [ … ],        // /api/chat
  "system": "…",
  "template": "…",
  "format": "json",         // or a JSON Schema (see Structured Outputs)
  "stream": true,
  "keep_alive": "5m",       // unload after idle
  "context": [ … ],         // /api/generate: previous returned context for continuation
  "raw": false,             // bypass templating
  "images": [ "<base64>" ], // multimodal
  "tools": [ … ],           // tool calling
  "options": {
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 40,
    "num_ctx": 8192,
    "num_predict": 512,
    "repeat_penalty": 1.1,
    "seed": 42,
    "stop": ["</s>"],
    "num_gpu": 99,
    "num_thread": 8
  }
}

OpenAI-Compatible API

Drop-in for OpenAI SDKs — point base_url at Ollama.

Bash

1
2
3
4
5
6
7
8
9
10
11
12
13
# Chat completions
curl http://localhost:11434/v1/chat/completions \
  -H "Authorization: Bearer ollama" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role":"user","content":"Hello"}]
  }'

# Completions (legacy)
POST /v1/completions
# Embeddings
POST /v1/embeddings
# Models list
GET  /v1/models

Python (OpenAI SDK)

Python

1
2
3
4
5
6
7
8
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Say hi"}],
)
print(resp.choices[0].message.content)

Node.js (OpenAI SDK)

1
2
3
4
5
6
7
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:11434/v1", apiKey: "ollama" });
const r = await client.chat.completions.create({
  model: "llama3.2",
  messages: [{ role: "user", content: "Say hi" }],
});

Environment Variables

Set via shell, systemd drop-in, or Docker -e.

Bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
OLLAMA_HOST=0.0.0.0:11434           # bind address (default 127.0.0.1:11434)
OLLAMA_ORIGINS="*"                  # allow CORS origins (comma-separated)
OLLAMA_MODELS=/data/ollama/models   # model storage path
OLLAMA_KEEP_ALIVE=5m                # how long to keep a model loaded; "-1" forever, "0" unload immediately
OLLAMA_NUM_PARALLEL=4               # concurrent requests per model
OLLAMA_MAX_LOADED_MODELS=3          # cap on simultaneously-loaded models
OLLAMA_MAX_QUEUE=512                # queued request limit
OLLAMA_FLASH_ATTENTION=1            # enable Flash Attention (compatible models)
OLLAMA_KV_CACHE_TYPE=f16            # f16 | q8_0 | q4_0
OLLAMA_GPU_OVERHEAD=512MiB          # VRAM headroom to leave free
OLLAMA_LLM_LIBRARY=cuda_v12         # force backend (cuda_v11/12, rocm, metal, cpu)
OLLAMA_NOPRUNE=1                    # don't auto-prune unused blobs
OLLAMA_DEBUG=1                      # verbose logs
OLLAMA_NOHISTORY=1                  # disable REPL history
HSA_OVERRIDE_GFX_VERSION=10.3.0     # AMD ROCm spoofing if your card needs it

Linux: persisting env for `systemd`

Bash

1
2
3
4
5
6
7
8
9
sudo systemctl edit ollama.service
# Add:
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0:11434"
# Environment="OLLAMA_KEEP_ALIVE=24h"
# Environment="OLLAMA_NUM_PARALLEL=4"
sudo systemctl daemon-reload
sudo systemctl restart ollama

Tool / Function Calling

Models supporting tools include Llama 3.1/3.2/3.3, Qwen 2.5, Mistral-Nemo, Mistral-Large, Granite.

Bash

1
2
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [{"role":"user","content":"Weather in Paris?"}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather",
      "parameters": {
        "type": "object",
        "properties": {
          "city":  { "type": "string" },
          "units": { "type": "string", "enum": ["c","f"] }
        },
        "required": ["city"]
      }
    }
  }],
  "stream": false
}'

The response contains a tool_calls array; you execute the function and post back a role: "tool" message with the result, then re-call /api/chat.

Structured Outputs (JSON)

Bash

1
2
3
4
5
6
# Free-form JSON
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role":"user","content":"Give 3 fruits as JSON list"}],
  "format": "json",
  "stream": false
}'

# Strict JSON Schema
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role":"user","content":"Bio of Ada Lovelace"}],
  "format": {
    "type": "object",
    "properties": {
      "name":     { "type": "string" },
      "born":     { "type": "integer" },
      "famous_for":{"type":"array","items":{"type":"string"}}
    },
    "required": ["name","born","famous_for"]
  },
  "stream": false
}'

Multimodal & Vision Models

Bash

1
2
3
4
5
6
# CLI: drag image path into the prompt
ollama run llava "What is in ./photo.png?"

# API: pass base64 images
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2-vision",
  "messages": [{
    "role": "user",
    "content": "Describe this picture.",
    "images": ["'"$(base64 -w0 photo.png)"'"]
  }],
  "stream": false
}'

Vision-capable Ollama models: llava, llava-llama3, llava-phi3, bakllava, moondream, minicpm-v, llama3.2-vision.

Embeddings

Bash

1
2
3
# CLI / API
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["hello world", "another sentence"]
}'

Recommended embedding models:

nomic-embed-text (768-dim, fast)
mxbai-embed-large (1024-dim, strong English)
all-minilm (384-dim, tiny)
bge-m3 (multilingual, 1024-dim)
snowflake-arctic-embed (multiple sizes)

Python (OpenAI SDK)

Python

1
2
client.embeddings.create(model="nomic-embed-text", input=["doc 1","doc 2"])

GPU / Hardware

Bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Verify GPU detection
ollama ps                 # shows "100% GPU" / "50/50 CPU/GPU"
journalctl -u ollama -e   # logs (Linux)

# Force a specific GPU (NVIDIA)
CUDA_VISIBLE_DEVICES=0 ollama serve

# Force a specific GPU (AMD ROCm)
HIP_VISIBLE_DEVICES=0 ollama serve

# Limit layers offloaded to GPU
# (per-request via "options": { "num_gpu": 35 })
# 0 = pure CPU; 99 = "as many as fit"

# Quick VRAM math: a Q4 model ≈ params × 0.6 GB
# (7B Q4 ≈ 4.2 GB, 13B Q4 ≈ 7.5 GB, 70B Q4 ≈ 42 GB)

Supported backends: NVIDIA CUDA (12.x/11.x), AMD ROCm, Apple Metal, CPU (AVX2/AVX‑512), Intel oneAPI (experimental).

Docker & Server Deployment

Bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Simple
docker run -d --name ollama -p 11434:11434 -v ollama:/root/.ollama ollama/ollama

# With GPU
docker run -d --gpus all --name ollama -p 11434:11434 \
    -v ollama:/root/.ollama ollama/ollama

# Pre-pull a model into the container
docker exec -it ollama ollama pull llama3.2

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama
    ports: ["11434:11434"]
    volumes: ["ollama:/root/.ollama"]
    environment:
      OLLAMA_KEEP_ALIVE: "24h"
      OLLAMA_NUM_PARALLEL: "4"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: ["gpu"]
volumes: { ollama: {} }

Production tips

Pin a reverse proxy (Caddy / nginx) and terminate TLS.
Set OLLAMA_HOST=0.0.0.0:11434 and protect with auth at the proxy layer (Ollama has no built-in auth).
Restrict OLLAMA_ORIGINS to known frontends.
Set OLLAMA_KEEP_ALIVE=-1 for hot models, 0 for rarely-used ones.

Integrations

Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# LangChain
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.2", temperature=0)

# LlamaIndex
from llama_index.llms.ollama import Ollama
llm = Ollama(model="llama3.2", request_timeout=120.0)

# LiteLLM
import litellm
litellm.completion(model="ollama/llama3.2", messages=[{"role":"user","content":"hi"}])

# Continue, Open WebUI, Page Assist, Msty, Cody, Aider — all support
# Ollama out of the box; point them at http://localhost:11434.

Troubleshooting

Bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# "model not found"
ollama list                 # spelling / tag?
ollama pull <model>

# "Error: listen tcp 127.0.0.1:11434: bind: address already in use"
lsof -i :11434              # find offender
sudo systemctl restart ollama
OLLAMA_HOST=127.0.0.1:11435 ollama serve

# Stuck pull
rm -rf ~/.ollama/models/blobs/sha256-*.partial
ollama pull <model>

# Out of memory / VRAM
# - smaller quant (q4_0 → q3_K_M)
# - smaller size (70b → 8b)
# - lower num_ctx
# - num_gpu N to offload only N layers
ollama show <model> | grep "context length"

# Slow CPU performance
# - install AVX2-capable build
# - set OMP_NUM_THREADS / OLLAMA_NUM_PARALLEL
# - enable OLLAMA_FLASH_ATTENTION=1 (compatible models)

# Verbose debugging
OLLAMA_DEBUG=1 ollama serve

# Wipe everything
rm -rf ~/.ollama

Quick Reference

Bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Daily
ollama run llama3.2
ollama ps
ollama list
ollama show llama3.2

# Customize
ollama create mybot -f Modelfile
ollama cp llama3.2 llama3.2-tuned

# Maintenance
ollama pull llama3.2          # update
ollama rm   llama3.2:1b
ollama stop llama3.2

# API smoke test
curl http://localhost:11434/api/version
curl http://localhost:11434/api/tags

Tip: treat keep_alive as a knob: "-1" to pin a model in VRAM, "0" to unload after every request, or a duration like "30m" for batch jobs.

Ollama

Ollama

Complete Ollama Cheatsheet

Table of Contents

Overview

Installation

Core CLI Commands

Inside the REPL (`ollama run`)

Working with Models

Popular Models

The Modelfile

Importing GGUF / Safetensors weights

Native REST API

Generate (single completion)

Chat (multi-turn)

Streaming

Other endpoints

Common request fields

OpenAI-Compatible API

Python (OpenAI SDK)

Node.js (OpenAI SDK)

Environment Variables

Linux: persisting env for `systemd`

Tool / Function Calling

Structured Outputs (JSON)

Multimodal & Vision Models

Embeddings

Python (OpenAI SDK)

GPU / Hardware

Docker & Server Deployment

Production tips

Integrations

Troubleshooting

Quick Reference

Continue Learning