Step 6: Install Ollama

Local vs. cloud — what happens where?

It is important to understand what runs locally in this workflow and what goes through the cloud:

StepWhereNotes
Zotero MCP (querying library)✅ LocalConnection via localhost
yt-dlp (fetching transcripts)✅ LocalScraping on your Mac
whisper.cpp (transcribing audio)✅ LocalM4 Metal GPU
Semantic search (Zotero MCP)✅ LocalLocal vector database
Reasoning, summarizing, writing synthesesLocalQwen3.5:9b via Ollama (default); Anthropic API only when --hd is used

In the default mode, all generative work — summaries, literature notes, flashcards — is handled locally by Qwen3.5:9b via Ollama. Claude Code orchestrates the workflow but does not send source content to the Anthropic API. No tokens, no data transfer.

The honest trade-off: local models are less capable than Claude Sonnet for complex or nuanced tasks. For simple summaries and flashcards the difference is small; for writing rich literature notes or drawing subtle connections between sources, Claude Sonnet is noticeably better. Add --hd to any request to switch to Claude Sonnet 4.6 via the Anthropic API for that task — Claude Code always announces this and asks for confirmation before making the API call.


6a. Install

brew install ollama

6b. Download models

For an M4 Mac mini with 24 GB memory, Qwen3.5:9b is the recommended default model for all local processing tasks in the workflow:

ollama pull qwen3.5:9b   # ~6.6 GB — default model for all tasks

Qwen3.5:9b has a context window of 256K tokens, recent training with explicit attention to multilingualism (201 languages, including Dutch), and a hybrid architecture that stays fast even with long input. It replaces both llama3.1:8b and mistral for all workflow tasks.

# Optional alternatives (not needed if you use qwen3.5:9b):
ollama pull llama3.1:8b   # ~4.7 GB — proven, but 128K context and less multilingual
ollama pull phi3           # ~2.3 GB — very compact, for systems with less memory

Check which models are available after downloading:

ollama list

6c. Start Ollama

ollama serve

Leave this running in a separate Terminal window, or configure it as a background service so it is always available:

# Start automatically at system startup (recommended):
brew services start ollama

Check whether Ollama is active and which models are available:

ollama list

6d. How Ollama is used in the workflow

Qwen3.5:9b is the default engine for all generative tasks. The workflow uses a dedicated helper script — .claude/ollama-generate.py — to call Ollama via its REST API rather than the CLI.

Why a helper script instead of ollama run?

The Ollama CLI (ollama run qwen3.5:9b < file.txt) produces terminal output with ANSI escape codes and a loading spinner. When this output is captured by Claude Code's bash tool, the result contains hundreds of lines of control characters that need to be stripped before the content is usable. The helper script avoids this entirely by talking directly to the Ollama REST API at http://localhost:11434/api/generate.

The script also prepends /no_think to the prompt, which tells Qwen3.5:9b to skip its internal reasoning step and produce output directly. This saves time and keeps the output clean.

Usage:

~/.local/share/uv/tools/zotero-mcp-server/bin/python3 .claude/ollama-generate.py \
  --input  inbox/source.txt \
  --output literature/note.md \
  --prompt "Write a literature note in the same language as the source text..."

The script prints only status lines (Input: ..., Model: ..., Written: ...) — never the source content or the generated text. This ensures source content does not appear in Claude Code's context.

Test whether Ollama is reachable:

echo "Test" > /tmp/test.txt
~/.local/share/uv/tools/zotero-mcp-server/bin/python3 .claude/ollama-generate.py \
  --input /tmp/test.txt \
  --output /tmp/test-out.txt \
  --prompt "Say only: works."
cat /tmp/test-out.txt

If this returns a short response, Ollama and the helper script are ready for use.

Default rule in CLAUDE.md

The CLAUDE.md in your vault already contains the privacy rule and points to ollama-generate.py as the standard tool — no further configuration is needed per task.