For your second question: No LLM calls. Context Mode uses algorithmic processing — FTS5 indexing with BM25 ranking and Porter stemming. Raw output gets chunked and indexed in a SQLite database inside the sandbox, and only the relevant snippets matching your intent are returned to context. It's purely deterministic text processing, no model inference involved.
Hey! Thank you for your comment! You can actually use an MCP on this basis, but I haven't tested it yet. I'll look into it as soon as possible. Your feedback is valuable.
The BM25+FTS5 approach without LLM calls is the right call - deterministic, no added latency, no extra token spend on compression itself.
The tradeoff I want to understand better: how does it handle cases where the relevant signal is in the "low-ranked" 310 KB, but you just haven't formed the query that would surface it yet? The compression is necessarily lossy - is there a raw mode fallback for when the summarized context produces unexpected downstream results?
Also curious about the token count methodology - are you measuring Claude's tokenizer specifically, or a proxy?
On lossy compression and the "unsurfaced signal" problem:
Nothing is thrown away. The full output is indexed into a persistent SQLite FTS5 store — the 310 KB stays in the knowledge base, only the search results enter context. If the first query misses something, you (or the model) can call search(queries: ["different angle", "another term"]) as many times as needed against the same indexed data. The vocabulary of distinctive terms is returned with every intent-search result specifically to help form better follow-up queries.
The fallback chain: if intent-scoped search returns nothing, it splits the intent into individual words and ranks by match count. If that still misses, batch_execute has a three-tier fallback — source-scoped search → boosted search with section titles → global search across all indexed content.
There's no explicit "raw mode" toggle, but if you omit the intent parameter, execute returns the full stdout directly (smart-truncated at 60% head / 40% tail if it exceeds the buffer). So the escape hatch is: don't pass intent, get raw output.
On token counting:
It's a bytes/4 estimate using Buffer.byteLength() (UTF-8), not an actual tokenizer. Marked as "estimated (~)" in stats output. It's a rough proxy — Claude's tokenizer would give slightly different numbers — but directionally accurate for measuring relative savings. The percentage reduction (e.g., "98%") is measured in bytes, not tokens, comparing raw output size vs. what actually enters the conversation context.
Really cool. A tangential task that seems to be coming up more and more is masking sensitive data in these calls for security and privacy. Is that something you considered as a feature?
The SQLite database is ephemeral — stored in the OS temp directory (/tmp/context-mode-{pid}.db) and scoped to the session process. Nothing persists after the session ends. For sensitive data masking specifically: right now the raw data never leaves the sandbox (it stays in the subprocess or the temp SQLite store), and only stdout summaries enter the conversation. But a dedicated redaction layer (regex-based PII stripping before indexing) is an interesting idea worth exploring. Would be a clean addition to the execute pipeline.
Yes — the database is tied to the MCP server process, so it's created fresh on each claude launch and lost when you exit; resuming a session starts a new process with a new empty database.
Interesting approach, I tried the Hackernews example from the docs, but its tools don't seem to trigger reliably. Any suggestions?
> Fetch the Hacker News front page, extract all posts with titles, scores,
and domains. Group by domain. Then run /context-mode stats.
* Claude used regular fetch *
> why didnt you use the context mode fetch?
● Fair point. Two honest reasons:
1. First request: The context-mode tools weren't loaded yet when I called WebFetch. I should have used ToolSearch to load
fetch_and_index first, then used it — but I defaulted to WebFetch out of habit.
2. Second request (the interrupted one): The tools were already loaded at that point. I had no excuse. I should have done:
fetch_and_index(url: "https://news.ycombinator.com", source: "HN")
→ search(queries: ["post title score domain"], source: "HN")
→ execute(group by domain, print summary)
2. Instead I reached for WebFetch again — that's the anti-pattern the plugin exists to prevent.
The context-mode skill instructions explicitly list this in the decision tree:
Fetching web documentation or HTML page? → Use fetch_and_index → search
I ignored it. The WebFetch output (the full post table) went straight into context when it didn't need to. I'll use
fetch_and_index correctly if you want to retry.
Context Mode doesn't replace your other MCP servers — it sits alongside them. Your Context7, Playwright, GitHub servers all stay installed and work normally. The hook intercepts output-heavy tool calls (like WebFetch, curl) and redirects them through the sandbox. For example, instead of WebFetch dumping 56KB of raw HTML into context, the hook blocks it and tells the model to use fetch_and_index instead — which fetches the same URL but indexes it in a local SQLite DB, returning only a 3KB summary.
Your other MCP servers still run. Context Mode just gives the model a more context-efficient way to process their results when the output would be large.
Interesting approach. I just finished some work for a similar task in a different domain.
One thing that surprised me: tantivy's BM25 search is faster, more expressive, and more scalable than SQLite. If you're just building a local search (or want to optimize for local FTS), I would strongly recommend looking into tantivy.
If you have the resources, it would be very interesting to throw a some models (especially smart-but-context-constrained cheaper ones) at some of the benchmark programming problems and see if this approach can show an effective improvement.
On Tantivy: Agree it's the better search engine, but context-mode is session-scoped — DB is a temp file that dies when the process exits. At that scale (50-200 chunks), FTS5 is zero-config, single-file, <1ms startup, and good enough. If we ever add persistent cross-session indexing, Tantivy would be the move.
On benchmarking: This is the experiment I most want to see. The hypothesis: context-mode benefits smaller models disproportionately — a 32K model with clean context could outperform a 200K model drowning in raw tool output. Would love to see SWE-bench results with context-mode on vs. off across model tiers.
handfuloflight | 15 hours ago
And when you say only returns summaries, does this mean there is LLM model calls happening in the sandbox?
[OP] mksglu | 15 hours ago
[OP] mksglu | 15 hours ago
handfuloflight | 14 hours ago
[OP] mksglu | 14 hours ago
sim04ful | 15 hours ago
[OP] mksglu | 15 hours ago
nightmunnas | 15 hours ago
[OP] mksglu | 15 hours ago
Codex CLI:
Or in ~/.codex/config.toml: opencode:In opencode.json:
We haven't tested yet — would love to hear if anyone tries it!vicchenai | 15 hours ago
The tradeoff I want to understand better: how does it handle cases where the relevant signal is in the "low-ranked" 310 KB, but you just haven't formed the query that would surface it yet? The compression is necessarily lossy - is there a raw mode fallback for when the summarized context produces unexpected downstream results?
Also curious about the token count methodology - are you measuring Claude's tokenizer specifically, or a proxy?
[OP] mksglu | 15 hours ago
--
On lossy compression and the "unsurfaced signal" problem:
Nothing is thrown away. The full output is indexed into a persistent SQLite FTS5 store — the 310 KB stays in the knowledge base, only the search results enter context. If the first query misses something, you (or the model) can call search(queries: ["different angle", "another term"]) as many times as needed against the same indexed data. The vocabulary of distinctive terms is returned with every intent-search result specifically to help form better follow-up queries.
The fallback chain: if intent-scoped search returns nothing, it splits the intent into individual words and ranks by match count. If that still misses, batch_execute has a three-tier fallback — source-scoped search → boosted search with section titles → global search across all indexed content.
There's no explicit "raw mode" toggle, but if you omit the intent parameter, execute returns the full stdout directly (smart-truncated at 60% head / 40% tail if it exceeds the buffer). So the escape hatch is: don't pass intent, get raw output.
On token counting:
It's a bytes/4 estimate using Buffer.byteLength() (UTF-8), not an actual tokenizer. Marked as "estimated (~)" in stats output. It's a rough proxy — Claude's tokenizer would give slightly different numbers — but directionally accurate for measuring relative savings. The percentage reduction (e.g., "98%") is measured in bytes, not tokens, comparing raw output size vs. what actually enters the conversation context.
rcarmo | 14 hours ago
[OP] mksglu | 14 hours ago
robbomacrae | 13 hours ago
[OP] mksglu | 13 hours ago
The SQLite database is ephemeral — stored in the OS temp directory (/tmp/context-mode-{pid}.db) and scoped to the session process. Nothing persists after the session ends. For sensitive data masking specifically: right now the raw data never leaves the sandbox (it stays in the subprocess or the temp SQLite store), and only stdout summaries enter the conversation. But a dedicated redaction layer (regex-based PII stripping before indexing) is an interesting idea worth exploring. Would be a clean addition to the execute pipeline.
virgilp | 13 hours ago
Does that mean that if I exit claude code and then later resume the session, the database is already lost? When exactly does the session end?
[OP] mksglu | 13 hours ago
wobblywobbegong | 11 hours ago
> Fetch the Hacker News front page, extract all posts with titles, scores, and domains. Group by domain. Then run /context-mode stats.
* Claude used regular fetch *
> why didnt you use the context mode fetch?
● Fair point. Two honest reasons:
[OP] mksglu | 6 hours ago
npm install -g context-mode@latest
If you're on the plugin install, re-run:
Then restart Claude Code. Sorry about that.gavinray | 10 hours ago
You mention Context7 in the document, so would I have both MCP servers installed and there's a hook that prevents other servers from being called?
[OP] mksglu | 6 hours ago
Your other MCP servers still run. Context Mode just gives the model a more context-efficient way to process their results when the output would be large.
i3oi3 | 8 hours ago
One thing that surprised me: tantivy's BM25 search is faster, more expressive, and more scalable than SQLite. If you're just building a local search (or want to optimize for local FTS), I would strongly recommend looking into tantivy.
If you have the resources, it would be very interesting to throw a some models (especially smart-but-context-constrained cheaper ones) at some of the benchmark programming problems and see if this approach can show an effective improvement.
[OP] mksglu | 6 hours ago
On benchmarking: This is the experiment I most want to see. The hypothesis: context-mode benefits smaller models disproportionately — a 32K model with clean context could outperform a 200K model drowning in raw tool output. Would love to see SWE-bench results with context-mode on vs. off across model tiers.