Claude Code's compaction discards data that's still on disk

31 points by aciccarelli2 21 hours ago on hackernews | 14 comments

Preflight Checklist

I have searched existing requests and this feature hasn't been requested yet
This is a single feature request (not multiple features)

Problem Statement

When auto-compaction triggers mid-task, the summarization process irreversibly loses user-provided data that Claude is actively working with. The compacted summary retains a reference to the data (e.g., "user provided DOM markup") but discards the data itself — even though the full transcript is still on disk at ~/.claude/projects/{project-path}/.

Concrete example: I paste ~8,000 characters of DOM markup and ask Claude to refactor it. The session continues, auto-compaction fires, and the summary now says "User provided 8,200 characters of DOM markup for analysis." When I ask a specific question about the markup, Claude either hallucinates details, hedges, or asks me to re-paste what I provided minutes ago. The transcript on disk still has the original markup — Claude just has no way to know that, or where to find it.

This isn't an edge case. Any task involving substantial user-provided input (DOM markup, config files, log output, schema definitions) is at risk the moment the context window fills up. There are at least 8 open issues describing different symptoms of this same root cause:

#1534 — Memory Loss After Auto-compact (June 2025, closed — problem persists)
#3021 — Forgets to Refresh Memory After Compaction
#10960 — Repository Path Changes Forgotten After Compaction
#13919 — Skills context completely lost after auto-compaction (compiled its own list of 5+ related issues)
#14968 — Context compaction loses critical state
#19888 — Conversation compaction loses entire history
#21105 — Context truncation causes loss of conversation history
#23620 — Agent team lost when lead's context gets compacted

The underlying problem: compaction is a one-way lossy transformation with an available lossless source that isn't wired up. The summary and the transcript coexist on disk, but the summary contains no pointer back to the source material it compressed.

Proposed Solution

Phase 1: Within-session indexed transcript references

Tag compressed entries in the compacted summary with transcript line-range annotations so Claude can surgically recover only the specific content it needs, on demand.

Current behavior (lossy, no recovery):

[compacted] User provided 8,200 characters of DOM markup for analysis.

Proposed behavior (recoverable):

[compacted] User provided 8,200 characters of DOM markup for analysis. [transcript:lines 847-1023]

When Claude detects it needs the original data — because a question requires specifics the summary doesn't contain — it reads only lines 847–1023 from the transcript. Surgical recovery, minimal token cost, no wasted context on data that isn't needed.

Why this works:

Zero standing overhead. The summary stays compact. Tokens are only spent at the moment of recovery, and only on the specific chunk needed.
Ground truth recovery. The transcript is the literal conversation — no interpretation layer, no embedding approximation.
Architecturally minimal. No new databases, no embedding models, no external processes. Just metadata pointers in summaries referencing files that already exist on disk.
Scales flat. Whether the session is 10 minutes or 10 hours, the cost model is identical — pay only at retrieval.

Implementation scope: A modification to the compaction summarizer — when compressing a block of content, emit the summary line plus a transcript line-range annotation. Then expose a recovery mechanism (likely a file read with offset) that Claude can invoke when it identifies a gap between what the summary references and what it can actually see.

Phase 2 (future extension): Cross-session indexed lookup

The same mechanism generalizes across sessions. Transcripts from past sessions persist in ~/.claude/projects/. A cross-session reference would look like:

[compacted] User established auth pattern using JWT with refresh tokens. 
[session:2025-02-15/transcript:lines 312-487]

Suggested UX: Claude checks the current session transcript first (fast, local, cheap). If no match, Claude prompts the user: "I don't have that detail in the current session. Want me to search previous session transcripts?" User opts in → Claude searches across session transcripts using indexed references. This respects the user's token budget while providing a clean escalation path.

Alternative Solutions

Workarounds I've tried:

MEMORY.md behavioral nudge — Adding a rule in ~/.claude/CLAUDE.md telling Claude to check the transcript when data feels incomplete. This works occasionally but depends on Claude being self-aware enough to recognize what it's lost, which it often can't. It's a behavioral hack, not an architectural fix.
Manual /compact with preservation instructions — e.g., /compact preserve all user-provided DOM markup. This is manual, not automated, and still produces lossy summaries.

Existing proposals that address this differently:

#17428 — File-backed summaries that write compacted content to new files for later retrieval. This proposal achieves the same goal with less machinery: the transcript already exists on disk, so rather than duplicating data into new files, annotate the summary with line-range pointers to the existing transcript. Same recoverability, zero additional storage.
#15923 — Pre-compaction hook to archive transcripts before compaction. Useful for logging but doesn't solve the recovery problem — Claude still can't find what it lost.
#21190 — Include last N interactions verbatim in the compact summary. Helps with recent context but doesn't scale to mid-session data and increases summary size.
#6549 — Manual continuation prompts before compaction. Effective but entirely manual and user-dependent.

Community workarounds: The MCP memory ecosystem (embedding stores, knowledge graphs, SQLite-backed recall) has dozens of projects working around this gap externally. These are all compensating for one missing capability: the compacted summary doesn't know where its source material lives. The indexed transcript reference approach solves it natively, at the platform layer, with zero additional infrastructure.

Priority

High - Significant impact on productivity

Feature Category

Performance and speed

Use Case Example

I'm building a Chrome extension and paste ~8,000 characters of DOM markup from a target page into Claude Code, asking it to analyze the structure and help me write content scripts to interact with specific elements.
Claude reads the markup, identifies key selectors, and starts generating code. We iterate over several rounds — adjusting selectors, handling edge cases, testing approaches.
Around 40 minutes in, auto-compaction triggers. The session continues seamlessly — or so it appears.
I ask Claude to target a specific div with a nested data-attribute that was in the original markup. Claude can't recall the exact structure. The compacted summary says "user provided DOM markup for analysis" but the actual markup is gone from context.
Claude guesses at the selector, gets it wrong, and I have to re-paste the markup — losing the continuity of the session and burning tokens on data Claude already had.

With indexed transcript references:

At step 4, Claude sees [transcript:lines 847-1023] in the compacted summary, reads just those 177 lines from the transcript on disk, recovers the exact markup, and answers accurately. No re-paste, no guessing, no wasted tokens. The session continues as if compaction never happened.

Additional Context

Technical considerations:

The transcript already persists at ~/.claude/projects// — no new storage mechanism needed. This proposal adds metadata to the compaction summary, not infrastructure.
Implementation scope is a modification to the compaction summarizer: emit summary lines with [transcript:lines X-Y] annotations, plus a recovery mechanism (file read with line offset) Claude can invoke on demand.
Token cost is zero until recovery is triggered, and surgical when it is — only the specific compressed chunk gets loaded, not the full transcript.

Related issues (demand evidence):

At least 8 open bugs trace back to this same root cause — compaction summaries with no back-reference to source:
#1534, #3021, #10960, #13919, #14968, #19888, #21105, #23620

Related feature requests (differentiation):

[Feature Request] Enhanced /compact with file-backed summaries and selective restoration #17428 proposes writing compacted content to new files. This proposal is leaner — the transcript already exists, so just annotate the summary with pointers rather than duplicating data.
[FEATURE] Add pre-compaction hook to preserve conversation history before auto-compacting #15923, Feature request: pre-compact hook or include last N interactions in compact summary #21190 propose pre-compaction hooks. Useful for logging but don't solve the recovery problem.
[FEATURE REQUEST] Better compaction is possible without any fancy machinery, MCPs, and with very little plumbing. #6549 proposes manual continuation prompts. Effective but entirely user-dependent.

Phase 2 extension (not the core ask, but worth noting): The same indexed mechanism generalizes across session transcripts for cross-session recovery. Default to current session lookup; prompt the user before searching prior sessions. Same architecture, wider scope.