If you've used Claude Code or Cursor for a long session, you know the moment. You're deep into a complex task, the context window fills up, and the tool "compacts" your conversation. What comes back is a lossy summary that forgot half of what you were doing. You spend the next ten minutes re-explaining things the model knew five minutes ago. It's the worst part of the experience, and every frontier product does it badly.
I built something better — running on a self-hosted Qwen 3.6 on a Mac Mini in my closet.
The Problem
Every frontier product treats context compaction as an afterthought. Claude Code, Cursor, Windsurf — they all hit the same wall. The context window fills up, they generate a summary, they throw away the original messages, and the model stumbles. Key decisions vanish. User preferences evaporate. The agent that was tracking six files and three open questions suddenly can't remember which branch it's on.
The summary is never enough. It's a lossy compression of a rich conversation, and the model on the other side doesn't have the original context to fill in the gaps. Every compaction is a small lobotomy.
I decided to solve this properly instead of papering over it.
What I Built (And Why It's Different)
The frontier products compact and pray. My system compacts, checkpoints, persists, and adapts. The context management system sits inside my Tier 2 Agent Runner — a lightweight Python harness I wrote to run open-weight models as full agent bots with persistent memory, RAG knowledge injection, tool calling, and Telegram integration.
The difference isn't the summarization (everyone does that). The difference is everything around it.
Token Tracking Across Every Zone
Every turn, the system counts tokens across four zones:
- System prompt — the agent's personality, instructions, and memory (~3K tokens)
- RAG context — domain knowledge auto-injected from a ChromaDB vector store (~2K tokens)
- Conversation history — the actual chat, growing with every exchange
- Response budget — tokens reserved for the model's reply (4,096 tokens)
The tracker uses a fast estimator that self-calibrates against actual token counts returned by the API. Over time, the estimates converge to within 5% of ground truth without needing a tokenizer library. No dependencies, no overhead, just math that gets smarter the longer it runs.
Warning at 80%: Auto-Checkpoint
When total usage crosses 80% of the context window, the system saves a checkpoint to my persistent memory server. This captures what the agent was working on, key decisions made, and conversation state — a snapshot that survives even if the process dies.
The warning fires once per threshold crossing, not every turn. When usage drops back below 80% (after compaction), it resets and can fire again.
Compaction at 90%: Summarize, Persist, Recover
Here's where I diverge from the frontier approach. When Claude Code compacts, it generates a summary and throws away the originals. That's it. One shot. If the summary misses something, it's gone forever.
My system does more:
- Checkpoint first — before any compaction, the full conversation state is already saved to my persistent memory server. Nothing is lost.
- Targeted split — oldest 60% gets summarized, newest 40% stays intact as original messages. The model keeps its recent working memory untouched.
- Structured summarization — the summary prompt explicitly preserves topics, decisions, user preferences, open tasks, and key facts. Not a generic "summarize this" — a directed extraction.
- Disk persistence — the summary writes to a working summary file, which loads into the system prompt on restart. The agent recovers from a full process crash with context intact.
- External memory backstop — even if the summary is lossy (and all summaries are), the agent can search its persistent memory for what it forgot. The compacted knowledge isn't deleted, it's moved to a different retrieval layer.
The result: compaction is a graceful degradation, not a lobotomy. The agent loses some conversational nuance but retains every decision, every preference, and every open task — because those live in memory systems that survive compaction.
Adaptive Knowledge Injection
Under context pressure, the RAG system automatically adapts. Instead of injecting 3 document hits at 600 characters each, it drops to 1 hit at 300 characters. The agent still gets relevant knowledge, just less of it, buying room for the conversation that matters.
Why This Matters
The model running this — Qwen 3.6 35B — is a Mixture of Experts architecture with 35 billion total parameters but only 3 billion active per inference. It runs on a single Mac Mini with an M4 Pro chip. No cloud GPU. No API costs. No data leaving my network.
That last point is the one that matters most. For companies handling sensitive data — defense contractors, healthcare, financial services — "I run my own models on my own hardware" isn't a nice-to-have. It's a requirement. But until now, self-hosted models meant giving up the quality-of-life features that make frontier products usable for real work.
Context management was one of those features. Now it's not.
The Numbers
| Metric | Before | After |
|---|---|---|
| Context window utilized | 3% (8K of 262K) | 100% (full 262K) |
| Max conversation length | ~20 messages | Unlimited (compacts as needed) |
| Knowledge injection | Fixed 3 hits | Adaptive (pressure-aware) |
| Crash recovery | None | Auto-checkpoint + disk summary |
| Token tracking | None | Per-turn with self-calibration |
The Bigger Picture
Context management isn't a standalone feature. It's one layer in an agent stack I designed from the start to make compaction survivable:
- Persistent memory — working summaries, wiki entries, mistake logs that outlive any single conversation
- RAG knowledge base — domain expertise auto-injected every turn, independent of conversation history
- Inter-agent communication — my agents can ask each other for context they've lost
- Checkpoint system — full conversation state saved to a cross-agent memory server before anything gets compacted
When Claude Code compacts your conversation, you lose context and there's nowhere to get it back. When my system compacts, the "lost" context is sitting in three different persistence layers that the agent can query on demand.
The frontier products have better models. I have better memory architecture. And it turns out memory architecture matters more than most people think — because the best model in the world is useless if it can't remember what you told it ten minutes ago.
The Bottom Line
I'm running a 35-billion-parameter model on a Mac Mini with a 256K context window, intelligent compaction, persistent memory, RAG injection, and crash recovery. No cloud. No API costs. No data leaving the building.
The frontier products charge per token and still lose your context. I run on hardware I own and remember everything.
That's not parity. That's an advantage.