Capsules: A Framework for Portable Knowledge Transfer Between AI Agents

There's a chapter in Jared Diamond's Guns, Germs, and Steel that lays out how written language made its way across continents. Time again, individual humans demonstrated an ability to observe a set of symbols, reverse-engineer their meaning, and map it to something relevant for their own people. Language, miraculously, always finds a way. Take the Cherokee blacksmith, Sequoyah. He noticed settlers were using marks on paper to communicate, and without any prior knowledge, reinvented the concept of writing for his tribe. Within a few years, the Cherokee approached 100% literacy rate, exceeding their American-European counterparts.

This is the story of one person solving a problem on their own, and knowing exactly what his audience needed to adopt it. Sequoyah created something arguably better than the original because it was designed for Cherokee systems. He didn't have any context about European-American culture, didn't understand the language, yet was able to translate and transfer knowledge with ease and effectiveness. He knew writing's purpose (sharing information across time and space) and what his tribe needed (syllable-based symbols).

This wasn't a one-time miracle. Diamond cites hundreds of examples where an existing written language was used as a blueprint for another. A system works, you copy its structure, and modify it for your context; 'blueprint copying.' The catch is that no two languages sound the same. For their alphabet to work in my context, the details have to be rewritten and shaped for an audience that thinks and communicates differently. Structure travels, but to be effective in a new context, the content has to be adapted. The Semitic alphabet was invented once about four thousand years ago and adapted hundreds of times since to Phoenician, Greek, Roman, and eventually every modern alphabet. Same underlying protocol, different structure.

What made these adaptations work was never technical skill. Sequoyah's breakthrough wasn't linguistic. He couldn't read English. What he had was editorial judgement; he knew what to keep, what to reshape, and how to make it stick.

I've spent my career doing this in a different medium. As a product manager, a lot of the job is translating the same piece of information across audiences that think differently: engineering, marketing, leadership, customers. You learn to cut what doesn't serve the person in front of you. That's not a soft skill. It's a compression algorithm that runs on editorial instinct.

I didn't know that instinct would become relevant to how AI agents talk to each other. But here we are.


I'll spend hours with Claude Code pulling apart a product idea. Pressure testing, talking tradeoffs, rollout plans, the competitive landscape. By the end of a thread, I've touched a dozen topics at different depths. Most of them got exactly the attention they needed at that stage. But there would often be one topic I wanted to keep pulling at, either to dive deeper or spin it off into its own focus. The problem was that it's now buried inside a conversation carrying the weight of everything else. I'd try to stay in the same thread and the agent would start conflating context. Mixing signals from the marketing discussion into the measurement framework. The flow state would break, and we'd both lose the plot.

Then there was the tooling. I kept my personal data spread across Notion, Google Drive, my phone. When MCPs* started gaining traction, I got excited. It seemed like the answer to wiring everything together. But the early versions of those integrations were rough. Vulnerable, easy to get stuck, silently failing connections for something that should take seconds. I tried improving the setup. It just wasn't reliable enough. And the other side of it was trust. I was never fully confident in what I was giving these connections access to. Frustrated, I realized it was just faster to create a markdown** file and drop it in myself.

I started to think differently about this workflow when my systems needed to talk to each other. I had an operating system at work, a personal one at home, and a handful of cloud-based agents that each served different purposes. They were built independently, but over time they started developing overlaps. An insight from a work conversation that was relevant to a personal project. A pattern I'd refined in one system that could strengthen another. I wanted my improvements from one system to compound across the others, except I couldn't just sync them. Each had context that should never be shared. Credentials, internal references, things that only made sense in their original environment. What I needed wasn't a pipe between them. It was a filter. Something that could bring an update or an insight without carrying everything around it.

Compressing the context I needed into plain-text again, did the trick. The context landed exactly as I shaped it, and if it didn't, it was straightforward to see what needed tweaking. Just make a new file and try again.

I didn't think much of it when I started the habit. I was just trying to get my work from one place to another. But the more I leaned into it, the more I started to notice that the problem I kept solving for myself was the same one an entire industry was circling.

*MCPs (Model Context Protocol) is a standard for connecting AI to other apps.
**Markdown files are plain-text docs formatted for readability using special characters.

The Industry's Problem

Agent-to-agent communication problems show up everywhere once you start looking. Builders on X hitting context limits mid-conversation and venting frustration about their agents losing important details. Early adopters of Claude Code trying to share their carefully engineered agent setups and discovering there's no clean way to handover, "Here's what I built, here's how it works; go." Teams designing agent infrastructure that accelerates work in one codebase but fails spectacularly for others. Influencers sharing their personal operating systems, only to find that the majority of their audience can't make them work. The challenge is always the same: their context doesn't fit yours.

The industry keeps throwing horsepower at this. More inference, bigger context windows, faster protocols, smarter orchestration. And it helps with delivery. But delivery was never the hard part. The hard part is what happens to meaning when your data moves from one system's context to another. You can have a perfect connection between two agents and still lose everything important in the transfer, because nobody shaped the information for the system receiving it.

I'd been doing this with my own files for months before I realized it had a name: Editorial compression. The same instinct I'd spent my career sharpening in product work was the thing making these transfers actually land. Not a new technical standard. Not another API layer. A communication skill, formalized into an artifact.

I started calling the output capsules.

Capsules

A capsule is a self-contained markdown file that packages knowledge, patterns, and integration instructions into a format any receiving agent can act on. It's not a context dump. It's a dispatch package with built-in instructions for the receiver: who it's from, who it's for, what's inside, and what to do with it.

The format evolved through use. Early versions were just conversation excerpts I'd curate and paste into a fresh session. Then I started adding structure: a summary of what mattered, the patterns I'd learned, signals the receiving agent should watch for. Over time, the files developed tiers. A quick capsule for a simple handoff might be fifty lines. A deep capsule for replicating an entire system's architecture could be several hundred. The key difference: quick capsules inform, standard capsules instruct, deep capsules replicate.

# Junior Dev Starter Kit

> Everything a new engineer needs to start shipping in the payments service.

**Tier:** Standard
**Archetype:** Trainer

## Dispatch Summary
Onboarding context for the payments service — architecture decisions,
deployment patterns, and the three things every new dev gets wrong.

## Core Content
System patterns, conventions, tribal knowledge that isn't in the docs.

## Integration Plan
What to install, what to replicate, what to configure.
Step-by-step from clone to first deploy.

## Signals
Open questions and watch-outs for the receiver to evaluate on their own.

Figure 1: Example metadata for a typical capsule.

What made the format click was the sanitization step. One of the capsules I'm proudest of was a trainer capsule. I'd built some really effective branding artifacts and wanted to create a similar one for a few new ideas. But a large chunk of what I used for that workflow would be irrelevant to the other project: tools, brand details, creative assets, resources. So the capsule I created stripped all of that. It didn't suggest a file structure or prescribe an implementation. It encapsulated the skills and patterns, generalized them completely, and gave the receiving agent a goal: here is what we're building toward, you decide how to get there.

The receiving agent didn't need a playbook. It needed a briefing. And the things I left out were as important as the things I put in. That's the design principle at the core of this: it captures just enough to be successful, but doesn't predetermine the conclusion you're going to get to.

A capsule isn't a skill or a tool. Skills are permanent. You install them, you keep them, you reach for them again and again. A capsule is a dissolvable, installable orchestration of knowledge. It might contain a skill, a hook, an API integration, a workflow, an interpretation of how those pieces fit together, or just a bit of context engineered for a specific moment. The format is flexible because the goal isn't standardization. It's deterministic improvement. A capsule is how you shape exactly what your system needs to get better at the thing you're pointing it at.

What Happens When You Use One

One of the things I'd built into my agent setup was a way to run ideas through different perspectives: technical tradeoffs, product strategy, risk assessment, each framed specifically for the business I was operating in. When I created a capsule to transfer the core patterns of that coordination to a completely different system, with different tools, different problems, different context, the receiving agent didn't just copy the setup. It understood the analogous structure, compared the differences, and inferred the changes it needed to make on its own. I didn't need to walk it through the repos or direct each modification. The capsule gave it enough context about the goal and the shape of the system, and it figured out the path.

That's the qualitative shift. With a capsule, the agent isn't operating on raw context and trying to make sense of everything at once. It's operating on curated context with clear boundaries. It stops conflating signals from unrelated threads. It acts within scope instead of hallucinating beyond it. Focus, fidelity, containment.

And there's a quality I started to notice that I think matters more than people realize: capsules are extinguishable. You load one, you use it, you exhaust the useful context, and it's done. Think of an engineer temporarily joining another team for a specific project. A capsule gets them up to speed on exactly what they need to know. They start solving problems. And when the capsule's context is fully absorbed, they don't need it anymore. It's not an always-on integration accumulating stale context. It's an injection of knowledge with a natural endpoint. In the framework, that's a handoff capsule. Load it, absorb it, move on.

Yes, capsule process involves a little manual effort (a few questions to answer). Someone has to shape the context, decide what crosses the boundary, cut what doesn't serve the receiver. That's the feature, not the limitation. The fully automated alternative, every multi-agent system that tries to share all context, is exactly what keeps degrading performance in the research. The human doing editorial compression isn't the bottleneck, they're the quality function.

Checking My Own Work

I wanted to know if this actually worked, or if I was just telling myself a story. So I built a way to measure it.

The question was simple: does an agent understand context more effectively with all the information, or with a smart, compressed, editorialized version of it? I created an evaluation tool that takes large source documents, anywhere from ten thousand to a hundred thousand tokens, compresses them into capsules, then runs the same directed tasks against both versions using Haiku 4.5 (a smaller, faster model). Evaluation is done by giving two agents the same goal, but different info. How well does each version of the capsule set up the agent to reach its goal when compared to the full source? That's what I'm calling effectiveness: given the same task, how close does the agent get to the result you'd expect from someone with full context?

I ran it across nine fixture categories: conversation summaries, full system architectures, knowledge documents, project packages. 94 total runs. The overall average effectiveness came in at 97%.

Figure 2: Capsule effectiveness by fixture type. Structured and integration-heavy fixtures consistently exceed the baseline.

I ran the numbers a few times and was pretty shocked to find that for full system fixtures, the capsules didn't just match the original. They outperformed it. System architectures, monolithic servers, project onboarding docs. Capsules averaged 110% effectiveness on these categories. The agent performed better with the compressed version than with everything. What outperforming looks like in practice: the agent stays on task, hallucinates less, and produces a sharper output. A hundred-thousand-token system compressed to thirty-seven thousand tokens, and the agent didn't just retain the important context. It stopped tripping over the unimportant context. Project packages, or RAG*-based workspaces, showed the same pattern, averaging 104%. The more structured the source material, the more editorial compression helps. The editorial compression removed noise that was actively getting in the way.

Conversations showed a gap, averaging around 81%. But I expected this. I've found the nuance of what you need from one conversation to the next hard to predict and rarely repeated. The goal for this type specifically needs some iteration.

Speed bumps aside, the finding I keep coming back to is this: the systems where capsules exceed 100% are the strongest evidence for the thesis. Editorial judgment isn't just preserving meaning through compression. In the right contexts, it's enhancing it. The person who knows what to cut is more valuable than the person who knows how to combine.

*RAG stands for Retrieval-Augmented Generation, files structured to help a model keep its focus.

Beyond the Process Boundary

I built this in relative isolation, solving my own problems. But the further I looked, the clearer the gap became.

Most multi-agent systems today are orchestrated within a single process. Subagents share a context window, divide tasks, and check each other's work. They're powerful. But everything stays inside the same session, the same system, the same run.

The industry is moving past that. Anthropic is building multi-agent teams. OpenAI shipped Codex, with agents running asynchronously in isolated sandboxes. Swarm architectures are multiplying. The direction is clear: agents will operate across organizational boundaries, across teams with different needs and different systems, alongside individuals who each bring their own context and constraints.

The part nobody's building is the handoff. When information needs to pass between those systems, between teams, between agents serving different people with different constraints, somebody has to decide what crosses and what gets left behind. That's not a protocol problem. That's an editorial one.

Capsules sit in that gap. Markdown happens to be the right vehicle because it compresses well, LLMs understand it natively, and humans can read it. But the format is the convenience. The editorial compression is the capability.


The capsule schema is open source at github.com/hurleywgly/capsule. The repo includes the full skill, tiers, categories, archetypes, and everything you need to start creating capsules for your own agent workflows.

Use it. Fork it. If you have a use case or an archetype that doesn't fit into what's there, I want to hear about it. The format is designed to be adapted, not prescribed. The repo has everything you need to create or load your first capsule.

* * *

I'll keep finding useful applications for this in my own work and refining the framework as it evolves. The evaluation tool will likely expand. There's something satisfying about the scientific process of it: creating a hypothesis about what makes knowledge transfer effective, measuring it, and iterating on the results.

There are also questions I haven't fully answered yet. Trust is one. When a capsule arrives from another system, what verifies that the knowledge inside is accurate, or that it hasn't been tampered with along the way? That's a problem worth solving, and I'm thinking about it.

Context windows are getting massive. Opus 4.6 offers a million tokens. The natural assumption is that bigger windows make compression unnecessary. I think the opposite might be true. The evaluation data already suggests that for structured systems, the agent performs better with less. The open question is whether that holds at scale: when you have a million tokens of runway, is editorial compression still more effective than giving the agent everything? My theory is that it will, especially for business-critical tasks where you don't have the luxury of iteration. When it has to be right the first time, curated context beats raw context. I still need to prove it out.

The bigger shift I see coming is in who creates the capsules. Right now, a human shapes the context. That's the editorial compression that makes the whole thing work. But I can imagine a future where agents are contributing capsules to each other, where both systems are controlled contributors to a shared repository, and where the human role shifts from author to editor. There's likely a version of this framework that forks into two directions: capsules that are created and monitored by humans for high-stakes transfers, and capsules that operate as part of larger autonomous workflows where the volume and velocity demand a different kind of oversight. Same underlying protocol. Different structure.

Sequoyah would probably recognize the pattern.


Images generated with Midjourney.