Context Management PR#228

If you've ever worked with LLMs for more than a few turns, you've probably hit the same wall we did: context windows are finite. Conversations grow, tool outputs explode in size, and suddenly you're out of tokens.
This task was about building a full context management system for LLM4S that keeps only the necessary information: trimming away heavy tool outputs, compressing older turns, and making sure the LLM still sees what it needs to stay coherent.

It turned out to be one of the most architecture-heavy tasks so far. I had to make design decisions about the context management pipeline (what runs first, what runs last), and also revisit earlier parts of the codebase, for example, changing how system prompts are stored, so that the new logic worked cleanly.

Understanding the Problem

LLM contexts grow quickly. A few concrete examples:

A single tool call that returns JSON logs or stack traces can flood the context with thousands of tokens that the LLM doesn't need to see entirely.
After a few turns, the conversation history is long, but most of it is just background. Keeping every single detail inline wastes budget.
Different models have very different context sizes (8k, 128k, 200k...) and without clear budgeting, you risk wasting capacity.

The solution was to design a step-by-step pipeline: start with the safest options (removing obvious junk), then progressively use more aggressive strategies (summarization, rewriting, cutting older turns) only when necessary.

Designing the Pipeline

I spent quite some time deciding what should run first, second, third... This order matters a lot.
The final pipeline runs inside ContextManager.manageContext. It has four steps:

1) Tool Output Compaction

What it touches: only ToolMessages. User/assistant messages are untouched in this step.

Large tool payloads are externalized into an ArtifactStore: Instead of keeping the full tool output inline, we save heavy content (e.g. logs, JSON...) into an external store (e.g. in-memory map, database, or file system) and replace it in the conversation with a lightweight pointer like [EXTERNALIZED:abc123] that refers back to the stored artifact. This way, the LLM sees a compact reference, while the full data is still retrievable whenever needed
For medium outputs (between ~2KB and 8KB), the system doesn't externalize but compacts them inline. For example, JSON is cleaned up (dropping empty fields, shortening long arrays), logs are clipped to just the beginning and end, and error traces are shortened to the most useful lines.

→ Think of this like moving attachments out of your email text. You don't inline the 20MB document; you just keep a link to it.
This runs first because it's the safest win: the LLM doesn't need the raw payload, just the fact that it exists and it gives the biggest win without touching semantics.

ArtifactStore: The ArtifactStore is a simple storage layer (currently in-memory) where we can move heavy tool outputs. In production, this can be any type of DB.

2) History Compression

What it touches: older semantic blocks only. The last K blocks (the most recent ones) are kept the same.

The system scans the conversation and groups messages into semantic blocks (e.g. a user question plus the assistant's reply, sometimes with tool calls or system messages).
All but the most recent K blocks are replaced with [HISTORY_SUMMARY] messages.
Each summary is built deterministically by applying regex extractors to the block text. It captures:
    – Identifiers (IDs, UUIDs, keys...)
    – URLs
    – Constraints (e.g. "must/should...")
    – Status codes and error fragments
    – Decision and outcome fragments
    – Tool usage fragments
Within each block, matches are distinct (duplicates collapsed) and capped (e.g. max 3 IDs, 2 URLs, 2 constraints, etc.).

→ This is purely rule-based logic: no LLM is called, so the output is predictable and testable.
This runs second because it keeps the conversation coherent while shrinking the "long tail."

3) LLM Digest Squeeze (optional)

What it touches: only [HISTORY_SUMMARY] messages.
When it runs: current tokens still exceed budget (even after tool output compression & history compression) and LLM compression is enabled and an LLM client is provided.

Sends the digest content to the LLM with a fixed prompt ("keep IDs/URLs/status codes/errors; preserve decisions/outcomes; maintain tool-usage; compress descriptive text; target ~50% reduction").
Replaces each digest with a shorter [HISTORY_SUMMARY].
Never compresses the whole conversation here—only the digest(s).

→ This runs third, only when deterministic steps weren't enough. It's comparable to what Claude code does with its built-in compaction: rewriting its own summaries.

4) Final Cut

Last resort: If all previous techniques fail or don't save us enough space in the context window, the system cuts the oldest non-pinned turns until the conversation fits under budget.
The first [HISTORY_SUMMARY] is pinned. Other digests are treated like normal messages and may be dropped if budget requires.
Runs last as the strict fallback.

→ Basically, dropping the least recent details so the model never crashes on input length

This sequencing was chosen with respect to three principles: start cheap and safe, escalate only when necessary, and never lose recency.

Code Changes Beyond the Pipeline

I also had to revisit existing design decisions in this task.

System prompts were previously stored as the first message in the agent conversation. Now they are stored separately in AgentState.systemMessage and they're re-injected only when calling the LLM. This makes sense since the system prompt should not be considered as any "message" (it is a configuration, not history) and it prevents it from being compacted like normal history in the context management pipeline.

Model-aware budgets: Context window sizes depend on the model. For example: GPT-4o has 128k token windows and GPT-3.5-turbo has 16k token windows. Thus, I had to add new parameters to model provider configurations in order to retrieve context window sizes. Currently, we maintain a mapping from model name to context window size.

I also added Token counting functionality and examples demonstrating the different pipeline stages so we can see how compression behaves in practice.

File Breakdown

For readers who want to explore the code: PR#228

ContextManager.scala - orchestrates the pipeline
ToolOutputCompressor.scala - externalizes heavy tool outputs
HistoryCompressor.scala - builds [HISTORY_SUMMARY] digests
LLMCompressor.scala - optional LLM squeeze of digests
TokenWindow.scala - trims oldest turns to fit budget

Tip: the samples/context directory has runnable examples that show how each stage behaves in practice.

Reflections

This was a big one. It combined new architecture and core refactors.

It was interesting to have to make such big decisions for users. There were many different ways of doing this pipeline, and no "right" solution. I did spend much more time than for other tasks to decide what kinds of features to add, and how to add them... It wasn't straightforward and it was mostly based on my "best judgment". This was the first time I felt this responsibility so strongly and I have to say it was nice :)

At the end of it, LLM4S can now handle longer conversations without blowing up the context window. The system knows exactly how much space it has, compresses safely and deterministically, and only falls back to the LLM compression when it absolutely has to. 👩🏻‍🍳