Session Context
Every LLM has a finite context window that limits how much text it can process in a single request. Agent Air tracks context usage to prevent exceeding model limits and to trigger compaction when needed.
Context Windows
The context window is the maximum amount of text an LLM can process in a single request. This includes:
- System prompt
- All previous messages in the conversation
- Tool calls and their results
- The current user message
- The model’s response
A token is roughly 4 characters of English text. When the total token count approaches or exceeds the context window, the model cannot process the request.
Default Context Limits
Agent Air sets default context limits based on the provider:
| Provider | Default Context Limit |
|---|---|
| Anthropic | 200,000 tokens |
| OpenAI | 128,000 tokens |
| 1,000,000 tokens | |
| Groq | 131,072 tokens |
| Most others | 128,000 tokens |
Configuring Context Limits
Set the context limit to match your model’s capability:
let config = LLMSessionConfig::anthropic("key", "claude-sonnet-4-20250514")
.with_context_limit(200_000);
For models with smaller context windows:
let config = LLMSessionConfig::openai("key", "gpt-4")
.with_context_limit(8_192);
Token Limits
Agent Air tracks two types of limits:
Context Limit
The total number of tokens the model can process, including input and output. This is used for compaction decisions and is not sent to the API.
Max Tokens
The maximum number of tokens in the model’s response. This is sent to the API with each request to limit response length.
let config = LLMSessionConfig::anthropic("key", "model")
.with_context_limit(200_000) // Total context
.with_max_tokens(8192); // Response limit
Example: With a 200,000 context limit and 4,096 max tokens, a request with 100,000 input tokens can receive up to 4,096 output tokens (not the full remaining 100,000).
Default Max Tokens
All providers default to 4,096 max tokens for responses.
Tracking Usage
Agent Air tracks token usage at multiple levels to enable monitoring, cost tracking, and session health assessment.
Per-Session Tracking
Each session tracks its own token usage:
// Query session usage
let usage = controller.get_session_token_usage(session_id).await;
if let Some(meter) = usage {
println!("Session {} used {} tokens", session_id, meter.total_tokens());
}
Per-Model Tracking
Track cumulative usage across all sessions using a specific model:
let usage = controller.get_model_token_usage("claude-sonnet-4-20250514").await;
if let Some(meter) = usage {
println!("Model used {} input, {} output tokens",
meter.input_tokens, meter.output_tokens);
}
Total Usage
Track cumulative usage across all models and sessions:
let total = controller.get_total_token_usage().await;
println!("Total tokens used: {}", total.total_tokens());
TokenMeter
The TokenMeter struct provides token counts:
pub struct TokenMeter {
pub input_tokens: i64,
pub output_tokens: i64,
}
Use total_tokens() to get the sum of input and output tokens.
Monitoring Utilization
Context Utilization
Context utilization is calculated as the ratio of used to available context:
utilization = context_used / context_limit
A utilization of 0.75 means 75% of the context window is consumed. This value drives compaction decisions.
SessionStatus
The SessionStatus struct provides real-time session metrics:
| Field | Description |
|---|---|
context_used | Current context size in tokens |
context_limit | The configured context window limit |
utilization | Percentage of context used (0.0 to 1.0) |
total_input | Cumulative input tokens across all requests |
total_output | Cumulative output tokens across all requests |
request_count | Number of API calls made |
Token Sources
Usage data comes from the LLM provider’s API response. The framework parses token counts from each response and updates tracking automatically.
Anthropic format:
{
"usage": {
"input_tokens": 1523,
"output_tokens": 847
}
}
OpenAI format:
{
"usage": {
"prompt_tokens": 1523,
"completion_tokens": 847,
"total_tokens": 2370
}
}
Context Pressure
As context usage increases, the session approaches its limits:
Low utilization (< 50%): Plenty of headroom for conversation growth.
Moderate utilization (50-75%): Normal operation. Monitor if conversation is long-running.
High utilization (> 75%): Approaching compaction threshold. Compaction may trigger automatically.
Near limit (> 90%): Risk of context overflow. Consider manual compaction or starting a new session.
Why Context Windows Matter
Conversation continuity: The context window determines how much history the model can see. Longer windows allow more extensive conversations without losing earlier context.
Tool results: Tool calls can return large amounts of data. File contents, search results, and API responses consume context space quickly.
Compaction triggers: When utilization exceeds the threshold (default 75%), the framework compacts the conversation to free up space.
Context Window Overflow
If context usage exceeds the model’s limit, the API request fails with a context length error.
Agent Air’s compaction system prevents this by reducing context usage before it reaches the limit. With default settings (75% threshold), compaction triggers well before the context window fills completely.
