Session Context

Every LLM has a finite context window that limits how much text it can process in a single request. Agent Air tracks context usage to prevent exceeding model limits and to trigger compaction when needed.

Context Windows

The context window is the maximum amount of text an LLM can process in a single request. This includes:

  • System prompt
  • All previous messages in the conversation
  • Tool calls and their results
  • The current user message
  • The model’s response

A token is roughly 4 characters of English text. When the total token count approaches or exceeds the context window, the model cannot process the request.

Default Context Limits

Agent Air sets default context limits based on the provider:

ProviderDefault Context Limit
Anthropic200,000 tokens
OpenAI128,000 tokens
Google1,000,000 tokens
Groq131,072 tokens
Most others128,000 tokens

Configuring Context Limits

Set the context limit to match your model’s capability:

let config = LLMSessionConfig::anthropic("key", "claude-sonnet-4-20250514")
    .with_context_limit(200_000);

For models with smaller context windows:

let config = LLMSessionConfig::openai("key", "gpt-4")
    .with_context_limit(8_192);

Token Limits

Agent Air tracks two types of limits:

Context Limit

The total number of tokens the model can process, including input and output. This is used for compaction decisions and is not sent to the API.

Max Tokens

The maximum number of tokens in the model’s response. This is sent to the API with each request to limit response length.

let config = LLMSessionConfig::anthropic("key", "model")
    .with_context_limit(200_000)  // Total context
    .with_max_tokens(8192);        // Response limit

Example: With a 200,000 context limit and 4,096 max tokens, a request with 100,000 input tokens can receive up to 4,096 output tokens (not the full remaining 100,000).

Default Max Tokens

All providers default to 4,096 max tokens for responses.

Tracking Usage

Agent Air tracks token usage at multiple levels to enable monitoring, cost tracking, and session health assessment.

Per-Session Tracking

Each session tracks its own token usage:

// Query session usage
let usage = controller.get_session_token_usage(session_id).await;
if let Some(meter) = usage {
    println!("Session {} used {} tokens", session_id, meter.total_tokens());
}

Per-Model Tracking

Track cumulative usage across all sessions using a specific model:

let usage = controller.get_model_token_usage("claude-sonnet-4-20250514").await;
if let Some(meter) = usage {
    println!("Model used {} input, {} output tokens",
        meter.input_tokens, meter.output_tokens);
}

Total Usage

Track cumulative usage across all models and sessions:

let total = controller.get_total_token_usage().await;
println!("Total tokens used: {}", total.total_tokens());

TokenMeter

The TokenMeter struct provides token counts:

pub struct TokenMeter {
    pub input_tokens: i64,
    pub output_tokens: i64,
}

Use total_tokens() to get the sum of input and output tokens.

Monitoring Utilization

Context Utilization

Context utilization is calculated as the ratio of used to available context:

utilization = context_used / context_limit

A utilization of 0.75 means 75% of the context window is consumed. This value drives compaction decisions.

SessionStatus

The SessionStatus struct provides real-time session metrics:

FieldDescription
context_usedCurrent context size in tokens
context_limitThe configured context window limit
utilizationPercentage of context used (0.0 to 1.0)
total_inputCumulative input tokens across all requests
total_outputCumulative output tokens across all requests
request_countNumber of API calls made

Token Sources

Usage data comes from the LLM provider’s API response. The framework parses token counts from each response and updates tracking automatically.

Anthropic format:

{
  "usage": {
    "input_tokens": 1523,
    "output_tokens": 847
  }
}

OpenAI format:

{
  "usage": {
    "prompt_tokens": 1523,
    "completion_tokens": 847,
    "total_tokens": 2370
  }
}

Context Pressure

As context usage increases, the session approaches its limits:

Low utilization (< 50%): Plenty of headroom for conversation growth.

Moderate utilization (50-75%): Normal operation. Monitor if conversation is long-running.

High utilization (> 75%): Approaching compaction threshold. Compaction may trigger automatically.

Near limit (> 90%): Risk of context overflow. Consider manual compaction or starting a new session.

Why Context Windows Matter

Conversation continuity: The context window determines how much history the model can see. Longer windows allow more extensive conversations without losing earlier context.

Tool results: Tool calls can return large amounts of data. File contents, search results, and API responses consume context space quickly.

Compaction triggers: When utilization exceeds the threshold (default 75%), the framework compacts the conversation to free up space.

Context Window Overflow

If context usage exceeds the model’s limit, the API request fails with a context length error.

Agent Air’s compaction system prevents this by reducing context usage before it reaches the limit. With default settings (75% threshold), compaction triggers well before the context window fills completely.