Token Tracking

This page documents how tokens are counted, stored, and tracked across sessions. Token tracking enables context management, usage monitoring, and compaction decisions.

Token Tracking Overview

┌─────────────────────────────────────────────────────────────────┐
│                    LLM API Response                              │
│  Contains usage: { input_tokens, output_tokens }                 │
└─────────────────────────────────┬───────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│                    LLMSession                                    │
│  Store in atomic counters                                        │
│  current_input_tokens, current_output_tokens                     │
└─────────────────────────────────┬───────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│                    FromLLMPayload                                │
│  Send TokenUpdate event to controller                            │
└─────────────────────────────────┬───────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│                    TokenUsageTracker                             │
│  Aggregate by session, model, and total                          │
└─────────────────────────────────────────────────────────────────┘

Session Token Fields

LLMSession tracks tokens in atomic fields:

pub struct LLMSession {
    // Current context usage
    current_input_tokens: AtomicI64,
    current_output_tokens: AtomicI64,

    // Request counter
    request_count: AtomicI64,

    // Context limit
    context_limit: AtomicI32,
}

Field Purposes

FieldPurpose
current_input_tokensTotal input tokens in current context
current_output_tokensTotal output tokens generated
request_countNumber of API calls made
context_limitMaximum allowed context tokens

Extracting Tokens from Responses

Streaming Responses

During streaming, tokens arrive in the MessageDelta event:

StreamEvent::MessageDelta { stop_reason, usage } => {
    if let Some(usage) = usage {
        // Update session counters
        self.current_input_tokens.store(
            usage.input_tokens as i64,
            Ordering::SeqCst
        );
        self.current_output_tokens.store(
            usage.output_tokens as i64,
            Ordering::SeqCst
        );

        // Send update to controller
        let payload = FromLLMPayload {
            response_type: LLMResponseType::TokenUpdate,
            session_id: self.id(),
            input_tokens: usage.input_tokens as i64,
            output_tokens: usage.output_tokens as i64,
            cache_read_tokens: usage.cache_read_tokens.unwrap_or(0) as i64,
            cache_write_tokens: usage.cache_write_tokens.unwrap_or(0) as i64,
            // ...
        };
        let _ = self.from_llm.send(payload).await;
    }
}

Non-Streaming Responses

For non-streaming, tokens are extracted from the complete response:

let response = self.client.complete(&request).await?;

if let Some(usage) = response.usage {
    self.current_input_tokens.store(usage.input_tokens as i64, Ordering::SeqCst);
    self.current_output_tokens.store(usage.output_tokens as i64, Ordering::SeqCst);
}

Usage Structure

The Usage struct contains token counts from the API:

pub struct Usage {
    pub input_tokens: u32,
    pub output_tokens: u32,
    pub cache_read_tokens: Option<u32>,
    pub cache_write_tokens: Option<u32>,
}

Cache Tokens

Anthropic supports prompt caching:

  • cache_read_tokens: Tokens read from cache (reduced cost)
  • cache_write_tokens: Tokens written to cache

Message Token Storage

Assistant messages store per-message token counts:

pub struct AssistantMessage {
    // ... other fields

    pub input_tokens: i64,
    pub output_tokens: i64,
    pub cache_read_tokens: i64,
    pub cache_write_tokens: i64,
}

This enables:

  • Historical token usage analysis
  • Per-message cost calculation
  • Cache efficiency metrics

Token Update Events

Token updates propagate through the event system:

pub enum LLMResponseType {
    // ... other variants
    TokenUpdate,
}

pub struct FromLLMPayload {
    pub response_type: LLMResponseType,
    pub session_id: i64,
    pub input_tokens: i64,
    pub output_tokens: i64,
    pub cache_read_tokens: i64,
    pub cache_write_tokens: i64,
    // ...
}

The controller converts these to UI events:

pub enum ControllerEvent {
    // ... other variants
    TokenUpdate {
        session_id: i64,
        input_tokens: i64,
        output_tokens: i64,
    },
}

TokenUsageTracker

The tracker aggregates usage across sessions and models:

pub struct TokenUsageTracker {
    tokens_per_session: RwLock<HashMap<i64, TokenMeter>>,
    tokens_per_model: RwLock<HashMap<String, TokenMeter>>,
    total_usage: RwLock<TokenMeter>,
}

TokenMeter

#[derive(Debug, Clone, Default)]
pub struct TokenMeter {
    pub input_tokens: i64,
    pub output_tokens: i64,
    pub request_count: i64,
}

impl TokenMeter {
    pub fn add(&mut self, input: i64, output: i64) {
        self.input_tokens += input;
        self.output_tokens += output;
        self.request_count += 1;
    }

    pub fn total(&self) -> i64 {
        self.input_tokens + self.output_tokens
    }
}

Incrementing Usage

impl TokenUsageTracker {
    pub async fn increment(
        &self,
        session_id: i64,
        model: &str,
        input_tokens: i64,
        output_tokens: i64,
    ) {
        // Update session-level usage
        {
            let mut sessions = self.tokens_per_session.write().await;
            sessions
                .entry(session_id)
                .or_default()
                .add(input_tokens, output_tokens);
        }

        // Update model-level usage
        {
            let mut models = self.tokens_per_model.write().await;
            models
                .entry(model.to_string())
                .or_default()
                .add(input_tokens, output_tokens);
        }

        // Update total usage
        {
            let mut total = self.total_usage.write().await;
            total.add(input_tokens, output_tokens);
        }
    }
}

Querying Usage

impl TokenUsageTracker {
    pub async fn get_session_usage(&self, session_id: i64) -> Option<TokenMeter> {
        self.tokens_per_session.read().await.get(&session_id).cloned()
    }

    pub async fn get_model_usage(&self, model: &str) -> Option<TokenMeter> {
        self.tokens_per_model.read().await.get(model).cloned()
    }

    pub async fn get_total_usage(&self) -> TokenMeter {
        self.total_usage.read().await.clone()
    }
}

Context Limit Enforcement

Token tracking enables context limit checks:

async fn check_context(&self) -> bool {
    let context_used = self.current_input_tokens.load(Ordering::SeqCst);
    let context_limit = self.context_limit.load(Ordering::SeqCst);
    context_used < context_limit as i64
}

Utilization Calculation

fn calculate_utilization(&self) -> f64 {
    let context_used = self.current_input_tokens.load(Ordering::SeqCst);
    let context_limit = self.context_limit.load(Ordering::SeqCst);
    context_used as f64 / context_limit as f64
}

Compaction Decision

fn should_compact(&self, context_used: i64, context_limit: i32) -> bool {
    let utilization = context_used as f64 / context_limit as f64;
    utilization > self.threshold  // Default: 0.75
}

Provider Context Limits

Default context limits by provider:

ProviderModelContext Limit
AnthropicClaude 3.5200,000
OpenAIGPT-4o128,000

Set in configuration:

LLMSessionConfig::anthropic(&api_key, "claude-sonnet-4-20250514")
    .with_context_limit(200_000)

LLMSessionConfig::openai(&api_key, "gpt-4o")
    .with_context_limit(128_000)

Resetting Token Counts

Token counts reset when conversation is cleared:

pub async fn clear_conversation(&self) {
    // Clear message history
    *self.conversation.write().await = Arc::new(Vec::new());

    // Reset token counters
    self.current_input_tokens.store(0, Ordering::SeqCst);
    self.current_output_tokens.store(0, Ordering::SeqCst);
}

Session Token Accessors

Read current token state:

impl LLMSession {
    pub fn input_tokens(&self) -> i64 {
        self.current_input_tokens.load(Ordering::SeqCst)
    }

    pub fn output_tokens(&self) -> i64 {
        self.current_output_tokens.load(Ordering::SeqCst)
    }

    pub fn request_count(&self) -> i64 {
        self.request_count.load(Ordering::SeqCst)
    }

    pub fn context_limit(&self) -> i32 {
        self.context_limit.load(Ordering::SeqCst)
    }
}

Thread Safety

Token tracking uses atomic operations for thread safety:

// Read current value
let tokens = self.current_input_tokens.load(Ordering::SeqCst);

// Update value
self.current_input_tokens.store(new_value, Ordering::SeqCst);

// Atomic increment
self.request_count.fetch_add(1, Ordering::SeqCst);

Ordering::SeqCst ensures:

  • All threads see updates in consistent order
  • No torn reads or writes
  • Memory barriers prevent reordering

Token Tracking Flow

Complete flow for a single request:

1. User message added to conversation
2. Request sent to LLM API
3. Response contains usage data
4. Session atomics updated
5. TokenUpdate event sent
6. Controller increments tracker
7. UI receives token update
8. Compaction checked against limit

Next Steps