Token Tracking
This page documents how tokens are counted, stored, and tracked across sessions. Token tracking enables context management, usage monitoring, and compaction decisions.
Token Tracking Overview
┌─────────────────────────────────────────────────────────────────┐
│ LLM API Response │
│ Contains usage: { input_tokens, output_tokens } │
└─────────────────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ LLMSession │
│ Store in atomic counters │
│ current_input_tokens, current_output_tokens │
└─────────────────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ FromLLMPayload │
│ Send TokenUpdate event to controller │
└─────────────────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TokenUsageTracker │
│ Aggregate by session, model, and total │
└─────────────────────────────────────────────────────────────────┘
Session Token Fields
LLMSession tracks tokens in atomic fields:
pub struct LLMSession {
// Current context usage
current_input_tokens: AtomicI64,
current_output_tokens: AtomicI64,
// Request counter
request_count: AtomicI64,
// Context limit
context_limit: AtomicI32,
}
Field Purposes
| Field | Purpose |
|---|---|
current_input_tokens | Total input tokens in current context |
current_output_tokens | Total output tokens generated |
request_count | Number of API calls made |
context_limit | Maximum allowed context tokens |
Extracting Tokens from Responses
Streaming Responses
During streaming, tokens arrive in the MessageDelta event:
StreamEvent::MessageDelta { stop_reason, usage } => {
if let Some(usage) = usage {
// Update session counters
self.current_input_tokens.store(
usage.input_tokens as i64,
Ordering::SeqCst
);
self.current_output_tokens.store(
usage.output_tokens as i64,
Ordering::SeqCst
);
// Send update to controller
let payload = FromLLMPayload {
response_type: LLMResponseType::TokenUpdate,
session_id: self.id(),
input_tokens: usage.input_tokens as i64,
output_tokens: usage.output_tokens as i64,
cache_read_tokens: usage.cache_read_tokens.unwrap_or(0) as i64,
cache_write_tokens: usage.cache_write_tokens.unwrap_or(0) as i64,
// ...
};
let _ = self.from_llm.send(payload).await;
}
}
Non-Streaming Responses
For non-streaming, tokens are extracted from the complete response:
let response = self.client.complete(&request).await?;
if let Some(usage) = response.usage {
self.current_input_tokens.store(usage.input_tokens as i64, Ordering::SeqCst);
self.current_output_tokens.store(usage.output_tokens as i64, Ordering::SeqCst);
}
Usage Structure
The Usage struct contains token counts from the API:
pub struct Usage {
pub input_tokens: u32,
pub output_tokens: u32,
pub cache_read_tokens: Option<u32>,
pub cache_write_tokens: Option<u32>,
}
Cache Tokens
Anthropic supports prompt caching:
cache_read_tokens: Tokens read from cache (reduced cost)cache_write_tokens: Tokens written to cache
Message Token Storage
Assistant messages store per-message token counts:
pub struct AssistantMessage {
// ... other fields
pub input_tokens: i64,
pub output_tokens: i64,
pub cache_read_tokens: i64,
pub cache_write_tokens: i64,
}
This enables:
- Historical token usage analysis
- Per-message cost calculation
- Cache efficiency metrics
Token Update Events
Token updates propagate through the event system:
pub enum LLMResponseType {
// ... other variants
TokenUpdate,
}
pub struct FromLLMPayload {
pub response_type: LLMResponseType,
pub session_id: i64,
pub input_tokens: i64,
pub output_tokens: i64,
pub cache_read_tokens: i64,
pub cache_write_tokens: i64,
// ...
}
The controller converts these to UI events:
pub enum ControllerEvent {
// ... other variants
TokenUpdate {
session_id: i64,
input_tokens: i64,
output_tokens: i64,
},
}
TokenUsageTracker
The tracker aggregates usage across sessions and models:
pub struct TokenUsageTracker {
tokens_per_session: RwLock<HashMap<i64, TokenMeter>>,
tokens_per_model: RwLock<HashMap<String, TokenMeter>>,
total_usage: RwLock<TokenMeter>,
}
TokenMeter
#[derive(Debug, Clone, Default)]
pub struct TokenMeter {
pub input_tokens: i64,
pub output_tokens: i64,
pub request_count: i64,
}
impl TokenMeter {
pub fn add(&mut self, input: i64, output: i64) {
self.input_tokens += input;
self.output_tokens += output;
self.request_count += 1;
}
pub fn total(&self) -> i64 {
self.input_tokens + self.output_tokens
}
}
Incrementing Usage
impl TokenUsageTracker {
pub async fn increment(
&self,
session_id: i64,
model: &str,
input_tokens: i64,
output_tokens: i64,
) {
// Update session-level usage
{
let mut sessions = self.tokens_per_session.write().await;
sessions
.entry(session_id)
.or_default()
.add(input_tokens, output_tokens);
}
// Update model-level usage
{
let mut models = self.tokens_per_model.write().await;
models
.entry(model.to_string())
.or_default()
.add(input_tokens, output_tokens);
}
// Update total usage
{
let mut total = self.total_usage.write().await;
total.add(input_tokens, output_tokens);
}
}
}
Querying Usage
impl TokenUsageTracker {
pub async fn get_session_usage(&self, session_id: i64) -> Option<TokenMeter> {
self.tokens_per_session.read().await.get(&session_id).cloned()
}
pub async fn get_model_usage(&self, model: &str) -> Option<TokenMeter> {
self.tokens_per_model.read().await.get(model).cloned()
}
pub async fn get_total_usage(&self) -> TokenMeter {
self.total_usage.read().await.clone()
}
}
Context Limit Enforcement
Token tracking enables context limit checks:
async fn check_context(&self) -> bool {
let context_used = self.current_input_tokens.load(Ordering::SeqCst);
let context_limit = self.context_limit.load(Ordering::SeqCst);
context_used < context_limit as i64
}
Utilization Calculation
fn calculate_utilization(&self) -> f64 {
let context_used = self.current_input_tokens.load(Ordering::SeqCst);
let context_limit = self.context_limit.load(Ordering::SeqCst);
context_used as f64 / context_limit as f64
}
Compaction Decision
fn should_compact(&self, context_used: i64, context_limit: i32) -> bool {
let utilization = context_used as f64 / context_limit as f64;
utilization > self.threshold // Default: 0.75
}
Provider Context Limits
Default context limits by provider:
| Provider | Model | Context Limit |
|---|---|---|
| Anthropic | Claude 3.5 | 200,000 |
| OpenAI | GPT-4o | 128,000 |
Set in configuration:
LLMSessionConfig::anthropic(&api_key, "claude-sonnet-4-20250514")
.with_context_limit(200_000)
LLMSessionConfig::openai(&api_key, "gpt-4o")
.with_context_limit(128_000)
Resetting Token Counts
Token counts reset when conversation is cleared:
pub async fn clear_conversation(&self) {
// Clear message history
*self.conversation.write().await = Arc::new(Vec::new());
// Reset token counters
self.current_input_tokens.store(0, Ordering::SeqCst);
self.current_output_tokens.store(0, Ordering::SeqCst);
}
Session Token Accessors
Read current token state:
impl LLMSession {
pub fn input_tokens(&self) -> i64 {
self.current_input_tokens.load(Ordering::SeqCst)
}
pub fn output_tokens(&self) -> i64 {
self.current_output_tokens.load(Ordering::SeqCst)
}
pub fn request_count(&self) -> i64 {
self.request_count.load(Ordering::SeqCst)
}
pub fn context_limit(&self) -> i32 {
self.context_limit.load(Ordering::SeqCst)
}
}
Thread Safety
Token tracking uses atomic operations for thread safety:
// Read current value
let tokens = self.current_input_tokens.load(Ordering::SeqCst);
// Update value
self.current_input_tokens.store(new_value, Ordering::SeqCst);
// Atomic increment
self.request_count.fetch_add(1, Ordering::SeqCst);
Ordering::SeqCst ensures:
- All threads see updates in consistent order
- No torn reads or writes
- Memory barriers prevent reordering
Token Tracking Flow
Complete flow for a single request:
1. User message added to conversation
2. Request sent to LLM API
3. Response contains usage data
4. Session atomics updated
5. TokenUpdate event sent
6. Controller increments tracker
7. UI receives token update
8. Compaction checked against limit
Next Steps
- Status Tracking - Session status monitoring
- Compaction Algorithm - Context reduction
- Context Management - Conversation storage
