Token budgeting: sliding window vs retrieval
When you build a multi-turn AI assistant, you run into a hard limit: Your model’s context window — the maximum number of tokens it can process at once.
If you push too much text:
Generation costs go up.
The model may fail or truncate important parts.
Responses can lose quality if the prompt gets cut.
So, you need a token budgeting strategy to keep the conversation relevant and efficient.
✅ What is Token Budgeting?
Token budgeting = deciding what information stays in the prompt and what gets dropped or compressed.
You’re balancing:
✅ New user input
✅ Assistant’s latest replies
✅ Past context (history)
✅ Optional: Retrieved info (RAG)
All must fit inside the LLM’s max context — e.g., 2k, 4k, 8k, or 32k tokens.
✅ Two Popular Strategies
1️⃣ Sliding Window
What it is:
Keep only the most recent turns in your chat history.
Drop older messages as new ones come in.
How it works:
If your context grows too big:
Keep only the last N turns that still fit.
Example: “last 5 exchanges”.
Pros: ✔️ Simple, fast to implement ✔️ Works well for short Q&A or casual chat ✔️ Good for general-purpose assistants
Cons: ❌ Loses important long-term facts (like names, preferences, or tasks). ❌ Can feel forgetful in long conversations.
2️⃣ Retrieval-Augmented Context
What it is: Combine your sliding window with on-demand retrieval.
👉 Instead of keeping all history, you:
Store old turns or facts in a vector store.
Embed the current question.
Retrieve relevant old messages or notes to re-insert.
This works like smart long-term memory.
How it works: 1️⃣ User asks: “What did I say about my project deadline?” 2️⃣ Embed that query. 3️⃣ Search your stored past turns or knowledge base. 4️⃣ Inject the top relevant snippets back into the prompt.
Pros: ✔️ Keeps history focused and factual. ✔️ Handles very long conversations or knowledge. ✔️ Reduces wasted tokens (only bring back relevant info).
Cons: ❌ More complex — needs a vector DB + embeddings. ❌ Small risk of missing nuance if retrieval fails. ❌ Takes extra compute to run retrieval each turn.
✅ Combining Both
Best practice: Use a sliding window for the recent chat, plus retrieval for older or domain knowledge.
Example Prompt:
✅ Practical Tips for Token Budgeting
✔️ Check your model’s max context — e.g., 4k tokens for many open models. ✔️ Calculate average token usage:
1 word ≈ 1.3 tokens (rough rule)
Code snippets use more tokens. ✔️ Keep system prompts short and reusable. ✔️ Set hard limits: truncate old turns or chunk retrieved text. ✔️ For big RAG pipelines, chunk your KB into small passages (100–300 words) to avoid bloat.
✅ Tools That Help
tiktokenfor token counting (OpenAI tokenizer)transformerstokenizer:
🗝️ Key Takeaway
Good context = relevant context. Sliding window keeps chat fresh. Retrieval keeps old knowledge alive only when needed — and both together make your assistant feel smart and efficient.
➡️ Next: Learn how to test your multi-turn flow — and add guardrails so your assistant stays consistent!
Last updated