Token budgeting: sliding window vs retrieval

When you build a multi-turn AI assistant, you run into a hard limit: Your model’s context window — the maximum number of tokens it can process at once.

If you push too much text:

Generation costs go up.
The model may fail or truncate important parts.
Responses can lose quality if the prompt gets cut.

So, you need a token budgeting strategy to keep the conversation relevant and efficient.

✅ What is Token Budgeting?

Token budgeting = deciding what information stays in the prompt and what gets dropped or compressed.

You’re balancing:

✅ New user input
✅ Assistant’s latest replies
✅ Past context (history)
✅ Optional: Retrieved info (RAG)

All must fit inside the LLM’s max context — e.g., 2k, 4k, 8k, or 32k tokens.

✅ Two Popular Strategies

1️⃣ Sliding Window

What it is:

Keep only the most recent turns in your chat history.
Drop older messages as new ones come in.

How it works:

System: You are a helpful assistant.

User: What is Python?
Assistant: Python is a programming language...

User: Show me an example.
Assistant: Sure! Here’s a simple script...

If your context grows too big:

Keep only the last N turns that still fit.
Example: “last 5 exchanges”.

Pros: ✔️ Simple, fast to implement ✔️ Works well for short Q&A or casual chat ✔️ Good for general-purpose assistants

Cons: ❌ Loses important long-term facts (like names, preferences, or tasks). ❌ Can feel forgetful in long conversations.

2️⃣ Retrieval-Augmented Context

What it is: Combine your sliding window with on-demand retrieval.

👉 Instead of keeping all history, you:

Store old turns or facts in a vector store.
Embed the current question.
Retrieve relevant old messages or notes to re-insert.

This works like smart long-term memory.

How it works: 1️⃣ User asks: “What did I say about my project deadline?” 2️⃣ Embed that query. 3️⃣ Search your stored past turns or knowledge base. 4️⃣ Inject the top relevant snippets back into the prompt.

Pros: ✔️ Keeps history focused and factual. ✔️ Handles very long conversations or knowledge. ✔️ Reduces wasted tokens (only bring back relevant info).

Cons: ❌ More complex — needs a vector DB + embeddings. ❌ Small risk of missing nuance if retrieval fails. ❌ Takes extra compute to run retrieval each turn.

✅ Combining Both

Best practice: Use a sliding window for the recent chat, plus retrieval for older or domain knowledge.

Example Prompt:

System: You are a helpful assistant.
[Recent turns]...
[Retrieved facts]...
User: [New question]
Assistant:

✅ Practical Tips for Token Budgeting

✔️ Check your model’s max context — e.g., 4k tokens for many open models. ✔️ Calculate average token usage:

1 word ≈ 1.3 tokens (rough rule)
Code snippets use more tokens. ✔️ Keep system prompts short and reusable. ✔️ Set hard limits: truncate old turns or chunk retrieved text. ✔️ For big RAG pipelines, chunk your KB into small passages (100–300 words) to avoid bloat.

✅ Tools That Help

tiktoken for token counting (OpenAI tokenizer)

transformers tokenizer:

tokens = tokenizer.encode(prompt)
print(len(tokens))

🗝️ Key Takeaway

Good context = relevant context. Sliding window keeps chat fresh. Retrieval keeps old knowledge alive only when needed — and both together make your assistant feel smart and efficient.

➡️ Next: Learn how to test your multi-turn flow — and add guardrails so your assistant stays consistent!

PreviousStrategies to maintain dialogue context NextDemo: Maintaining context through longer sessions

Last updated 5 months ago