Token budgeting: sliding window vs retrieval

When you build a multi-turn AI assistant, you run into a hard limit: Your model’s context window — the maximum number of tokens it can process at once.

If you push too much text:

  • Generation costs go up.

  • The model may fail or truncate important parts.

  • Responses can lose quality if the prompt gets cut.

So, you need a token budgeting strategy to keep the conversation relevant and efficient.


What is Token Budgeting?

Token budgeting = deciding what information stays in the prompt and what gets dropped or compressed.

You’re balancing:

  • New user input

  • Assistant’s latest replies

  • Past context (history)

  • Optional: Retrieved info (RAG)

All must fit inside the LLM’s max context — e.g., 2k, 4k, 8k, or 32k tokens.



1️⃣ Sliding Window

What it is:

  • Keep only the most recent turns in your chat history.

  • Drop older messages as new ones come in.

How it works:

If your context grows too big:

  • Keep only the last N turns that still fit.

  • Example: “last 5 exchanges”.


Pros: ✔️ Simple, fast to implement ✔️ Works well for short Q&A or casual chat ✔️ Good for general-purpose assistants

Cons: ❌ Loses important long-term facts (like names, preferences, or tasks). ❌ Can feel forgetful in long conversations.


2️⃣ Retrieval-Augmented Context

What it is: Combine your sliding window with on-demand retrieval.

👉 Instead of keeping all history, you:

  • Store old turns or facts in a vector store.

  • Embed the current question.

  • Retrieve relevant old messages or notes to re-insert.

This works like smart long-term memory.


How it works: 1️⃣ User asks: “What did I say about my project deadline?” 2️⃣ Embed that query. 3️⃣ Search your stored past turns or knowledge base. 4️⃣ Inject the top relevant snippets back into the prompt.


Pros: ✔️ Keeps history focused and factual. ✔️ Handles very long conversations or knowledge. ✔️ Reduces wasted tokens (only bring back relevant info).

Cons: ❌ More complex — needs a vector DB + embeddings. ❌ Small risk of missing nuance if retrieval fails. ❌ Takes extra compute to run retrieval each turn.


Combining Both

Best practice: Use a sliding window for the recent chat, plus retrieval for older or domain knowledge.

Example Prompt:


Practical Tips for Token Budgeting

✔️ Check your model’s max context — e.g., 4k tokens for many open models. ✔️ Calculate average token usage:

  • 1 word ≈ 1.3 tokens (rough rule)

  • Code snippets use more tokens. ✔️ Keep system prompts short and reusable. ✔️ Set hard limits: truncate old turns or chunk retrieved text. ✔️ For big RAG pipelines, chunk your KB into small passages (100–300 words) to avoid bloat.


Tools That Help

  • tiktoken for token counting (OpenAI tokenizer)

  • transformers tokenizer:


🗝️ Key Takeaway

Good context = relevant context. Sliding window keeps chat fresh. Retrieval keeps old knowledge alive only when needed — and both together make your assistant feel smart and efficient.


➡️ Next: Learn how to test your multi-turn flow — and add guardrails so your assistant stays consistent!

Last updated