Demo: Maintaining context through longer sessions

Now that you understand why context matters and how to budget tokens using sliding windows and retrieval — let’s see it in action with a simple multi-turn chat demo.

This demo shows you how to:

Keep a rolling conversation history
Manage what stays in the prompt
Optionally summarize or retrieve older info
Make your assistant feel coherent over multiple turns

✅ Goal

Build a simple Gradio chat (or Python CLI) that: 1️⃣ Stores each turn in a session history 2️⃣ Adds history to the prompt for every new question 3️⃣ Automatically trims or summarizes old turns if they exceed your token budget

✅ Step 1️⃣ — Basic Session Store

Use a Python list to track messages:

# Example format: [("User", "text"), ("Assistant", "text")]
session_history = []

✅ Step 2️⃣ — Add a Message

When the user submits a new question:

Add the question to the list
Build a prompt that includes the system prompt + conversation so far

✅ Step 3️⃣ — Example Prompt Builder

def build_prompt(history, new_user_input, max_turns=5):
    # Keep only last N turns to stay within token limit
    recent = history[-(max_turns * 2):]  # 2 per turn: user + assistant

    prompt = "You are a helpful assistant.\n\n"

    for role, text in recent:
        prompt += f"{role}: {text}\n"

    prompt += f"User: {new_user_input}\nAssistant:"

    return prompt

✅ Step 4️⃣ — Run the LLM

def chat(user_input):
    global session_history

    # Build prompt with sliding window
    prompt = build_prompt(session_history, user_input)

    # Generate answer
    output = generator(prompt, max_length=300, do_sample=True, temperature=0.4)
    answer_text = output[0]["generated_text"].split("Assistant:")[-1].strip()

    # Add new exchange to history
    session_history.append(("User", user_input))
    session_history.append(("Assistant", answer_text))

    return answer_text

✅ Step 5️⃣ — Add to Gradio Chatbot

import gradio as gr

with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox(placeholder="Type your question here...")
    clear = gr.Button("Clear")

    def respond(user_input, chat_history):
        answer = chat(user_input)
        chat_history.append((user_input, answer))
        return "", chat_history

    msg.submit(respond, [msg, chatbot], [msg, chatbot])
    clear.click(lambda: None, None, chatbot, queue=False)

demo.launch()

✅ Optional: Add Summarization

When the chat gets too long:

Use your LLM to summarize old turns.
Replace old details with a short summary.

Example:

# Call your model to summarize:
summary_prompt = f"Summarize this chat:\n{long_chat_history}\nSummary:"
summary = generator(summary_prompt, max_length=100)[0]["generated_text"]

Then store:

session_history = [("Summary", summary)] + recent_turns

✅ See It in Action

Ask “What is Python?”
Then: “Who created it?”
Then: “When was it released?”

Each follow-up uses previous answers to stay relevant!

✅ Key Benefits

✔️ Simple rolling window = fast, works for most casual chats. ✔️ Summaries = squeeze more context into your token budget. ✔️ Session list = easily expandable: store in DB, link to user ID, or persist between calls.

🗝️ Key Takeaway

A smart assistant remembers what you said — within the limits of its context. Managing this well makes your AI feel more natural, helpful, and trustworthy for longer conversations.

➡️ Next: Learn how to add tool use or plug in plugins so your assistant can do real tasks, not just answer text!

PreviousToken budgeting: sliding window vs retrieval NextChapter 9

Last updated 5 months ago