Logging conversations, rating responses

Once your assistant is live, you shouldn’t just trust it blindly. You need to measure how well it performs — so you can fix mistakes, track improvements, and ensure it stays helpful and safe.


🔍 Why Evaluate?

  • LLMs can produce fluent but incorrect answers.

  • Fine-tuning and RAG can drift over time if your knowledge base changes.

  • Users need high-quality responses for real trust.


Types of Evaluation


1️⃣ Automatic Metrics: BLEU & ROUGE

These classic metrics compare generated text to reference answers:

  • BLEU (Bilingual Evaluation Understudy):

    • Checks overlap of n-grams (words or phrases).

    • Popular in translation tasks.

  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

    • Measures recall — how much of the reference text is covered.

    • Good for summarization tasks.

These are easy to run at scale, but they:

  • Don’t fully capture quality for open-ended chat.

  • Can miss subtleties like helpfulness or tone.


📌 Example: Evaluate with datasets


2️⃣ Human-in-the-Loop (HITL) Feedback

For assistants, human feedback is often more meaningful than scores: ✔️ Users rate the response: 👍 / 👎 ✔️ Or pick “Good answer”, “Wrong”, “Too short”, etc. ✔️ Or suggest better rephrasing.


How to Collect HITL:

  • Add a simple 👍 👎 button in your Gradio UI:

  • Store ratings in a file or database.

  • Review bad answers regularly and add new training examples.


3️⃣ Custom Checks

Some extra signals you might track:

  • Response length — too short or too long?

  • Harmful or forbidden topics — does it say something it shouldn’t?

  • Coverage — does the answer actually include retrieved context?

  • Factual correctness — manual spot-checks or QA pairs.


What’s Best for You?

Metric
Best For
Pros
Cons

BLEU

Translation-like outputs

Fast, objective

Doesn’t measure truth or tone

ROUGE

Summaries, paraphrases

Good for factual overlap

Same limits

HITL

Open-ended chat

Human judgment, real-world

Needs people/time

👉 Combo is best: ✅ Automatic scores for bulk checks ✅ Human feedback for quality and trust


How to Improve Based on Scores

Low BLEU/ROUGE? ➜ Fine-tune with better instructions. Poor user thumbs-up ratio? ➜ Review bad samples, expand dataset. Hallucinations? ➜ Improve your RAG context or prompt style.


🗝️ Key Takeaway

You can’t improve what you don’t measure. Mix automatic scores with real human ratings to make your assistant more accurate, helpful, and trusted over time.


➡️ Next: Learn how to log conversations, save ratings, and build a feedback loop to keep your assistant getting smarter!

Last updated