Logging conversations, rating responses
Once your assistant is live, you shouldn’t just trust it blindly. You need to measure how well it performs — so you can fix mistakes, track improvements, and ensure it stays helpful and safe.
🔍 Why Evaluate?
LLMs can produce fluent but incorrect answers.
Fine-tuning and RAG can drift over time if your knowledge base changes.
Users need high-quality responses for real trust.
✅ Types of Evaluation
1️⃣ Automatic Metrics: BLEU & ROUGE
These classic metrics compare generated text to reference answers:
BLEU (Bilingual Evaluation Understudy):
Checks overlap of n-grams (words or phrases).
Popular in translation tasks.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
Measures recall — how much of the reference text is covered.
Good for summarization tasks.
These are easy to run at scale, but they:
Don’t fully capture quality for open-ended chat.
Can miss subtleties like helpfulness or tone.
📌 Example: Evaluate with datasets
2️⃣ Human-in-the-Loop (HITL) Feedback
For assistants, human feedback is often more meaningful than scores: ✔️ Users rate the response: 👍 / 👎 ✔️ Or pick “Good answer”, “Wrong”, “Too short”, etc. ✔️ Or suggest better rephrasing.
How to Collect HITL:
Add a simple 👍 👎 button in your Gradio UI:
Store ratings in a file or database.
Review bad answers regularly and add new training examples.
3️⃣ Custom Checks
Some extra signals you might track:
Response length — too short or too long?
Harmful or forbidden topics — does it say something it shouldn’t?
Coverage — does the answer actually include retrieved context?
Factual correctness — manual spot-checks or QA pairs.
✅ What’s Best for You?
BLEU
Translation-like outputs
Fast, objective
Doesn’t measure truth or tone
ROUGE
Summaries, paraphrases
Good for factual overlap
Same limits
HITL
Open-ended chat
Human judgment, real-world
Needs people/time
👉 Combo is best: ✅ Automatic scores for bulk checks ✅ Human feedback for quality and trust
✅ How to Improve Based on Scores
Low BLEU/ROUGE? ➜ Fine-tune with better instructions. Poor user thumbs-up ratio? ➜ Review bad samples, expand dataset. Hallucinations? ➜ Improve your RAG context or prompt style.
🗝️ Key Takeaway
You can’t improve what you don’t measure. Mix automatic scores with real human ratings to make your assistant more accurate, helpful, and trusted over time.
➡️ Next: Learn how to log conversations, save ratings, and build a feedback loop to keep your assistant getting smarter!
Last updated