Logging conversations, rating responses

Once your assistant is live, you shouldn’t just trust it blindly. You need to measure how well it performs — so you can fix mistakes, track improvements, and ensure it stays helpful and safe.

🔍 Why Evaluate?

LLMs can produce fluent but incorrect answers.
Fine-tuning and RAG can drift over time if your knowledge base changes.
Users need high-quality responses for real trust.

✅ Types of Evaluation

1️⃣ Automatic Metrics: BLEU & ROUGE

These classic metrics compare generated text to reference answers:

BLEU (Bilingual Evaluation Understudy):
- Checks overlap of n-grams (words or phrases).
- Popular in translation tasks.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- Measures recall — how much of the reference text is covered.
- Good for summarization tasks.

These are easy to run at scale, but they:

Don’t fully capture quality for open-ended chat.
Can miss subtleties like helpfulness or tone.

📌 Example: Evaluate with datasets

from datasets import load_metric

bleu = load_metric("bleu")
rouge = load_metric("rouge")

# Example: list of pairs
predictions = ["Python is easy to learn."]
references = [["Python is simple to pick up."]]

bleu_result = bleu.compute(predictions=predictions, references=references)
rouge_result = rouge.compute(predictions=predictions, references=["Python is simple to pick up."])

print("BLEU:", bleu_result)
print("ROUGE:", rouge_result)

2️⃣ Human-in-the-Loop (HITL) Feedback

For assistants, human feedback is often more meaningful than scores: ✔️ Users rate the response: 👍 / 👎 ✔️ Or pick “Good answer”, “Wrong”, “Too short”, etc. ✔️ Or suggest better rephrasing.

How to Collect HITL:

Add a simple 👍 👎 button in your Gradio UI:

gr.Interface(
    fn=answer,
    inputs="text",
    outputs="text",
    allow_flagging="manual",
    flagging_options=["👍 Good", "👎 Bad"]
)

Store ratings in a file or database.
Review bad answers regularly and add new training examples.

3️⃣ Custom Checks

Some extra signals you might track:

Response length — too short or too long?
Harmful or forbidden topics — does it say something it shouldn’t?
Coverage — does the answer actually include retrieved context?
Factual correctness — manual spot-checks or QA pairs.

✅ What’s Best for You?

Metric

Best For

Pros

Cons

BLEU

Translation-like outputs

Fast, objective

Doesn’t measure truth or tone

ROUGE

Summaries, paraphrases

Good for factual overlap

Same limits

HITL

Open-ended chat

Human judgment, real-world

Needs people/time

👉 Combo is best: ✅ Automatic scores for bulk checks ✅ Human feedback for quality and trust

✅ How to Improve Based on Scores

Low BLEU/ROUGE? ➜ Fine-tune with better instructions. Poor user thumbs-up ratio? ➜ Review bad samples, expand dataset. Hallucinations? ➜ Improve your RAG context or prompt style.

🗝️ Key Takeaway

You can’t improve what you don’t measure. Mix automatic scores with real human ratings to make your assistant more accurate, helpful, and trusted over time.

➡️ Next: Learn how to log conversations, save ratings, and build a feedback loop to keep your assistant getting smarter!

PreviousSet up evaluation metrics: BLEU, ROUGE, or human‑in‑the‑loop feedback NextOptionally use HF Evaluator or commercial tools

Last updated 5 months ago