Add safety filters, profanity cleanup, guardrails

Once your assistant can answer questions, use tools, and maintain context, you need to ensure it stays safe and appropriate for real users.

LLMs — even instruction-tuned ones — can:

  • Hallucinate harmful or misleading info

  • Generate profanity or offensive language

  • Answer questions they shouldn’t (like personal medical or legal advice)

👉 That’s why adding guardrails is critical for a production-ready assistant.


What Are Guardrails?

Guardrails = extra checks and rules that:

  • Filter out unwanted user inputs

  • Block or rewrite unsafe model outputs

  • Refuse or safely redirect risky queries


Key Guardrail Strategies


1️⃣ Input Filters

✔️ Scan incoming user prompts for:

  • Profanity

  • Forbidden topics (e.g., hate speech, violence, hacking instructions)

Example: Basic profanity filter:

If profanity is found:


2️⃣ Output Filters

✔️ Check the generated response before sending it back:

  • Run a quick scan for flagged words.

  • Or re-run output through a cleanup step.

Example: Remove or replace profanity:


3️⃣ Refusal Templates

✔️ Teach your assistant when to refuse politely:

Or add an explicit policy to your system prompt:


4️⃣ Guardrail Libraries

For bigger projects, try:

  • OpenAI Moderation API: Checks for unsafe content.

  • **transformers.pipelines** with a classification model (like facebook/roberta-hate-speech-detection`).

  • Guardrails AI: Open-source framework to define guardrail flows (pip install guardrails-ai).


Example: Combine All

Simple secured flow:


When to Use Guardrails

✔️ Anytime you’re serving real users. ✔️ Required for public or customer-facing apps. ✔️ Important for schools, kids, workplace tools. ✔️ Useful to meet compliance needs (GDPR, COPPA, etc.).


Best Practices

Do’s
Don’ts

✅ Keep a clear policy: what’s allowed & not.

❌ Don’t rely on the LLM alone — always add checks.

✅ Combine multiple filters: input + output.

❌ Don’t trust user input blindly for tools (e.g., eval()).

✅ Log flagged queries for review.

❌ Don’t expose raw generated text if it may contain unsafe content.


🗝️ Key Takeaway

Guardrails = protect your users + your reputation. They help keep your assistant safe, polite, and compliant — even when the base LLM wants to say something it shouldn’t.


➡️ Next: Learn how to monitor logs, track flagged messages, and retrain your assistant to improve safety over time!

Last updated