Add safety filters, profanity cleanup, guardrails
Once your assistant can answer questions, use tools, and maintain context, you need to ensure it stays safe and appropriate for real users.
LLMs — even instruction-tuned ones — can:
Hallucinate harmful or misleading info
Generate profanity or offensive language
Answer questions they shouldn’t (like personal medical or legal advice)
👉 That’s why adding guardrails is critical for a production-ready assistant.
✅ What Are Guardrails?
Guardrails = extra checks and rules that:
Filter out unwanted user inputs
Block or rewrite unsafe model outputs
Refuse or safely redirect risky queries
✅ Key Guardrail Strategies
1️⃣ Input Filters
✔️ Scan incoming user prompts for:
Profanity
Forbidden topics (e.g., hate speech, violence, hacking instructions)
Example: Basic profanity filter:
If profanity is found:
2️⃣ Output Filters
✔️ Check the generated response before sending it back:
Run a quick scan for flagged words.
Or re-run output through a cleanup step.
Example: Remove or replace profanity:
3️⃣ Refusal Templates
✔️ Teach your assistant when to refuse politely:
Or add an explicit policy to your system prompt:
4️⃣ Guardrail Libraries
For bigger projects, try:
OpenAI Moderation API: Checks for unsafe content.
**transformers.pipelines
** with a classification model (likefacebook/roberta-hate-speech-detection`).Guardrails AI: Open-source framework to define guardrail flows (pip install
guardrails-ai).
✅ Example: Combine All
Simple secured flow:
✅ When to Use Guardrails
✔️ Anytime you’re serving real users. ✔️ Required for public or customer-facing apps. ✔️ Important for schools, kids, workplace tools. ✔️ Useful to meet compliance needs (GDPR, COPPA, etc.).
✅ Best Practices
✅ Keep a clear policy: what’s allowed & not.
❌ Don’t rely on the LLM alone — always add checks.
✅ Combine multiple filters: input + output.
❌ Don’t trust user input blindly for tools (e.g., eval()).
✅ Log flagged queries for review.
❌ Don’t expose raw generated text if it may contain unsafe content.
🗝️ Key Takeaway
Guardrails = protect your users + your reputation. They help keep your assistant safe, polite, and compliant — even when the base LLM wants to say something it shouldn’t.
➡️ Next: Learn how to monitor logs, track flagged messages, and retrain your assistant to improve safety over time!
Last updated