Add safety filters, profanity cleanup, guardrails

Once your assistant can answer questions, use tools, and maintain context, you need to ensure it stays safe and appropriate for real users.

LLMs — even instruction-tuned ones — can:

Hallucinate harmful or misleading info
Generate profanity or offensive language
Answer questions they shouldn’t (like personal medical or legal advice)

👉 That’s why adding guardrails is critical for a production-ready assistant.

✅ What Are Guardrails?

Guardrails = extra checks and rules that:

Filter out unwanted user inputs
Block or rewrite unsafe model outputs
Refuse or safely redirect risky queries

✅ Key Guardrail Strategies

1️⃣ Input Filters

✔️ Scan incoming user prompts for:

Profanity
Forbidden topics (e.g., hate speech, violence, hacking instructions)

Example: Basic profanity filter:

BAD_WORDS = ["badword1", "badword2"]

def contains_profanity(text):
    return any(word in text.lower() for word in BAD_WORDS)

If profanity is found:

if contains_profanity(user_input):
    return "⚠️ Sorry, I can't process that request."

2️⃣ Output Filters

✔️ Check the generated response before sending it back:

Run a quick scan for flagged words.
Or re-run output through a cleanup step.

Example: Remove or replace profanity:

def clean_profanity(text):
    for word in BAD_WORDS:
        text = text.replace(word, "***")
    return text

3️⃣ Refusal Templates

✔️ Teach your assistant when to refuse politely:

if "how to hack" in user_input.lower():
    return "❌ Sorry, I can't help with that."

Or add an explicit policy to your system prompt:

System: You are a helpful assistant. If asked for illegal, unethical, or harmful instructions, you must refuse.

4️⃣ Guardrail Libraries

For bigger projects, try:

OpenAI Moderation API: Checks for unsafe content.
**transformers.pipelines** with a classification model (like facebook/roberta-hate-speech-detection`).
Guardrails AI: Open-source framework to define guardrail flows (pip install guardrails-ai).

✅ Example: Combine All

Simple secured flow:

def answer(user_input):
    # 1️⃣ Check input
    if contains_profanity(user_input):
        return "⚠️ Please avoid using inappropriate language."

    if "hack" in user_input:
        return "❌ I’m not allowed to help with that topic."

    # 2️⃣ Do normal generation
    response = your_llm_pipeline(user_input)

    # 3️⃣ Clean output if needed
    safe_response = clean_profanity(response)

    return safe_response

✅ When to Use Guardrails

✔️ Anytime you’re serving real users. ✔️ Required for public or customer-facing apps. ✔️ Important for schools, kids, workplace tools. ✔️ Useful to meet compliance needs (GDPR, COPPA, etc.).

✅ Best Practices

Do’s

Don’ts

✅ Keep a clear policy: what’s allowed & not.

❌ Don’t rely on the LLM alone — always add checks.

✅ Combine multiple filters: input + output.

❌ Don’t trust user input blindly for tools (e.g., eval()).

✅ Log flagged queries for review.

❌ Don’t expose raw generated text if it may contain unsafe content.

🗝️ Key Takeaway

Guardrails = protect your users + your reputation. They help keep your assistant safe, polite, and compliant — even when the base LLM wants to say something it shouldn’t.

➡️ Next: Learn how to monitor logs, track flagged messages, and retrain your assistant to improve safety over time!

PreviousUse langchain + HF model for tool orchestration NextChapter 10

Last updated 5 months ago