Prepare a JSONL dataset of instruction/query/response

To instruction‑tune your base LLM, you need training data that shows how you want your assistant to behave. The most common format is a JSON Lines file (.jsonl):

Each line = one training example
Each example = instruction, input (optional), ideal response

✅ What Should Each Example Look Like?

Each entry should answer:

What does the user want? (instruction)
Any extra input? (like a text passage)
What’s the best possible reply? (response)

📌 Minimal Example

{"instruction": "Explain what a Python function is.", "input": "", "output": "A Python function is a reusable block of code that performs a specific task. It can accept inputs, called arguments, and can return a value."}

{"instruction": "Summarize this paragraph.", "input": "Python is a popular programming language known for its readability and community support.", "output": "Python is a user-friendly programming language with strong community backing."}

{"instruction": "Give a beginner-friendly example of a for loop in Python.", "input": "", "output": "Here’s an example:\n\n```\nfor i in range(5):\n    print(i)\n```\nThis prints numbers from 0 to 4."}

✅ Recommended Structure

Field

Description

instruction

The task or question you want the model to handle.

input

Optional supporting text, context, or passage.

output

The ideal response your assistant should generate.

✅ Tips for Writing Good Examples

✔️ Make them clear, realistic, and polite. ✔️ Cover typical tasks your assistant should handle. ✔️ Include edge cases: politely refuse harmful or off-topic requests. ✔️ Keep outputs short and direct if your assistant should be concise.

✅ How Many Examples Do I Need?

For quick demos: 10–50 examples can show basic behavior.
For real improvement: 500–5,000+ examples are better.
Many open datasets (like Alpaca, Dolly, OpenAssistant) use 10,000+ instructions.

Start small, test, then scale up.

✅ Save Your Dataset

Put your examples in a text file, one JSON per line. Example filename: my_instructions.jsonl

Test that it loads:

from datasets import load_dataset

dataset = load_dataset("json", data_files="my_instructions.jsonl")
print(dataset)

⚙️ Where to Get More Data

Write your own instructions.
Use public instruction datasets: ➜ tatsu-lab/alpaca ➜ databricks/databricks-dolly-15k ➜ OpenAssistant/oasst1
Combine and adapt them for your assistant’s style.

✅ Key Takeaway

A good dataset = clear instructions ➜ clear outputs. This is what teaches your model how to respond politely, helpfully, and on topic.

➡️ Next: You’ll learn how to fine-tune your model using this dataset with transformers and accelerate!

PreviousWhy instruction-tune for assistant behavior NextUse transformers.Trainer or trlx for fine‑tuning

Last updated 5 months ago