Interview Questions

1. Core Definition

RAG (Retrieval-Augmented Generation) A system architecture where an LLM retrieves relevant documents from a private or curated knowledge base and then generates an answer grounded in those documents.

Google Search (Simple Search Engine) A search tool that retrieves ranked web pages based on user queries using indexing, ranking algorithms, and relevance scoring. It returns links, not synthesized answers (except limited snippets).


2. Purpose

Aspect
RAG
Google Search

Main Goal

Produce accurate, context-aware, synthesized answers

Return the most relevant publicly available web pages

Target Use Case

Closed-domain, enterprise, or private-data Q&A

Broad, open-domain information discovery


3. Data Source

Aspect
RAG
Google Search

Data

Organization’s internal documents, PDFs, databases, curated data

Public internet content indexed by Google

Freshness

You control updates; real-time ingestion possible

Google updates based on its crawl schedule

Accuracy

Fully depends on your data quality

Varies; based on public web


4. Output Format

Aspect
RAG
Google Search

Output

Direct synthesized answer with citations (if implemented)

List of URLs + small text snippets

Structure

Natural language response, actionable

You must click and manually extract information

Personalization

Highly possible

Limited to your search history and ad model


5. Interaction Style

Aspect
RAG
Google Search

Conversational

Yes, supports multi-turn dialogue

No (single query → result page)

Context memory

Yes, if implemented

No

Adaptation

Can fine-tune to domain vocabulary

General-purpose only


6. Privacy and Security

Aspect
RAG
Google Search

Data Privacy

Fully private; data stays inside your environment

All queries go to Google servers (public cloud)

Enterprise Suitability

Very high (controls, audit logs, no external data)

Low for confidential data


7. Use Cases

RAG Use Cases

  • Enterprise knowledge-base assistants

  • Internal policy & compliance Q&A

  • Codebase understanding

  • Healthcare, fintech, legal workflows

  • Customer support bots

  • Document intelligence (PDF/chat with files)

Google Search Use Cases

  • Discovering general information

  • Finding news, blogs, public facts

  • Exploring a topic before deep research

  • Getting URLs for public resources


8. Strengths & Weaknesses

RAG Strengths

  • Uses your data (accurate & domain-specific)

  • Produces ready-to-use answers

  • Private and secure

  • Customizable retrieval and ranking

  • Supports multi-turn reasoning

RAG Weaknesses

  • Requires infrastructure (vector DB, embeddings, LLM)

  • Needs proper data preprocessing and maintenance

  • Garbage-in → garbage-out (depends entirely on your dataset)


Google Search Strengths

  • Global, broad coverage

  • Highly optimized ranking algorithms

  • Very fast and constantly improving

  • Great for exploration and public info

Google Search Weaknesses

  • No synthesis; user must read multiple pages

  • Ads influence visibility

  • Not suitable for private or internal knowledge

  • No memory or contextual reasoning


9. Summary Table

Category
RAG
Google Search

Knowledge Scope

Private/curated

Open web

Output

Synthesized answer

Links

Privacy

High

Low for confidential queries

Infrastructure

You manage it

Zero

Best For

Enterprises, domain-specific systems

Public info retrieval

Limitations

Setup cost, quality of data

No context, no reasoning


Final Summary

RAG is for enterprises that need accurate, contextual, private, synthesized answers from their own data. Google Search is for retrieving public information and exploring topics.


2. I have Ollama Model which I need to fine tune. Tell me the steps to fine tune. Also, how would I measure the improvement on before fine-tuning and after fine-tuning

Below is a practical, end-to-end plan you can follow to fine-tune an Ollama-compatible model, plus a rigorous measurement strategy to quantify improvement (before vs after). I’ll give the concrete steps, example commands/configs, recommended evaluation metrics by task type, and an experiment plan so results are trustworthy.


Summary (one-line)

Fine-tune the base model externally (LoRA/QLoRA/adapter workflow), export the adapter or merged weights in a format Ollama accepts, import via a Modelfile (ADAPTER or FROM) and ollama create, then evaluate using held-out test sets (automatic metrics + human/A-B tests + statistical significance). (Ollama Documentation)


Part A — Fine-tuning steps (practical)

1) Define your objective & collect/format data

  1. Decide target task(s): instruction following, QA, summarization, classification, dialogue, code, etc.

  2. Create datasets:

    • Training: 80% (or more if limited).

    • Validation: 10% (for hyperparameter tuning / early stopping).

    • Test (holdout): 10% — must not be used during training.

  3. Recommended formats:

    • Instruction / response pairs or chat JSONL (one JSON per line); each item should include input/instruction and output/response.

    • For QA: include context, question, and gold answer fields.

  4. Clean and de-duplicate. Annotate failure cases you want the model to fix.

(Designing a good dataset and holdout test set is critical; many practitioners fine-tune externally and then import into Ollama for serving). (GitHub)


2) Choose fine-tuning method and infrastructure

Options depending on resources and goals:

  • LoRA / QLoRA (recommended for limited GPU RAM): trains only low-rank adapter weights; much cheaper, fast to iterate.

  • Full fine-tune / adapters: if you need full weight updates (costly).

  • Tools: Hugging Face Transformers + PEFT, bitsandbytes (8-bit), TRL or Accelerate for distributed training, and libraries like Unsloth that automate QLoRA workflows (see Unsloth / community guides). (Unsloth Docs)

Hardware:

  • For QLoRA on Llama-style 7B–13B: a single 48–80GB GPU can work; otherwise multi-GPU or cloud instances.


3) Train: example (high-level)

Below is a high-level recipe. Adapt hyperparameters per dataset.

  1. Install libs:

  1. Example (pseudo) command / script outline for QLoRA/PEFT:

Key knobs to monitor: training loss, validation loss, and generation quality. Use early stopping on validation metrics.


4) Export adapter / merged weights for Ollama

Two common routes:

A. Export as adapter (safetensors adapter) and import into Ollama

  • Save LoRA/adapter weights as *.safetensors or a directory with safetensors.

  • Create an Ollama Modelfile that references ADAPTER /path/to/adapter and FROM <base model> (per Ollama docs). Then ollama create my-finetuned-model will create a model that applies the adapter at inference. This avoids producing a full merged model and is lightweight. (Ollama Documentation)

Modelfile example:

B. Merge adapters into base model and import full weights

  • Merge adapter into the base weights (tools provide merge functions), produce safetensors for full weights, and reference with FROM /path/to/safetensors/directory in Modelfile. Then ollama create to register the model. (Ollama Documentation)

After import test locally:

Notes: many practitioners fine-tune with external toolchains then use Ollama for serving and local inference. (union.ai)


Part B — Practical checklist before you start

  • Baseline snapshot: run your pre-fine-tune base model on the exact same test queries and save outputs. You will compare these to the fine-tuned model outputs.

  • Keep training logs, random seeds, and hyperparameters for reproducibility.

  • Ensure test set reflects real production prompts and covers edge cases.


Part C — How to measure improvement (before vs after)

Measurement has three complementary pillars: automatic metrics, behavioral/human evaluation, and operational metrics.

1) Automatic metrics (task dependent)

  • Perplexity (PPL) — general language modelling fit (lower is better). Good for language modeling tasks but not sufficient for instruction quality.

  • BLEU / ROUGE / METEOR — for summarization or exact n-gram overlap tasks.

  • Exact Match (EM) and F1 — for extractive QA (span matching).

  • Accuracy / Macro F1 — for classification tasks.

  • EMBEDDING-BASED similarity (BERTScore, MoverScore) — for more semantic similarity assessment than n-gram metrics.

  • Hallucination rate — measure proportion of generated claims that are unsupported by source (requires label/verification). Use whichever metrics align with your business objective. Compute on the same held-out test set for pre and post models.

Recommendation: Choose 2–3 automated metrics relevant to the task (e.g., for instruction following: BLEU/ROUGE + BERTScore + hallucination count). Aim to compute statistical significance (bootstrap CI or paired t-test where applicable).

2) Human / qualitative evaluation

  • A/B testing (blind): present random users/annotators with paired outputs (A = base, B = fine-tuned) for the same prompts in random order. Ask raters to score on:

    • Correctness (0–3)

    • Usefulness / Relevance (0–3)

    • Hallucination (yes/no)

    • Fluency (0–3)

  • Preference rate: percent of times fine-tuned model preferred.

  • Task-success rate: for goal-oriented tasks (e.g., correct SQL generated), measure success on end task.

  • Rubrics: define clear labeling guidelines to reduce annotator variance.

  • Sample size: 200–1000 prompt samples is common for robust A/B tests; for early checks, 100–200 may suffice. Use multiple annotators per item (2–3) and compute inter-annotator agreement.

3) Operational metrics

  • Latency: measure average response time before and after (especially if you merged adapters).

  • Throughput and memory usage: LoRA adapters are smaller and faster than merged full models sometimes.

  • Cost: GPU utilization/cost per inference if deploying to cloud.


  1. Baseline collection

    • Pick N prompts (N ≥ 200 recommended). These should include:

      • Representative distribution of production prompts.

      • Edge cases where base model fails.

    • Record base model outputs and automatic metrics.

  2. Fine-tune

    • Train with your selected method (LoRA/QLoRA). Save adapter weights and logs.

  3. Post-tune evaluation

    • Run the same N prompts against fine-tuned model.

    • Compute automatic metrics (paired comparisons).

    • Run blind A/B study with human raters (each sample judged by 2–3 raters).

    • Compute preference rate and significance (e.g., paired bootstrap for metric differences; for binary preferences, use a two-proportion z-test).

  4. Error analysis

    • Inspect cases where performance dropped or hallucinations increased.

    • Iterate dataset and fine-tune again if needed.

  5. Production readiness

    • Test latency, memory, and scale (concurrent users).

    • Set monitoring: live logs, sample drift detection, hallucination alerts.


Part E — Example metrics & evaluation scripts (sketch)

Compute EM and F1 for QA (Python sketch):

Statistical significance (paired bootstrap):

  • For metric M, compute differences per example (M_after − M_before). Bootstrap sample differences to compute 95% CI; if CI does not include 0, difference is significant.


Part F — Common gotchas & best practices

  • Overfitting: small datasets → overfit. Use early stopping and validate on unseen data.

  • Distribution shift: test set must reflect production prompts, not just training examples.

  • Validation pipeline parity: ensure pre and post inference use the same tokenization, generation settings (temperature, top_p), and prompt templates — otherwise comparisons are invalid.

  • Hallucination drift: fine-tuning can increase confidently wrong answers; always measure hallucination rate.

  • Reproducibility: fix RNG seeds and document environment.


Quick checklist / commands you’ll use frequently

  • Prepare dataset: my_data.jsonl (train/val/test split)

  • Run fine-tune (example): python finetune.py --model base --data my_data.jsonl --peft lora ...

  • Save adapter: ./adapter_out (safetensors)

  • Modelfile for Ollama:

  • Create in Ollama: ollama create my-finetuned-model

  • Run locally: ollama run my-finetuned-model

(See Ollama docs for Modelfile ADAPTER usage and import commands.) (Ollama Documentation)


  • Ollama Modelfile & import docs (adapter import & FROM examples). (Ollama Documentation)

  • Tutorials on fine-tuning Llama-3 / QLoRA and exporting for local use. (datacamp.com)

  • Community writeups showing practical flows: medium / toward-ai posts and union.ai examples for serving with Ollama. (Towards AI)


Last updated