IVQA 151-200

151. What preprocessing steps are needed before training a GenAI model?

  • Tokenization (model-specific)

  • Lowercasing, punctuation handling

  • Text normalization (unicode, contractions)

  • Deduplication of samples

  • Filtering (e.g., remove empty, toxic, or low-quality content)

  • Formatting (JSONL, conversational format)


152. How do you clean noisy textual data for GenAI training?

  • Remove non-text elements (HTML, code artifacts)

  • Normalize whitespaces and characters

  • Use heuristics or classifiers to remove off-topic/toxic content

  • Use language detection to remove non-target language content


153. What is tokenization drift and how do you prevent it?

Tokenization drift occurs when the tokenizer used during training differs from the one used at inference. Prevent it by:

  • Using the same tokenizer version across pipeline stages

  • Locking tokenizer vocab

  • Including tokenizer metadata in model checkpoints


154. How do you manage out-of-vocabulary (OOV) tokens?

  • Use subword tokenization (e.g., BPE, WordPiece)

  • Add domain-specific tokens during tokenizer training

  • Avoid character-level OOV fallback unless explicitly needed


155. How would you prepare a custom dataset for fine-tuning GPT?

  • Format as JSONL with {"prompt": "...", "completion": "..."} or chat messages

  • Ensure consistent structure (e.g., user/assistant roles)

  • Clean and deduplicate entries

  • Optionally balance dataset classes/topics


156. What is the role of chunking in RAG pipelines?

  • Breaks long documents into manageable, semantically coherent pieces

  • Improves retrieval accuracy

  • Enables context-window fitting

  • Helps minimize hallucination by feeding only relevant chunks


157. How do you handle multi-language data ingestion for a GenAI use case?

  • Use language detection and tag inputs

  • Apply language-specific cleaning/tokenization

  • Maintain language balance or weight important languages

  • Fine-tune with multilingual embeddings or models like mT5, XLM-R


158. How do you anonymize personally identifiable data before training?

  • Use regex/NLP rules for detecting names, emails, IDs, etc.

  • Replace with placeholders ([NAME], [EMAIL])

  • Apply NER-based de-identification (spaCy, Presidio)

  • Validate outputs for leakage before final training


159. What are tradeoffs between training on documents vs. dialogue data?

Aspect
Document Data
Dialogue Data

Strength

Rich, factual knowledge

Conversational fluency

Limitation

Less interactive, static format

Prone to bias or casual tone

Use case

Summarization, retrieval

Chatbots, assistants


160. How do you balance dataset diversity without sacrificing relevance?

  • Use sampling weights for underrepresented domains

  • Apply clustering + filtering to avoid redundancy

  • Combine general + domain-specific corpora

  • Maintain quality using human-in-the-loop filtering


161. How do LLMs handle long context windows, and what are the limits?

  • Use transformer variants like Transformer-XL, Longformer, or GPT-4-128k

  • Context limit = maximum token window (e.g., 8k, 32k, 128k)

  • Tradeoff: Larger context → more compute → higher cost


162. What is memory replay in agent frameworks?

Memory replay is the reuse of past dialogues, steps, or retrieved content in agent workflows to:

  • Improve reasoning

  • Avoid repetition

  • Preserve continuity across tasks


163. How does ReAct differ from simple tool-calling agents?

ReAct agents combine reasoning and acting:

  • Think step-by-step (“Thought → Action → Observation”)

  • Update reasoning after each action

  • More robust than one-shot tool invocations


164. What is “episodic memory” in LLMs?

Episodic memory stores structured interaction history (e.g., chat sessions) that the model can recall across sessions. It enables:

  • Persistent context

  • Cross-session continuity

  • Task tracking


165. How do you store and retrieve long-term memory using vector DBs?

  • Embed chunks using an embedding model (e.g., OpenAI, SBERT)

  • Store in vector DBs (e.g., Qdrant, Weaviate)

  • Retrieve top-k relevant memories using cosine similarity during each turn


166. How do you deal with context loss in multi-turn conversations?

  • Implement sliding windows or summarization

  • Use conversation history compression

  • Store and re-inject memory via external databases or RAG

  • Keep prompts focused to reduce drift


167. What’s the difference between external and internal memory for agents?

Type
Description

Internal

Held within model context (tokens)

External

Retrieved/stored in databases or vector stores

External memory enables persistent, large-scale memory beyond the token limit.


168. How does Claude 2/3 manage longer context better than GPT-4?

Claude 2/3 uses windowed attention + efficient summarization internally, enabling:

  • Up to 200k+ token input

  • Higher recall fidelity over long documents

  • Better tracking across lengthy sessions


169. What strategies help chunk documents for better summarization?

  • Semantic chunking using sentence boundaries

  • Overlapping chunks to retain context

  • Use headings or section tags

  • Embed and filter for relevance before summarization


170. How do you evaluate memory relevance in GenAI workflows?

  • Use retrieval score thresholds

  • Manual relevance judgment (human eval)

  • Measure downstream impact (e.g., improved Q&A, fewer hallucinations)

  • Token-efficiency vs. relevance tradeoff


171. Compare Mistral, LLaMA 2, and Falcon models.

Model
Strengths
Context Limit
License

Mistral

Fast, open, small model (7B)

~32K tokens

Apache 2.0

LLaMA 2

Meta’s accurate model (7B–70B)

4K–16K tokens

Non-commercial

Falcon

Strong pretraining, Arabic/NLP

~2K–4K tokens

Apache 2.0


172. How do you host an open-source LLM using Ollama or Text Generation Web UI?

  • Ollama: CLI-based; runs models like Mistral locally with simple commands

  • Text Generation Web UI: GUI wrapper over Hugging Face models with quantization, streaming, chat support


173. What are the benefits of vLLM for serving LLMs in production?

  • Fast inference via PagedAttention

  • Efficient multi-user batching

  • Drop-in support for OpenAI-compatible APIs

  • Better throughput and latency than naive inference


174. How does Hugging Face Inference Endpoints work for GenAI?

  • Deploy Hugging Face models via managed infrastructure

  • Auto-scales and provides secure REST APIs

  • Ideal for fast deployments without infra setup


175. What is quantization-aware training (QAT)?

QAT trains a model while simulating lower-precision (e.g., int8) arithmetic. It:

  • Improves inference efficiency

  • Reduces size with minimal accuracy loss

  • Outperforms post-training quantization (PTQ)


176. How do you deploy LLaMA 2 using Hugging Face Transformers?

  • Load model using AutoModelForCausalLM

  • Use transformers, accelerate, or vLLM

  • Deploy via FastAPI, Triton, or HF Endpoints

  • Follow Meta’s usage policy (especially for 65B/70B)


177. What is the role of Triton or ONNX in GenAI inference?

  • Triton Inference Server: Scales multi-model, multi-framework workloads

  • ONNX Runtime: Cross-platform, optimized runtime for exporting models Both improve deployment performance and portability


178. How do you benchmark different GenAI models locally?

  • Use Hugging Face’s evaluate or custom scripts

  • Compare:

    • Latency (ms/token)

    • Throughput

    • Memory use

    • Accuracy (BLEU, ROUGE, MMLU)

  • Use same prompt sets and tokenizers


179. How does OpenRouter help route across multiple LLMs?

  • Acts as a proxy to multiple models (OpenAI, Claude, Cohere)

  • Smart routing based on:

    • Availability

    • Cost

    • Performance Great for fallback and comparative eval.


180. What are the licensing concerns when using open-source LLMs in commercial apps?

  • Check commercial use clauses (e.g., LLaMA 2 is not fully open)

  • Respect weights redistribution restrictions

  • Provide attribution where required

  • Prefer Apache 2.0, MIT, BSD for commercial use


181. How do you use OpenAI’s function calling to interact with APIs?

  • Define function schemas (name, parameters, descriptions)

  • Pass as functions to the API

  • Model returns JSON with function + arguments

  • You run the function and return result to model


182. Build a Python script to call GPT-4 for summarizing a PDF.


183. Write a prompt template to extract structured data from unstructured reviews.

"Extract the following fields: Product Name, Rating (out of 5), Complaint Summary, Suggested Improvement. Review: {review_text}"


184. How would you use GenAI to create SQL queries from English prompts?

  • Few-shot prompt: Include examples of English-to-SQL

  • Tools like Text2SQL, OpenAI with function calling

  • Add table schema in prompt to improve accuracy


185. How do you validate user inputs before passing them to an LLM?

  • Sanitize (remove harmful content)

  • Use regex or schema validation

  • Enforce length and structure limits

  • Strip system prompt injection attempts


186. Build a FastAPI endpoint that takes user input and calls a GenAI model.


187. What’s the best way to batch prompts for OpenAI API to reduce cost?

  • Use ChatCompletion.create(messages=[...]) with multiple conversations

  • Combine related queries

  • Use OpenAI's batch API (if available)

  • Apply caching with hash-based keys


188. How can you use GenAI to classify and route customer tickets?

  • Prompt: "Classify this ticket into Billing, Technical, Feedback, Other"

  • Output = label → map to internal routing logic

  • Optionally use OpenAI function calling or fine-tuned classifier


189. Implement a RAG flow using LangChain and Qdrant.

  • Embed documents with OpenAI

  • Store in Qdrant

  • On query, retrieve top-k chunks

  • Construct context → send to OpenAI for answer Use RetrievalQA chain in LangChain


190. How do you cache frequent queries in a GenAI-powered web app?

  • Store hash of input prompt

  • Cache response in Redis or Postgres

  • Set TTL or LRU eviction

  • Avoid redundant token costs on repeated inputs


191. What architecture would you recommend for a GenAI-powered document search system?

  • User → API Gateway → Retriever (Qdrant) + Reader (LLM)

  • Preprocess → Chunk → Embed → Store

  • Serve via FastAPI/Flask + monitoring stack (Prometheus + Grafana)


192. How do you secure an LLM API used in internal enterprise tools?

  • Use OAuth or API keys

  • Apply RBAC policies

  • Encrypt logs and inputs

  • Log access for audits


193. What are the tradeoffs between using managed LLMs and self-hosting?

Managed
Self-hosted

Fast, scalable

Full control

Less privacy

Infra overhead

Low maintenance

Customization possible


194. How do you enforce audit logs and traceability in a GenAI pipeline?

  • Log input/output, timestamps, user ID

  • Store function/tool calls

  • Maintain model version history

  • Comply with SOC2/GDPR audit trails


195. How would you scale an LLM-based email summarizer for 1M users?

  • Use queue-based system (e.g., Celery, Kafka)

  • Batch inputs

  • Use vLLM or quantized models

  • Parallelize with GPU workers

  • Cache past summaries


196. What’s the role of message queues (e.g., Kafka, RabbitMQ) in GenAI backends?

  • Handle async tasks (summarization, classification)

  • Enable horizontal scaling

  • Buffer bursty requests

  • Decouple frontend and LLM processing


197. How do you integrate GenAI with CI/CD workflows?

  • Lint/test prompt templates

  • Version control for prompts + models

  • Run eval tests on staging before deploying

  • Track performance drift


198. What’s a good microservices structure for a GenAI-powered SaaS platform?

  • API Gateway

  • Auth Service

  • GenAI Service (RAG, summarizer, etc.)

  • Vector Store Service

  • Monitoring Service Each can be deployed, scaled, and tested independently.


199. How do you perform load testing on GenAI endpoints?

  • Use tools like Locust, k6, or Artillery

  • Simulate concurrent users

  • Track response time, token latency, token usage

  • Monitor GPU/memory usage under load


200. How do you maintain versioning for prompts, models, and embeddings in production?

  • Use Git-based prompt repos

  • Tag model versions and configs

  • Hash embeddings or use vector DB namespaces

  • Include version metadata in logs and user requests


Last updated