IVQA 151-200
151. What preprocessing steps are needed before training a GenAI model?
Tokenization (model-specific)
Lowercasing, punctuation handling
Text normalization (unicode, contractions)
Deduplication of samples
Filtering (e.g., remove empty, toxic, or low-quality content)
Formatting (JSONL, conversational format)
152. How do you clean noisy textual data for GenAI training?
Remove non-text elements (HTML, code artifacts)
Normalize whitespaces and characters
Use heuristics or classifiers to remove off-topic/toxic content
Use language detection to remove non-target language content
153. What is tokenization drift and how do you prevent it?
Tokenization drift occurs when the tokenizer used during training differs from the one used at inference. Prevent it by:
Using the same tokenizer version across pipeline stages
Locking tokenizer vocab
Including tokenizer metadata in model checkpoints
154. How do you manage out-of-vocabulary (OOV) tokens?
Use subword tokenization (e.g., BPE, WordPiece)
Add domain-specific tokens during tokenizer training
Avoid character-level OOV fallback unless explicitly needed
155. How would you prepare a custom dataset for fine-tuning GPT?
Format as JSONL with
{"prompt": "...", "completion": "..."}or chat messagesEnsure consistent structure (e.g., user/assistant roles)
Clean and deduplicate entries
Optionally balance dataset classes/topics
156. What is the role of chunking in RAG pipelines?
Breaks long documents into manageable, semantically coherent pieces
Improves retrieval accuracy
Enables context-window fitting
Helps minimize hallucination by feeding only relevant chunks
157. How do you handle multi-language data ingestion for a GenAI use case?
Use language detection and tag inputs
Apply language-specific cleaning/tokenization
Maintain language balance or weight important languages
Fine-tune with multilingual embeddings or models like mT5, XLM-R
158. How do you anonymize personally identifiable data before training?
Use regex/NLP rules for detecting names, emails, IDs, etc.
Replace with placeholders (
[NAME],[EMAIL])Apply NER-based de-identification (spaCy, Presidio)
Validate outputs for leakage before final training
159. What are tradeoffs between training on documents vs. dialogue data?
Strength
Rich, factual knowledge
Conversational fluency
Limitation
Less interactive, static format
Prone to bias or casual tone
Use case
Summarization, retrieval
Chatbots, assistants
160. How do you balance dataset diversity without sacrificing relevance?
Use sampling weights for underrepresented domains
Apply clustering + filtering to avoid redundancy
Combine general + domain-specific corpora
Maintain quality using human-in-the-loop filtering
161. How do LLMs handle long context windows, and what are the limits?
Use transformer variants like Transformer-XL, Longformer, or GPT-4-128k
Context limit = maximum token window (e.g., 8k, 32k, 128k)
Tradeoff: Larger context → more compute → higher cost
162. What is memory replay in agent frameworks?
Memory replay is the reuse of past dialogues, steps, or retrieved content in agent workflows to:
Improve reasoning
Avoid repetition
Preserve continuity across tasks
163. How does ReAct differ from simple tool-calling agents?
ReAct agents combine reasoning and acting:
Think step-by-step (“Thought → Action → Observation”)
Update reasoning after each action
More robust than one-shot tool invocations
164. What is “episodic memory” in LLMs?
Episodic memory stores structured interaction history (e.g., chat sessions) that the model can recall across sessions. It enables:
Persistent context
Cross-session continuity
Task tracking
165. How do you store and retrieve long-term memory using vector DBs?
Embed chunks using an embedding model (e.g., OpenAI, SBERT)
Store in vector DBs (e.g., Qdrant, Weaviate)
Retrieve top-k relevant memories using cosine similarity during each turn
166. How do you deal with context loss in multi-turn conversations?
Implement sliding windows or summarization
Use conversation history compression
Store and re-inject memory via external databases or RAG
Keep prompts focused to reduce drift
167. What’s the difference between external and internal memory for agents?
Internal
Held within model context (tokens)
External
Retrieved/stored in databases or vector stores
External memory enables persistent, large-scale memory beyond the token limit.
168. How does Claude 2/3 manage longer context better than GPT-4?
Claude 2/3 uses windowed attention + efficient summarization internally, enabling:
Up to 200k+ token input
Higher recall fidelity over long documents
Better tracking across lengthy sessions
169. What strategies help chunk documents for better summarization?
Semantic chunking using sentence boundaries
Overlapping chunks to retain context
Use headings or section tags
Embed and filter for relevance before summarization
170. How do you evaluate memory relevance in GenAI workflows?
Use retrieval score thresholds
Manual relevance judgment (human eval)
Measure downstream impact (e.g., improved Q&A, fewer hallucinations)
Token-efficiency vs. relevance tradeoff
171. Compare Mistral, LLaMA 2, and Falcon models.
Mistral
Fast, open, small model (7B)
~32K tokens
Apache 2.0
LLaMA 2
Meta’s accurate model (7B–70B)
4K–16K tokens
Non-commercial
Falcon
Strong pretraining, Arabic/NLP
~2K–4K tokens
Apache 2.0
172. How do you host an open-source LLM using Ollama or Text Generation Web UI?
Ollama: CLI-based; runs models like Mistral locally with simple commands
Text Generation Web UI: GUI wrapper over Hugging Face models with quantization, streaming, chat support
173. What are the benefits of vLLM for serving LLMs in production?
Fast inference via PagedAttention
Efficient multi-user batching
Drop-in support for OpenAI-compatible APIs
Better throughput and latency than naive inference
174. How does Hugging Face Inference Endpoints work for GenAI?
Deploy Hugging Face models via managed infrastructure
Auto-scales and provides secure REST APIs
Ideal for fast deployments without infra setup
175. What is quantization-aware training (QAT)?
QAT trains a model while simulating lower-precision (e.g., int8) arithmetic. It:
Improves inference efficiency
Reduces size with minimal accuracy loss
Outperforms post-training quantization (PTQ)
176. How do you deploy LLaMA 2 using Hugging Face Transformers?
Load model using
AutoModelForCausalLMUse
transformers,accelerate, orvLLMDeploy via FastAPI, Triton, or HF Endpoints
Follow Meta’s usage policy (especially for 65B/70B)
177. What is the role of Triton or ONNX in GenAI inference?
Triton Inference Server: Scales multi-model, multi-framework workloads
ONNX Runtime: Cross-platform, optimized runtime for exporting models Both improve deployment performance and portability
178. How do you benchmark different GenAI models locally?
Use Hugging Face’s
evaluateor custom scriptsCompare:
Latency (ms/token)
Throughput
Memory use
Accuracy (BLEU, ROUGE, MMLU)
Use same prompt sets and tokenizers
179. How does OpenRouter help route across multiple LLMs?
Acts as a proxy to multiple models (OpenAI, Claude, Cohere)
Smart routing based on:
Availability
Cost
Performance Great for fallback and comparative eval.
180. What are the licensing concerns when using open-source LLMs in commercial apps?
Check commercial use clauses (e.g., LLaMA 2 is not fully open)
Respect weights redistribution restrictions
Provide attribution where required
Prefer Apache 2.0, MIT, BSD for commercial use
181. How do you use OpenAI’s function calling to interact with APIs?
Define function schemas (name, parameters, descriptions)
Pass as
functionsto the APIModel returns JSON with function + arguments
You run the function and return result to model
182. Build a Python script to call GPT-4 for summarizing a PDF.
183. Write a prompt template to extract structured data from unstructured reviews.
"Extract the following fields: Product Name, Rating (out of 5), Complaint Summary, Suggested Improvement. Review: {review_text}"
184. How would you use GenAI to create SQL queries from English prompts?
Few-shot prompt: Include examples of English-to-SQL
Tools like Text2SQL, OpenAI with function calling
Add table schema in prompt to improve accuracy
185. How do you validate user inputs before passing them to an LLM?
Sanitize (remove harmful content)
Use regex or schema validation
Enforce length and structure limits
Strip system prompt injection attempts
186. Build a FastAPI endpoint that takes user input and calls a GenAI model.
187. What’s the best way to batch prompts for OpenAI API to reduce cost?
Use
ChatCompletion.create(messages=[...])with multiple conversationsCombine related queries
Use OpenAI's
batchAPI (if available)Apply caching with hash-based keys
188. How can you use GenAI to classify and route customer tickets?
Prompt: "Classify this ticket into Billing, Technical, Feedback, Other"
Output = label → map to internal routing logic
Optionally use OpenAI function calling or fine-tuned classifier
189. Implement a RAG flow using LangChain and Qdrant.
Embed documents with OpenAI
Store in Qdrant
On query, retrieve top-k chunks
Construct context → send to OpenAI for answer Use
RetrievalQAchain in LangChain
190. How do you cache frequent queries in a GenAI-powered web app?
Store hash of input prompt
Cache response in Redis or Postgres
Set TTL or LRU eviction
Avoid redundant token costs on repeated inputs
191. What architecture would you recommend for a GenAI-powered document search system?
User → API Gateway → Retriever (Qdrant) + Reader (LLM)
Preprocess → Chunk → Embed → Store
Serve via FastAPI/Flask + monitoring stack (Prometheus + Grafana)
192. How do you secure an LLM API used in internal enterprise tools?
Use OAuth or API keys
Apply RBAC policies
Encrypt logs and inputs
Log access for audits
193. What are the tradeoffs between using managed LLMs and self-hosting?
Fast, scalable
Full control
Less privacy
Infra overhead
Low maintenance
Customization possible
194. How do you enforce audit logs and traceability in a GenAI pipeline?
Log input/output, timestamps, user ID
Store function/tool calls
Maintain model version history
Comply with SOC2/GDPR audit trails
195. How would you scale an LLM-based email summarizer for 1M users?
Use queue-based system (e.g., Celery, Kafka)
Batch inputs
Use vLLM or quantized models
Parallelize with GPU workers
Cache past summaries
196. What’s the role of message queues (e.g., Kafka, RabbitMQ) in GenAI backends?
Handle async tasks (summarization, classification)
Enable horizontal scaling
Buffer bursty requests
Decouple frontend and LLM processing
197. How do you integrate GenAI with CI/CD workflows?
Lint/test prompt templates
Version control for prompts + models
Run eval tests on staging before deploying
Track performance drift
198. What’s a good microservices structure for a GenAI-powered SaaS platform?
API Gateway
Auth Service
GenAI Service (RAG, summarizer, etc.)
Vector Store Service
Monitoring Service Each can be deployed, scaled, and tested independently.
199. How do you perform load testing on GenAI endpoints?
Use tools like Locust, k6, or Artillery
Simulate concurrent users
Track response time, token latency, token usage
Monitor GPU/memory usage under load
200. How do you maintain versioning for prompts, models, and embeddings in production?
Use Git-based prompt repos
Tag model versions and configs
Hash embeddings or use vector DB namespaces
Include version metadata in logs and user requests
Last updated