IVQA 201-250
201. How do you horizontally scale a GenAI service?
Deploy multiple stateless inference workers behind a load balancer
Use container orchestration (e.g., Kubernetes)
Separate frontend/API, vector DB, and LLM layers
Scale RAG independently of model inference
202. What are GPU memory optimization strategies for LLM inference?
Use quantized models (e.g., 4-bit, 8-bit)
Enable KV cache reuse
Apply tensor parallelism and offloading
Serve using frameworks like vLLM or DeepSpeed-Inference
203. How would you handle rate limits for large-scale OpenAI API usage?
Implement token-level retry with backoff
Rotate API keys or use rate-aware load balancing
Queue and batch user requests
Monitor usage and forecast spikes
204. What’s the difference between multi-GPU vs. multi-node LLM serving?
Multi-GPU (same machine): Faster, lower latency, shared memory
Multi-node: Enables scaling across machines, but with communication overhead (e.g., using NCCL, DeepSpeed)
205. How can you use model sharding in production environments?
Split model weights across GPUs or nodes (tensor/model parallelism)
Use libraries like Megatron-LM, DeepSpeed, or Hugging Face Accelerate
Requires sync between shards during inference
206. What is speculative decoding and how does it improve throughput?
Run a small model to draft predictions
Use a large model to validate/correct them
Reduces latency by avoiding full generation for every token
207. What are the main bottlenecks in high-load GenAI applications?
Token throughput limits
GPU memory and concurrency constraints
Retrieval latency in RAG
Inefficient batching or prompt length explosion
208. How do you manage logs and observability in a GenAI backend?
Centralized logging (e.g., with Loki, ELK)
Monitor token usage, latency, error rates
Trace prompt → model → response paths
Use tools like Prometheus, Grafana, Sentry
209. How do you build a cost-monitoring dashboard for GenAI endpoints?
Track tokens used per request
Multiply by pricing model (e.g., $/1k tokens)
Visualize by user/session/feature
Export to dashboards (e.g., Metabase, Grafana)
210. What is KV cache reuse and how does it optimize performance?
Key-value caches store previous attention computations
Allows faster generation for multi-turn prompts
Avoids recomputing context for every new token
211. What are the main types of agents in LangChain or AutoGen?
Reactive agents: Act step-by-step without full planning
Planner-Executor: First plan, then execute actions
Tool-using agents: Call external functions/APIs
Collaborative agents (AutoGen): Multiple roles working together
212. How do agents handle tool selection dynamically?
Use prompt-based reasoning to decide which tool fits
Maintain tool metadata (name, description, params)
Use function calling or LangChain’s Tool interface
Feedback loop: tool → output → next decision
213. What’s the difference between planner-executor and reactive agents?
Reactive
No
High
May repeat
Planner-Executor
Yes
Medium
Structured
214. How would you design a memory-aware GenAI assistant using tools?
Store episodic memory in a vector store (e.g., Qdrant)
Retrieve relevant memory on each turn
Use tools for summarization, calendar access, or search
Maintain memory trace to reduce context loss
215. How can agents collaborate or hand off tasks to each other?
Use AutoGen’s multi-agent framework
Define protocols (e.g., User Proxy → Coder → Critic)
Pass message history or context
Define exit criteria or delegation logic
216. How do you sandbox tool-executing agents for safety?
Run in containerized environments with limited permissions
Whitelist tool types and arguments
Monitor execution time and outputs
Apply runtime guards or function wrappers
217. How do you create an autonomous agent for report generation?
Use planner to define tasks: data fetch → summarize → generate slides
Schedule runs (daily/weekly)
Log outputs and allow review
Add tools like PDFGen, DBQuery, summarizer
218. What role does the scratchpad play in reasoning agents?
Stores intermediate thoughts, decisions, tool outputs
Helps the agent “think out loud”
Enables better transparency and multi-step reasoning
Used in ReAct and LangChain chains
219. How do agents balance exploration vs. exploitation in decision-making?
Score and rank tool options by confidence or past success
Use randomness or UCB-like strategies
Adjust based on feedback or failure patterns
220. How can GenAI agents perform complex multi-step workflows?
Chain sub-agents via plans or FSMs (Finite State Machines)
Inject memory/state between steps
Use LangGraph or AutoGen for dynamic state transitions
Recover from failure by reasoning + retry
221. How do you personalize a GenAI chatbot for individual users?
Store user profile, preferences, and prior chats
Retrieve personalized context per query
Use prompt templates that adapt dynamically
Apply embeddings to match interests or tone
222. What’s the role of user embeddings in personalizing responses?
Represent user interests, behavior, tone as vectors
Used for retrieval, ranking, or conditioning LLM outputs
Enable collaborative filtering for suggestions
223. How do you maintain personalization across sessions securely?
Store tokens or IDs in secure databases
Use hashed identifiers
Encrypt memory and limit access
Allow users to reset memory or opt out
224. What are privacy challenges in personalized GenAI apps?
Storing sensitive data (e.g., preferences, history)
Compliance (e.g., GDPR, CCPA)
Risks of model leaks if fine-tuned on PII
Need for explainability and opt-out options
225. How do you use retrieval to provide contextually aware responses?
Index user-specific notes, docs, history
Embed and store in a vector DB
On query, retrieve top-k chunks
Inject into prompt to provide relevant answers
226. How do you fine-tune an LLM on user feedback?
Collect structured feedback (thumbs up/down + reasoning)
Aggregate and format into training pairs
Use LoRA or SFT pipelines
Iterate in evaluation cycles
227. What is reinforcement learning with user signals (RLUS)?
Use implicit signals (clicks, retention, edits) as rewards
Apply them in a bandit or RLHF framework
Continuously fine-tune the model based on real-world interaction
228. How can GenAI recommend content adaptively in real time?
Embed both users and content
Use nearest neighbor matching (e.g., cosine similarity)
Rerank with context relevance
Adjust recommendations based on real-time interactions
229. What’s the tradeoff between personalization and generalization?
High engagement
Better for new users
Risk of overfitting
Robust performance
Less scalable
Easier to deploy
230. How would you build a personalized study tutor using GenAI?
Track student progress (topics, mistakes, goals)
Generate adaptive quizzes and hints
Retrieve previous Q&A for reference
Use conversational fine-tuning + retrieval for feedback
231. What are the principles behind Constitutional AI?
Define ethical principles as rules (e.g., “Don’t cause harm”)
Use LLM to self-critique outputs during training
Replace or augment RLHF with model-guided alignment
Introduced by Anthropic
232. How do you ensure alignment of LLMs with company values?
Embed values in system prompts and outputs
Fine-tune with curated, value-aligned datasets
Apply moderation and feedback loops
Audit outputs for tone, accuracy, fairness
233. How can LLMs be aligned post-deployment?
Use real-time filtering (e.g., Moderation API)
Apply reward models for reranking
Collect user feedback for fine-tuning
Isolate high-risk domains with stricter controls
234. What is the difference between supervised fine-tuning and RLHF?
SFT
Learn from labeled examples
RLHF
Reward preferred behavior via feedback loop
235. How do you detect harmful or biased outputs in real-time?
Use moderation models (toxicity, hate speech, bias)
Pattern matching (regex)
Statistical monitoring (e.g., outlier detection)
Route flagged outputs to human review
236. What are adversarial prompts, and how do you defend against them?
Prompts designed to bypass guardrails (e.g., "ignore instructions")
Defenses: prompt hardening, sandboxed tool execution, input validation, behavior simulation during training
237. How can GenAI support explainability in its outputs?
Use step-by-step reasoning prompts (e.g., Chain-of-Thought)
Highlight source documents in RAG
Add metadata (retrieved docs, scoring)
Maintain scratchpad logs
238. What’s the role of safety layers and filters like OpenAI’s Moderation API?
Post-process outputs to detect content violations
Block or reroute unsafe completions
Enforce TOS compliance
Allow audit and oversight
239. How do you manage trade-offs between safety and creativity in generation?
Adjust temperature and decoding strategies
Use persona-specific safety rules
Accept some variance in low-risk domains
Separate exploratory vs. production modes
240. How do companies like Anthropic approach LLM alignment?
Use Constitutional AI for self-supervision
Incorporate transparency and model critique
Test for harmlessness, helpfulness, honesty
Emphasize red teaming and internal safety evaluations
241. How do you compress models for edge deployment?
Use quantization, distillation, pruning
Apply LoRA for efficient adaptation
Offload layers to disk if memory-bound
242. Compare TinyLLaMA, DistilGPT, and other small LLMs.
TinyLLaMA
~1B
Edge, fast response
DistilGPT-2
~82M
Chatbots, basic tasks
Phi-2
~2.7B
Reasoning, balanced
243. How do you handle low-bandwidth GenAI applications?
Perform compression (gzip, protobuf)
Cache frequent responses
Minimize prompt length
Serve smaller distilled models
244. What are on-device privacy benefits for GenAI assistants?
No data leaves the device
Reduces regulatory risk (GDPR, HIPAA)
No dependency on network or cloud providers
Improves response latency
245. How do you optimize inference for ARM architectures?
Use ONNX with ARM-specific backends
Quantize models with TVM, TensorRT, or GGML
Avoid memory-intensive attention variants
246. What’s the best model quantization method for small devices?
4-bit GPTQ or AWQ for smallest footprint
int8 for speed-accuracy tradeoff
ZeroQuant for training-aware quantization
247. How do you deploy a GenAI chatbot on a Raspberry Pi?
Use quantized TinyLLaMA or DistilGPT
Run via GGML or llama.cpp
Use FastAPI + SQLite for storage
Limit context to reduce load
248. What are trade-offs when using 4-bit vs. 8-bit quantized models?
Smaller, slower
Faster, more accurate
More memory savings
Less compression
249. What are low-resource strategies for building domain-specific GenAI apps?
Fine-tune small models with LoRA
Limit scope of tasks
Use retrieval over generation when possible
Precompute responses for FAQs
250. How do you use LoRA for quick personalization in resource-constrained environments?
Attach LoRA adapters to a base model
Train with few-shot data
Avoid full model tuning
Works well on edge with quantized base models
Last updated