IVQA 151-200

151. What preprocessing steps are needed before training a GenAI model?

Tokenization (model-specific)
Lowercasing, punctuation handling
Text normalization (unicode, contractions)
Deduplication of samples
Filtering (e.g., remove empty, toxic, or low-quality content)
Formatting (JSONL, conversational format)

152. How do you clean noisy textual data for GenAI training?

Remove non-text elements (HTML, code artifacts)
Normalize whitespaces and characters
Use heuristics or classifiers to remove off-topic/toxic content
Use language detection to remove non-target language content

153. What is tokenization drift and how do you prevent it?

Tokenization drift occurs when the tokenizer used during training differs from the one used at inference. Prevent it by:

Using the same tokenizer version across pipeline stages
Locking tokenizer vocab
Including tokenizer metadata in model checkpoints

154. How do you manage out-of-vocabulary (OOV) tokens?

Use subword tokenization (e.g., BPE, WordPiece)
Add domain-specific tokens during tokenizer training
Avoid character-level OOV fallback unless explicitly needed

155. How would you prepare a custom dataset for fine-tuning GPT?

Format as JSONL with {"prompt": "...", "completion": "..."} or chat messages
Ensure consistent structure (e.g., user/assistant roles)
Clean and deduplicate entries
Optionally balance dataset classes/topics

156. What is the role of chunking in RAG pipelines?

Breaks long documents into manageable, semantically coherent pieces
Improves retrieval accuracy
Enables context-window fitting
Helps minimize hallucination by feeding only relevant chunks

157. How do you handle multi-language data ingestion for a GenAI use case?

Use language detection and tag inputs
Apply language-specific cleaning/tokenization
Maintain language balance or weight important languages
Fine-tune with multilingual embeddings or models like mT5, XLM-R

158. How do you anonymize personally identifiable data before training?

Use regex/NLP rules for detecting names, emails, IDs, etc.
Replace with placeholders ([NAME], [EMAIL])
Apply NER-based de-identification (spaCy, Presidio)
Validate outputs for leakage before final training

159. What are tradeoffs between training on documents vs. dialogue data?

Aspect

Document Data

Dialogue Data

Strength

Rich, factual knowledge

Conversational fluency

Limitation

Less interactive, static format

Prone to bias or casual tone

Use case

Summarization, retrieval

Chatbots, assistants

160. How do you balance dataset diversity without sacrificing relevance?

Use sampling weights for underrepresented domains
Apply clustering + filtering to avoid redundancy
Combine general + domain-specific corpora
Maintain quality using human-in-the-loop filtering

161. How do LLMs handle long context windows, and what are the limits?

Use transformer variants like Transformer-XL, Longformer, or GPT-4-128k
Context limit = maximum token window (e.g., 8k, 32k, 128k)
Tradeoff: Larger context → more compute → higher cost

162. What is memory replay in agent frameworks?

Memory replay is the reuse of past dialogues, steps, or retrieved content in agent workflows to:

Improve reasoning
Avoid repetition
Preserve continuity across tasks

163. How does ReAct differ from simple tool-calling agents?

ReAct agents combine reasoning and acting:

Think step-by-step (“Thought → Action → Observation”)
Update reasoning after each action
More robust than one-shot tool invocations

164. What is “episodic memory” in LLMs?

Episodic memory stores structured interaction history (e.g., chat sessions) that the model can recall across sessions. It enables:

Persistent context
Cross-session continuity
Task tracking

165. How do you store and retrieve long-term memory using vector DBs?

Embed chunks using an embedding model (e.g., OpenAI, SBERT)
Store in vector DBs (e.g., Qdrant, Weaviate)
Retrieve top-k relevant memories using cosine similarity during each turn

166. How do you deal with context loss in multi-turn conversations?

Implement sliding windows or summarization
Use conversation history compression
Store and re-inject memory via external databases or RAG
Keep prompts focused to reduce drift

167. What’s the difference between external and internal memory for agents?

Type

Description

Internal

Held within model context (tokens)

External

Retrieved/stored in databases or vector stores

External memory enables persistent, large-scale memory beyond the token limit.

168. How does Claude 2/3 manage longer context better than GPT-4?

Claude 2/3 uses windowed attention + efficient summarization internally, enabling:

Up to 200k+ token input
Higher recall fidelity over long documents
Better tracking across lengthy sessions

169. What strategies help chunk documents for better summarization?

Semantic chunking using sentence boundaries
Overlapping chunks to retain context
Use headings or section tags
Embed and filter for relevance before summarization

170. How do you evaluate memory relevance in GenAI workflows?

Use retrieval score thresholds
Manual relevance judgment (human eval)
Measure downstream impact (e.g., improved Q&A, fewer hallucinations)
Token-efficiency vs. relevance tradeoff

171. Compare Mistral, LLaMA 2, and Falcon models.

Model

Strengths

Context Limit

License

Mistral

Fast, open, small model (7B)

~32K tokens

Apache 2.0

LLaMA 2

Meta’s accurate model (7B–70B)

4K–16K tokens

Non-commercial

Falcon

Strong pretraining, Arabic/NLP

~2K–4K tokens

Apache 2.0

172. How do you host an open-source LLM using Ollama or Text Generation Web UI?

Ollama: CLI-based; runs models like Mistral locally with simple commands
Text Generation Web UI: GUI wrapper over Hugging Face models with quantization, streaming, chat support

173. What are the benefits of vLLM for serving LLMs in production?

Fast inference via PagedAttention
Efficient multi-user batching
Drop-in support for OpenAI-compatible APIs
Better throughput and latency than naive inference

174. How does Hugging Face Inference Endpoints work for GenAI?

Deploy Hugging Face models via managed infrastructure
Auto-scales and provides secure REST APIs
Ideal for fast deployments without infra setup

175. What is quantization-aware training (QAT)?

QAT trains a model while simulating lower-precision (e.g., int8) arithmetic. It:

Improves inference efficiency
Reduces size with minimal accuracy loss
Outperforms post-training quantization (PTQ)

176. How do you deploy LLaMA 2 using Hugging Face Transformers?

Load model using AutoModelForCausalLM
Use transformers, accelerate, or vLLM
Deploy via FastAPI, Triton, or HF Endpoints
Follow Meta’s usage policy (especially for 65B/70B)

177. What is the role of Triton or ONNX in GenAI inference?

Triton Inference Server: Scales multi-model, multi-framework workloads
ONNX Runtime: Cross-platform, optimized runtime for exporting models Both improve deployment performance and portability

178. How do you benchmark different GenAI models locally?

Use Hugging Face’s evaluate or custom scripts
Compare:
- Latency (ms/token)
- Throughput
- Memory use
- Accuracy (BLEU, ROUGE, MMLU)
Use same prompt sets and tokenizers

179. How does OpenRouter help route across multiple LLMs?

Acts as a proxy to multiple models (OpenAI, Claude, Cohere)
Smart routing based on:
- Availability
- Cost
- Performance Great for fallback and comparative eval.

180. What are the licensing concerns when using open-source LLMs in commercial apps?

Check commercial use clauses (e.g., LLaMA 2 is not fully open)
Respect weights redistribution restrictions
Provide attribution where required
Prefer Apache 2.0, MIT, BSD for commercial use

181. How do you use OpenAI’s function calling to interact with APIs?

Define function schemas (name, parameters, descriptions)
Pass as functions to the API
Model returns JSON with function + arguments
You run the function and return result to model

182. Build a Python script to call GPT-4 for summarizing a PDF.

from openai import OpenAI
from PyPDF2 import PdfReader

reader = PdfReader("file.pdf")
text = " ".join([p.extract_text() for p in reader.pages[:5]])

response = openai.ChatCompletion.create(
  model="gpt-4",
  messages=[{"role": "user", "content": f"Summarize this:\n{text}"}]
)
print(response['choices'][0]['message']['content'])

183. Write a prompt template to extract structured data from unstructured reviews.

"Extract the following fields: Product Name, Rating (out of 5), Complaint Summary, Suggested Improvement. Review: {review_text}"

184. How would you use GenAI to create SQL queries from English prompts?

Few-shot prompt: Include examples of English-to-SQL
Tools like Text2SQL, OpenAI with function calling
Add table schema in prompt to improve accuracy

185. How do you validate user inputs before passing them to an LLM?

Sanitize (remove harmful content)
Use regex or schema validation
Enforce length and structure limits
Strip system prompt injection attempts

186. Build a FastAPI endpoint that takes user input and calls a GenAI model.

@app.post("/ask")
async def ask_q(input: dict):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": input["query"]}]
    )
    return {"answer": response['choices'][0]['message']['content']}

187. What’s the best way to batch prompts for OpenAI API to reduce cost?

Use ChatCompletion.create(messages=[...]) with multiple conversations
Combine related queries
Use OpenAI's batch API (if available)
Apply caching with hash-based keys

188. How can you use GenAI to classify and route customer tickets?

Prompt: "Classify this ticket into Billing, Technical, Feedback, Other"
Output = label → map to internal routing logic
Optionally use OpenAI function calling or fine-tuned classifier

189. Implement a RAG flow using LangChain and Qdrant.

Embed documents with OpenAI
Store in Qdrant
On query, retrieve top-k chunks
Construct context → send to OpenAI for answer Use RetrievalQA chain in LangChain

190. How do you cache frequent queries in a GenAI-powered web app?

Store hash of input prompt
Cache response in Redis or Postgres
Set TTL or LRU eviction
Avoid redundant token costs on repeated inputs

User → API Gateway → Retriever (Qdrant) + Reader (LLM)
Preprocess → Chunk → Embed → Store
Serve via FastAPI/Flask + monitoring stack (Prometheus + Grafana)

192. How do you secure an LLM API used in internal enterprise tools?

Use OAuth or API keys
Apply RBAC policies
Encrypt logs and inputs
Log access for audits

193. What are the tradeoffs between using managed LLMs and self-hosting?

Managed

Self-hosted

Fast, scalable

Full control

Less privacy

Infra overhead

Low maintenance

Customization possible

194. How do you enforce audit logs and traceability in a GenAI pipeline?

Log input/output, timestamps, user ID
Store function/tool calls
Maintain model version history
Comply with SOC2/GDPR audit trails

195. How would you scale an LLM-based email summarizer for 1M users?

Use queue-based system (e.g., Celery, Kafka)
Batch inputs
Use vLLM or quantized models
Parallelize with GPU workers
Cache past summaries

196. What’s the role of message queues (e.g., Kafka, RabbitMQ) in GenAI backends?

Handle async tasks (summarization, classification)
Enable horizontal scaling
Buffer bursty requests
Decouple frontend and LLM processing

197. How do you integrate GenAI with CI/CD workflows?

Lint/test prompt templates
Version control for prompts + models
Run eval tests on staging before deploying
Track performance drift

198. What’s a good microservices structure for a GenAI-powered SaaS platform?

API Gateway
Auth Service
GenAI Service (RAG, summarizer, etc.)
Vector Store Service
Monitoring Service Each can be deployed, scaled, and tested independently.

199. How do you perform load testing on GenAI endpoints?

Use tools like Locust, k6, or Artillery
Simulate concurrent users
Track response time, token latency, token usage
Monitor GPU/memory usage under load

200. How do you maintain versioning for prompts, models, and embeddings in production?

Use Git-based prompt repos
Tag model versions and configs
Hash embeddings or use vector DB namespaces
Include version metadata in logs and user requests

PreviousIVQA 101-150 NextIVQA 201-250

Last updated 7 months ago