IVQ 151-200
151–155: Backpropagation, Gradients & Attention
Q151. How does the chain rule apply to backpropagation in transformer-based LLMs? A151: The chain rule propagates gradients layer-by-layer through self-attention, feed-forward, and normalization, ensuring each parameter adjusts in relation to its effect on the loss.
Q152. How does matrix multiplication influence attention score computation in LLMs? A152: Queries are multiplied by keys’ transposes to compute similarity scores — a single matrix multiplication yields all pairwise scores efficiently.
Q153. What role does the softmax derivative play in updating weights during LLM training? A153: The softmax derivative distributes gradient flow across competing tokens, adjusting how much each output influences parameter updates.
Q154. How does gradient clipping help stabilize training in large-scale language models? A154: It caps exploding gradients, preventing large updates that destabilize training, especially in deep or long-context transformers.
Q155. How do partial derivatives help fine-tune each layer in multi-head attention? A155: They measure each parameter’s local effect on the loss, guiding precise weight updates in query/key/value projections and output mixing.
156–160: Transformer Core Building Blocks
Q156. How is positional encoding applied in transformer architectures? A156: It’s added to token embeddings before feeding into self-attention — preserving token order info through the model.
Q157. How does layer normalization affect the output in transformer blocks? A157: It stabilizes activations by normalizing mean and variance, aiding convergence and training consistency.
Q158. How are query, key, and value matrices derived from input embeddings? A158: The embedding is linearly projected with separate learned weight matrices to produce Q, K, and V for each head.
Q159. How is multi-head attention different from single-head attention in transformers? A159: Multi-head attention splits embedding space to attend to diverse patterns in parallel — single-head lacks this diversity.
Q160. How are residual connections used in transformer layers to preserve gradient flow? A160: They skip-connect inputs to outputs, preventing vanishing gradients and allowing deeper stacking.
161–165: Multimodal LLMs
Q161. How does GPT-4o handle real-time cross-modal inputs during inference? A161: It fuses text and vision streams with unified token spaces and routing to handle modalities jointly in a single pass.
Q162. How does Claude 3 integrate vision and language for complex reasoning tasks? A162: It uses shared embeddings and cross-attention layers that align visual regions with textual tokens for grounded reasoning.
Q163. How does Google DeepMind improve token alignment in multimodal training? A163: By aligning visual patches and words in joint embedding spaces, enhancing cross-modal consistency.
Q164. How do large models fuse audio, text, and images during joint representation learning? A164: They embed each modality into a common latent space, then cross-attend to align signals for unified outputs.
Q165. How does OpenAI ensure modality balance during multimodal LLM pretraining? A165: It carefully curates datasets and training schedules so no single modality dominates the shared embedding space.
166–170: Foundation Model Variants
Q166. What are the main categories of generative AI models? A166: Text-only (GPT), vision-only (Stable Diffusion), audio, multimodal (Gemini, GPT-4o), and domain-specific models.
Q167. How do encoder-only, decoder-only, and encoder-decoder models differ? A167: Encoder-only (BERT) understand context; decoder-only (GPT) generate; encoder-decoder (T5) map input → output.
Q168. What types of pretraining objectives are used in foundation models? A168: Masked token prediction, next-token prediction, span corruption, contrastive loss, multimodal alignment.
Q169. How do vision-language models differ from pure language foundation models? A169: They learn to align text with visual representations, needing cross-modal training data and alignment losses.
Q170. What distinguishes general-purpose foundation models from domain-specific ones? A170: General-purpose handle broad tasks; domain-specific (e.g., legal, biomedical) specialize with tailored pretraining/fine-tuning.
171–175: PEFT & Adaptation
Q171. How does LoRA improve parameter efficiency during fine-tuning? A171: It injects trainable low-rank adapters, avoiding full weight updates — lightweight and modular.
Q172. How does QLoRA maintain performance while reducing memory usage? A172: It quantizes base weights and combines LoRA adapters, fine-tuning in low precision while preserving accuracy.
Q173. How does adapter tuning preserve pre-trained knowledge in LLMs? A173: It adds small modules without changing base weights — isolating task-specific tweaks.
Q174. How does prefix tuning enable task adaptation with minimal updates? A174: It prepends trainable “prompts” to inputs, conditioning the model’s generation with no core weight edits.
Q175. How do PEFT techniques balance generalization and specialization in LLMs? A175: They preserve broad knowledge while layering small task-specific adaptations cheaply.
176–180: Retrieval-Augmented Generation
Q176. How does dense retrieval differ from sparse retrieval in RAG pipelines? A176: Dense uses vector embeddings; sparse uses keyword matches. Dense finds semantic similarity beyond exact terms.
Q177. What role do embeddings play in document retrieval for RAG? A177: They represent queries/passages in a shared space, enabling similarity search for relevant chunks.
Q178. How is retrieved context integrated into the prompt for generation? A178: Retrieved snippets are concatenated with the user query, expanding the LLM’s context window.
Q179. What are the advantages of RAG over closed-book LLMs? A179: Freshness, factual grounding, and larger knowledge base without retraining the core model.
Q180. How does RAG ensure relevance and factual accuracy during response generation? A180: Retrieval narrows the search space; the generator then grounds output in retrieved text.
181–185: Mixture of Experts
Q181. How does expert routing work in a Mixture of Experts architecture? A181: A router picks which “experts” (sub-networks) handle each input — activating only a few per token.
Q182. What are the trade-offs between sparse and dense MoE models? A182: Sparse MoE is compute-efficient but harder to train; dense uses all experts but costs more.
Q183. How does MoE reduce computational cost while maintaining performance? A183: By activating only relevant experts, compute scales sub-linearly with total capacity.
Q184. How does token-to-expert mapping affect efficiency in MoE-based LLMs? A184: Better routing reduces overlap/conflict between experts, boosting throughput.
Q185. What challenges arise when training large-scale MoE models? A185: Expert imbalance, routing instability, and communication overhead across devices.
186–190: Chain-of-Thought
Q186. How does CoT prompting differ from standard prompting in LLMs? A186: CoT asks the model to reason step-by-step instead of jumping straight to answers.
Q187. What types of tasks benefit most from Chain-of-Thought reasoning? A187: Multi-step arithmetic, logic puzzles, and complex reasoning questions.
Q188. How does CoT prompting improve multi-step arithmetic and logic problems? A188: It encourages explicit intermediate reasoning, reducing skipped steps and errors.
Q189. How is CoT prompting combined with self-consistency for better accuracy? A189: Multiple CoT paths are generated; final output is chosen by voting, filtering noisy reasoning.
Q190. What are the limitations of CoT prompting in complex reasoning tasks? A190: Long chains can drift off-topic; reasoning steps may still hallucinate without verification.
191–195: Discriminative vs Generative
Q191. What are the key differences between classification and generation tasks in AI models? A191: Classification labels inputs; generation produces sequences — requiring next-token sampling.
Q192. How do discriminative models learn decision boundaries, and how does that contrast with generative models? A192: Discriminative models focus only on class separation; generative learn the data distribution itself.
Q193. How does the training objective differ for discriminative vs. generative models? A193: Discriminative minimizes classification error; generative maximizes data likelihood or next-token prediction.
Q194. When should you choose a generative model over a discriminative one in real-world applications? A194: For text generation, completion, summarization — when you need output, not just labels.
Q195. How do models like BERT (discriminative) and GPT (generative) differ in architecture and use cases? A195: BERT uses bidirectional masking for understanding tasks; GPT uses autoregression for coherent generation.
196–200: Knowledge Graphs & Hallucination
Q196. How do knowledge graphs enhance factual grounding in LLM responses? A196: They inject structured, verified facts the model can cross-check during generation.
Q197. What techniques are used to connect structured knowledge with unstructured LLM outputs? A197: Entity linking, embedding alignment, and retrieval pipelines bridge graphs and text generation.
Q198. How does entity linking between text and a knowledge graph benefit reasoning tasks? A198: It ties text spans to known entities, improving consistency and inferencing.
Q199. How can knowledge graphs reduce hallucinations in generative models? A199: They constrain generation to known facts, anchoring free-form text to reliable data.
Q200. What are the challenges of integrating dynamic or evolving knowledge graphs with LLMs? A200: Keeping embeddings fresh, syncing updates in real-time, and scaling graph lookups efficiently.
Last updated