IVQ 1-50
1–10: Attention & Transformer Basics
Q1. How do attention mechanisms work in transformers, and why are they central to LLM performance? A1: Attention mechanisms compute a weighted sum of input representations, letting each token focus on relevant parts of the sequence. This enables transformers to capture long-range dependencies efficiently — crucial for understanding complex language patterns.
Q2. What is the role of positional encoding in transformer-based models like LLMs? A2: Since transformers lack recurrence, positional encodings inject information about token order, allowing the model to understand sequence structure.
Q3. How do LLMs represent and generate human language using embeddings? A3: Words/tokens are mapped to dense vectors (embeddings) capturing semantic and syntactic properties. These embeddings are transformed through layers to generate coherent outputs.
Q4. What is a context window in LLMs, and how does it affect model performance and memory? A4: The context window is the maximum number of tokens a model can “see” at once. Larger windows capture more context but increase memory and compute demands quadratically.
Q5. What is the difference between pre-training and fine-tuning in the context of large language models? A5: Pre-training teaches general language patterns on huge corpora; fine-tuning adapts the model to specific tasks or domains with smaller datasets.
Q6. What are query, key, and value vectors in the attention mechanism, and how do they interact? A6: For each token, a query vector is matched with keys to compute attention scores. The scores weight the values, producing context-aware representations.
Q7. How does multi-head attention improve the model's ability to understand context compared to single-head attention? A7: Multiple heads let the model attend to different aspects of the input in parallel — e.g., syntax and semantics — enriching contextual understanding.
Q8. Why is self-attention important in processing sequences, and how does it differ from traditional RNN approaches? A8: Self-attention processes all tokens simultaneously, enabling efficient parallelization and direct long-range connections, unlike sequential RNNs.
Q9. How are attention scores computed, and what role does the softmax function play in this process? A9: Attention scores are dot products of queries and keys. Softmax normalizes these into probabilities, ensuring they sum to one and can weight values effectively.
Q10. What are the computational challenges of attention in long sequences, and how do architectures like Longformer or FlashAttention address them? A10: Vanilla attention scales quadratically with sequence length. Longformer uses sparse attention; FlashAttention optimizes memory access for faster, lower-cost computation.
11–20: Context Windows & Fine-Tuning
Q11. How does increasing the context window size impact an LLM’s performance and computational requirements? A11: Larger windows improve coherence and knowledge retention but raise memory/compute costs non-linearly.
Q12. What are the limitations of a fixed context window, and how do models like Claude or Gemini handle extended context? A12: Fixed windows can’t remember beyond the limit. Claude, Gemini, and others extend context using retrieval or special memory mechanisms.
Q13. In what ways does context fragmentation affect coherence in multi-turn conversations with LLMs? A13: If prior turns fall outside the window, the model loses track of the dialogue’s flow, leading to contradictions or repetition.
Q14. How do retrieval-augmented generation (RAG) methods help overcome context window limitations? A14: RAG pipelines fetch relevant external data during generation, augmenting the model’s context without needing massive token windows.
Q15. What strategies are used to compress or summarize prior context to fit within an LLM’s token limit? A15: Chunking, summarization, context distillation, and selective recall are common strategies to maintain relevance within limits.
Q16. How does parameter-efficient fine-tuning (PEFT) differ from full fine-tuning in terms of resource usage and model performance? A16: PEFT adapts small subsets of parameters (e.g., adapters, LoRA) with less compute and storage. Full fine-tuning updates all weights, needing more resources.
Q17. What role do low-rank matrices play in LoRA, and why are they effective for LLM adaptation? A17: LoRA injects low-rank matrices into weight updates, efficiently steering model behavior with minimal new parameters.
Q18. How does QLoRA enable fine-tuning large models on consumer-grade hardware? A18: QLoRA quantizes model weights for memory efficiency and combines LoRA for targeted updates — allowing fine-tuning on modest GPUs.
Q19. What are the trade-offs between using quantization-aware fine-tuning methods like QLoRA versus traditional approaches? A19: QLoRA saves compute/memory but may slightly reduce final accuracy vs. full-precision fine-tuning. Trade-off: efficiency vs. maximum performance.
Q20. In which scenarios would you choose LoRA, QLoRA, or full fine-tuning for adapting a foundation model? A20: Use LoRA for quick adaptation with limited compute, QLoRA for large models on standard hardware, full fine-tuning when accuracy is critical and resources are ample.
21–30: Decoding & Generation
Q21. What are the trade-offs between beam search and nucleus sampling in text generation? A21: Beam search favors optimal, deterministic output but can be repetitive. Nucleus sampling adds randomness for diversity and creativity.
Q22. How does top-k sampling introduce diversity in LLM outputs, and when is it preferred over deterministic decoding? A22: Top-k keeps only the k most likely next tokens, then samples randomly — useful for creative or non-repetitive tasks.
Q23. What are the limitations of greedy decoding in capturing coherent long-form text? A23: Greedy decoding always picks the top token, which can lead to repetitive or bland text lacking variation.
Q24. How do temperature settings influence randomness and creativity in sampling-based decoding methods? A24: Higher temperatures flatten probability distributions, increasing randomness; lower temperatures make output more deterministic.
Q25. In what scenarios would you prioritize decoding efficiency over generation diversity in LLM applications? A25: When factual accuracy, determinism, or latency are critical — e.g., QA bots, legal document drafting.
Q26. How does adjusting the temperature parameter affect the randomness and determinism of LLM-generated text? A26: Low temperature → deterministic. High temperature → more randomness and creative word choices.
Q27. In what situations would a low-temperature setting be more appropriate than a high-temperature one? A27: For factual tasks, summarization, or formal writing where coherence and accuracy trump creativity.
Q28. How does temperature interact with top-k and top-p sampling in shaping output diversity? A28: Temperature controls randomness; top-k/top-p define the candidate pool. Combined, they balance coherence and novelty.
Q29. What are the risks of using a high temperature in safety-critical LLM applications? A29: High temperature may generate unexpected or unsafe outputs — potentially introducing hallucinations or harmful text.
Q30. How can tuning temperature help balance creativity and coherence in generative tasks like storytelling or summarization? A30: Moderate temperatures inject enough variability for fresh ideas while keeping the narrative logical.
31–40: Pretraining & Seq2Seq
Q31. How does causal language modeling differ from masked language modeling in LLM training? A31: Causal models predict the next token sequentially (e.g., GPT). Masked models fill in masked tokens bidirectionally (e.g., BERT).
Q32. Why is next sentence prediction used in some pretraining objectives, and what are its limitations? A32: It helps models learn sentence relationships. Limitation: may not align with real downstream tasks, sometimes replaced with better objectives.
Q33. How does span masking improve over token-level masking in models like SpanBERT? A33: Masking spans teaches models to understand contiguous chunks, improving performance on phrase-level tasks.
Q34. What are the benefits and trade-offs of using denoising autoencoding for pretraining language models? A34: It teaches robust reconstruction, improving context understanding. Trade-off: not directly optimized for generation.
Q35. How does the choice of pretraining objective impact downstream performance on tasks like QA or summarization? A35: Pretraining objectives aligned with downstream tasks yield better transfer learning — e.g., span prediction helps QA.
Q36. How do encoder-decoder architectures work in sequence-to-sequence models? A36: The encoder processes input into a context representation; the decoder generates output conditioned on this context.
Q37. What makes sequence-to-sequence models suitable for tasks like machine translation and summarization? A37: They handle input-output pairs of variable lengths and learn direct mappings from source to target text.
Q38. How does attention enhance the performance of traditional sequence-to-sequence models? A38: Attention allows the decoder to focus on relevant encoder states at each step, improving alignment and fluency.
Q39. In what ways do transformer-based seq2seq models outperform RNN-based ones? A39: Better parallelization, superior long-range context capture, and less vanishing gradient issues.
Q40. How are sequence-to-sequence models used in speech recognition and text-to-speech systems? A40: They map audio features to text (ASR) or text to audio frames (TTS) using encoder-decoder frameworks.
41–50: Architectures & Embeddings
Q41. What are the key differences between encoder-only, decoder-only, and encoder-decoder transformer architectures? A41: Encoder-only: context understanding (BERT). Decoder-only: generation (GPT). Encoder-decoder: input-output mapping (T5, BART).
Q42. How does causal attention enable autoregressive language modeling? A42: It masks future tokens, ensuring each prediction only uses past context.
Q43. Why are masked models like BERT better suited for understanding tasks than generation tasks? A43: They learn bidirectional context, ideal for classification, QA, and filling blanks, but not for left-to-right generation.
Q44. How do bidirectional context models differ from left-to-right models in their ability to capture dependencies? A44: Bidirectional models see full context simultaneously; left-to-right models can’t peek ahead.
Q45. In what scenarios would you choose an autoregressive model over a masked language model, and vice versa? A45: Autoregressive: generation tasks (chatbots). Masked: classification, NER, QA.
Q46. How do word embeddings differ from contextual embeddings in modern language models? A46: Word embeddings are static. Contextual embeddings change meaning based on the surrounding text.
Q47. What role do positional embeddings play in transformer architectures? A47: They encode sequence order so the model knows token positions relative to each other.
Q48. How are learned embeddings updated during pretraining and fine-tuning? A48: Gradients adjust embeddings during backpropagation to better fit training data or target tasks.
Q49. What is the significance of embedding dimensionality in LLM performance and memory usage? A49: Higher dimensions capture richer information but increase model size and computation.
Q50. How can pre-trained embeddings be reused or adapted in downstream NLP tasks? A50: They initialize models with learned knowledge, boosting performance and reducing training needs for new tasks.
Last updated