IVQ 51-100
51–55: Next Sentence Prediction
Q51. Why was next sentence prediction removed from RoBERTa, and what impact did it have on performance? A51: RoBERTa removed NSP because it didn’t significantly help downstream tasks and sometimes misled training. Removing it allowed longer training and better data usage, improving performance.
Q52. How does sentence order prediction differ from next sentence prediction as a pretraining objective? A52: Sentence order prediction (SOP) asks whether two segments are in the right order — more focused on coherence — while NSP asks if one sentence follows another, which can be too easy or ambiguous.
Q53. What are the limitations of using next sentence prediction in training language understanding models? A53: NSP may not match real tasks, can be trivial (predictable sentence pairs), and doesn’t always teach deep discourse understanding.
Q54. How do tasks like sentence entailment benefit from models trained with next sentence prediction? A54: NSP helps models learn pairwise sentence relationships, which is useful for entailment and coherence tasks — but better objectives can replace it.
Q55. In what ways does next sentence prediction contribute to discourse-level understanding in LLMs? A55: NSP encourages the model to capture topic flow and coherence across sentences — but with limited depth compared to better discourse-level tasks.
56–60: Decoding Trade-offs & Sampling
Q56. What are the trade-offs between deterministic decoding and sampling-based methods in LLMs? A56: Deterministic decoding is repeatable and safe but can be repetitive. Sampling adds diversity but risks incoherence or factual errors.
Q57. How does nucleus (top-p) sampling dynamically adjust the candidate token pool during generation? A57: Top-p includes only the smallest set of tokens whose cumulative probability exceeds p — dynamically adjusting the pool based on distribution.
Q58. When would you prefer top-k sampling over top-p sampling in creative text generation? A58: Top-k provides a fixed number of choices, offering more control when you want consistent randomness, e.g., in constrained creative writing.
Q59. How do sampling strategies affect the diversity, coherence, and repetitiveness of LLM outputs? A59: More aggressive sampling (higher p/k) boosts diversity but can reduce coherence; narrower sampling does the opposite.
Q60. Can top-k and top-p sampling be combined, and what is the benefit of doing so? A60: Yes — combining ensures the candidate pool is both probable enough (p) and limited in size (k) — balancing control and fluidity.
61–65: Prompting & Instruction Tuning
Q61. How does prompt structure influence the quality and relevance of LLM responses? A61: Clear, precise prompts with good context steer the model’s generation — ambiguous prompts lead to unpredictable results.
Q62. What is zero-shot prompting, and how does it differ from few-shot and chain-of-thought prompting? A62: Zero-shot provides only instructions. Few-shot adds examples. Chain-of-thought prompts the model to reason step-by-step for complex tasks.
Q63. How can prompt engineering be used to steer LLMs toward safer or more factual outputs? A63: Adding disclaimers, constraints, or fact-check steps reduces hallucinations. System messages or instructions can enforce rules.
Q64. What are the challenges in designing prompts for multilingual or multi-domain tasks? A64: Ensuring clarity across languages, handling cultural nuances, and covering diverse knowledge domains without ambiguity.
Q65. How does instruction tuning reduce the need for manual prompt engineering? A65: It trains the model to follow task instructions directly, making it more robust to varied prompt phrasing.
66–70: Continual Learning & Forgetting
Q66. What role does continual learning play in mitigating catastrophic forgetting in LLMs? A66: Continual learning updates models with new data while retaining old knowledge — crucial for evolving tasks.
Q67. How do regularization techniques like Elastic Weight Consolidation (EWC) help preserve pretrained knowledge? A67: EWC penalizes changes to important weights, preserving useful parameters while allowing adaptation.
Q68. What is rehearsal-based fine-tuning, and how does it prevent forgetting in LLMs? A68: It reuses samples from old tasks during new training — “rehearsing” old knowledge to keep it fresh.
Q69. How does adapter tuning reduce the risk of overwriting core model weights? A69: Adapters add small trainable modules without modifying base weights — enabling safe domain shifts.
Q70. In what scenarios is catastrophic forgetting most likely to occur during LLM adaptation? A70: When fine-tuning on small, narrow data sets or sequential tasks without regularization — old tasks degrade.
71–75: Distillation & Compression
Q71. How does knowledge distillation transfer performance from a large teacher model to a smaller student model? A71: The teacher’s outputs guide the student to mimic its behavior — capturing patterns with fewer parameters.
Q72. What are the trade-offs between model compression via distillation and quantization? A72: Distillation reduces parameters while preserving accuracy; quantization reduces precision and size but can hurt performance more.
Q73. How does distillation preserve task-specific behavior in smaller LLM variants? A73: The student learns from the teacher’s soft probabilities — inheriting nuanced decision boundaries.
Q74. In what scenarios is model distillation preferred over training a small model from scratch? A74: When you need smaller deployable models with the teacher’s advanced capabilities but limited compute for retraining.
Q75. How can distillation be used to align model outputs with human preferences or safety constraints? A75: A safety-aligned teacher can teach a student to generate safe, user-friendly text by example.
76–80: Tokenization & OOV Handling
Q76. What is subword tokenization, and how does it help LLMs handle rare or unseen words? A76: Words are split into subunits (e.g., “un+believ+able”), allowing flexible handling of new words.
Q77. How do Byte Pair Encoding (BPE) and SentencePiece differ in managing vocabulary in LLMs? A77: BPE merges frequent pairs iteratively; SentencePiece treats input as raw text or byte streams for language-agnostic tokenization.
Q78. Why are character-level and byte-level models more robust to OOV tokens? A78: They work directly on characters/bytes, avoiding unknown tokens altogether.
Q79. How do LLMs preserve semantic meaning when breaking unfamiliar words into smaller units? A79: Contextual embeddings recombine subword pieces meaningfully based on surrounding words.
Q80. What are the trade-offs between fixed vocabulary sizes and open-vocabulary tokenization strategies? A80: Fixed vocabularies are efficient but limit OOV handling; open vocabularies increase flexibility but may add sequence length.
81–85: Transformer vs. RNN
Q81. What limitations of RNNs and LSTMs are addressed by the transformer architecture? A81: Transformers parallelize sequence processing and handle long-range dependencies better.
Q82. How does self-attention in transformers enable better handling of long-range dependencies? A82: Any token can directly attend to any other, regardless of position.
Q83. Why do transformers allow for more parallelization during training compared to recurrent models? A83: They process tokens simultaneously instead of sequentially.
Q84. How do positional encodings in transformers compensate for the lack of sequential recurrence? A84: They inject order info, so the model understands token positions.
Q85. In what ways have transformer-based models outperformed traditional Seq2Seq models in NLP benchmarks? A85: Higher accuracy, better long-text fluency, and faster training times.
86–90: Overfitting & Generalization
Q86. How does regularization help prevent overfitting during LLM training and fine-tuning? A86: It penalizes excessive weight updates and enforces generalization.
Q87. What role does dropout play in reducing overfitting in transformer-based models? A87: It randomly disables neurons during training to prevent reliance on specific features.
Q88. How can early stopping be used to detect and prevent overfitting in large-scale training runs? A88: Training halts if validation performance stops improving — preventing unnecessary overfitting.
Q89. Why is having a diverse and large training dataset critical for minimizing overfitting in LLMs? A89: Diverse data exposes the model to more patterns, reducing memorization of specifics.
Q90. How do techniques like data augmentation or noise injection improve generalization in LLMs? A90: They expand training variety and force the model to learn robust features.
91–95: Generative vs. Discriminative
Q91. How do discriminative models differ from generative models in terms of learning decision boundaries? A91: Discriminative models learn P(Y|X) — focusing on labels. Generative models learn P(X,Y) — modeling full data distributions.
Q92. Why are generative models better suited for tasks like text generation and completion? A92: They can sample new sequences by modeling how text unfolds probabilistically.
Q93. In what scenarios would a discriminative model be preferable over a generative one in NLP? A93: For classification, NER, or sentiment tasks where decision boundaries matter more than generation.
Q94. How do models like BERT (discriminative) and GPT (generative) differ in architecture and objectives? A94: BERT masks and classifies bidirectionally; GPT predicts next tokens autoregressively.
Q95. Can generative and discriminative approaches be combined, and what are the benefits of doing so? A95: Yes — combining boosts tasks like semi-supervised learning and robust representation learning.
96–100: GPT-4 vs. GPT-3
Q96. What architectural improvements make GPT-4 more capable than GPT-3 in reasoning tasks? A96: Larger scale, better training data, instruction tuning, and optimized attention help GPT-4 reason better.
Q97. How does GPT-4’s multimodal capability extend its range of applications compared to GPT-3? A97: GPT-4 can handle images and text jointly, enabling tasks like visual question answering.
Q98. In what ways has GPT-4 improved safety, alignment, and factual accuracy over GPT-3? A98: Better alignment training, improved reinforcement learning from human feedback (RLHF), and expanded policy safeguards.
Q99. How does the context window size differ between GPT-3 and GPT-4, and what does that enable? A99: GPT-4 can handle longer contexts (e.g., >32k tokens), boosting coherence for bigger documents and conversations.
Q100. What are the performance differences between GPT-3 and GPT-4 in multilingual and domain-specific tasks? A100: GPT-4 handles low-resource languages and domain-specific tasks more robustly thanks to broader training and better fine-tuning.
Last updated