IVQ 51-100
Q51. Why was next sentence prediction removed from RoBERTa, and what impact did it have on performance?
Q52. How does sentence order prediction differ from next sentence prediction as a pretraining objective?
Q53. What are the limitations of using next sentence prediction in training language understanding models?
Q54. How do tasks like sentence entailment benefit from models trained with next sentence prediction?
Q55. In what ways does next sentence prediction contribute to discourse-level understanding in LLMs?
Q56. What are the trade-offs between deterministic decoding and sampling-based methods in LLMs?
Q57. How does nucleus (top-p) sampling dynamically adjust the candidate token pool during generation?
Q58. When would you prefer top-k sampling over top-p sampling in creative text generation?
Q59. How do sampling strategies affect the diversity, coherence, and repetitiveness of LLM outputs?
Q60. Can top-k and top-p sampling be combined, and what is the benefit of doing so?
Q61. How does prompt structure influence the quality and relevance of LLM responses?
Q62. What is zero-shot prompting, and how does it differ from few-shot and chain-of-thought prompting?
Q63. How can prompt engineering be used to steer LLMs toward safer or more factual outputs?
Q64. What are the challenges in designing prompts for multilingual or multi-domain tasks?
Q65. How does instruction tuning reduce the need for manual prompt engineering?
Q66. What role does continual learning play in mitigating catastrophic forgetting in LLMs?
Q67. How do regularization techniques like Elastic Weight Consolidation (EWC) help preserve pretrained knowledge?
Q68. What is rehearsal-based fine-tuning, and how does it prevent forgetting in LLMs?
Q69. How does adapter tuning reduce the risk of overwriting core model weights?
Q70. In what scenarios is catastrophic forgetting most likely to occur during LLM adaptation?
Q71. How does knowledge distillation transfer performance from a large teacher model to a smaller student model?
Q72. What are the trade-offs between model compression via distillation and quantization?
Q73. How does distillation preserve task-specific behavior in smaller LLM variants?
Q74. In what scenarios is model distillation preferred over training a small model from scratch?
Q75. How can distillation be used to align model outputs with human preferences or safety constraints?
Q76. What is subword tokenization, and how does it help LLMs handle rare or unseen words?
Q77. How do Byte Pair Encoding (BPE) and SentencePiece differ in managing vocabulary in LLMs?
Q78. Why are character-level and byte-level models more robust to OOV tokens?
Q79. How do LLMs preserve semantic meaning when breaking unfamiliar words into smaller units?
Q80. What are the trade-offs between fixed vocabulary sizes and open-vocabulary tokenization strategies?
Q81. What limitations of RNNs and LSTMs are addressed by the transformer architecture?
Q82. How does self-attention in transformers enable better handling of long-range dependencies?
Q83. Why do transformers allow for more parallelization during training compared to recurrent models?
Q84. How do positional encodings in transformers compensate for the lack of sequential recurrence?
Q85. In what ways have transformer-based models outperformed traditional Seq2Seq models in NLP benchmarks?
Q86. How does regularization help prevent overfitting during LLM training and fine-tuning?
Q87. What role does dropout play in reducing overfitting in transformer-based models?
Q88. How can early stopping be used to detect and prevent overfitting in large-scale training runs?
Q89. Why is having a diverse and large training dataset critical for minimizing overfitting in LLMs?
Q90. How do techniques like data augmentation or noise injection improve generalization in LLMs?
Q91. How do discriminative models differ from generative models in terms of learning decision boundaries?
Q92. Why are generative models better suited for tasks like text generation and completion?
Q93. In what scenarios would a discriminative model be preferable over a generative one in NLP?
Q94. How do models like BERT (discriminative) and GPT (generative) differ in architecture and objectives?
Q95. Can generative and discriminative approaches be combined, and what are the benefits of doing so?
Q96. What architectural improvements make GPT-4 more capable than GPT-3 in reasoning tasks?
Q97. How does GPT-4’s multimodal capability extend its range of applications compared to GPT-3?
Q98. In what ways has GPT-4 improved safety, alignment, and factual accuracy over GPT-3?
Q99. How does the context window size differ between GPT-3 and GPT-4, and what does that enable?
Q100. What are the performance differences between GPT-3 and GPT-4 in multilingual and domain-specific tasks?
Last updated