IVQ 101-150

Q101. How do positional encodings allow transformers to capture the order of tokens in a sequence?

Q102. What is the difference between sinusoidal and learned positional encodings?

Q103. Why are positional encodings essential in transformer models that lack recurrence?

Q104. How do relative positional encodings improve upon absolute positional encodings in attention mechanisms?

Q105. In what ways do positional encodings affect performance in long-context or streaming transformer models?

Q106. How do transformers process sequences without relying on recurrence or convolution?

Q107. What challenges arise in modeling token position, and how do positional encodings address them?

Q108. How do different types of positional encodings (absolute vs. relative) affect model behavior?

Q109. Why are positional encodings necessary in self-attention mechanisms?

Q110. How does positional information influence the model’s ability to understand syntax and grammar?

Q111. How does multi-head attention help capture different types of relationships in a sequence?

Q112. What is the difference between single-head and multi-head attention in transformer models?

Q113. How are attention outputs from multiple heads combined to form the final representation?

Q114. Why is dimensionality splitting important in multi-head attention?

Q115. How does multi-head attention contribute to the scalability and expressiveness of LLMs?

Q116. Why is softmax used to convert attention scores into probability distributions?

Q117. How does the softmax function ensure that attention weights sum to one?

Q118. What role does scaling (e.g., by √dₖ) play before applying softmax in scaled dot-product attention?

Q119. How does the temperature of the softmax affect attention sharpness in transformer models?

Q120. What are the numerical stability concerns when applying softmax in attention, and how are they addressed?

Q121. Why is the dot product used to compute similarity between queries and keys in attention?

Q122. How does the scaled dot-product attention mechanism prevent large gradient values?

Q123. What is the geometric interpretation of using dot products in self-attention?

Q124. How does the dot product differ from other similarity measures like cosine similarity in attention contexts?

Q125. Why is normalization applied after computing dot product scores in attention mechanisms?

Q126. How does cross-entropy loss measure the difference between predicted and actual probability distributions?

Q127. Why is cross-entropy preferred over mean squared error in classification tasks like language modeling?

Q128. How does minimizing cross-entropy help improve the likelihood of generating correct tokens?

Q129. What does a lower cross-entropy loss indicate about a language model’s performance?

Q130. How is cross-entropy computed in sequence models across multiple time steps?

Q131. How does backpropagation update the embedding matrix during training?

Q132. What role does the embedding lookup table play in gradient flow for language models?

Q133. How are gradients aggregated when the same token appears multiple times in a sequence?

Q134. Why is the embedding layer treated as a trainable parameter in LLMs?

Q135. How do optimizers like Adam affect the learning of embedding vectors during fine-tuning?

Q136. How does the Jacobian relate to gradient flow through non-linear layers in transformers?

Q137. Why is the Jacobian important when computing gradients in multi-head attention mechanisms?

Q138. How does automatic differentiation use the Jacobian in training large language models?

Q139. What are the computational challenges of handling large Jacobian matrices in deep networks?

Q140. In what scenarios does analyzing the Jacobian help diagnose vanishing or exploding gradients in transformers?

Q141. Why are eigenvectors used in Principal Component Analysis (PCA) for feature extraction?

Q142. How do eigenvalues help determine the amount of variance captured by each principal component?

Q143. What does it mean when an eigenvalue is close to zero in the context of data compression?

Q144. How is the covariance matrix involved in computing eigenvectors for dimensionality reduction?

Q145. In what ways do eigen decomposition techniques improve efficiency in high-dimensional data processing?

Q146. How does KL divergence measure the difference between two probability distributions in model training?

Q147. Why is KL divergence used in regularization for models like BERT or in variational inference?

Q148. How does minimizing KL divergence help align a student model with a teacher model in distillation?

Q149. What are the limitations of KL divergence when distributions have non-overlapping support?

Q150. How is KL divergence used in reinforcement learning from human feedback (RLHF) for aligning LLMs?



Last updated