IVQ 101-150
101–105: Positional Encodings
Q101. How do positional encodings allow transformers to capture the order of tokens in a sequence? A101: They add position-specific vectors to token embeddings, injecting order info so the model distinguishes “dog bites man” from “man bites dog.”
Q102. What is the difference between sinusoidal and learned positional encodings? A102: Sinusoidal encodings use fixed, periodic functions; learned encodings use trainable vectors, adapting position signals during training.
Q103. Why are positional encodings essential in transformer models that lack recurrence? A103: Without recurrence or convolution, transformers process tokens in parallel — positional encodings add the sequential structure they’d otherwise miss.
Q104. How do relative positional encodings improve upon absolute positional encodings in attention mechanisms? A104: They model token relationships based on distance, not just absolute index — improving generalization to unseen lengths and repeated structures.
Q105. In what ways do positional encodings affect performance in long-context or streaming transformer models? A105: Poor positional handling limits context generalization; relative or rotary encodings help scale to longer sequences and streaming inputs.
106–110: Sequence Modeling & Syntax
Q106. How do transformers process sequences without relying on recurrence or convolution? A106: Self-attention connects all tokens directly in parallel, mixing information globally at each layer.
Q107. What challenges arise in modeling token position, and how do positional encodings address them? A107: The challenge is differentiating shuffled input; encodings give unique position signals so patterns like word order can be learned.
Q108. How do different types of positional encodings (absolute vs. relative) affect model behavior? A108: Absolute encodings fix position, good for static text. Relative encodings adapt better to varying lengths and repeated phrases.
Q109. Why are positional encodings necessary in self-attention mechanisms? A109: Self-attention alone is permutation invariant — without positions, it treats all token orders the same.
Q110. How does positional information influence the model’s ability to understand syntax and grammar? A110: It lets the model infer phrase structures and dependencies, e.g., who’s the subject, object, or verb.
111–115: Multi-Head Attention
Q111. How does multi-head attention help capture different types of relationships in a sequence? A111: Each head learns to focus on different aspects — syntax, semantics, word dependencies — in parallel.
Q112. What is the difference between single-head and multi-head attention in transformer models? A112: Single-head uses one similarity function; multi-head splits dimensions for diverse subspace views, increasing expressiveness.
Q113. How are attention outputs from multiple heads combined to form the final representation? A113: The heads’ outputs are concatenated and linearly transformed to merge their learned relations.
Q114. Why is dimensionality splitting important in multi-head attention? A114: It keeps total computation constant — each head works in a smaller subspace, avoiding parameter explosion.
Q115. How does multi-head attention contribute to the scalability and expressiveness of LLMs? A115: It enables richer context modeling while remaining parallelizable and modular for large models.
116–120: Softmax & Stability
Q116. Why is softmax used to convert attention scores into probability distributions? A116: It normalizes raw scores into weights summing to one — easy to interpret as “how much to attend.”
Q117. How does the softmax function ensure that attention weights sum to one? A117: By exponentiating scores and dividing by their sum — guaranteeing a valid probability simplex.
Q118. What role does scaling (e.g., by √dₖ) play before applying softmax in scaled dot-product attention? A118: It prevents large dot products from saturating softmax — stabilizing gradients and training.
Q119. How does the temperature of the softmax affect attention sharpness in transformer models? A119: Lower “temperature” (implicit in scaling) makes softmax sharper — boosting focus on top scores.
Q120. What are the numerical stability concerns when applying softmax in attention, and how are they addressed? A120: Large values can cause overflow. Subtracting the max logit before exponentiation stabilizes computation.
121–125: Dot Product & Normalization
Q121. Why is the dot product used to compute similarity between queries and keys in attention? A121: It’s efficient, differentiable, and measures alignment in high-dimensional space.
Q122. How does the scaled dot-product attention mechanism prevent large gradient values? A122: The scaling (√dₖ) normalizes magnitude, avoiding exploding gradients during backprop.
Q123. What is the geometric interpretation of using dot products in self-attention? A123: It computes the cosine-like similarity between query and key vectors, capturing directional alignment.
Q124. How does the dot product differ from other similarity measures like cosine similarity in attention contexts? A124: Cosine normalizes vectors’ magnitudes; dot product includes length, so scale matters too.
Q125. Why is normalization applied after computing dot product scores in attention mechanisms? A125: Softmax normalizes scores so they can serve as interpretable, differentiable weights for mixing values.
126–130: Cross-Entropy
Q126. How does cross-entropy loss measure the difference between predicted and actual probability distributions? A126: It quantifies the negative log-likelihood of the true class under the model’s predicted distribution.
Q127. Why is cross-entropy preferred over mean squared error in classification tasks like language modeling? A127: MSE treats output as continuous; cross-entropy directly handles probabilities, penalizing wrong classes sharply.
Q128. How does minimizing cross-entropy help improve the likelihood of generating correct tokens? A128: It pushes the model to assign higher probability to the correct next token at each step.
Q129. What does a lower cross-entropy loss indicate about a language model’s performance? A129: The model’s predictions align well with ground truth — it’s more certain and accurate.
Q130. How is cross-entropy computed in sequence models across multiple time steps? A130: The loss is summed or averaged over all tokens, treating each prediction independently.
131–135: Embeddings & Gradients
Q131. How does backpropagation update the embedding matrix during training? A131: Gradients flow from the loss through embeddings, adjusting token vectors to reduce prediction error.
Q132. What role does the embedding lookup table play in gradient flow for language models? A132: It maps token IDs to vectors — during backprop, only the vectors for used tokens receive updates.
Q133. How are gradients aggregated when the same token appears multiple times in a sequence? A133: Their gradients are summed — ensuring repeated tokens update their shared vector correctly.
Q134. Why is the embedding layer treated as a trainable parameter in LLMs? A134: It adapts word representations to the target domain and task.
Q135. How do optimizers like Adam affect the learning of embedding vectors during fine-tuning? A135: Adam adapts learning rates per dimension, helping embeddings converge faster and more smoothly.
136–140: Jacobians & Gradients
Q136. How does the Jacobian relate to gradient flow through non-linear layers in transformers? A136: The Jacobian represents partial derivatives; it defines how outputs change with respect to inputs.
Q137. Why is the Jacobian important when computing gradients in multi-head attention mechanisms? A137: It captures how attention outputs affect loss with respect to query/key/value weights.
Q138. How does automatic differentiation use the Jacobian in training large language models? A138: It computes gradients efficiently by chaining local Jacobians layer by layer.
Q139. What are the computational challenges of handling large Jacobian matrices in deep networks? A139: They grow quadratically with layer size — so storing them fully is infeasible; backprop uses chain rule tricks instead.
Q140. In what scenarios does analyzing the Jacobian help diagnose vanishing or exploding gradients in transformers? A140: It reveals whether layer transformations squash or amplify signals, helping tune initialization and normalization.
141–145: PCA & Eigenvectors
Q141. Why are eigenvectors used in Principal Component Analysis (PCA) for feature extraction? A141: They define directions of maximum variance — the principal components.
Q142. How do eigenvalues help determine the amount of variance captured by each principal component? A142: Larger eigenvalues mean that component explains more data variance.
Q143. What does it mean when an eigenvalue is close to zero in the context of data compression? A143: The dimension adds little new info — it can be discarded for compact representations.
Q144. How is the covariance matrix involved in computing eigenvectors for dimensionality reduction? A144: Its eigenvectors/eigenvalues reveal correlated directions — guiding projection to fewer dimensions.
Q145. In what ways do eigen decomposition techniques improve efficiency in high-dimensional data processing? A145: They reduce feature count while preserving patterns — speeding up learning and storage.
146–150: KL Divergence
Q146. How does KL divergence measure the difference between two probability distributions in model training? A146: It quantifies how much info is lost when approximating one distribution with another.
Q147. Why is KL divergence used in regularization for models like BERT or in variational inference? A147: It keeps learned distributions close to priors, enforcing useful constraints.
Q148. How does minimizing KL divergence help align a student model with a teacher model in distillation? A148: It makes the student’s output match the teacher’s soft probabilities — transferring subtle behaviors.
Q149. What are the limitations of KL divergence when distributions have non-overlapping support? A149: KL is undefined if the student predicts zero where the teacher is non-zero — causing instability.
Q150. How is KL divergence used in reinforcement learning from human feedback (RLHF) for aligning LLMs? A150: It penalizes divergence from a reference policy, balancing exploration with human-preferred responses.
Last updated