Day 14 Generative AI Quiz: Test Your LLM Knowledge
Welcome to Day 14 of our Generative AI series! This technical quiz tests your understanding of core concepts in transformer architectures, language modeling, bias mitigation, and more. Whether you're preparing for certifications or reinforcing your knowledge, these questions cover essential Gen AI principles with detailed explanations of best practices. Designed for ML engineers and AI practitioners, each question includes real-world context and practical insights.
Table of Contents#
- Quiz Overview
- Question 1: Transformer Architecture
- Question 2: Temperature in Sampling
- Question 3: Bias Mitigation
- Question 4: Transfer Learning
- Question 5: Attention Mechanisms
- Question 6: RLHF Fundamentals
- Question 7: Tokenization
- Question 8: Top-p Sampling
- Conclusion & Answers
- References
1. Quiz Overview#
- Level: Intermediate (Assumes foundational Gen AI knowledge)
- Format: 8 scenario-based multiple-choice questions
- Focus Areas:
- Transformer architectures
- Sampling techniques
- Ethical considerations
- Optimization methods
- Best Practice Indicators: Each explanation highlights industry standards
2. Question 1: Transformer Architecture#
Scenario: You're optimizing a transformer model for real-time translation. Memory constraints require you to modify the attention mechanism. Which approach provides the best trade-off between performance and memory efficiency?
A) Full self-attention
B) Sliding window attention
C) Random sparse attention
D) Causal attention
Answer & Explanation
Correct Answer: B
Explanation:
Sliding window attention (e.g., in Longformer) limits each token's attention to a fixed window of adjacent tokens. This reduces memory from O(n²) to O(n*k) (k=window size), making it ideal for long sequences.
Best Practice:
- Use sliding windows for document-level tasks
- Combine with global attention for key tokens (e.g., [CLS] token)
- Example:
LongformerModelin Hugging Face uses 4,096-token windows
Common Mistake: Full attention (A) exhausts memory for long sequences (>512 tokens).
3. Question 2: Temperature in Sampling#
Scenario: Your chatbot generates repetitive responses. Adjusting which hyperparameter will increase output diversity without sacrificing coherence?
A) Top-k
B) Temperature
C) Top-p
D) Repetition penalty
Answer & Explanation
Correct Answer: B
Explanation:
Temperature (τ) controls randomness in softmax: softmax(logits / τ). Higher τ (>1.0) flattens probabilities for diverse outputs, while lower τ (<1.0) sharpens distributions.
Best Practice:
- Start with τ=0.7 for balanced creativity
- For technical content: τ=0.2–0.5
- For creative writing: τ=0.8–1.2
- Code example:
generate(input_ids, temperature=0.8, do_sample=True)
4. Question 3: Bias Mitigation#
Scenario: Your fine-tuned model associates "doctor" with "he" and "nurse" with "she." Which technique directly counters this bias?
A) Data augmentation
B) Adversarial debiasing
C) Prompt engineering
D) Layer normalization
Answer & Explanation
Correct Answer: B
Explanation:
Adversarial debiasing trains the model to make predictions while minimizing an adversary's ability to detect protected attributes (e.g., gender).
Best Practices:
- Pretrain with balanced datasets (e.g., Wikipedia + counterfactual augmentation)
- Use loss terms like:
loss = task_loss - λ * adversary_loss - Test with benchmarks:
BOLD,HONEST
Common Pitfall: Prompt engineering (C) masks symptoms but doesn't fix model weights.
5. Question 4: Transfer Learning#
Scenario: You have 1,000 labeled legal documents for contract analysis. Which transfer learning strategy delivers optimal accuracy?
A) Train from scratch
B) Feature extraction (frozen base)
C) Full fine-tuning
D) LoRA (Low-Rank Adaptation)
Answer & Explanation
Correct Answer: D
Explanation:
LoRA freezes pretrained weights and injects trainable low-rank matrices into attention layers. It achieves ~95% of full fine-tuning accuracy with 10x fewer parameters.
Best Practice:
- Use r=8 (rank) for most tasks
- Target only attention matrices:
peft_config = LoraConfig(r=8, target_modules=["q_proj", "v_proj"]) - Example:
peft.LoraModellibrary
Why Not (C)? Full fine-tuning risks catastrophic forgetting with small datasets.
6. Question 5: Attention Mechanisms#
Scenario: Processing 10,000-token genomic sequences requires calculating pairwise attention. Which optimization is essential?
A) FlashAttention
B) Multi-query attention
C) Sparse attention
D) Alibi positional embeddings
Answer & Explanation
Correct Answer: A
Explanation:
FlashAttention uses tiling and recomputation to reduce memory reads/writes and significantly speeds up computation for long sequences. It’s 3x faster for sequences >2K tokens.
Best Practice:
- Enable via
model(..., use_flash_attention=True) - Combine with ALiBi (D) for extrapolation beyond trained context
- Hardware requirement: CUDA kernels
Real-World Use: GPT-4 uses FlashAttention-2 for 32K context.
7. Question 6: RLHF Fundamentals#
Scenario: During RLHF training, your reward model assigns high scores to verbose but incorrect answers. Why might this happen?
A) Over-optimization of KL divergence
B) Misaligned reward hacking
C) High KL penalty coefficient
D) Poor preference data
Answer & Explanation
Correct Answer: D
Explanation:
Reward models inherit biases from human preference data. If annotators favor lengthy responses, the model learns to prioritize verbosity over correctness.
Best Practices:
- Curate preference pairs with unambiguous criteria
- Use ranking systems (e.g., Elo) instead of absolute scores
- Annotator training example:
"Rate accuracy over style; penalize factual errors"
Solution: Augment data with correctness-focused samples.
8. Question 7: Tokenization#
Scenario: Your multilingual model performs poorly on agglutinative languages like Finnish. Which tokenizer change would help?
A) Switch to byte-level BPE
B) Increase vocabulary size
C) Use Unigram LM tokenization
D) Add language-specific prefixes
Answer & Explanation
Correct Answer: C
Explanation:
Unigram tokenization (e.g., SentencePiece) probabilistically drops subwords, better handling rare/morphologically rich words.
Best Practice:
- Vocabulary size: 100K–200K for multilingual models
- Example for Finnish:
tokenizer = SentencePieceProcessor(model_file="spm_finnish.model") - Avoid BPE (A) for agglutinative languages—it fragments words excessively.
9. Question 8: Top-p Sampling#
Scenario: Generating a story requires creative but coherent text. Which configuration ensures variable-length sampling while avoiding nonsensical outputs?
A) Top-p = 0.5, Top-k = 10
B) Top-p = 0.9, Top-k = 50
C) Top-p = 0.95, Top-k = 0
D) Top-p = 0.5, Top-k = 0
Answer & Explanation
Correct Answer: C
Explanation:
Top-p (nucleus) sampling dynamically selects the smallest set of tokens whose cumulative probability ≥ p. Using p=0.95 includes plausible alternatives while excluding low-probability nonsense.
Best Practice:
- Top-p > 0.9 for creative tasks
- Disable top-k (k=0) to rely solely on probability mass
- Avoid low top-p values (A,D): overly restrictive
- Configuration:
generate(inputs, top_p=0.95, top_k=0, temperature=0.8)
10. Conclusion & Answers#
Summary of Correct Answers:#
- B – Sliding window attention
- B – Temperature
- B – Adversarial debiasing
- D – LoRA
- A – FlashAttention
- D – Poor preference data
- C – Unigram tokenization
- C – Top-p=0.95, Top-k=0
Scoring:
- 7-8 correct: Expert level!
- 5-6 correct: Strong conceptual foundation
- <5: Review transformer papers and Hugging Face docs
Continue experimenting with these concepts in frameworks like PyTorch and libraries like 🤗 Transformers.
11. References#
- Vaswani et al. (2017). Attention Is All You Need
- Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Models
- Dao, T. (2023). FlashAttention-2
- Wolf, T. et al. Hugging Face Transformers Library
- Rae, J. et al. (2021). Scaling Language Models
- Bender, E. M. (2019). Data Statements for NLP