Pre-enroll today!
CS 1998, PRJ 608 · Class number 18589
1 Credit · 7 Weeks First · S/U Grading · Open Enrollment
CS 1998, PRJ 608 · Class number 18589
CS 1998: Intro to AI Safety & Alignment is a student-led course that explores why advanced AI systems can fail in unexpected and dangerous ways. We begin by building a solid understanding of how modern language models are trained, from pretraining on web-scale data through supervised fine-tuning and reinforcement learning from human feedback. From there, we turn to the core question: how do we ensure these systems do what we actually want? Students will learn key technical ideas in mechanistic interpretability (reverse-engineering model internals to understand what they've learned), reward learning (how optimization pressure can produce unintended behaviors like sycophancy and reward hacking), red teaming and adversarial evaluation (systematically probing models for failure modes), and scalable oversight (supervising systems that may exceed human-level performance on the tasks we're evaluating them on).
This course is technically focused, so some prior knowledge of linear algebra, machine learning (particularly the transformer architecture), and proficiency in Python are assumed. That said, we spend significant time building intuition behind these ideas, so students from non-technical backgrounds who are motivated to engage with the material are also welcome.
Undergraduates who are curious about how modern AI systems can fail, concerned about the long-term risks of advanced AI, or looking to understand the technical foundations behind ongoing safety research.
This course is technically focused, so some prior knowledge of linear algebra, machine learning (particularly the transformer architecture), and proficiency in Python are assumed. That said, we spend significant time building intuition behind these ideas, so students from non-technical backgrounds who are motivated to engage with the material are also welcome.
Undergraduates who are curious about how modern AI systems can fail, concerned about the long-term risks of advanced AI, or looking to understand the technical foundations behind ongoing safety research.
Week 1
Motivation & The Training Pipeline
Topics
Slide
Slides (TBD)
Recommended Reading
Demo
LLM visualization and forward pass walkthrough
Visualize a Logit Lens view of hidden states during generation.
Take-Home Notebook
Implement raw sampling (temperature and top-k) and inspect output distributions.
Week 2
Mech Interp I: Causal Tracing
Topics
Slide
Slides (TBD)
Recommended Reading
Demo
Patch clean activations into a corrupted Eiffel Tower prompt to locate which layer restores "Paris".
Take-Home Notebook
TBD
Week 3
Mech Interp II: Understanding the Black Box
Topics
Slide
Slides (TBD)
Recommended Reading
Demo
Steering the direction of non-corrigibility (Jinzhou's project).
Take-Home Notebook
Building a lie detector probe.ipynb
Replicate lie/hallucination detection from internal activations.
Week 4
RL, RLHF, and Goal Misgeneralization
Topics
Slide
Slides (TBD)
Recommended Reading
Demo
Live DPO run with TRL on safe/toxic preference pairs.
Take-Home Notebook
TBD
Week 5
Evals & Red Teaming
Topics
Slide
Slides (TBD)
Recommended Reading
Demo
Automated jailbreaking demo with GCG.
Take-Home Notebook
Jailbreaking CTF: extract a hidden key from a black-box API.
Week 6
Control & Scalable Oversight
Topics
Slide
Slides (TBD)
Recommended Reading
Demo
TBD
Take-Home Notebook
TBD
Week 7
Policy, Trajectory & Careers
Topics
Slide
Slides (TBD)
Recommended Reading
Demo
TBD
Take-Home Notebook
Research proposal: hypothesis, method, and expected result from Weeks 2-6 topics.