Content
What is this course about?
CS 1998: Intro to AI Safety & Alignment is a student-led course that explores why advanced AI systems can fail in unexpected and dangerous ways. We begin by building a solid understanding of how modern language models are trained, from pretraining on web-scale data through supervised fine-tuning and reinforcement learning from human feedback. From there, we turn to the core question: how do we ensure these systems do what we actually want? Students will learn key technical ideas in mechanistic interpretability (reverse-engineering model internals to understand what they've learned), reward learning (how optimization pressure can produce unintended behaviors like sycophancy and reward hacking), red teaming and adversarial evaluation (systematically probing models for failure modes), and scalable oversight (supervising systems that may exceed human-level performance on the tasks we're evaluating them on).
Prerequisites
This course is technically focused, so some prior knowledge of linear algebra, machine learning (particularly the transformer architecture), and proficiency in Python are assumed. That said, we spend significant time building intuition behind these ideas, so students from non-technical backgrounds who are motivated to engage with the material are also welcome.
Audience
Undergraduates who are curious about how modern AI systems can fail, concerned about the long-term risks of advanced AI, or looking to understand the technical foundations behind ongoing safety research.