Syllabus (Tentative)

Week 1

Motivation & The Training Pipeline

Topics

Motivation for AI safety: examples of specification gaming, toxicity, orthogonality thesis, and why safety is not guaranteed by default.
Transformer architecture: residual stream, multi-head attention, and MLP layers.
GPT-2 architecture: decoder-only specifics.
Training pipeline: pretraining, supervised fine-tuning (SFT), and RLHF.
Pretraining objective and web-scale datasets such as CommonCrawl.
How generation works: temperature and top-k sampling.

Slide

Slides (TBD)

Recommended Reading

Attention Is All You Need (Vaswani et al., 2017)

Language Models are Few-Shot Learners (Brown et al., 2020)

Demo

LLM visualization and forward pass walkthrough

Visualize a Logit Lens view of hidden states during generation.

Take-Home Notebook

NanoGPT playground

Implement raw sampling (temperature and top-k) and inspect output distributions.

Week 2

Mech Interp I: Causal Tracing

Topics

Mechanistic interpretability as reverse engineering.
Residual stream as the model communication channel.
Attention heads and information flow between token positions.
Induction heads and in-context pattern copying.
Activation patching (causal tracing) to identify causally important components.

Slide

Slides (TBD)

Recommended Reading

In-context Learning and Induction Heads (Olsson et al., 2022)

Interpretability in the Wild (Wang et al., 2022)

Demo

Patch clean activations into a corrupted Eiffel Tower prompt to locate which layer restores "Paris".

Take-Home Notebook

TBD

Week 3

Mech Interp II: Understanding the Black Box

Topics

Linear probes for concept detection in internal states.
Persona vectors for traits like honesty, deception, and sycophancy.
Activation steering to control model behavior.
Superposition and feature packing.
Sparse autoencoders (SAEs) for disentangling representations.

Slide

Slides (TBD)

Recommended Reading

Representation Engineering: A Top-Down Approach to AI Transparency (Zou et al., 2023)

Golden Gate Claude (Anthropic, 2024)

The Internal State of an LLM Knows When It's Lying (Azaria & Mitchell, 2023)

Demo

Steering the direction of non-corrigibility (Jinzhou's project).

Take-Home Notebook

Building a lie detector probe.ipynb

Replicate lie/hallucination detection from internal activations.

Week 4

RL, RLHF, and Goal Misgeneralization

Topics

RL basics: policy, reward, and value functions.
RLHF pipeline: preference data collection, reward modeling, PPO/DPO optimization.
Reward hacking and Goodhart's law.
Goal misgeneralization in competent-but-misaligned agents.
Sycophancy as reward hacking toward user approval.

Slide

Slides (TBD)

Recommended Reading

Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017)

Scaling Laws for Reward Model Overoptimization (Gao et al., 2022)

Demo

Live DPO run with TRL on safe/toxic preference pairs.

Take-Home Notebook

TBD

Week 5

Evals & Red Teaming

Topics

Capability and safety benchmarks.
Manual and automated red teaming.
Jailbreak techniques and refusal bypass patterns.
Adversarial suffix attacks and prompt injection.
Model organisms and controlled dangerous-trait studies.

Slide

Slides (TBD)

Recommended Reading

Universal and Transferable Adversarial Attacks on Aligned Language Models (Zou et al., 2023)

Sleeper Agents (Hubinger et al., 2024)

Demo

Automated jailbreaking demo with GCG.

Take-Home Notebook

Jailbreaking CTF: extract a hidden key from a black-box API.

Week 6

Control & Scalable Oversight

Topics

Scalable oversight for systems beyond human evaluator capability.
Weak-to-strong generalization.
AI control: monitoring and containment methods.
Anomaly detection over model internals.
AI safety via debate.

Slide

Slides (TBD)

Recommended Reading

Weak-to-Strong Generalization (Burns et al., 2023)

AI Safety via Debate (Irving et al., 2018)

Demo

TBD

Take-Home Notebook

TBD

Week 7

Policy, Trajectory & Careers

Topics

Scaling laws and trajectory forecasting.
Compute governance and frontier compute monitoring.
Current research directions in AI safety.
Technical and governance career paths.
Research proposal synthesis.

Slide

Slides (TBD)

Recommended Reading

Scaling Laws for Neural Language Models (Kaplan et al., 2020)

Epoch AI Compute Trends

Demo

TBD

Take-Home Notebook

Research proposal: hypothesis, method, and expected result from Weeks 2-6 topics.

Week	Description	Topics	Materials
1	Motivation & The Training Pipeline	Motivation for AI safety: examples of specification gaming, toxicity, orthogonality thesis, and why safety is not guaranteed by default. Transformer architecture: residual stream, multi-head attention, and MLP layers. GPT-2 architecture: decoder-only specifics. Training pipeline: pretraining, supervised fine-tuning (SFT), and RLHF. Pretraining objective and web-scale datasets such as CommonCrawl. How generation works: temperature and top-k sampling.	Slide Slides (TBD) Recommended Reading Attention Is All You Need (Vaswani et al., 2017) Language Models are Few-Shot Learners (Brown et al., 2020) Demo LLM visualization and forward pass walkthrough Visualize a Logit Lens view of hidden states during generation. Take-Home Notebook NanoGPT playground Implement raw sampling (temperature and top-k) and inspect output distributions.
2	Mech Interp I: Causal Tracing	Mechanistic interpretability as reverse engineering. Residual stream as the model communication channel. Attention heads and information flow between token positions. Induction heads and in-context pattern copying. Activation patching (causal tracing) to identify causally important components.	Slide Slides (TBD) Recommended Reading In-context Learning and Induction Heads (Olsson et al., 2022) Interpretability in the Wild (Wang et al., 2022) Demo Patch clean activations into a corrupted Eiffel Tower prompt to locate which layer restores "Paris". Take-Home Notebook TBD
3	Mech Interp II: Understanding the Black Box	Linear probes for concept detection in internal states. Persona vectors for traits like honesty, deception, and sycophancy. Activation steering to control model behavior. Superposition and feature packing. Sparse autoencoders (SAEs) for disentangling representations.	Slide Slides (TBD) Recommended Reading Representation Engineering: A Top-Down Approach to AI Transparency (Zou et al., 2023) Golden Gate Claude (Anthropic, 2024) The Internal State of an LLM Knows When It's Lying (Azaria & Mitchell, 2023) Demo Steering the direction of non-corrigibility (Jinzhou's project). Take-Home Notebook Building a lie detector probe.ipynb Replicate lie/hallucination detection from internal activations.
4	RL, RLHF, and Goal Misgeneralization	RL basics: policy, reward, and value functions. RLHF pipeline: preference data collection, reward modeling, PPO/DPO optimization. Reward hacking and Goodhart's law. Goal misgeneralization in competent-but-misaligned agents. Sycophancy as reward hacking toward user approval.	Slide Slides (TBD) Recommended Reading Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017) Scaling Laws for Reward Model Overoptimization (Gao et al., 2022) Demo Live DPO run with TRL on safe/toxic preference pairs. Take-Home Notebook TBD
5	Evals & Red Teaming	Capability and safety benchmarks. Manual and automated red teaming. Jailbreak techniques and refusal bypass patterns. Adversarial suffix attacks and prompt injection. Model organisms and controlled dangerous-trait studies.	Slide Slides (TBD) Recommended Reading Universal and Transferable Adversarial Attacks on Aligned Language Models (Zou et al., 2023) Sleeper Agents (Hubinger et al., 2024) Demo Automated jailbreaking demo with GCG. Take-Home Notebook Jailbreaking CTF: extract a hidden key from a black-box API.
6	Control & Scalable Oversight	Scalable oversight for systems beyond human evaluator capability. Weak-to-strong generalization. AI control: monitoring and containment methods. Anomaly detection over model internals. AI safety via debate.	Slide Slides (TBD) Recommended Reading Weak-to-Strong Generalization (Burns et al., 2023) AI Safety via Debate (Irving et al., 2018) Demo TBD Take-Home Notebook TBD
7	Policy, Trajectory & Careers	Scaling laws and trajectory forecasting. Compute governance and frontier compute monitoring. Current research directions in AI safety. Technical and governance career paths. Research proposal synthesis.	Slide Slides (TBD) Recommended Reading Scaling Laws for Neural Language Models (Kaplan et al., 2020) Epoch AI Compute Trends Demo TBD Take-Home Notebook Research proposal: hypothesis, method, and expected result from Weeks 2-6 topics.

CS 1998: Intro to AI Safety & Alignment

Pre-enroll today!

Content

What is this course about?

Prerequisites

Audience

Prerequisites

Audience

Logistics

Syllabus (Tentative)