Cornell AI Alignment Club
Planning Stage (Fall 2026, tentative)

CS 1998: Intro to AI Safety & Alignment

1 Credit · 7 Weeks First · S/U Grading · Open Enrollment

Content

What is this course about?

CS 1998: Intro to AI Safety & Alignment is a student-led course that explores why advanced AI systems can fail in unexpected and dangerous ways. We begin by building a solid understanding of how modern language models are trained, from pretraining on web-scale data through supervised fine-tuning and reinforcement learning from human feedback. From there, we turn to the core question: how do we ensure these systems do what we actually want? Students will learn key technical ideas in mechanistic interpretability (reverse-engineering model internals to understand what they've learned), reward learning (how optimization pressure can produce unintended behaviors like sycophancy and reward hacking), red teaming and adversarial evaluation (systematically probing models for failure modes), and scalable oversight (supervising systems that may exceed human-level performance on the tasks we're evaluating them on).

Prerequisites

This course is technically focused, so some prior knowledge of linear algebra, machine learning (particularly the transformer architecture), and proficiency in Python are assumed. That said, we spend significant time building intuition behind these ideas, so students from non-technical backgrounds who are motivated to engage with the material are also welcome.

Audience

Undergraduates who are curious about how modern AI systems can fail, concerned about the long-term risks of advanced AI, or looking to understand the technical foundations behind ongoing safety research.

Logistics

  • This will be a 1-credit, 7-week, first S/U course.
  • This course will be open enrollment (without application), and we are planning around 75 seats.

Syllabus (Tentative)

Week 1

Motivation & The Training Pipeline

+

Topics

  • Motivation for AI safety: examples of specification gaming, toxicity, orthogonality thesis, and why safety is not guaranteed by default.
  • Transformer architecture: residual stream, multi-head attention, and MLP layers.
  • GPT-2 architecture: decoder-only specifics.
  • Training pipeline: pretraining, supervised fine-tuning (SFT), and RLHF.
  • Pretraining objective and web-scale datasets such as CommonCrawl.
  • How generation works: temperature and top-k sampling.

Slide

Slides (TBD)

Demo

LLM visualization and forward pass walkthrough

Visualize a Logit Lens view of hidden states during generation.

Take-Home Notebook

NanoGPT playground

Implement raw sampling (temperature and top-k) and inspect output distributions.

Week 2

Mech Interp I: Causal Tracing

+

Topics

  • Mechanistic interpretability as reverse engineering.
  • Residual stream as the model communication channel.
  • Attention heads and information flow between token positions.
  • Induction heads and in-context pattern copying.
  • Activation patching (causal tracing) to identify causally important components.

Slide

Slides (TBD)

Demo

Patch clean activations into a corrupted Eiffel Tower prompt to locate which layer restores "Paris".

Take-Home Notebook

TBD

Week 3

Mech Interp II: Understanding the Black Box

+

Topics

  • Linear probes for concept detection in internal states.
  • Persona vectors for traits like honesty, deception, and sycophancy.
  • Activation steering to control model behavior.
  • Superposition and feature packing.
  • Sparse autoencoders (SAEs) for disentangling representations.

Slide

Slides (TBD)

Demo

Steering the direction of non-corrigibility (Jinzhou's project).

Take-Home Notebook

Building a lie detector probe.ipynb

Replicate lie/hallucination detection from internal activations.

Week 4

RL, RLHF, and Goal Misgeneralization

+

Topics

  • RL basics: policy, reward, and value functions.
  • RLHF pipeline: preference data collection, reward modeling, PPO/DPO optimization.
  • Reward hacking and Goodhart's law.
  • Goal misgeneralization in competent-but-misaligned agents.
  • Sycophancy as reward hacking toward user approval.

Slide

Slides (TBD)

Demo

Live DPO run with TRL on safe/toxic preference pairs.

Take-Home Notebook

TBD

Week 5

Evals & Red Teaming

+

Topics

  • Capability and safety benchmarks.
  • Manual and automated red teaming.
  • Jailbreak techniques and refusal bypass patterns.
  • Adversarial suffix attacks and prompt injection.
  • Model organisms and controlled dangerous-trait studies.

Slide

Slides (TBD)

Demo

Automated jailbreaking demo with GCG.

Take-Home Notebook

Jailbreaking CTF: extract a hidden key from a black-box API.

Week 6

Control & Scalable Oversight

+

Topics

  • Scalable oversight for systems beyond human evaluator capability.
  • Weak-to-strong generalization.
  • AI control: monitoring and containment methods.
  • Anomaly detection over model internals.
  • AI safety via debate.

Slide

Slides (TBD)

Demo

TBD

Take-Home Notebook

TBD

Week 7

Policy, Trajectory & Careers

+

Topics

  • Scaling laws and trajectory forecasting.
  • Compute governance and frontier compute monitoring.
  • Current research directions in AI safety.
  • Technical and governance career paths.
  • Research proposal synthesis.

Slide

Slides (TBD)

Demo

TBD

Take-Home Notebook

Research proposal: hypothesis, method, and expected result from Weeks 2-6 topics.