Week 1 - Further Understanding the Risks from AI
In the Alignment 101 course, you learned about different views on how AI might transform our society over the coming decades. You also learned about various threat models and risks associated with advanced AI systems.
This week will build upon that knowledge. In the first reading, Christiano presents a detailed formulation of what it should mean for an advanced AI model to be aligned. In the second reading, Hendrycks provides additional breadth to the threat models you have learned about thus far, covering risks that weren’t discussed during Alignment 101.
In the third and fourth reading, you will go in-depth into arguments about how far AGI is, how quickly might an AGI system reach superhuman intelligence, how difficult it is to align an AGI system, and how optimistic or pessimistic all of this should make us about our prospects of aligning advanced AIs. Finally, Hubinger et al. go deep into one specific threat model which might turn out to be particularly important: that of a deceptive, inner misaligned AGI.
Core readings:
Christiano provides a precise formulation and an analogy of what he means when he talks of an AI system as being aligned.
What is AI alignment? (10 min)
This article by IBM approaches the process of encoding human values and goals into artificial intelligence models to ensure that systems remain safe, reliable, and helpful.
The TESCREAL bundle: Eugenics and the promise of utopia through artificial general intelligence (60 mins) (only sections 1, 4.1 Transhumanism, Singularitarianism, Effective Altruism and Longtermism, 4.2 Table, 5, 6.2 6.3, 6.4, 7, 8)
Timnit Gebru and Émile P. Torres examine the ideological foundations behind the current race to develop AGI. They argue that the pursuit of AGI is driven by a specific set of interconnected ideologies they label the "TESCREAL bundle", which includes: Transhumanism, Extropianism, Singularitarianism, (Modern) Cosmism, Rationalism, Effective Altruism, Longtermism. The core thesis of the paper is that these ideologies are direct descendants of the 20th-century Anglo-American eugenics movement.
AGI ruin: a list of lethalities (Yudkowsky, 2022) (35 mins) (only sections A and B)
Eliezer Yudkowsky, one of the early researchers in the AI alignment field and a founder and researcher at the Machine Intelligence Research Institute, provides a comprehensive list of reasons that make him pessimistic about humanity’s chances of building an aligned AGI system.
Where I agree and disagree with Eliezer (Christiano, 2022) (25 mins) (only sections Agreements and Disagreements)
Another prominent researcher in the field, Paul Christiano, responds to the above post by listing his agreements and disagreements with Yudkowsky, that make him more optimistic about the probability that the first AGI system will be aligned.
Further readings:
Risks from Learned Optimization in Advanced Machine Learning Systems (Hubinger et al., 2019) (only sections 1, 3 and 4) (45 mins)
This is the paper that originally introduced the inner alignment problem. It explains why inner alignment should be viewed as a separate problem from the problem of finding the correct reward function and explains why inner misalignment might lead to systems that are deceptive.
Some of my disagreements with List of Lethalities (Turner, 2023) (20 mins)
Worst-case guarantees (Christiano, 2019) (15 mins)
The alignment problem from a deep learning perspective (Ngo, Mindermann and Chan, 2022) (only sections 4 and 5) (20 mins)
Biological Anchors: A Trick That Might Or Might Not Work (Alexander, 2022) (30 mins)
This reading provides an in-depth critical review of one popular method for forecasting progress in AI that you learned about in the first part of the course: biological anchors.
The Bitter Lesson (Sutton, 2019) (10 mins)
This article argues that in the field of AI, general methods that scale with increased compute outperform (expert) human-knowledge-based approaches.
Worlds Where Iterative Design Fails (Wentworth, 2022) (15 mins)
Alignment By Default (Wentworth, 2020) (20 mins)
The case for ensuring that powerful AIs are controlled (Greenblatt and Shlegeris, 2024) (45 mins)
Counterarguments to the basic AI x-risk case (Grace, 2022) (45 mins)
Low-stakes alignment (Christiano, 2021) (10 mins)
Reward is not the optimization target (Turner, 2022) (15 mins)
An Overview of Catastrophic AI Risks (Hendrycks, 2023) (20 mins)
Hendrycks provides a broad overview of possible ways AI systems could pose catastrophic risks. The discussed risks go beyond the threat models that were covered during the Alignment 101 course: for example, the author discusses risks from malicious use, arms race risks, and accident risks.
Podcast spotlight:
For an in-depth explanation of why advanced AI systems pose an existential risk and what would it look like to develop safer systems, you can listen to episode 12 of the AI X-Risk Podcast (AXRP) with Paul Christiano, a researcher at the Alignment Research Center and one of the inventors of the RLHF fine-tuning protocol. For a long discussion with Yudkowsky on the reasons behind his pessimism about the feasibility of AGI alignment, check out his appearance on the Dwarkesh podcast. For further context on the Risks from Learned Optimization paper, you can listen to the AXRP episode with Evan Hubinger. Forecasting AI progress is discussed in the 80,000 Hours podcast episode featuring Danny Hernandez from Anthropic.
Exercises:
Which methods for AI Alignment (RLHF, Synthetic Data, Red Teaming, AI Governance, Corporate AI Ethics Boards) from the IBM article did you find most promising, and why?
Look into the list of risks highlighted by section A of AGI ruin: a list of lethalities (Yudkowsky, 2022). Are there any risks that you (dis)agree with, and why?
Recall the TESCREAL paper highlighted by Torres & Gebru. What lines of the principal argument do you (dis)agree with, and why? Think about: The tie between longtermism and eugenics; The “building well-scoped and well-defined systems” section.
The general argument against AI Safety is that it is too preoccupied on long-term risks to the detriment of current problems. However, are there any current risks that the AI alignment methods could help mitigate?
