AI Alignment: Why It’s Hard, and Where to Start

By Eliezer Yudkowsky
Machine Intelligence Research Institute
a talk at Stanford University for the Symbolic Systems Distinguished Speaker series
Video of the talk
Long article

My Notes:

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions

‒ Stuart Russell

Exhibit an agent that decides according to utility function and therefore naturally chooses to self-modify to new code that pursues.

But how can we exhibit that when we’re far away from coding up self-modifying, expected utility agents?

Well, would you know how to write the code given unbounded computing power?

Arithmetical or algebraical calculations are, from their very nature, fixed and determinate. . . Even granting that the movements of the Automaton Chess-Player were in themselves determinate, they would be necessarily interrupted and disarranged by the indeterminate will of his antagonist. There is then no analogy whatever between the operations of the Chess-Player, and those of the calculating machine of Mr. Babbage. . . It is quite certain that the operations of the Automaton are regulated by mind, and by nothing else. Indeed this matter is susceptible of a mathematical demonstration, a priori.

‒ Edgar Allan Poe

If we know how to solve a problem with unbounded computation, we “merely” need faster algorithms (47 years later).

If we can’t solve it with unbounded computation, we’re confused about the work to be performed.

Agents and their utility functions

The best general introductions to the topic of smarter-than-human artificial intelligence are plausibly Nick Bostrom’s Superintelligence and Stuart Armstrong’s Smarter Than Us. For a shorter explanation, see my recent guest post on EconLog.

A fuller version of Stuart Russell’s quotation (from

There have been many unconvincing arguments [for worrying about AI disasters] — especially those involving blunt applications of Moore’s law or the spontaneous emergence of consciousness and evil intent. Many of the contributors to this conversation seem to be responding to those arguments and ignoring the more substantial arguments proposed by Omohundro, Bostrom, and others.

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:

  1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.
  2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.

Isaac Asimov introduced the three laws of robotics in the 1942 short story “Runaround.”

The Peter Norvig / Stuart Russell quotation is from Artificial Intelligence: A Modern Approach, the top undergraduate textbook in AI.

The arguments I give for having a utility function are standard, and can be found in e.g. Poole and Mackworth’s Artificial Intelligence: Foundations of Computational Agents. I write about normative rationality at greater length in Rationality: From AI to Zombies (e.g., in The Allais Paradox and Zut Allais!).

Some AI alignment subproblems

My discussion of low-impact agents borrows from a forthcoming research proposal by Taylor et al.: “Value Alignment for Advanced Machine Learning Systems.” For an overview, see Low Impact on Arbital.

The suspend problem is discussed (under the name “shutdown problem”) in Soares et al.’s “Corrigibility.” The stable policy proposal comes from Taylor’s Maximizing a Quantity While Ignoring Effect Through Some Channel.

Poe’s argument against the possibility of machine chess is from the 1836 essay “Maelzel’s Chess-Player.”

Fallenstein and Soares’ “Vingean Reflection” is currently the most up-to-date overview of work in goal stability. Other papers cited:

  1. Yudkowsky and Herreshoff (2013). “Tiling Agents for Self-Modifying AI, and the Löbian Obstacle.” Working paper.
  2. Christiano et al. (2013). “Definability of Truth in Probabilistic Logic.” Working paper.
  3. Fallenstein and Kumar (2015). “Proof-Producing Reflection for HOL: With an Application to Model Polymorphism.” In Interactive Theorem Proving: 6th International Conference, Proceedings.
  4. Yudkowsky (2014). “Distributions Allowing Tiling of Staged Subjective EU Maximizers.” Technical report 2014-1. Machine Intelligence Research Institute.
Why expect difficulty?

For more about orthogonal final goals and convergent instrumental strategies, see Bostrom’s “The Superintelligent Will” (also reproduced in Superintelligence). Benson-Tilsen and Soares’ “Formalizing Convergent Instrumental Goals” provides a toy model.

The smile maximizer is based on a proposal by Bill Hibbard. This example and Jürgen Schmidhuber’s compressibility proposal are discussed more fully in Soares’ “The Value Learning Problem.” See also the Arbital pages on Edge Instantiation, Context Disaster, and Nearest Unblocked Strategy.

See the MIRI FAQ and GiveWell’s report on potential risks from advanced AI for quick explanations of why AI is likely to be able to surpass human cognitive capabilities, among other topics. Bensinger’s When AI Accelerates AI notes general reasons to expect capability speedup, while “Intelligence Explosion Microeconomics” delves into the specific question of whether self-modifying AI is likely to result in accelerating AI progress.

Muehlhauser notes the analogy between computer security and AI alignment research in AI Risk and the Security Mindset.

Where we are now

MIRI’s technical research agenda summarizes many of the field’s core open problems.

For more on conservatism, see the Arbital post Conservative Concept Boundary and Taylor’s Conservative Classifiers. Also on Arbital: introductions to mild optimization and act-based agents.

Papers cited in the slides:

  1. Armstrong and Levinstein (2015). “Reduced Impact Artificial Intelligences.” Working paper.
  2. Soares (2015). “Formalizing Two Problems of Realistic World-Models.” Technical report 2015-3. Machine Intelligence Research Institute.
  3. Taylor (2016). “Quantilizers: A Safer Alternative to Maximizers for Limited Optimization.” Paper presented at the AAAI 2016 AI, Ethics and Society Workshop.
  4. Evans et al. (2015). “Learning the Preferences of Bounded Agents.” Paper presented at the NIPS 2015 Workshop on Bounded Optimality.
  5. Hutter (2007). “Universal Algorithmic Intelligence: A Mathematical Top→Down Approach.” arXiv:cs/0701125 [cs.AI].
  6. LaVictoire et al. (2014). “Program Equilibrium in the Prisoner’s Dilemma via Löb’s Theorem.” Paper presented at the AAAI 2014 Multiagent Interaction without Prior Coordination Workshop.
  7. Fallenstein et al. (2015). “Reflective Oracles: A Foundation for Game Theory in Artificial Intelligence.” In Proceedings of LORI 2015.

Email if you have any questions, and see for information about opportunities to collaborate on AI alignment projects.

Where can you work on this?
  • Machine Intelligence Research Institute (Berkeley)
  • Future of Humanity Institute (Oxford University)
  • Stuart Russell (UC Berkeley)
  • Leverhulme CFI is starting up (Cambridge UK)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: