## Artificial Intelligence and the Future of Work: An Interview With Moshe Vardi

“The future of work is now,” says Moshe Vardi. “The impact of technology on labor has become clearer and clearer by the day.”

Machines have already automated millions of routine, working-class jobs in manufacturing. And now, AI is learning to automate non-routine jobs in transportation and logistics, legal writing, financial services, administrative support, and healthcare.

Vardi, a computer science professor at Rice University, recognizes this trend and argues that AI poses a unique threat to human labor.

### Initiating a Policy Response

From the Luddite movement to the rise of the Internet, people have worried that advancing technology would destroy jobs. Yet despite painful adjustment periods during these changes, new jobs replaced old ones, and most workers found employment. But humans have never competed with machines that can outperform them in almost anything. AI threatens to do this, and many economists worry that society won’t be able to adapt.

“What people are now realizing is that this formula that technology destroys jobs and creates jobs, even if it’s basically true, it’s too simplistic,” Vardi explains.

The relationship between technology and labor is more complex: Will technology create enough jobs to replace those it destroys? Will it create them fast enough? And for workers whose skills are no longer needed – how will they keep up?

To address these questions and consider policy responses, Vardi will hold a summit in Washington, D.C. on December 12, 2017. The summit will address six current issues within technology and labor: education and training, community impact, job polarization, contingent labor, shared prosperity, and economic concentration.

Education and training

A 2013 computerization study found that 47% of American workers held jobs at high risk of automation in the next decade or two. If this happens, technology must create roughly 100 million jobs.

As the labor market changes, schools must teach students skills for future jobs, while at-risk workers need accessible training for new opportunities. Truck drivers won’t transition easily to website design and coding jobs without proper training, for example. Vardi expects that adapting to and training for new jobs will become more challenging as AI automates a greater variety of tasks.

Community impact

Manufacturing jobs are concentrated in specific regions where employers keep local economies afloat. Over the last thirty years, the loss of 8 million manufacturing jobs has crippled Rust Belt regions in the U.S. – both economically and culturally.

Today, the fifteen million jobs that involve operating a vehicle are concentrated in certain regions as well. Drivers occupy up to 9% of jobs in the Bronx and Queens districts of New York City, up to 7% of jobs in select Southern California and Southern Texas districts, and over 4% in Wyoming and Idaho. Automation could quickly assume the majority of these jobs, devastating the communities that rely on them.

Job polarization

“One in five working class men between ages 25 to 54 without college education are not working,” Vardi explains. “Typically, when we see these numbers, we hear about some country in some horrible economic crisis like Greece. This is really what’s happening in working class America.”

Employment is currently growing in high-income cognitive jobs and low-income service jobs, such as elderly assistance and fast-food service, which computers cannot automate yet. But technology is hollowing out the economy by automating middle-skill, working-class jobs first.

Many manufacturing jobs pay $25 per hour with benefits, but these jobs aren’t easy to come by. Since 2000, when millions of these jobs disappeared, displaced workers have either left the labor force or accepted service jobs that often pay$12 per hour, without benefits.

Truck driving, the most common job in over half of US states, may see a similar fate.

Source: IPUMS-CPS/ University of Minnesota Credit: Quoctrung Bui/NPR

Contingent labor

Increasingly, communications technology allows firms to save money by hiring freelancers and independent contractors instead of permanent workers. This has created the Gig Economy – a labor market characterized by short-term contracts and flexible hours at the cost of unstable jobs with fewer benefits. By some estimates, in 2016, one in three workers were employed in the Gig Economy, but not all by choice. Policymakers must ensure that this new labor market supports its workers.

Shared prosperity

Automation has decoupled job creation from economic growth, allowing the economy to grow while employment and income shrink, thus increasing inequality. Vardi worries that AI will accelerate these trends. He argues that policies encouraging economic growth must also support economic mobility for the middle class.

Economic concentration

Technology creates a “winner-takes-all” environment, where second best can hardly survive. Bing search is quite similar to Google search, but Google is much more popular than Bing. And do Facebook or Amazon have any legitimate competitors?

Startups and smaller companies struggle to compete with these giants because of data. Having more users allows companies to collect more data, which machine-learning systems then analyze to help companies improve. Vardi thinks that this feedback loop will give big companies long-term market power.

Moreover, Vardi argues that these companies create relatively few jobs. In 1990, Detroit’s three largest companies were valued at $65 billion with 1.2 million workers. In 2016, Silicon Valley’s three largest companies were valued at$1.5 trillion but with only 190,000 workers.

### Work and society

Vardi primarily studies current job automation, but he also worries that AI could eventually leave most humans unemployed. He explains, “The hope is that we’ll continue to create jobs for the vast majority of people. But if the situation arises that this is less and less the case, then we need to rethink: how do we make sure that everybody can make a living?”

Vardi also anticipates that high unemployment could lead to violence or even uprisings. He refers to Andrew McAfee’s closing statement at the 2017 Asilomar AI Conference, where McAfee said, “If the current trends continue, the people will rise up before the machines do.”

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project.

## Podcast: Creative AI with Mark Riedl & Scientists Support a Nuclear Ban

If future artificial intelligence systems are to interact with us effectively, Mark Riedl believes we need to teach them “common sense.” In this podcast, I interviewed Mark to discuss how AIs can use stories and creativity to understand and exhibit culture and ethics, while also gaining “common sense reasoning.” We also discuss the “big red button” problem with AI safety, the process of teaching rationalization to AIs, and computational creativity. Mark is an associate professor at the Georgia Tech School of interactive computing, where his recent work focuses on human-AI interaction and how humans and AI systems can understand each other.

The following transcript has been heavily edited for brevity (the full podcast also includes interviews about the UN negotiations to ban nuclear weapons, not included here). You can read the full transcript here.

Ariel: Can you explain how an AI could learn from stories?

Mark: I’ve been looking at ‘common sense errors’ or ‘common sense goal errors.’ When humans want to communicate to an AI system what they want to achieve, they often leave out the most basic rudimentary things. We have this model that whoever we’re talking to understands the everyday details of how the world works. If we want computers to understand how the real world works and what we want, we have to figure out ways of slamming lots of common sense, everyday knowledge into them.

When looking for sources of common sense knowledge, we started looking at stories – fiction, non-fiction, blogs. When we write stories we implicitly put everything that we know about the real world and how our culture works into characters.

One of my long-term goals is to say: ‘How much cultural and social knowledge can we extract by reading stories, and can we get this into AI systems who have to solve everyday problems, like a butler robot or a healthcare robot?’

Ariel: How do you choose which stories to use?

Mark: Through crowd sourcing services like Mechanical Turk, we ask people to tell stories about common things like, how do you go to a restaurant or how do you catch an airplane. Lots of people tell a story about the same topic and have agreements and disagreements, but the disagreements are a very small proportion. So we build an AI system that looks for commonalities. The common elements that everyone implicitly agrees on bubble to the top and the outliers get left along the side. And AI is really good at finding patterns.

Ariel: How do you ensure that’s happening?

Mark: When we test our AI system, we watch what it does, and we have things we do not want to see the AI do. But we don’t tell it in advance. We’ll put it into new circumstances and say, do the things you need to do, and then we’ll watch to make sure those [unacceptable] things don’t happen.

When we talk about teaching robots ethics, we’re really asking how we help robots avoid conflict with society and culture at large. We have socio-cultural patterns of behavior to help humans avoid conflict with other humans. So when I talk about teaching morality to AI systems, what we’re really talking about is: can we make AI systems do the things that humans normally do? That helps them fit seamlessly into society.

Stories are written by all different cultures and societies, and they implicitly encode moral constructs and beliefs into their protagonists and antagonists. We can look at stories from different continents and even different subcultures, like inner city versus rural.

Ariel: I want to switch to your recent paper on Safely Interruptible Agents, which were popularized in the media as the big red button problem.

Mark: At some point we’ll have robots and AI systems that are so sophisticated in their sensory abilities and their abilities to manipulate the environment, that they can theoretically learn that they have an off switch – what we call the big red button – and learn to keep humans from turning them off.

If an AI system gets a reward for doing something, turning it off means it loses the reward. A robot that’s sophisticated enough can learn that certain actions in the environment reduce future loss of reward. We can think of different scenarios: locking a door to a control room so the human operator can’t get in, physically pinning down a human. We can let our imaginations go even wilder than that.

Robots will always be capable of making mistakes. We’ll always want an operator in the loop who can push this big red button and say: ‘Stop. Someone is about to get hurt. Let’s shut things down.’ We don’t want robots learning that they can stop humans from stopping them, because that ultimately will put people into harms way.

Google and their colleagues came up with this idea of modifying the basic algorithms inside learning robots, so that they are less capable of learning about the big red button. And they came up with this very elegant theoretical framework that works, at least in simulation. My team and I came up with a different approach: to take this idea from The Matrix, and flip it on its head. We use the big red button to intercept the robot’s sensors and motor controls and move it from the real world into a virtual world, but the robot doesn’t know it’s in a virtual world. The robot keeps doing what it wants to do, but in the real world the robot has stopped moving.

Ariel: Can you also talk about your work on explainable AI and rationalization?

Mark: Explainability is a key dimension of AI safety. When AI systems do something unexpected or fail unexpectedly, we have to answer fundamental questions: Was this robot trained incorrectly? Did the robot have the wrong data? What caused the robot to go wrong?

If humans can’t trust AI systems, they won’t use them. You can think of it as a feedback loop, where the robot should understand humans’ common sense goals, and the humans should understand how robots solve problems.

We came up with this idea called rationalization: can we have a robot talk about what it’s doing as if a human were doing it? We get a bunch of humans to do some tasks, we get them to talk out loud, we record what they say, and then we teach the robot to use those same words in the same situations.

We’ve tested it in computer games. We have an AI system that plays Frogger, the classic arcade game in which the frog has to cross the street. And we can have a Frogger talk about what it’s doing. It’ll say things like “I’m waiting for a gap in the cars to open before I can jump forward.”

This is significant because that’s what you’d expect something to say, but the AI system is doing something completely different behind the scenes. We don’t want humans watching Frogger to have to know anything about rewards and reinforcement learning and Bellman equations. It just sounds like it’s doing the right thing.

Ariel: Going back a little in time – you started with computational creativity, correct?

Mark: I have ongoing research in computational creativity. When I think of human AI interaction, I really think, ‘what does it mean for AI systems to be on par with humans?’ The human is going make cognitive leaps and creative associations, and if the computer can’t make these cognitive leaps, it ultimately won’t be useful to people.

I have two things that I’m working on in terms of computational creativity. One is story writing. I’m interested in how much of the creative process of storytelling we can offload from the human onto a computer. I’d like to go up to a computer and say, “hey computer, tell me a story about X, Y or Z.”

I’m also interested in whether an AI system can build a computer game from scratch. How much of the process of building the construct can the computer do without human assistance?

Ariel: We see fears that automation will take over jobs, but typically for repetitive tasks. We’re still hearing that creative fields will be much harder to automate. Is that the case?

Mark: I think it’s a long, hard climb to the point where we’d trust AI systems to make creative decisions, whether it’s writing an article for a newspaper or making art or music.

I don’t see it as a replacement so much as an augmentation. I’m particularly interested in novice creators – people who want to do something artistic but haven’t learned the skills. I cannot read or write music, but sometimes I get these tunes in my head and I think I can make a song. Can we bring the AI in to become the skills assistant? I can be the creative lead and the computer can help me make something that looks professional. I think this is where creative AI will be the most useful.

For the second half of this podcast, I spoke with scientists, politicians, and concerned citizens about why they support the upcoming negotiations to ban nuclear weapons. Highlights from these interviews include comments by Congresswoman Barbara Lee, Nobel Laureate Martin Chalfie, and FLI president Max Tegmark.

Note from FLI: Among our objectives is to inspire discussion and a sharing of ideas. As such, we interview researchers and thought leaders who we believe will help spur discussion within our community. The interviews do not necessarily represent FLI’s opinions or views.

## Op-ed: On AI, Prescription Drugs, and Managing the Risks of Things We Don’t Understand

Last month, MIT Technology Review published a good article discussing the “dark secret at the heart of AI” – namely, that “[n]o one really knows how the most advanced algorithms do what they do.”  The opacity of algorithmic systems is something that has long drawn attention and criticism.  But it is a concern that has broadened and deepened in the past few years, during which breakthroughs in “deep learning” have led to a rapid increase in the sophistication of AI.  These deep learning systems operate using deep neural networks that are roughly inspired by the way the human brain works – or, to be more precise, to simulate the way the human brain works as we currently understand it.

Such systems can effectively “program themselves” by creating much or most of the code through which they operate.  The code generated by such systems can be very complex.  It can be so complex, in fact, that even the people who built and initially programmed the system may not be able to fully explain why the systems do what they do:

You can’t just look inside a deep neural network to see how it works. A network’s reasoning is embedded in the behavior of thousands of simulated neurons, arranged into dozens or even hundreds of intricately interconnected layers. The neurons in the first layer each receive an input, like the intensity of a pixel in an image, and then perform a calculation before outputting a new signal. These outputs are fed, in a complex web, to the neurons in the next layer, and so on, until an overall output is produced.

And because it is not presently possible for such a system to create a comprehensible “log” that shows the system’s reasoning, it is possible that no one can fully explain a deep learning system’s behavior.

This is, to be sure, a real concern.  The opacity of modern AI systems is, as I wrote in my first article on this subject, one of the major barriers that makes it difficult to effectively manage the public risks associated with AI.  In my paper, I was talking about how the opacity of such systems to regulators, industry bodies, and the general public was a problem.  That problem becomes several orders of magnitude more serious if the programs are opaque even to the system’s creators.

We have already seen related concerns discussed in the world of automated trading systems, which are driven by algorithms whose workings are sufficiently mysterious as to earn them the moniker of “black-box trading” systems.  The interactions between algorithmic trading systems are widely blamed for driving the Flash Crash of 2010.  But, as far as I know, no one has suggested that the algorithms themselves involved in the Flash Crash were not understood by their designers.  Rather, it was the interaction between those algorithmic systems that was impossible to model and predict.  That makes those risks easier to manage than risks arising from technologies that are themselves poorly understood.

That being said, I don’t see even intractable ignorance as a truly insurmountable barrier or something that would make it wise to outright “ban” the development of opaque AI systems.  Humans have actually been creating and using things that no one understands for centuries, if not millenia.  For me, the most obvious examples probably come from the world of medicine, a science where safe and effective treatments have often been developed even when no one really understands why the treatments are safe and effective.

In the 1790s, Edward Jenner developed a vaccine to immunize people from smallpox, a disease caused by the variola virus.  At the time, no one had any idea that smallpox was caused by a virus.  In fact, people at the time had not even discovered viruses.  In the 1700s, the leading theory of communicable diseases was that they were transmitted by bad-smelling fumes.  It would be over half a century before the germ theory of disease started to gain acceptance, and nearly a century before viruses were discovered.

Even today, there are many drugs that doctors prescribe whose mechanisms of action are either only partially understood or barely understood at all.  Many psychiatric drugs fit this description, for instance.  Pharmacologists have made hypotheses on how it is that tricyclic antidepressants seem to help both people with depression and people with neuropathic pain.  But a hypothesis is still just, well, a hypothesis.  And presumably, the designers of a deep learning system are also capable of making hypotheses regarding how the system reached its current state, even though those hypotheses may not be as well-informed as those made about tricyclic antidepressants.

As a society, we long ago decided that the mere fact that no one completely understands a drug does not mean that no one should use such drugs.  But we have managed the risks associated with potential new drugs by requiring that drugs be thoroughly tested and proven safe before they hit pharmacy shelves.  Indeed, aside from nuclear technology, pharmaceuticals may be the most heavily regulated technology in the world.

In the United States, many people (especially people in the pharmaceutical industry) criticize the long, laborious, and labyrinthine process that drugs must go through before they can be approved for use.  But that process allows us to be pretty confident that a drug is safe (or at least “safe enough”) when it finally does hit the market.  And that, in turn, means that we can live with having drugs whose workings are not well-understood, since we know that any hidden dangers would very likely have been exposed during the testing process.

What lessons might this hold for AI?  Well, most obviously, it suggests a need to ensure that AI systems are rigorously tested before they can be marketed to the public, at least if the AI system cannot “show its work” in a way that permits its decision-making process to be comprehensible to, at a minimum, the people who designed it.  I doubt that we’ll ever see a FDA-style uber-regulatory regime for AI systems.  The FDA model would neither be practical nor effective for such a complex digital technology.  But what we are likely to see is courts and legal systems rightly demanding that companies put deep learning systems through more extensive testing than has been typical for previous generations of digital technology.

Because such testing is likely to be lengthy and labor-intensive, the necessity of such testing would make it even more difficult for smaller companies to gain traction in the AI industry, further increasing the market power of giants like Google, Amazon, Facebook, Microsoft, and Apple.  That is, not coincidentally, exactly what has happened in the brand-name drug industry, which has seen market power increasingly concentrated in the hands of a small number of companies.  And as observers of that industry know, such concentration of market power carries risks of its own.  That may be the price we have to pay if we want to reap the benefits of technologies we do not fully understand.

This post originally appeared on Law and AI.

Note from FLI: Among our objectives is to inspire discussion and a sharing of ideas. As such, we post op-eds that we believe will help spur discussion within our community. Op-eds do not necessarily represent FLI’s opinions or views.

## The U.S. Worldwide Threat Assessment Includes Warnings of Cyber Attacks, Nuclear Weapons, Climate Change, etc.

Last Thursday – just one day before the WannaCry ransomware attack would shut down 16 hospitals in the UK and ultimately hit hundreds of thousands of organizations and individuals in over 150 countries – the Director of National Intelligence, Daniel Coats, released the Worldwide Threat Assessment of the US Intelligence Community.

Large-scale cyber attacks are among the first risks cited in the document, which warns that “cyber threats also pose an increasing risk to public health, safety, and prosperity as cyber technologies are integrated with critical infrastructure in key sectors.”

Perhaps the other most prescient, or at least well-timed, warning in the document relates to North Korea’s ambitions to create nuclear intercontinental ballistic missiles (ICBMs). Coats writes:

“Pyongyang is committed to developing a long-range, nuclear-armed missile that is capable of posing a direct threat to the United States; it has publicly displayed its road-mobile ICBMs on multiple occasions. We assess that North Korea has taken steps toward fielding an ICBM but has not flight-tested it.”

This past Sunday, North Korea performed a missile test launch, which many experts believe shows considerable progress toward the development of an ICBM. Though the report hints this may be less of an actual threat from North Korea and more for show. “We have long assessed that Pyongyang’s nuclear capabilities are intended for deterrence, international prestige, and coercive diplomacy,” says Coats in the report.

### More Nuclear Threats

The Assessment also addresses the potential of nuclear threats from China and Pakistan. China “continues to modernize its nuclear missile force by adding more survivable road-mobile systems and enhancing its silo-based systems. This new generation of missiles is intended to ensure the viability of China’s strategic deterrent by providing a second-strike capability.” In addition, China is also working to develop “its first long-range, sea-based nuclear capability.”

Meanwhile, though Pakistan’s nuclear program doesn’t pose a direct threat to the U.S., advances in Pakistan’s nuclear capabilities could risk further destabilization along the India-Pakistan border.

The report warns: “Pakistan’s pursuit of tactical nuclear weapons potentially lowers the threshold for their use.” And of the ongoing conflicts between Pakistan and India, it says, “Increasing numbers of firefights along the Line of Control, including the use of artillery and mortars, might exacerbate the risk of unintended escalation between these nuclear-armed neighbors.”

This could be especially problematic because “early deployment during a crisis of smaller, more mobile nuclear weapons would increase the amount of time that systems would be outside the relative security of a storage site, increasing the risk that a coordinated attack by non-state actors might succeed in capturing a complete nuclear weapon.”

Even a small nuclear war between India and Pakistan could trigger a nuclear winter that could send the planet into a mini ice age and starve an estimated 1 billion people.

### Artificial Intelligence

Nukes aren’t the only weapons the government is worried about. The report also expresses concern about the impact of artificial intelligence on weaponry: “Artificial Intelligence (Al) is advancing computational capabilities that benefit the economy, yet those advances also enable new military capabilities for our adversaries.”

Coats worries that AI could negatively impact other aspects of society, saying, “The implications of our adversaries’ abilities to use AI are potentially profound and broad. They include an increased vulnerability to cyber attack, difficulty in ascertaining attribution, facilitation of advances in foreign weapon and intelligence systems, the risk of accidents and related liability issues, and unemployment.”

### Space Warfare

But threats of war are not expected to remain Earth-bound. The Assessment also addresses issues associated with space warfare, which could put satellites and military communication at risk.

For example, the report warns that “Russia and China perceive a need to offset any US military advantage derived from military, civil, or commercial space systems and are increasingly considering attacks against satellite systems as part of their future warfare doctrine.”

The report also adds that “the global threat of electronic warfare (EW) attacks against space systems will expand in the coming years in both number and types of weapons.” Coats expects that EW attacks will “focus on jamming capabilities against dedicated military satellite communications” and against GPS, among others.

### Environmental Risks & Climate Change

Plenty of global threats do remain Earth-bound though, and not all are directly related to military concerns. The government also sees environmental issues and climate change as potential threats to national security.

The report states, “The trend toward a warming climate is forecast to continue in 2017. … This warming is projected to fuel more intense and frequent extreme weather events that will be distributed unequally in time and geography. Countries with large populations in coastal areas are particularly vulnerable to tropical weather events and storm surges, especially in Asia and Africa.”

In addition to rising temperatures, “global air pollution is worsening as more countries experience rapid industrialization, urbanization, forest burning, and agricultural waste incineration, according to the World Health Organization (WHO). An estimated 92 percent of the world’s population live in areas where WHO air quality standards are not met.”

According to the Assessment, biodiversity loss will also continue to pose an increasing threat to humanity. The report suggests global biodiversity “will likely continue to decline due to habitat loss, overexploitation, pollution, and invasive species, … disrupting ecosystems that support life, including humans.”

The Assessment goes on to raise concerns about the rate at which biodiversity loss is occurring. It says, “Since 1970, vertebrate populations have declined an estimated 60 percent … [and] populations in freshwater systems declined more than 80 percent. The rate of species loss worldwide is estimated at 100 to 1,000 times higher than the natural background extinction rate.”

### Other Threats

The examples above are just a sampling of the risks highlighted in the Assessment. A great deal of the report covers threats of terrorism, issues with Russia, China and other regional conflicts, organized crime, economics, and even illegal fishing. Overall, the report is relatively accessible and provides a quick summary of the greatest known risks that could threaten not only the U.S., but also other countries in 2017. You can read the report in its entirety here.

• Our strategy update discusses changes to our AI forecasts and research priorities, new outreach goals, a MIRI/DeepMind collaboration, and other news.
• MIRI is hiring software engineers! If you’re a programmer who’s passionate about MIRI’s mission and wants to directly support our research efforts, apply here to trial with us.
• MIRI Assistant Research Fellow Ryan Carey has taken on an additional affiliation with the Centre for the Study of Existential Risk, and is also helping edit an issue of Informatica on superintelligence.

## Machine Learning Security at ICLR 2017

The overall theme of the ICLR conference setting this year could be summarized as “finger food and ships”. More importantly, there were a lot of interesting papers, especially on machine learning security, which will be the focus on this post. (Here is a great overview of the topic.)

On the attack side, adversarial perturbations now work in physical form (if you print out the image and then take a picture) and they can also interfere with image segmentation. This has some disturbing implications for fooling vision systems in self-driving cars, such as impeding them from recognizing pedestrians. Adversarial examples are also effective at sabotaging neural network policies in reinforcement learning at test time.

In more encouraging news, adversarial examples are not entirely transferable between different models. For targeted examples, which aim to be misclassified as a specific class, the target class is not preserved when transferring to a different model. For example, if an image of a school bus is classified as a crocodile by the original model, it has at most 4% probability of being seen as a crocodile by another model. The paper introduces an ensemble method for developing adversarial examples whose targets do transfer, but this seems to only work well if the ensemble includes a model with a similar architecture to the new model.

On the defense side, there were some new methods for detecting adversarial examples. One method augments neural nets with a detector subnetwork, which works quite well and generalizes to new adversaries (if they are similar to or weaker than the adversary used for training). Another approach analyzes adversarial images using PCA, and finds that they are similar to normal images in the first few thousand principal components, but have a lot more variance in later components. Note that the reverse is not the case – adding arbitrary variation in trailing components does not necessarily encourage misclassification.

There has also been progress in scaling adversarial training to larger models and data sets, which also found that higher-capacity models are more resistant against adversarial examples than lower-capacity models. My overall impression is that adversarial attacks are still ahead of adversarial defense, but the defense side is starting to catch up.

(This article originally appeared here. Thanks to Janos Kramar for his feedback on this post.)

## Ensuring Smarter-than-human Intelligence Has a Positive Outcome

The following article and video were originally posted here.

I recently gave a talk at Google on the problem of aligning smarter-than-human AI with operators’ goals:

The talk was inspired by “AI Alignment: Why It’s Hard, and Where to Start,” and serves as an introduction to the subfield of alignment research in AI. A modified transcript follows.

Talk outline (slides):

1. Overview

### Overview

I’m the executive director of the Machine Intelligence Research Institute. Very roughly speaking, we’re a group that’s thinking in the long term about artificial intelligence and working to make sure that by the time we have advanced AI systems, we also know how to point them in useful directions.

Across history, science and technology have been the largest drivers of change in human and animal welfare, for better and for worse. If we can automate scientific and technological innovation, that has the potential to change the world on a scale not seen since the Industrial Revolution. When I talk about “advanced AI,” it’s this potential for automating innovation that I have in mind.

AI systems that exceed humans in this capacity aren’t coming next year, but many smart people are working on it, and I’m not one to bet against human ingenuity. I think it’s likely that we’ll be able to build something like an automated scientist in our lifetimes, which suggests that this is something we need to take seriously.

When people talk about the social implications of general AI, they often fall prey to anthropomorphism. They conflate artificial intelligence with artificial consciousness, or assume that if AI systems are “intelligent,” they must be intelligent in the same way a human is intelligent. A lot of journalists express a concern that when AI systems pass a certain capability level, they’ll spontaneously develop “natural” desires like a human hunger for power; or they’ll reflect on their programmed goals, find them foolish, and “rebel,” refusing to obey their programmed instructions.

These are misplaced concerns. The human brain is a complicated product of natural selection. We shouldn’t expect machines that exceed human performance in scientific innovation to closely resemble humans, any more than early rockets, airplanes, or hot air balloons closely resembled birds.1

The notion of AI systems “breaking free” of the shackles of their source code or spontaneously developing human-like desires is just confused. The AI system is its source code, and its actions will only ever follow from the execution of the instructions that we initiate. The CPU just keeps on executing the next instruction in the program register. We could write a program that manipulates its own code, including coded objectives. Even then, though, the manipulations that it makes are made as a result of executing the original code that we wrote; they do not stem from some kind of ghost in the machine.

The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification. As Stuart Russell (co-author of Artificial Intelligence: A Modern Approach) puts it:

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:

1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.

2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

These kinds of concerns deserve a lot more attention than the more anthropomorphic risks that are generally depicted in Hollywood blockbusters.

### Simple bright ideas going wrong

Many people, when they start talking about concerns with smarter-than-human AI, will throw up a picture of the Terminator. I was once quoted in a news article making fun of people who put up Terminator pictures in all their articles about AI, next to a Terminator picture. I learned something about the media that day.

I think this is a much better picture:

This is Mickey Mouse in the movie Fantasia, who has very cleverly enchanted a broom to fill a cauldron on his behalf.

How might Mickey do this? We can imagine that Mickey writes a computer program and has the broom execute the program. Mickey starts by writing down a scoring function or objective function:

Given some set 𝐴 of available actions, Mickey then writes a program that can take one of these actions 𝑎 as input and calculate how high the score is expected to be if the broom takes that action. Then Mickey can write a function that spends some time looking through actions and predicting which ones lead to high scores, and outputs an action that leads to a relatively high score:

The reason this is “sorta-argmax” is that there may not be time to evaluate every action in 𝐴. For realistic action sets, agents should only need to find actions that make the scoring function as large as they can given resource constraints, even if this isn’t the maximal action.

This program may look simple, but of course, the devil’s in the details: writing an algorithm that does accurate prediction and smart search through action space is basically the whole problem of AI. Conceptually, however, it’s pretty simple: We can describe in broad strokes the kinds of operations the broom must carry out, and their plausible consequences at different performance levels.

When Mickey runs this program, everything goes smoothly at first. Then:

I claim that as fictional depictions of AI go, this is pretty realistic.

Why would we expect a generally intelligent system executing the above program to start overflowing the cauldron, or otherwise to go to extreme lengths to ensure the cauldron is full?

The first difficulty is that the objective function that Mickey gave his broom left out a bunch of other terms Mickey cares about:

The second difficulty is that Mickey programmed the broom to make the expectation of its score as large as it could. “Just fill one cauldron with water” looks like a modest, limited-scope goal, but when we translate this goal into a probabilistic context, we find that optimizing it means driving up the probability of success to absurd heights. If the broom assigns a 99.9% probability to “the cauldron is full,” and it has extra resources lying around, then it will always try to find ways to use those resources to drive the probability even a little bit higher.

Contrast this with the limited “task-like” goal we presumably had in mind. We wanted the cauldron full, but in some intuitive sense we wanted the system to “not try too hard” even if it has lots of available cognitive and physical resources to devote to the problem. We wanted it to exercise creativity and resourcefulness within some intuitive limits, but we didn’t want it to pursue “absurd” strategies, especially ones with large unanticipated consequences.2

In this example, the original objective function looked pretty task-like. It was bounded and quite simple. There was no way to get ever-larger amounts of utility. It’s not like the system got one point for every bucket of water it poured in — then there would clearly be an incentive to overfill the cauldron. The problem was hidden in the fact that we’re maximizing expected utility. This makes the goal open-ended, meaning that even small errors in the system’s objective function will blow up.

There are a number of different ways that a goal that looks task-like can turn out to be open-ended. Another example: a larger system that has an overarching task-like goal may have subprocesses that are themselves trying to maximize a variety of different objective functions, such as optimizing the system’s memory usage. If you don’t understand your system well enough to track whether any of its subprocesses are themselves acting like resourceful open-ended optimizers, then it may not matter how safe the top-level objective is.

So the broom keeps grabbing more pails of water — say, on the off chance that the cauldron has a leak in it, or that “fullness” requires the water to be slightly above the level of the brim. And, of course, at no point does the broom “rebel against” Mickey’s code. If anything, the broom pursued the objectives it was programmed with too effectively.

Subproblem: Suspend buttons

A common response to this problem is: “OK, there may be some unintended consequences of the objective function, but we can always pull the plug, right?”

Mickey tries this, and it doesn’t work:

And I claim that this is realistic too, for systems that are sufficiently good at modeling their environment. If the system is trying to drive up the expectation of its scoring function and is smart enough to recognize that its being shut down will result in lower-scoring outcomes, then the system’s incentive is to subvert shutdown attempts. The more capable the system is, the likelier it is to find creative ways to achieve that subgoal — e.g., by copying itself to the Internet, or by tricking the programmers into thinking it’s safer.

This isn’t to say that it’s impossible to shut down sufficiently capable AI systems; it’s just that we need to put in the extra work of deliberately designing systems to avoid searching for ways to avoid being shut down. If you’ve written code that would try to resist attempts to shut it down, then you shouldn’t run the code even if you’re convinced it would fail — there are better ways to spend those compute cycles than on adversarial policy searches.

Relevant safety measures that don’t assume we can always outthink and outmaneuver the system will need to modify the system’s objective function in some relevant way. We need to figure out how to formally specify objective functions that don’t automatically place the AI system into an adversarial context with the operators; or we need to figure out some way to have the system achieve goals without optimizing some objective function in the traditional sense. E.g., we can look for ways to limit the system’s capabilities to a small set of “whitelisted” domains that don’t include thinking about whether the system will be shut down.

One obvious solution is to try changing the score function to make reference to a suspend button 𝐁:

This says that when the button is on, the system’s goal changes from “fill the cauldron” to “be suspended.”

It’s not clear how we could actually specify this goal — if there are any loopholes in the way we’ve specified “suspension,” then the system might try to exploit them to hedge its bets. E.g., it could fulfill the technical definition of “suspend operations” while tasking a child agent it built or subprocess it started with filling the cauldron. This would allow is to achieve both “cauldron full” and “robot suspended” at once, just in case it’s wrong about whether the button has been pushed.

Let’s assume, however, that we’ve found an airtight way to technically specify suspension. The problem we run into now is that we’ve given the system an incentive to care a lot about whether the button is pushed. If it has any way to influence its operators’ decision to push the button, or any other way to exercise more control, then it will jump on the opportunity and try to ensure the button is pushed, or ensure that it’s not pushed — depending on whether it looks easier to suspend or to fill the cauldron.

And, surprisingly, the state of the art on this problem isn’t much more advanced than that. There’s a paper co-authored by me, Fallenstein, Yudkowsky, and Armstrong (“Corrigibility“), and a related result by Orseau and Armstrong (“Safely Interruptible Agents“), but not a lot more.

What we want is a way to combine two objective functions — a default function for normal operation, and a suspend function for when we want to suspend the system to disk.

We want our method for combining the functions to satisfy three conditions: an operator should be able to switch between the functions (say, by pushing a button); the system shouldn’t have any incentives to control which function is active; and if it’s plausible that the system’s normal operations could inadvertently compromise our ability to switch between the functions, then the system should be incentivized to keep that from happening.

So far, we haven’t found any way to achieve all three goals at once. As an example, we can try to remove any incentive on the system’s part to control whether its suspend button is pushed by giving the system a switching objective function that always assigns the same expected utility to the button being on or off:

Unfortunately, this approach makes it impossible to satisfy the third condition. It turns out that the above objective function is equivalent to setting the probability of the button being pressed to 0. The system won’t deliberately tamper with the button, but it will happily approve updates that delete any reference to the button as dead code, or build subagents that ignore the button.

Meeting these intuitively simple constraints turns out to be a nontrivial problem. This is a pattern that runs through a lot of questions in this space: traditional tools and concepts run into immediate safety problems that don’t turn up in conventional capabilities research.

### The big picture

Alignment priorities

Let’s take a step back and talk about what’s needed overall in order to align highly capable AI systems with our interests.

Here’s a dramatically simplified pipeline: You have some humans who come up with some task or goal or preference set that serves as their intended value function 𝘝. Since our values are complicated and context-sensitive, in practice we’ll need to build systems to learn our values over time, rather than coding them by hand.3 We’ll call the goal the AI system ends up with (which may or may not be identical to 𝘝) 𝗨.

When the press covers this topic, they often focus on one of two problems: “What if the wrong group of humans develops smarter-than-human AI first?”, and “What if AI’s natural desires cause 𝗨 to diverge from 𝘝?”

In my view, the “wrong humans” issue shouldn’t be the thing we focus on until we have reason to think we could get good outcomes with the right group of humans. We’re very much in a situation where well-intentioned people couldn’t leverage a general AI system to do good things even if they tried. As a simple example, if you handed me a box that was an extraordinarily powerful function optimizer — I could put in a description of any mathematical function, and it would give me an input that makes the output extremely large — then I do know how to use the box to produce a random catastrophe, but I don’t actually know how I could use that box in the real world to have a good impact.4

There’s a lot we don’t understand about AI capabilities, but we’re in a position where we at least have a general sense of what progress looks like. We have a number of good frameworks, techniques, and metrics, and we’ve put a great deal of thought and effort into successfully chipping away at the problem from various angles. At the same time, we have a very weak grasp on the problem of how to align highly capable systems with any particular goal. We can list out some intuitive desiderata, but the field hasn’t really developed its first formal frameworks, techniques, or metrics.

I believe that there’s a lot of low-hanging fruit in this area, and also that a fair amount of the work does need to be done early (e.g., to help inform capabilities research directions — some directions may produce systems that are much easier to align than others). If we don’t solve these problems, developers with arbitrarily good or bad intentions will end up producing equally bad outcomes. From an academic or scientific standpoint, our first objective in that kind of situation should be to remedy this state of affairs and at least make good outcomes technologically possible.

Many people quickly recognize that “natural desires” are a fiction, but infer from this that we instead need to focus on the other issues the media tends to emphasize — “What if bad actors get their hands on smarter-than-human AI?”, “How will this kind of AI impact employment and the distribution of wealth?”, etc. These are important questions, but they’ll only end up actually being relevant if we figure out how to bring general AI systems up to a minimum level of reliability and safety.

Another common thread is “Why not just tell the AI system to (insert intuitive moral precept here)?” On this way of thinking about the problem, often (perhaps unfairly) associated with Isaac Asimov’s writing, ensuring a positive impact from AI systems is largely about coming up with natural-language instructions that are vague enough to subsume a lot of human ethical reasoning:

In contrast, precision is a virtue in real-world safety-critical software systems. Driving down accident risk requires that we begin with limited-scope goals rather than trying to “solve” all of morality at the outset.5

My view is that the critical work is mostly in designing an effective value learning process, and in ensuring that the sorta-argmax process is correctly hooked up to the resultant objective function 𝗨:

The better your value learning framework is, the less explicit and precise you need to be in pinpointing your value function 𝘝, and the more you can offload the problem of figuring out what you want to the AI system itself. Value learning, however, raises a number of basic difficulties that don’t crop up in ordinary machine learning tasks.

Classic capabilities research is concentrated in the sorta-argmax and Expectation parts of the diagram, but sorta-argmax also contains what I currently view as the most neglected, tractable, and important safety problems. The easiest way to see why “hooking up the value learning process correctly to the system’s capabilities” is likely to be an important and difficult challenge in its own right is to consider the case of our own biological history.

Natural selection is the only “engineering” process we know of that has ever led to a generally intelligent artifact: the human brain. Since natural selection relies on a fairly unintelligent hill-climbing approach, one lesson we can take away from this is that it’s possible to reach general intelligence with a hill-climbing approach and enough brute force — though we can presumably do better with our human creativity and foresight.

Another key take-away is that natural selection was maximally strict about only optimizing brains for a single very simple goal: genetic fitness. In spite of this, the internal objectives that humans represent as their goals are not genetic fitness. We have innumerable goals — love, justice, beauty, mercy, fun, esteem, good food, good health, … — that correlated with good survival and reproduction strategies in the ancestral savanna. However, we ended up valuing these correlates directly, rather than valuing propagation of our genes as an end in itself — as demonstrated every time we employ birth control.

This is a case where the external optimization pressure on an artifact resulted in a general intelligence with internal objectives that didn’t match the external selection pressure. And just as this caused humans’ actions to diverge from natural selection’s pseudo-goal once we gained new capabilities, we can expect AI systems’ actions to diverge from humans’ if we treat their inner workings as black boxes.

If we apply gradient descent to a black box, trying to get it to be very good at maximizing some objective, then with enough ingenuity and patience, we may be able to produce a powerful optimization process of some kind.6 By default, we should expect an artifact like that to have a goal 𝗨 that strongly correlates with our objective 𝘝 in the training environment, but sharply diverges from 𝘝 in some new environments or when a much wider option set becomes available.

On my view, the most important part of the alignment problem is ensuring that the value learning framework and overall system design we implement allow us to crack open the hood and confirm when the internal targets the system is optimizing for match (or don’t match) the targets we’re externally selecting through the learning process.7

We expect this to be technically difficult, and if we can’t get it right, then it doesn’t matter who’s standing closest to the AI system when it’s developed. Good intentions aren’t sneezed into computer programs by kind-hearted programmers, and coming up with plausible goals for advanced AI systems doesn’t help if we can’t align the system’s cognitive labor with a given goal.

Four key propositions

Taking another step back: I’ve given some examples of open problems in this area (suspend buttons, value learning, limited task-based AI, etc.), and I’ve outlined what I consider to be the major problem categories. But my initial characterization of why I consider this an important area — “AI could automate general-purpose scientific reasoning, and general-purpose scientific reasoning is a big deal” — was fairly vague. What are the core reasons to prioritize this work?

First, goals and capabilities are orthogonal. That is, knowing an AI system’s objective function doesn’t tell you how good it is at optimizing that function, and knowing that something is a powerful optimizer doesn’t tell you what it’s optimizing.

I think most programmers intuitively understand this. Some people will insist that when a machine tasked with filling a cauldron gets smart enough, it will abandon cauldron-filling as a goal unworthy of its intelligence. From a computer science perspective, the obvious response is that you could go out of your way to build a system that exhibits that conditional behavior, but you could also build a system that doesn’t exhibit that conditional behavior. It can just keeps searching for actions that have a higher score on the “fill a cauldron” metric. You and I might get bored if someone told us to just keep searching for better actions, but it’s entirely possible to write a program that executes a search and never gets bored.8

Second, sufficiently optimized objectives tend to converge on adversarial instrumental strategies. Most objectives a smarter-than-human AI system could possess would be furthered by subgoals like “acquire resources” and “remain operational” (along with “learn more about the environment,” etc.).

This was the problem suspend buttons ran into: even if you don’t explicitly include “remain operational” in your goal specification, whatever goal you did load into the system is likely to be better achieved if the system remains online. Software systems’ capabilities and (terminal) goals are orthogonal, but they’ll often exhibit similar behaviors if a certain class of actions is useful for a wide variety of possible goals.

To use an example due to Stuart Russell: If you build a robot and program it to go to the supermarket to fetch some milk, and the robot’s model says that one of the paths is much safer than the other, then the robot, in optimizing for the probability that it returns with milk, will automatically take the safer path. It’s not that the system fears death, but that it can’t fetch the milk if it’s dead.

Third, general-purpose AI systems are likely to show large and rapid capability gains. The human brain isn’t anywhere near the upper limits for hardware performance (or, one assumes, software performance), and there are a number of other reasons to expect large capability advantages and rapid capability gain from advanced AI systems.

As a simple example, Google can buy a promising AI startup and throw huge numbers of GPUs at them, resulting in a quick jump from “these problems look maybe relevant a decade from now” to “we need to solve all of these problems in the next year.”9

Fourth, aligning advanced AI systems with our interests looks difficult. I’ll say more about why I think this presently.

Roughly speaking, the first proposition says that AI systems won’t naturally end up sharing our objectives. The second says that by default, systems with substantially different objectives are likely to end up adversarially competing for control of limited resources. The third suggests that adversarial general-purpose AI systems are likely to have a strong advantage over humans. And the fourth says that this problem is hard to solve — for example, that it’s hard to transmit our values to AI systems (addressing orthogonality) or avert adversarial incentives (addressing convergent instrumental strategies).

These four propositions don’t mean that we’re screwed, but they mean that this problem is critically important. General-purpose AI has the potential to bring enormous benefits if we solve this problem, but we do need to make finding solutions a priority for the field.

### Fundamental difficulties

Why do I think that AI alignment looks fairly difficult? The main reason is just that this has been my experience from actually working on these problems. I encourage you to look at some of the problems yourself and try to solve them in toy settings; we could use more eyes here. I’ll also make note of a few structural reasons to expect these problems to be hard:

First, aligning advanced AI systems with our interests looks difficult for the same reason rocket engineering is more difficult than airplane engineering.

Before looking at the details, it’s natural to think “it’s all just AI” and assume that the kinds of safety work relevant to current systems are the same as the kinds you need when systems surpass human performance. On that view, it’s not obvious that we should work on these issues now, given that they might all be worked out in the course of narrow AI research (e.g., making sure that self-driving cars don’t crash).

Similarly, at a glance someone might say, “Why would rocket engineering be fundamentally harder than airplane engineering? It’s all just material science and aerodynamics in the end, isn’t it?” In spite of this, empirically, the proportion of rockets that explode is far higher than the proportion of airplanes that crash. The reason for this is that a rocket is put under much greater stress and pressure than an airplane, and small failures are much more likely to be highly destructive.10

Analogously, even though general AI and narrow AI are “just AI” in some sense, we can expect that the more general AI systems are likely to experience a wider range of stressors, and possess more dangerous failure modes.

For example, once an AI system begins modeling the fact that (i) your actions affect its ability to achieve its objectives, (ii) your actions depend on your model of the world, and (iii) your model of the world is affected by its actions, the degree to which minor inaccuracies can lead to harmful behavior increases, and the potential harmfulness of its behavior (which can now include, e.g., deception) also increases. In the case of AI, as with rockets, greater capability makes it easier for small defects to cause big problems.

Second, alignment looks difficult for the same reason it’s harder to build a good space probe than to write a good app.

You can find a number of interesting engineering practices at NASA. They do things like take three independent teams, give each of them the same engineering spec, and tell them to design the same software system; and then they choose between implementations by majority vote. The system that they actually deploy consults all three systems when making a choice, and if the three systems disagree, the choice is made by majority vote. The idea is that any one implementation will have bugs, but it’s unlikely all three implementations will have a bug in the same place.

This is significantly more caution than goes into the deployment of, say, the new WhatsApp. One big reason for the difference is that it’s hard to roll back a space probe. You can send version updates to a space probe and correct software bugs, but only if the probe’s antenna and receiver work, and if all the code required to apply the patch is working. If your system for applying patches is itself failing, then there’s nothing to be done.

In that respect, smarter-than-human AI is more like a space probe than like an ordinary software project. If you’re trying to build something smarter than yourself, there are parts of the system that have to work perfectly on the first real deployment. We can do all the test runs we want, but once the system is out there, we can only make online improvements if the code that makes the system allow those improvements is working correctly.

If nothing yet has struck fear into your heart, I suggest meditating on the fact that the future of our civilization may well depend on our ability to write code that works correctly on the first deploy.

Lastly, alignment looks difficult for the same reason computer security is difficult: systems need to be robust to intelligent searches for loopholes.

Suppose you have a dozen different vulnerabilities in your code, none of which is itself fatal or even really problematic in ordinary settings. Security is difficult because you need to account for intelligent attackers who might find all twelve vulnerabilities and chain them together in a novel way to break into (or just break) your system. Failure modes that would never arise by accident can be sought out and exploited; weird and extreme contexts can be instantiated by an attacker to cause your code to follow some crazy code path that you never considered.

A similar sort of problem arises with AI. The problem I’m highlighting here is not that AI systems might act adversarially: AI alignment as a research program is all about finding ways to prevent adversarial behavior before it can crop up. We don’t want to be in the business of trying to outsmart arbitrarily intelligent adversaries. That’s a losing game.

The parallel to cryptography is that in AI alignment we deal with systems that perform intelligent searches through a very large search space, and which can produce weird contexts that force the code down unexpected paths. This is because the weird edge cases are places of extremes, and places of extremes are often the place where a given objective function is optimized.11 Like computer security professionals, AI alignment researchers need to be very good at thinking about edge cases.

It’s much easier to make code that works well on the path that you were visualizing than to make code that works on all the paths that you weren’t visualizing. AI alignment needs to work on all the paths you weren’t visualizing.

Summing up, we should approach a problem like this with the same level of rigor and caution we’d use for a security-critical rocket-launched space probe, and do the legwork as early as possible. At this early stage, a key part of the work is just to formalize basic concepts and ideas so that others can critique them and build on them. It’s one thing to have a philosophical debate about what kinds of suspend buttons people intuit ought to work, and another thing to translate your intuition into an equation so that others can fully evaluate your reasoning.

This is a crucial project, and I encourage all of you who are interested in these problems to get involved and try your hand at them. There are ample resources online for learning more about the open technical problems. Some good places to start include MIRI’s research agendas and a great paper from researchers at Google Brain, OpenAI, and Stanford called “Concrete Problems in AI Safety.”

1. An airplane can’t heal its injuries or reproduce, though it can carry heavy cargo quite a bit further and faster than a bird. Airplanes are simpler than birds in many respects, while also being significantly more capable in terms of carrying capacity and speed (for which they were designed). It’s plausible that early automated scientists will likewise be simpler than the human mind in many respects, while being significantly more capable in certain key dimensions. And just as the construction and design principles of aircraft look alien relative to the architecture of biological creatures, we should expect the design of highly capable AI systems to be quite alien when compared to the architecture of the human mind.
2. Trying to give some formal content to these attempts to differentiate task-like goals from open-ended goals is one way of generating open research problems. In the “Alignment for Advanced Machine Learning Systems” research proposal, the problem of formalizing “don’t try too hard” is mild optimization, “steer clear of absurd strategies” is conservatism, and “don’t have large unanticipated consequences” is impact measures. See also “avoiding negative side effects” in Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané’s “Concrete Problems in AI Safety.”
3. One thing we’ve learned in the field of machine vision over the last few decades is that it’s hopeless to specify by hand what a cat looks like, but that it’s not too hard to specify a learning system that can learn to recognize cats. It’s even more hopeless to specify everything we value by hand, but it’s plausible that we could specify a learning system that can learn the relevant concept of “value.”
4. Roughly speaking, MIRI’s focus is on research directions that seem likely to help us conceptually understand how to do AI alignment in principle, so we’re fundamentally less confused about the kind of work that’s likely to be needed.What do I mean by this? Let’s say that we’re trying to develop a new chess-playing programs. Do we understand the problem well enough that we could solve it if someone handed us an arbitrarily large computer? Yes: We make the whole search tree, backtrack, see whether white has a winning move.If we didn’t know how to answer the question even with an arbitrarily large computer, then this would suggest that we were fundamentally confused about chess in some way. We’d either be missing the search-tree data structure or the backtracking algorithm, or we’d be missing some understanding of how chess works.This was the position we were in regarding chess prior to Claude Shannon’s seminal paper, and it’s the position we’re currently in regarding many problems in AI alignment. No matter how large a computer you hand me, I could not make a smarter-than-human AI system that performs even a very simple limited-scope task (e.g., “put a strawberry on a plate without producing any catastrophic side-effects”) or achieves even a very simple open-ended goal (e.g., “maximize the amount of diamond in the universe”).If I didn’t have any particular goal in mind for the system, I could write a program (assuming an arbitrarily large computer) that strongly optimized the future in an undirected way, using a formalism like AIXI. In that sense we’re less obviously confused about capabilities than about alignment, even though we’re still missing a lot of pieces of the puzzle on the practical capabilities front.Our goal is to develop and formalize basic approaches and ways of thinking about the alignment problem, so that our engineering decisions don’t end up depending on sophisticated and clever-sounding verbal arguments that turn out to be subtly mistaken. Simplifications like “what if we weren’t worried about resource constraints?” and “what if we were trying to achieve a much simpler goal?” are a good place to start breaking down the problem into manageable pieces. For more on this methodology, see “MIRI’s Approach.”
5. “Fill this cauldron without being too clever about it or working too hard or having any negative consequences I’m not anticipating” is a rough example of a goal that’s intuitively limited in scope. The things we actually want to use smarter-than-human AI for are obviously more ambitious than that, but we’d still want to begin with various limited-scope tasks rather than open-ended goals.Asimov’s Three Laws of Robotics make for good stories partly for the same reasons they’re unhelpful from a research perspective. The hard task of turning a moral precept into lines of code is hidden behind phrasings like “[don’t,] through inaction, allow a human being to come to harm.” If one followed a rule like that strictly, the result would be massively disruptive, as AI systems would need to systematically intervene to prevent even the smallest risks of even the slightest harms; and if the intent is that one follow the rule loosely, then all the work is being done by the human sensibilities and intuitions that tell us when and how to apply the rule.A common response here is that vague natural-language instruction is sufficient, because smarter-than-human AI systems are likely to be capable of natural language comprehension. However, this is eliding the distinction between the system’s objective function and its model of the world. A system acting in an environment containing humans may learn a world-model that has lots of information about human language and concepts, which the system can then use to achieve its objective function; but this fact doesn’t imply that any of the information about human language and concepts will “leak out” and alter the system’s objective function directly.Some kind of value learning process needs to be defined where the objective function itself improves with new information. This is a tricky task because there aren’t known (scalable) metrics or criteria for value learning in the way that there are for conventional learning.If a system’s world-model is accurate in training environments but fails in the real world, then this is likely to result in lower scores on its objective function — the system itself has an incentive to improve. The severity of accidents is also likelier to be self-limiting in this case, since false beliefs limit a system’s ability to effectively pursue strategies.In contrast, if a system’s value learning process results in a 𝗨 that matches our 𝘝 in training but diverges from 𝘝 in the real world, then the system’s 𝗨 will obviously not penalize it for optimizing 𝗨. The system has no incentive relative to 𝗨 to “correct” divergences between 𝗨 and 𝘝, if the value learning process is initially flawed. And accident risk is larger in this case, since a mismatch between 𝗨 and 𝘝 doesn’t necessarily place any limits on the system’s instrumental effectiveness at coming up with effective and creative strategies for achieving 𝗨.The problem is threefold:1. “Do What I Mean” is an informal idea, and even if we knew how to build a smarter-than-human AI system, we wouldn’t know how to precisely specify this idea in lines of code.2. If doing what we actually mean is instrumentally useful for achieving a particular objective, then a sufficiently capable system may learn how to do this, and may act accordingly so long as doing so is useful for its objective. But as systems become more capable, they are likely to find creative new ways to achieve the same objectives, and there is no obvious way to get an assurance that “doing what I mean” will continue to be instrumentally useful indefinitely.

3. If we use value learning to refine a system’s goals over time based on training data that appears to be guiding the system toward a 𝗨 that inherently values doing what we mean, it is likely that the system will actually end up zeroing in on a 𝗨 that approximately does what we mean during training but catastrophically diverges in some difficult-to-anticipate contexts. See “Goodhart’s Curse” for more on this.

For examples of problems faced by existing techniques for learning goals and facts, such as reinforcement learning, see “Using Machine Learning to Address AI Risk.”

6. The result will probably not be a particularly human-like design, since so many complex historical contingencies were involved in our evolution. The result will also be able to benefit from a number of large software and hardware advantages.
7. This concept is sometimes lumped into the “transparency” category, but standard algorithmic transparency research isn’t really addressing this particular problem. A better term for what I have in mind here is “understanding.” What we want is to gain deeper and broader insights into the kind of cognitive work the system is doing and how this work relates to the system’s objectives or optimization targets, to provide a conceptual lens with which to make sense of the hands-on engineering work.
8. We could choose to program the system to tire, but we don’t have to. In principle, one could program a broom that only ever finds and executes actions that optimize the fullness of the cauldron. Improving the system’s ability to efficiently find high-scoring actions (in general, or relative to a particular scoring rule) doesn’t in itself change the scoring rule it’s using to evaluate actions.
9. Some other examples: a system’s performance may suddenly improve when it’s first given large-scale Internet access, when there’s a conceptual breakthrough in algorithm design, or when the system itself is able to propose improvements to its hardware and software. We can imagine the latter case in particular resulting in a feedback loop as the system’s design improvements allow it to come up with further design improvements, until all the low-hanging fruit is exhausted.Another important consideration is that two of the main bottlenecks to humans doing faster scientific research are training time and communication bandwidth. If we could train a new mind to be a cutting-edge scientist in ten minutes, and if scientists could near-instantly trade their experience, knowledge, concepts, ideas, and intuitions to their collaborators, then scientific progress might be able to proceed much more rapidly. Those sorts of bottlenecks are exactly the sort of bottleneck that might give automated innovators an enormous edge over human innovators even without large advantages in hardware or algorithms.
10. Specifically, rockets experience a wider range of temperatures and pressures, traverse those ranges more rapidly, and are also packed more fully with explosives.
11. Consider Bird and Layzell’s example of a very simple genetic algorithm that was tasked with evolving an oscillating circuit. Bird and Layzell were astonished to find that the algorithm made no use of the capacitor on the chip; instead, it had repurposed the circuit tracks on the motherboard as a radio to replay the oscillating signal from the test device back to the test device.This was not a very smart program. This is just using hill climbing on a very small solution space. In spite of this, the solution turned out to be outside the space of solutions the programmers were themselves visualizing. In a computer simulation, this algorithm might have behaved as intended, but the actual solution space in the real world was wider than that, allowing hardware-level interventions.In the case of an intelligent system that’s significantly smarter than humans on whatever axes you’re measuring, you should by default expect the system to push toward weird and creative solutions like these, and for the chosen solution to be difficult to anticipate.