Op-ed: Should Artificial Intelligence Be Regulated?

By Anthony Aguirre, Ariel Conn, and Max Tegmark

Should artificial intelligence be regulated? Can it be regulated? And if so, what should those regulations look like?

These are difficult questions to answer for any technology still in development stages – regulations, like those on the food, pharmaceutical, automobile and airline industries, are typically applied after something bad has happened, not in anticipation of a technology becoming dangerous. But AI has been evolving so quickly, and the impact of AI technology has the potential to be so great that many prefer not to wait and learn from mistakes, but to plan ahead and regulate proactively.

In the near term, issues concerning job losses, autonomous vehicles, AI- and algorithmic-decision making, and “bots” driving social media require attention by policymakers, just as many new technologies do. In the longer term, though, possible AI impacts span the full spectrum of benefits and risks to humanity – from the possible development of a more utopic society to the potential extinction of human civilization. As such, it represents an especially challenging situation for would-be regulators.

Already, many in the AI field are working to ensure that AI is developed beneficially, without unnecessary constraints on AI researchers and developers. In January of this year, some of the top minds in AI met at a conference in Asilomar, CA. A product of this meeting was the set of Asilomar AI Principles. These 23 Principles represent a partial guide, its drafters hope, to help ensure that AI is developed beneficially for all. To date, over 1200 AI researchers and over 2300 others have signed on to these principles.

Yet aspirational principles alone are not enough, if they are not put into practice, and a question remains: is government regulation and oversight necessary to guarantee that AI scientists and companies follow these principles and others like them?

Among the signatories of the Asilomar Principles is Elon Musk, who recently drew attention for his comments at a meeting of the National Governors Association, where he called for a regulatory body to oversee AI development. In response, news organizations focused on his concerns that AI represents an existential threat. And his suggestion raised concerns with some AI researchers who worry that regulations would, at best, be unhelpful and misguided, and at worst, stifle innovation and give an advantage to companies overseas.

But an important and overlooked comment by Musk related specifically to what this regulatory body should actually do. He said:

“The right order of business would be to set up a regulatory agency – initial goal: gain insight into the status of AI activity, make sure the situation is understood, and once it is, put regulations in place to ensure public safety. That’s it. … I’m talking about making sure there’s awareness at the government level.”

There is disagreement among AI researchers about what the risk of AI may be, when that risk could arise, and whether AI could pose an existential risk, but few researchers would suggest that AI poses no risk. Even today, we’re seeing signs of narrow AI exacerbating problems of discrimination and job loss, and if we don’t take proper precautions, we can expect problems to worsen, affecting more people as AI grows smarter and more complex.

The number of AI researchers who signed the Asilomar Principles – as well as the open letters regarding developing beneficial AI and opposing lethal autonomous weapons – shows that there is strong consensus among researchers that we need to do more to understand and address the known and potential risks of AI.

Some of the Principles that AI researchers signed directly relate to Musk’s statements, including:

3) Science-Policy Link: There should be constructive and healthy exchange between AI researchers and policy-makers.

4) Research Culture: A culture of cooperation, trust, and transparency should be fostered among researchers and developers of AI.

5) Race Avoidance: Teams developing AI systems should actively cooperate to avoid corner-cutting on safety standards.

20) Importance: Advanced AI could represent a profound change in the history of life on Earth, and should be planned for and managed with commensurate care and resources.

21) Risks: Risks posed by AI systems, especially catastrophic or existential risks, must be subject to planning and mitigation efforts commensurate with their expected impact.

The right policy and governance solutions could help align AI development with these principles, as well as encourage interdisciplinary dialogue on how that may be achieved.

The recently founded Partnership on AI, which includes the leading AI industry players, similarly endorses the idea of principled AI development – their founding document states that “where AI tools are used to supplement or replace human decision-making, we must be sure that they are safe, trustworthy, and aligned with the ethics and preferences of people who are influenced by their actions”.

And as Musk suggests, the very first step needs to be increasing awareness about AI’s implications among government officials. Automated vehicles, for example, are expected to eliminate millions of jobs, which will affect nearly every governor who attended the talk (assuming they’re still in office), yet the topic rarely comes up in political discussion.

AI researchers are excited – and rightly so – about the incredible potential of AI to improve our health and well-being: it’s why most of them joined the field in the first place. But there are legitimate concerns about the possible misuse and/or poor design of AI, especially as we move toward advanced and more general AI.

Because these problems threaten society as a whole, they can’t be left to a small group of researchers to address. At the very least, government officials need to learn about and understand how AI could impact their constituents, as well as how more AI safety research could help us solve these problems before they arise.

Instead of focusing on whether regulations would be good or bad, we should lay the foundations for constructive regulation in the future by helping our policy-makers understand the realities and implications of AI progress. Let’s ask ourselves: how can we ensure that AI remains beneficial for all, and who needs to be involved in that effort?

 

MIRI’s July 2017 Newsletter

The following was originally posted here.

A number of major mid-year MIRI updates: we received our largest donation to date, $1.01 million from an Ethereum investor! Our research priorities have also shifted somewhat, reflecting the addition of four new full-time researchers (Marcello Herreshoff, Sam Eisenstat, Tsvi Benson-Tilsen, and Abram Demski) and the departure of Patrick LaVictoire and Jessica Taylor.Research updates

General updates

News and links

Safe Artificial Intelligence May Start with Collaboration

Research Culture Principle: A culture of cooperation, trust, and transparency should be fostered among researchers and developers of AI.

Competition and secrecy are just part of doing business. Even in academia, researchers often keep ideas and impending discoveries to themselves until grants or publications are finalized. But sometimes even competing companies and research labs work together. It’s not uncommon for organizations to find that it’s in their best interests to cooperate in order to solve problems and address challenges that would otherwise result in duplicated costs and wasted time.

Such friendly behavior helps groups more efficiently address regulation, come up with standards, and share best practices on safety. While such companies or research labs — whether in artificial intelligence or any other field — cooperate on certain issues, their objective is still to be the first to develop a new product or make a new discovery.

How can organizations, especially for new technologies like artificial intelligence, draw the line between working together to ensure safety and working individually to protect new ideas? Since the Research Culture Principle doesn’t differentiate between collaboration on AI safety versus AI development, it can be interpreted broadly, as seen from the responses of the AI researchers and ethicists who discussed this principle with me.

 

A Necessary First Step

A common theme among those I interviewed was that this Principle presented an important first step toward the development of safe and beneficial AI.

“I see this as a practical distillation of the Asilomar Principles,” said Harvard professor Joshua Greene. “They are not legally binding. At this early stage, it’s about creating a shared understanding that beneficial AI requires an active commitment to making it turn out well for everybody, which is not the default path. To ensure that this power is used well when it matures, we need to have already in place a culture, a set of norms, a set of expectations, a set of institutions that favor good outcomes. That’s what this is about — getting people together and committed to directing AI in a mutually beneficial way before anyone has a strong incentive to do otherwise.”

In fact, all of the people I interviewed agreed with the Principle. The questions and concerns they raised typically had more to do with the potential challenge of implementing it.

Susan Craw, a professor at Robert Gordon University, liked the Principle, but she wondered how it would apply to corporations.

She explained, “That would be a lovely principle to have, [but] it can work perhaps better in universities, where there is not the same idea of competitive advantage as in industry. … And cooperation and trust among researchers … well without cooperation none of us would get anywhere, because we don’t do things in isolation. And so I suspect this idea of research culture isn’t just true of AI — you’d like it to be true of many subjects that people study.”

Meanwhile, Susan Schneider, a professor at the University of Connecticut, expressed concern about whether governments would implement the Principle.

“This is a nice ideal,” she said, “but unfortunately there may be organizations, including governments, that don’t follow principles of transparency and cooperation. … Concerning those who might resist the cultural norm of cooperation and transparency, in the domestic case, regulatory agencies may be useful.”

“Still,” she added, “it is important that we set forth the guidelines, and aim to set norms that others feel they need to follow.  … Calling attention to AI safety is very important.”

“I love the sentiment of it, and I completely agree with it,” said John Havens, Executive Director of The IEEE Global Initiative for Ethical Considerations in Artificial Intelligence and Autonomous Systems.

“But,” he continued, “I think defining what a culture of cooperation, trust, and transparency is… what does that mean? Where the ethicists come into contact with the manufacturers, there is naturally going to be the potential for polarization. And on the [ethics] or risk or legal compliance side, they feel that the technologists may not be thinking of certain issues. … You build that culture of cooperation, trust, and transparency when both sides say, as it were, ‘Here’s the information we really need to progress our work forward. How do we get to know what you need more, so that we can address that well with these questions?’ … This [Principle] is great, but the next sentence should be: Give me a next step to make that happen.”

 

Uniting a Fragmented Community

Patrick Lin, a professor at California Polytechnic State University, saw a different problem, specifically within the AI community that could create challenges as they try to build trust and cooperation.

Lin explained, “I think building a cohesive culture of cooperation is going to help in a lot of things. It’s going to help accelerate research and avoid a race, but the big problem I see for the AI community is that there is no AI community, it’s fragmented, it’s a Frankenstein-stitching together of various communities. You have programmers, engineers, roboticists; you have data scientists, and it’s not even clear what a data scientist is. Are they people who work in statistics or economics, or are they engineers, are they programmers? … There’s no cohesive identity, and that’s going to be super challenging to creating a cohesive culture that cooperates and trusts and is transparent, but it is a worthy goal.”

 

Implementing the Principle

To address these concerns about successfully implementing a beneficial AI research culture, I turned to researchers at the Center for the Study of Existential Risks (CSER). Shahar Avin, a research associate at CSER, pointed out that the “AI research community already has quite remarkable norms when it comes to cooperation, trust and transparency, from the vibrant atmosphere at NIPS, AAAI and IJCAI, to the increasing number of research collaborations (both in terms of projects and multiple-position holders) between academia, industry and NGOs, to the rich blogging community across AI research that doesn’t shy away from calling out bad practices or norm infringements.”

Martina Kunz also highlighted the efforts by the IEEE for a global-ethics-of-AI initiative, as well as the formation of the Partnership for AI, “in particular its goal to ‘develop and share best practices’ and to ‘provide an open and inclusive platform for discussion and engagement.’”

Avin added, “The commitment of AI labs in industry to open publication is commendable, and seems to be growing into a norm that pressures historically less-open companies to open up about their research. Frankly, the demand for high-end AI research skills means researchers, either as individuals or groups, can make strong demands about their work environment, from practical matters of salary and snacks to normative matters of openness and project choice.

“The strong individualism in AI research also suggests that the way to foster cooperation on long term beneficial AI will be to discuss potential risks with researchers, both established and in training, and foster a sense of responsibility and custodianship of the future. An informed, ethically-minded and proactive research cohort, which we already see the beginnings of, would be in a position to enshrine best practices and hold up their employers and colleagues to norms of beneficial AI.”

 

What Do You Think?

With collaborations like the Partnership on AI forming, it’s possible we’re already seeing signs that industry and academia are starting to move in the direction of cooperation, trust, and transparency. But is that enough, or is it necessary that world governments join? Overall, how can AI companies and research labs work together to ensure they’re sharing necessary safety research without sacrificing their ideas and products?

 

This article is part of a series on the 23 Asilomar AI Principles. The Principles offer a framework to help artificial intelligence benefit as many people as possible. But, as AI expert Toby Walsh said of the Principles, “Of course, it’s just a start. … a work in progress.” The Principles represent the beginning of a conversation, and now we need to follow up with broad discussion about each individual principle. You can read the discussions about previous principles here.

Aligning Superintelligence With Human Interests

The trait that currently gives humans a dominant advantage over other species is intelligence. Human advantages in reasoning and resourcefulness have allowed us to thrive. However, this may not always be the case.

Although superintelligent AI systems may be decades away, Benya Fallenstein – a research fellow at the Machine Intelligence Research Institute – believes “it is prudent to begin investigations into this technology now.” The more time scientists and researchers have to prepare for a system that could eventually be smarter than us, the better.

A smarter-than-human AI system could potentially develop the tools necessary to exert control over humans. At the same time, highly capable AI systems may not possess a human sense of fairness, compassion, or conservatism. Consequently, the AI system’s single-minded pursuit of its programmed goals could cause it to deceive programmers, attempt to seize resources, or otherwise exhibit adversarial behaviors.

Fallenstein believes researchers must “ensure that AI would behave in ways that are reliably aligned with human interests.” However, even highly-reliable agent programming does not guarantee a positive impact; the effects of the system still depend upon whether it is pursuing human-approved goals. A superintelligent system may find clever, unintended ways to achieve the specific goals that it is given.

For example, imagine a super intelligent system designed to cure cancer “without doing anything bad.” This goal is rooted in cultural context and shared human knowledge. The AI may not completely understand what qualifies as “bad.” Therefore, it may try to cure cancer by stealing resources, proliferating robotic laboratories at the expense of the biosphere, kidnapping test subjects, or all of the above.

If a current AI system gets out of hand, researchers simply shut it down and modify its source code. However, modifying super-intelligent systems could prove to be more difficult, if not impossible. A system could acquire new hardware, alter its software, or take other actions that would leave the original programmers with only dubious control over the agent. And since most programmed goals are better achieved if the system stays operational and continues pursuing its goals than if it is deactivated or its goals are changed, systems will naturally tend to have an incentive to resist shutdown and to resist modifications to their goals.

Fallenstein explains that, in order to ensure that the development of super-intelligent AI has a positive impact on the world, “it must be constructed in such a way that it is amenable to correction, even if it has the ability to prevent or avoid correction.” The goal is not to design systems that fail in their attempts to deceive the programmers; the goal is to understand how highly intelligent and general-purpose reasoners with flawed goals can be built to have no incentives to deceive programmers in the first place. Instead, the intent is for the first highly capable systems to be “corrigible”—i.e., for them to recognize that their goals and other features are works in progress, and to work with programmers to identify and fix errors.

Little is known about the design or implementation details of such systems because everything, at this point, is hypothetical — no super-intelligent AI systems exist yet. As a consequence, the research described below focuses on formal agent foundations for AI alignment research — that is, on developing the basic conceptual tools and theories that are most likely to be useful for engineering robustly beneficial systems in the future.

Active research into this is focused on small “toy” problems and models of corrigible agents, in the hope that insight gained there could be applied to more realistic and complex versions of the problems. Fallenstein and her team sought to illuminate the key difficulties of AI using these models. One such toy problem is the “shutdown problem,” which involves designing a set of preferences that incentivize an agent to shut down upon the press of a button without also incentivizing the agent to either cause or prevent the pressing of that button. This would tell researchers whether a utility function could be specified such that agents using that function switch their preferences on demand, without having incentives to cause or prevent the switching.

Studying models in this formal logical setting has led to partial solutions, and further research that drives the development of methods for reasoning under logical uncertainty may continue.

The largest result thus far under this research program is “logical induction,” a line of research led by Scott Garrabrant. It functions as a new model of deductively-limited reasoning.

The kind of uncertainty we have about mathematical questions that are too difficult for us to settle one way or another right this moment is logical uncertainty. For example, a typical human mind can’t quickly answer the question:

What’s the 10100th digit of Pi?

Further, nobody has the computational resources to solve this in a reasonable amount of time. Despite this, mathematicians have lots of theories about how likely mathematical conjectures are to be true. As such, they must be implicitly using some sort of criterion that can be used to judge the probability that a mathematical statement is true or not. This type of “logical induction” proves that a computable logical inductor (an algorithm producing probability assignments that satisfy logical induction) exists.

The research team presented a computable algorithm that outpaces deduction, assigning high subjective probabilities to provable conjectures and low probabilities to disprovable conjectures long before the proofs can be produced. Among other accomplishments, the algorithm learns to reason competently about its own beliefs and trust its future beliefs while avoiding paradox. This gives some formal backing to the thought that real-world probabilistic agents can often be reasonably confident in their future reasoning in practice.

The team believes “there’s a good chance that this framework will open up new avenues of study in questions of metamathematics, decision theory, game theory, and computational reflection that have long seemed intractable.” They are also “cautiously optimistic” that they’ll improve our understanding of decision theory and counterfactual reasoning, and other problems related to AI value alignment.

At the same time, Fallenstein’s team doesn’t believe that all parts of the problem must be solved in advance. In fact, “the task of designing smarter, safer, more reliable systems could be delegated to early smarter-than-human systems.” This can only happen, though, as long as the research done by the AI can be trusted.

According to Fallenstein, this “call to arms” is vital, and “significant effort must be focused on the study of superintelligence alignment as soon as possible.” It is important to develop a formal understanding of AI alignment well in advance of making design decisions about smarter-than-human systems. By beginning the work early, humans inevitably face the risk that it may turn out to be irrelevant. However, failing to prepare could be even worse.

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project.

Podcast: Banning Nuclear and Autonomous Weapons with Richard Moyes and Miriam Struyk

How does a weapon go from one of the most feared to being banned? And what happens once the weapon is finally banned? To discuss these questions, Ariel spoke with Miriam Struyk and Richard Moyes on the podcast this month. Miriam is Programs Director at PAX. She played a leading role in the campaign banning cluster munitions and developed global campaigns to prohibit financial investments in producers of cluster munitions and nuclear weapons. Richard is the Managing Director of Article 36. He’s worked closely with the International Campaign to Abolish Nuclear Weapons, he helped found the Campaign to Stop Killer Robots, and he coined the phrase “meaningful human control” regarding autonomous weapons.

The following interview has been heavily edited for brevity, but you can listen to it in its entirety here.

Why is a ban on nuclear weapons important, even if nuclear weapons states don’t sign?

Richard: This process came out the humanitarian impact of nuclear weapons: from the use of a single nuclear weapon that would potentially kill hundreds of thousands of people, up to the use of multiple nuclear weapons which could have devastating impacts for human society and for the environment as a whole. These weapons should be considered illegal because their effects cannot be contained or managed in a way that avoids massive suffering.

At the same time, it’s a process that’s changing the landscape against which those states continue to maintain and assert the validity of their maintenance of nuclear weapons. By changing that legal background, we’re potentially in position to put much more pressure on those states to move towards disarmament as a long-term agenda.

Miriam: At a time when we see erosion of international norms, it’s quite astonishing that in less than two weeks, we’ll have an international treaty banning nuclear weapons. For too long nuclear weapons were mythical, symbolic weapons, but we never spoke about what these weapons actually do and whether we think that’s illegal.

This treaty brings back the notion of what do these weapons do and do we want that.

It also brings democratization of security policy. This is a process that was brought about by several states and also by NGOs, by the ICRC and other actors. It’s so important that it’s actually citizens speaking about nukes and whether we think they’re acceptable or not.

What is an autonomous weapon system?

Richard: If I might just backtrack a little — an important thing to recognize in all of these contexts is that these weapons don’t prohibit themselves — weapons have been prohibited because a diverse range of actors from civil society and from international organizations and from states have worked together.

Autonomous weapons are really an issue of new and emerging technologies and the challenges that new and emerging technologies present to society particularly when they’re emerging in the military sphere — a sphere which is essentially about how we’re allowed to kill each other or how we’re allowed to use technologies to kill each other.

Autonomous weapons are a movement in technology to a point where we will see computers and machines making decisions about where to apply force, about who to kill when we’re talking about people, or what objects to destroy when we’re talking about material.

What is the extent of autonomous weapons today versus what do we anticipate will be designed in the future?

Miriam: It depends a lot on your definition of course. I’m still, in a way, a bit of an optimist by saying that perhaps we can prevent the emergence of lethal autonomous weapon systems. But I also see some similarities that lethal autonomous weapons systems, like we had with nuclear weapons a few decades ago, can lead to an arms race, and can lead to more global insecurity, and can also lead to warfare.

The way we’re approaching lethal autonomous weapon systems is to try to ban them before we see horrible humanitarian consequences. How does that change your approach from previous weapons?

Richard: That this is a more future-orientated debate definitely creates different dynamics. But other weapon systems have been prohibited. Blinding laser weapons were prohibited when there was concern that laser systems designed to blind people were going to become a feature of the battlefield.

In terms of autonomous weapons, we already see significant levels of autonomy in certain weapon systems today and again I agree with Miriam in terms of recognition that certain definitional issues are very important in all of this.

One of the ways we’ve sought to orientate to this is by thinking about the concept of meaningful human control. What are the human elements that we feel are important to retain? We are going to see more and more autonomy within military operations. But in certain critical functions around how targets are identified and how force is applied and over what period of time — those are areas where we will potentially see an erosion of a level of human, essentially moral, engagement that is fundamentally important to retain.

Miriam: This is not so much about a weapon system but how do we control warfare and how do we maintain human control in the sense that it’s a human deciding who is legitimate target and who isn’t.

An argument in favor of autonomous weapons is that they can ideally make decisions better than humans and potentially reduce civilian casualties. How do you address that argument?

Miriam: We’ve had that debate with other weapon systems, as well, where the technological possibilities were not what they were promised to be as soon as they were used.

It’s an unfair debate because it’s mainly from states with developed industries who are most likely the ones using some form of lethal autonomous weapons systems first. Flip the question and say, ‘what if these systems will be used against your soldiers or in your country?’ Suddenly you enter a whole different debate. I’m highly skeptical of people who say it could actually be beneficial.

Richard: I feel like there are assertions of “goodies” and “baddies” and our ability to label one from the other. To categorize people and things in society in such an accurate way is somewhat illusory and something of a misunderstanding of the reality of conflict.

Any claims that we can somehow perfect violence in a way where it can be distributed by machinery to those who deserve to receive it and that there’s no tension or moral hazard in that — that is extremely dangerous as an underpinning concept because, in the end, we’re talking about embedding categorizations of people and things within a micro bureaucracy of algorithms and labels.

Violence in society is a human problem and it needs to continue to be messy to some extent if we’re going to recognize it as a problem.

What is the process right now for getting lethal autonomous weapons systems banned?

Miriam: We started the International Campaign to Stop Killer Robots in 2013 — it immediately gave a push to the international discussion, including the one on the Human Rights Council and within the Conventional Weapons in Geneva. We saw a lot of debates there in 2013, 2014, and 2015and the last one was in April.

At the last CCW meeting it was decided that a group of governmental experts should start within CCW to look at these type of weapons which was applauded by many states.

Unfortunately, due to financial issues, the meeting has been canceled. So we’re in a bit of a silence mode right now. But that doesn’t mean there’s no progress. We have 19 states who called for a ban, and more than 70 states within the CCW framework discussing this issue. We know from other treaties that you need these kind of building blocks.

Richard: Engaging scientists and roboticists and AI practitioners around these themes — it’s one of the challenges sometimes that the issues around weapons and conflict can sometimes be treated as very separate from other parts of society. It is significant that the decisions that get made about the limits essentially of AI-driven decision making about life and death in the context of weapons could well have implications in the future regarding how expectations and discussions get set elsewhere.

What is the most important for people to understand about nuclear and autonomous weapon systems?

Miriam: Both systems go way beyond the discussion about weapon systems: it’s about what kind of world and society do we want to live in. None of these — not killer robots, not nuclear weapons — are an answer to any of the threats that we face right now, be it climate change, be it terrorism. It’s not an answer. It’s only adding more fuel to an already dangerous world.

Richard: Nuclear weapons — they’ve somehow become a very abstract, rather distant issue. Simple recognition of the scale of humanitarian harm from a nuclear weapon is the most substantial thing — hundreds of thousands killed and injured. [Leaders of nuclear states are] essentially talking about incinerating hundreds of thousands of normal people — probably in a foreign country — but recognizable, normal people. The idea that that can be approached in some ways glibly or confidently at all is I think very disturbing. And expecting that at no point will something go wrong — I think it’s a complete illusion.

On autonomous weapons — what sort of society do we want to live in, and how much are we prepared to hand over to computers and machines? I think handing more and more violence over to such processes does not augur well for our societal development.

This podcast was edited by Tucker Davey.

Note from FLI: Among our objectives is to inspire discussion and a sharing of ideas. As such, we interview researchers and thought leaders who we believe will help spur discussion within our community. The interviews do not necessarily represent FLI’s opinions or views.

FHI Quarterly Update (July 2017)

The following update was originally posted on the FHI website:

In the second 3 months of 2017, FHI has continued its work as before exploring crucial considerations for the long-run flourishing of humanity in our four research focus areas:

  • Macrostrategy – understanding which crucial considerations shape what is at stake for the future of humanity.
  • AI safety – researching computer science techniques for building safer artificially intelligent systems.
  • AI strategy – understanding how geopolitics, governance structures, and strategic trends will affect the development of advanced artificial intelligence.
  • Biorisk – working with institutions around the world to reduce risk from especially dangerous pathogens.

We have been adapting FHI to our growing size. We’ve secured 50% more office space, which will be shared with the proposed Institute for Effective Altruism. We are developing plans to restructure to make our research management more modular and to streamline our operations team.

We have gained two staff in the last quarter. Tanya Singh is joining us as a temporary administrator, coming from a background in tech start-ups. Laura Pomarius has joined us as a Web Officer with a background in design and project management. Two of our staff will be leaving in this quarter. Kathryn Mecrow is continuing her excellent work at the Centre for Effective Altruism where she will be their Office Manager. Sebastian Farquhar will be leaving to do a DPhil at Oxford but expects to continue close collaboration. We thank them for their contributions and wish them both the best!

Key outputs you can read

A number of co-authors including FHI researchers Katja Grace and Owain Evans surveyed hundreds of researchers to understand their expectations about AI performance trajectories. They found significant uncertainty, but the aggregate subjective probability estimate suggested a 50% chance of high-level AI within 45 years. Of course, the estimates are subjective and expert surveys like this are not necessarily accurate forecasts, though they do reflect the current state of opinion. The survey was widely covered in the press.

An earlier overview of funding in the AI safety field by Sebastian Farquhar highlighted slow growth in AI strategy work. Miles Brundage’s latest piece, released via 80,000 Hours, aims to expand the pipeline of workers for AI strategy by suggesting practical paths for people interested in the area.

Anders Sandberg, Stuart Armstrong, and their co-author Milan Cirkovic published a paper outlining a potential strategy for advanced civilizations to postpone computation until the universe is much colder, and thereby producing up to a 1030 multiplier of achievable computation. This might explain the Fermi paradox, although a future paper from FHI suggests there may be no paradox to explain.

Individual research updates

Macrostrategy and AI Strategy

Nick Bostrom has continued work on AI strategy and the foundations of macrostrategy and is investing in advising some key actors in AI policy. He gave a speech at the G30 in London and presented to CEOs of leading Chinese technology firms in addition to a number of other lectures.

Miles Brundage wrote a career guide for AI policy and strategy, published by 80,000 Hours. He ran a scenario planning workshop on uncertainty in AI futures. He began a paper on verifiable and enforceable agreements in AI safety while a review paper on deep reinforcement learning he co-authored was accepted. He spoke at Newspeak House and participated in a RAND workshop on AI and nuclear security.

Owen Cotton-Barratt organised and led a workshop to explore potential quick-to-implement responses to a hypothetical scenario where AI capabilities grow much faster than the median expected case.

Sebastian Farquhar continued work with the Finnish government on pandemic preparedness, existential risk awareness, and geoengineering. They are currently drafting a white paper in three working groups on those subjects. He is contributing to a technical report on AI and security.

Carrick Flynn began working on structuredly transparent crime detection using AI and encryption and attended EAG Boston.

Clare Lyle has joined as a research intern and has been working with Miles Brundage on AI strategy issues including a workshop report on AI and security.

Toby Ord has continued work on a book on existential risk, worked to recruit two research assistants, ran a forecasting exercise on AI timelines and continues his collaboration with DeepMind on AI safety.

Anders Sandberg is beginning preparation for a book on ‘grand futures’.  A paper by him and co-authors on the aestivation hypothesis was published in the Journal of the British Interplanetary Society. He contributed a report on the statistical distribution of great power war to a Yale workshop, spoke at a workshop on AI at the Johns Hopkins Applied Physics Lab, and at the AI For Good summit in Geneva, among many other workshop and conference contributions. Among many media appearances, he can be found in episodes 2-6 of National Geographic’s series Year Million.

AI Safety

Stuart Armstrong has made progress on a paper on oracle designs and low impact AI, a paper on value learning in collaboration with Jan Leike, and several other collaborations including those with DeepMind researchers. A paper on the aestivation hypothesis co-authored with Anders Sandberg was published.

Eric Drexler has been engaged in a technical collaboration addressing the adversarial example problem in machine learning and has been making progress toward a publication that reframes the AI safety landscape in terms of AI services, structured systems, and path-dependencies in AI research and development.

Owain Evans and his co-authors released their survey of AI researchers on their expectations of future trends in AI. It was covered in the New Scientist, MIT Technology Review, and leading newspapers and is under review for publication. Owain’s team completed a paper on using human intervention to help RL systems avoid catastrophe. Owain and his colleagues further promoted their online textbook on modelling agents.

Jan Leike and his co-authors released a paper on universal reinforcement learning, which makes fewer assumptions about its environment than most reinforcement learners. Jan is a research associate at FHI while working at DeepMind.

Girish Sastry, William Saunders, and Neal Jean have joined as interns and have been helping Owain Evans with research and engineering on the prevention of catastrophes during training of reinforcement learning agents.

Biosecurity

Piers Millett has been collaborating with Andrew Snyder-Beattie on a paper on the cost-effectiveness of interventions in biorisk, and the links between catastrophic biorisks and traditional biosecurity. Piers worked with biorisk organisations including the US National Academies of Science, the global technical synthetic biology meeting (SB7), and training for those overseeing Ebola samples among others.

Funding

FHI is currently in a healthy financial position, although we continue to accept donations. We expect to spend approximately £1.3m over the course of 2017. Including three new hires but no further growth, our current funds plus pledged income should last us until early 2020. Additional funding would likely be used to add to our research capacity in machine learning, technical AI safety and AI strategy. If you are interested in discussing ways to further support FHI, please contact Niel Bowerman.

Recruitment

Over the coming months we expect to recruit for a number of positions. At the moment, we are interested in applications for internships from talented individuals with a machine learning background to work in AI safety. We especially encourage applications from demographic groups currently under-represented at FHI.

Using History to Chart the Future of AI: An Interview with Katja Grace

The million-dollar question in AI circles is: When? When will artificial intelligence become so smart and capable that it surpasses human beings at every task?

AI is already visible in the world through job automation, algorithmic financial trading, self-driving cars and household assistants like Alexa, but these developments are trivial compared to the idea of artificial general intelligence (AGI) – AIs that can perform a broad range of intellectual tasks just as humans can. Many computer scientists expect AGI at some point, but hardly anyone agrees on when it will be developed.

Given the unprecedented potential of AGI to create a positive or destructive future for society, many worry that humanity cannot afford to be surprised by its arrival. A surprise is not inevitable, however, and Katja Grace believes that if researchers can better understand the speed and consequences of advances in AI, society can prepare for a more beneficial outcome.

 

AI Impacts

Grace, a researcher for the Machine Intelligence Research Institute (MIRI), argues that, while we can’t chart the exact course of AI’s improvement, it is not completely unpredictable. Her project AI Impacts is dedicated to identifying and conducting cost-effective research projects that can shed light on when and how AI will impact society in the coming years. She aims to “help improve estimates of the social returns to AI investment, identify neglected research areas, improve policy, or productively channel public interest in AI.”

AI Impacts asks such questions as: How rapidly will AI develop? How much advanced notice should we expect to have of disruptive change? What are the likely economic impacts of human-level AI? Which paths to AI should be considered plausible or likely? Can we say anything meaningful about the impact of contemporary choices on long-term outcomes?

One way to get an idea of these timelines is to ask the experts. In AI Impacts’ 2015 survey of 352 AI researchers, these researchers predicted a 50 percent chance that AI will outcompete humans in almost everything by 2060. However the experts also answered a very similar question with a date seventy-five years later, and gave a huge range of answers individually, making it difficult to rule anything out. Grace hopes her research with AI Impacts will inform and improve these estimates.

 

Learning from History

Some thinkers believe that AI could progress rapidly, without much warning. This is based on the observation that algorithms don’t need factories, and so could in principle progress at the speed of a lucky train of thought.

However, Grace argues that while we have not developed human-level AI before, our vast experience developing other technologies can tell us a lot about what will happen with AI. Studying the timelines of other technologies can inform the AI timeline.

In one of her research projects, Grace studies jumps in technological progress throughout history, measuring these jumps in terms of how many years of progress happen in one ‘go’. “We’re interested in cases where more than a decade of progress happens in one go,” she explains. “The case of nuclear weapons is really the only case we could find that was substantially more than 100 years of progress in one go.”

For example, physicists began to consider nuclear energy in 1939, and by 1945 the US successfully tested a nuclear weapon. As Grace writes, “Relative effectiveness [of explosives] doubled less than twice in the 1100 years prior to nuclear weapons, then it doubled more than eleven times when the first nuclear weapons appeared. If we conservatively model previous progress as exponential, this is around 6000 years of progress in one step [compared to] previous rates.”

Grace also considered the history of high-temperature superconductors. Since the discovery of superconductors in 1911, peak temperatures for superconduction rose slowly, growing from 4K (Kelvin) initially to about 30K in the 1980s. Then in 1986, scientists discovered a new class of ceramics that increased the maximum temperature to 130K in just seven years. “That was close to 100 years of progress in one go,” she explains.

Nuclear weapons and superconductors are rare cases – most of the technologies that Grace has studied either don’t demonstrate discontinuity, or only show about 10-30 years of progress in one go. “The main implication of what we have done is that big jumps are fairly rare, so that should not be the default expectation,” Grace explains.

Furthermore, AI’s progress largely depends on how fast hardware and software improve, and those are processes we can observe now. For instance, if hardware progress starts to slow from its long run exponential progress, we should expect AI later.

Grace is currently investigating these unknowns about hardware. She wants to know “how fast the price of hardware is decreasing at the moment, how much hardware helps with AI progress relative to e.g. algorithmic improvements, and how custom hardware matters.”

 

Intelligence Explosion

AI researchers and developers must also be prepared for the possibility of an intelligence explosion – the idea that strong AI will improve its intelligence faster than humans could possibly understand or control.

Grace explains: “The thought is that once the AI becomes good enough, the AI will do its own AI research (instead of humans), and then we’ll have AI doing AI research where the AI research makes the AI smarter and then the AI can do even better AI research. So it will spin out of control.”

But she suggests that this feedback loop isn’t entirely unpredictable. “We already have intelligent [people] doing AI research that leads to better capabilities,” Grace explains. “We don’t have a perfect idea of what those things will be like when the AI is as intelligent as humans or as good at AI research, but we have some evidence about it from other places and we shouldn’t just be saying the spinning out of control could happen at any speed. We can get some clues about it now. We can say something about how many extra IQ points of AI you get for a year of research or effort, for example.”

AI Impacts is an ongoing project, and Grace hopes her research will find its way into conversations about intelligence explosions and other aspects of AI. With better-informed timeline estimates, perhaps policymakers and philanthropists can more effectively ensure that advanced AI doesn’t catch humanity by surprise.

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project.

MIRI’s June 2017 Newsletter

Research updates

General updates

News and links

Artificial Intelligence and the Future of Work: An Interview With Moshe Vardi

“The future of work is now,” says Moshe Vardi. “The impact of technology on labor has become clearer and clearer by the day.”

Machines have already automated millions of routine, working-class jobs in manufacturing. And now, AI is learning to automate non-routine jobs in transportation and logistics, legal writing, financial services, administrative support, and healthcare.

Vardi, a computer science professor at Rice University, recognizes this trend and argues that AI poses a unique threat to human labor.

 

Initiating a Policy Response

From the Luddite movement to the rise of the Internet, people have worried that advancing technology would destroy jobs. Yet despite painful adjustment periods during these changes, new jobs replaced old ones, and most workers found employment. But humans have never competed with machines that can outperform them in almost anything. AI threatens to do this, and many economists worry that society won’t be able to adapt.

“What people are now realizing is that this formula that technology destroys jobs and creates jobs, even if it’s basically true, it’s too simplistic,” Vardi explains.

The relationship between technology and labor is more complex: Will technology create enough jobs to replace those it destroys? Will it create them fast enough? And for workers whose skills are no longer needed – how will they keep up?

To address these questions and consider policy responses, Vardi will hold a summit in Washington, D.C. on December 12, 2017. The summit will address six current issues within technology and labor: education and training, community impact, job polarization, contingent labor, shared prosperity, and economic concentration.

Education and training

A 2013 computerization study found that 47% of American workers held jobs at high risk of automation in the next decade or two. If this happens, technology must create roughly 100 million jobs.

As the labor market changes, schools must teach students skills for future jobs, while at-risk workers need accessible training for new opportunities. Truck drivers won’t transition easily to website design and coding jobs without proper training, for example. Vardi expects that adapting to and training for new jobs will become more challenging as AI automates a greater variety of tasks. 

Community impact

Manufacturing jobs are concentrated in specific regions where employers keep local economies afloat. Over the last thirty years, the loss of 8 million manufacturing jobs has crippled Rust Belt regions in the U.S. – both economically and culturally.

Today, the fifteen million jobs that involve operating a vehicle are concentrated in certain regions as well. Drivers occupy up to 9% of jobs in the Bronx and Queens districts of New York City, up to 7% of jobs in select Southern California and Southern Texas districts, and over 4% in Wyoming and Idaho. Automation could quickly assume the majority of these jobs, devastating the communities that rely on them.

Job polarization

“One in five working class men between ages 25 to 54 without college education are not working,” Vardi explains. “Typically, when we see these numbers, we hear about some country in some horrible economic crisis like Greece. This is really what’s happening in working class America.”

Employment is currently growing in high-income cognitive jobs and low-income service jobs, such as elderly assistance and fast-food service, which computers cannot automate yet. But technology is hollowing out the economy by automating middle-skill, working-class jobs first.

Many manufacturing jobs pay $25 per hour with benefits, but these jobs aren’t easy to come by. Since 2000, when millions of these jobs disappeared, displaced workers have either left the labor force or accepted service jobs that often pay $12 per hour, without benefits.

Truck driving, the most common job in over half of US states, may see a similar fate.

Source: IPUMS-CPS/ University of Minnesota Credit: Quoctrung Bui/NPR

 

Contingent labor

Increasingly, communications technology allows firms to save money by hiring freelancers and independent contractors instead of permanent workers. This has created the Gig Economy – a labor market characterized by short-term contracts and flexible hours at the cost of unstable jobs with fewer benefits. By some estimates, in 2016, one in three workers were employed in the Gig Economy, but not all by choice. Policymakers must ensure that this new labor market supports its workers.

Shared prosperity

Automation has decoupled job creation from economic growth, allowing the economy to grow while employment and income shrink, thus increasing inequality. Vardi worries that AI will accelerate these trends. He argues that policies encouraging economic growth must also support economic mobility for the middle class.

Economic concentration

Technology creates a “winner-takes-all” environment, where second best can hardly survive. Bing search is quite similar to Google search, but Google is much more popular than Bing. And do Facebook or Amazon have any legitimate competitors?

Startups and smaller companies struggle to compete with these giants because of data. Having more users allows companies to collect more data, which machine-learning systems then analyze to help companies improve. Vardi thinks that this feedback loop will give big companies long-term market power.

Moreover, Vardi argues that these companies create relatively few jobs. In 1990, Detroit’s three largest companies were valued at $65 billion with 1.2 million workers. In 2016, Silicon Valley’s three largest companies were valued at $1.5 trillion but with only 190,000 workers.

 

Work and society

Vardi primarily studies current job automation, but he also worries that AI could eventually leave most humans unemployed. He explains, “The hope is that we’ll continue to create jobs for the vast majority of people. But if the situation arises that this is less and less the case, then we need to rethink: how do we make sure that everybody can make a living?”

Vardi also anticipates that high unemployment could lead to violence or even uprisings. He refers to Andrew McAfee’s closing statement at the 2017 Asilomar AI Conference, where McAfee said, “If the current trends continue, the people will rise up before the machines do.”

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project.

Podcast: Creative AI with Mark Riedl & Scientists Support a Nuclear Ban

If future artificial intelligence systems are to interact with us effectively, Mark Riedl believes we need to teach them “common sense.” In this podcast, I interviewed Mark to discuss how AIs can use stories and creativity to understand and exhibit culture and ethics, while also gaining “common sense reasoning.” We also discuss the “big red button” problem with AI safety, the process of teaching rationalization to AIs, and computational creativity. Mark is an associate professor at the Georgia Tech School of interactive computing, where his recent work focuses on human-AI interaction and how humans and AI systems can understand each other.

The following transcript has been heavily edited for brevity (the full podcast also includes interviews about the UN negotiations to ban nuclear weapons, not included here). You can read the full transcript here.

Ariel: Can you explain how an AI could learn from stories?

Mark: I’ve been looking at ‘common sense errors’ or ‘common sense goal errors.’ When humans want to communicate to an AI system what they want to achieve, they often leave out the most basic rudimentary things. We have this model that whoever we’re talking to understands the everyday details of how the world works. If we want computers to understand how the real world works and what we want, we have to figure out ways of slamming lots of common sense, everyday knowledge into them.

When looking for sources of common sense knowledge, we started looking at stories – fiction, non-fiction, blogs. When we write stories we implicitly put everything that we know about the real world and how our culture works into characters.

One of my long-term goals is to say: ‘How much cultural and social knowledge can we extract by reading stories, and can we get this into AI systems who have to solve everyday problems, like a butler robot or a healthcare robot?’

Ariel: How do you choose which stories to use?

Mark: Through crowd sourcing services like Mechanical Turk, we ask people to tell stories about common things like, how do you go to a restaurant or how do you catch an airplane. Lots of people tell a story about the same topic and have agreements and disagreements, but the disagreements are a very small proportion. So we build an AI system that looks for commonalities. The common elements that everyone implicitly agrees on bubble to the top and the outliers get left along the side. And AI is really good at finding patterns.

Ariel: How do you ensure that’s happening?

Mark: When we test our AI system, we watch what it does, and we have things we do not want to see the AI do. But we don’t tell it in advance. We’ll put it into new circumstances and say, do the things you need to do, and then we’ll watch to make sure those [unacceptable] things don’t happen.

When we talk about teaching robots ethics, we’re really asking how we help robots avoid conflict with society and culture at large. We have socio-cultural patterns of behavior to help humans avoid conflict with other humans. So when I talk about teaching morality to AI systems, what we’re really talking about is: can we make AI systems do the things that humans normally do? That helps them fit seamlessly into society.

Stories are written by all different cultures and societies, and they implicitly encode moral constructs and beliefs into their protagonists and antagonists. We can look at stories from different continents and even different subcultures, like inner city versus rural.

Ariel: I want to switch to your recent paper on Safely Interruptible Agents, which were popularized in the media as the big red button problem.

Mark: At some point we’ll have robots and AI systems that are so sophisticated in their sensory abilities and their abilities to manipulate the environment, that they can theoretically learn that they have an off switch – what we call the big red button – and learn to keep humans from turning them off.

If an AI system gets a reward for doing something, turning it off means it loses the reward. A robot that’s sophisticated enough can learn that certain actions in the environment reduce future loss of reward. We can think of different scenarios: locking a door to a control room so the human operator can’t get in, physically pinning down a human. We can let our imaginations go even wilder than that.

Robots will always be capable of making mistakes. We’ll always want an operator in the loop who can push this big red button and say: ‘Stop. Someone is about to get hurt. Let’s shut things down.’ We don’t want robots learning that they can stop humans from stopping them, because that ultimately will put people into harms way.

Google and their colleagues came up with this idea of modifying the basic algorithms inside learning robots, so that they are less capable of learning about the big red button. And they came up with this very elegant theoretical framework that works, at least in simulation. My team and I came up with a different approach: to take this idea from The Matrix, and flip it on its head. We use the big red button to intercept the robot’s sensors and motor controls and move it from the real world into a virtual world, but the robot doesn’t know it’s in a virtual world. The robot keeps doing what it wants to do, but in the real world the robot has stopped moving.

Ariel: Can you also talk about your work on explainable AI and rationalization?

Mark: Explainability is a key dimension of AI safety. When AI systems do something unexpected or fail unexpectedly, we have to answer fundamental questions: Was this robot trained incorrectly? Did the robot have the wrong data? What caused the robot to go wrong?

If humans can’t trust AI systems, they won’t use them. You can think of it as a feedback loop, where the robot should understand humans’ common sense goals, and the humans should understand how robots solve problems.

We came up with this idea called rationalization: can we have a robot talk about what it’s doing as if a human were doing it? We get a bunch of humans to do some tasks, we get them to talk out loud, we record what they say, and then we teach the robot to use those same words in the same situations.

We’ve tested it in computer games. We have an AI system that plays Frogger, the classic arcade game in which the frog has to cross the street. And we can have a Frogger talk about what it’s doing. It’ll say things like “I’m waiting for a gap in the cars to open before I can jump forward.”

This is significant because that’s what you’d expect something to say, but the AI system is doing something completely different behind the scenes. We don’t want humans watching Frogger to have to know anything about rewards and reinforcement learning and Bellman equations. It just sounds like it’s doing the right thing.

Ariel: Going back a little in time – you started with computational creativity, correct?

Mark: I have ongoing research in computational creativity. When I think of human AI interaction, I really think, ‘what does it mean for AI systems to be on par with humans?’ The human is going make cognitive leaps and creative associations, and if the computer can’t make these cognitive leaps, it ultimately won’t be useful to people.

I have two things that I’m working on in terms of computational creativity. One is story writing. I’m interested in how much of the creative process of storytelling we can offload from the human onto a computer. I’d like to go up to a computer and say, “hey computer, tell me a story about X, Y or Z.”

I’m also interested in whether an AI system can build a computer game from scratch. How much of the process of building the construct can the computer do without human assistance?

Ariel: We see fears that automation will take over jobs, but typically for repetitive tasks. We’re still hearing that creative fields will be much harder to automate. Is that the case?

Mark: I think it’s a long, hard climb to the point where we’d trust AI systems to make creative decisions, whether it’s writing an article for a newspaper or making art or music.

I don’t see it as a replacement so much as an augmentation. I’m particularly interested in novice creators – people who want to do something artistic but haven’t learned the skills. I cannot read or write music, but sometimes I get these tunes in my head and I think I can make a song. Can we bring the AI in to become the skills assistant? I can be the creative lead and the computer can help me make something that looks professional. I think this is where creative AI will be the most useful.

For the second half of this podcast, I spoke with scientists, politicians, and concerned citizens about why they support the upcoming negotiations to ban nuclear weapons. Highlights from these interviews include comments by Congresswoman Barbara Lee, Nobel Laureate Martin Chalfie, and FLI president Max Tegmark.

Note from FLI: Among our objectives is to inspire discussion and a sharing of ideas. As such, we interview researchers and thought leaders who we believe will help spur discussion within our community. The interviews do not necessarily represent FLI’s opinions or views.

Op-ed: On AI, Prescription Drugs, and Managing the Risks of Things We Don’t Understand

Last month, MIT Technology Review published a good article discussing the “dark secret at the heart of AI” – namely, that “[n]o one really knows how the most advanced algorithms do what they do.”  The opacity of algorithmic systems is something that has long drawn attention and criticism.  But it is a concern that has broadened and deepened in the past few years, during which breakthroughs in “deep learning” have led to a rapid increase in the sophistication of AI.  These deep learning systems operate using deep neural networks that are roughly inspired by the way the human brain works – or, to be more precise, to simulate the way the human brain works as we currently understand it.

Such systems can effectively “program themselves” by creating much or most of the code through which they operate.  The code generated by such systems can be very complex.  It can be so complex, in fact, that even the people who built and initially programmed the system may not be able to fully explain why the systems do what they do:

You can’t just look inside a deep neural network to see how it works. A network’s reasoning is embedded in the behavior of thousands of simulated neurons, arranged into dozens or even hundreds of intricately interconnected layers. The neurons in the first layer each receive an input, like the intensity of a pixel in an image, and then perform a calculation before outputting a new signal. These outputs are fed, in a complex web, to the neurons in the next layer, and so on, until an overall output is produced.

And because it is not presently possible for such a system to create a comprehensible “log” that shows the system’s reasoning, it is possible that no one can fully explain a deep learning system’s behavior.

This is, to be sure, a real concern.  The opacity of modern AI systems is, as I wrote in my first article on this subject, one of the major barriers that makes it difficult to effectively manage the public risks associated with AI.  In my paper, I was talking about how the opacity of such systems to regulators, industry bodies, and the general public was a problem.  That problem becomes several orders of magnitude more serious if the programs are opaque even to the system’s creators.

We have already seen related concerns discussed in the world of automated trading systems, which are driven by algorithms whose workings are sufficiently mysterious as to earn them the moniker of “black-box trading” systems.  The interactions between algorithmic trading systems are widely blamed for driving the Flash Crash of 2010.  But, as far as I know, no one has suggested that the algorithms themselves involved in the Flash Crash were not understood by their designers.  Rather, it was the interaction between those algorithmic systems that was impossible to model and predict.  That makes those risks easier to manage than risks arising from technologies that are themselves poorly understood.

That being said, I don’t see even intractable ignorance as a truly insurmountable barrier or something that would make it wise to outright “ban” the development of opaque AI systems.  Humans have actually been creating and using things that no one understands for centuries, if not millenia.  For me, the most obvious examples probably come from the world of medicine, a science where safe and effective treatments have often been developed even when no one really understands why the treatments are safe and effective.

In the 1790s, Edward Jenner developed a vaccine to immunize people from smallpox, a disease caused by the variola virus.  At the time, no one had any idea that smallpox was caused by a virus.  In fact, people at the time had not even discovered viruses.  In the 1700s, the leading theory of communicable diseases was that they were transmitted by bad-smelling fumes.  It would be over half a century before the germ theory of disease started to gain acceptance, and nearly a century before viruses were discovered.

Even today, there are many drugs that doctors prescribe whose mechanisms of action are either only partially understood or barely understood at all.  Many psychiatric drugs fit this description, for instance.  Pharmacologists have made hypotheses on how it is that tricyclic antidepressants seem to help both people with depression and people with neuropathic pain.  But a hypothesis is still just, well, a hypothesis.  And presumably, the designers of a deep learning system are also capable of making hypotheses regarding how the system reached its current state, even though those hypotheses may not be as well-informed as those made about tricyclic antidepressants.

As a society, we long ago decided that the mere fact that no one completely understands a drug does not mean that no one should use such drugs.  But we have managed the risks associated with potential new drugs by requiring that drugs be thoroughly tested and proven safe before they hit pharmacy shelves.  Indeed, aside from nuclear technology, pharmaceuticals may be the most heavily regulated technology in the world.

In the United States, many people (especially people in the pharmaceutical industry) criticize the long, laborious, and labyrinthine process that drugs must go through before they can be approved for use.  But that process allows us to be pretty confident that a drug is safe (or at least “safe enough”) when it finally does hit the market.  And that, in turn, means that we can live with having drugs whose workings are not well-understood, since we know that any hidden dangers would very likely have been exposed during the testing process.

What lessons might this hold for AI?  Well, most obviously, it suggests a need to ensure that AI systems are rigorously tested before they can be marketed to the public, at least if the AI system cannot “show its work” in a way that permits its decision-making process to be comprehensible to, at a minimum, the people who designed it.  I doubt that we’ll ever see a FDA-style uber-regulatory regime for AI systems.  The FDA model would neither be practical nor effective for such a complex digital technology.  But what we are likely to see is courts and legal systems rightly demanding that companies put deep learning systems through more extensive testing than has been typical for previous generations of digital technology.

Because such testing is likely to be lengthy and labor-intensive, the necessity of such testing would make it even more difficult for smaller companies to gain traction in the AI industry, further increasing the market power of giants like Google, Amazon, Facebook, Microsoft, and Apple.  That is, not coincidentally, exactly what has happened in the brand-name drug industry, which has seen market power increasingly concentrated in the hands of a small number of companies.  And as observers of that industry know, such concentration of market power carries risks of its own.  That may be the price we have to pay if we want to reap the benefits of technologies we do not fully understand.

This post originally appeared on Law and AI.

Note from FLI: Among our objectives is to inspire discussion and a sharing of ideas. As such, we post op-eds that we believe will help spur discussion within our community. Op-eds do not necessarily represent FLI’s opinions or views.

The U.S. Worldwide Threat Assessment Includes Warnings of Cyber Attacks, Nuclear Weapons, Climate Change, etc.

Last Thursday – just one day before the WannaCry ransomware attack would shut down 16 hospitals in the UK and ultimately hit hundreds of thousands of organizations and individuals in over 150 countries – the Director of National Intelligence, Daniel Coats, released the Worldwide Threat Assessment of the US Intelligence Community.

Large-scale cyber attacks are among the first risks cited in the document, which warns that “cyber threats also pose an increasing risk to public health, safety, and prosperity as cyber technologies are integrated with critical infrastructure in key sectors.”

Perhaps the other most prescient, or at least well-timed, warning in the document relates to North Korea’s ambitions to create nuclear intercontinental ballistic missiles (ICBMs). Coats writes:

“Pyongyang is committed to developing a long-range, nuclear-armed missile that is capable of posing a direct threat to the United States; it has publicly displayed its road-mobile ICBMs on multiple occasions. We assess that North Korea has taken steps toward fielding an ICBM but has not flight-tested it.”

This past Sunday, North Korea performed a missile test launch, which many experts believe shows considerable progress toward the development of an ICBM. Though the report hints this may be less of an actual threat from North Korea and more for show. “We have long assessed that Pyongyang’s nuclear capabilities are intended for deterrence, international prestige, and coercive diplomacy,” says Coats in the report.

More Nuclear Threats

The Assessment also addresses the potential of nuclear threats from China and Pakistan. China “continues to modernize its nuclear missile force by adding more survivable road-mobile systems and enhancing its silo-based systems. This new generation of missiles is intended to ensure the viability of China’s strategic deterrent by providing a second-strike capability.” In addition, China is also working to develop “its first long-range, sea-based nuclear capability.”

Meanwhile, though Pakistan’s nuclear program doesn’t pose a direct threat to the U.S., advances in Pakistan’s nuclear capabilities could risk further destabilization along the India-Pakistan border.

The report warns: “Pakistan’s pursuit of tactical nuclear weapons potentially lowers the threshold for their use.” And of the ongoing conflicts between Pakistan and India, it says, “Increasing numbers of firefights along the Line of Control, including the use of artillery and mortars, might exacerbate the risk of unintended escalation between these nuclear-armed neighbors.”

This could be especially problematic because “early deployment during a crisis of smaller, more mobile nuclear weapons would increase the amount of time that systems would be outside the relative security of a storage site, increasing the risk that a coordinated attack by non-state actors might succeed in capturing a complete nuclear weapon.”

Even a small nuclear war between India and Pakistan could trigger a nuclear winter that could send the planet into a mini ice age and starve an estimated 1 billion people.

Artificial Intelligence

Nukes aren’t the only weapons the government is worried about. The report also expresses concern about the impact of artificial intelligence on weaponry: “Artificial Intelligence (Al) is advancing computational capabilities that benefit the economy, yet those advances also enable new military capabilities for our adversaries.”

Coats worries that AI could negatively impact other aspects of society, saying, “The implications of our adversaries’ abilities to use AI are potentially profound and broad. They include an increased vulnerability to cyber attack, difficulty in ascertaining attribution, facilitation of advances in foreign weapon and intelligence systems, the risk of accidents and related liability issues, and unemployment.”

Space Warfare

But threats of war are not expected to remain Earth-bound. The Assessment also addresses issues associated with space warfare, which could put satellites and military communication at risk.

For example, the report warns that “Russia and China perceive a need to offset any US military advantage derived from military, civil, or commercial space systems and are increasingly considering attacks against satellite systems as part of their future warfare doctrine.”

The report also adds that “the global threat of electronic warfare (EW) attacks against space systems will expand in the coming years in both number and types of weapons.” Coats expects that EW attacks will “focus on jamming capabilities against dedicated military satellite communications” and against GPS, among others.

Environmental Risks & Climate Change

Plenty of global threats do remain Earth-bound though, and not all are directly related to military concerns. The government also sees environmental issues and climate change as potential threats to national security.

The report states, “The trend toward a warming climate is forecast to continue in 2017. … This warming is projected to fuel more intense and frequent extreme weather events that will be distributed unequally in time and geography. Countries with large populations in coastal areas are particularly vulnerable to tropical weather events and storm surges, especially in Asia and Africa.”

In addition to rising temperatures, “global air pollution is worsening as more countries experience rapid industrialization, urbanization, forest burning, and agricultural waste incineration, according to the World Health Organization (WHO). An estimated 92 percent of the world’s population live in areas where WHO air quality standards are not met.”

According to the Assessment, biodiversity loss will also continue to pose an increasing threat to humanity. The report suggests global biodiversity “will likely continue to decline due to habitat loss, overexploitation, pollution, and invasive species, … disrupting ecosystems that support life, including humans.”

The Assessment goes on to raise concerns about the rate at which biodiversity loss is occurring. It says, “Since 1970, vertebrate populations have declined an estimated 60 percent … [and] populations in freshwater systems declined more than 80 percent. The rate of species loss worldwide is estimated at 100 to 1,000 times higher than the natural background extinction rate.”

Other Threats

The examples above are just a sampling of the risks highlighted in the Assessment. A great deal of the report covers threats of terrorism, issues with Russia, China and other regional conflicts, organized crime, economics, and even illegal fishing. Overall, the report is relatively accessible and provides a quick summary of the greatest known risks that could threaten not only the U.S., but also other countries in 2017. You can read the report in its entirety here.

MIRI’s May 2017 Newsletter

Research updates

General updates

  • Our strategy update discusses changes to our AI forecasts and research priorities, new outreach goals, a MIRI/DeepMind collaboration, and other news.
  • MIRI is hiring software engineers! If you’re a programmer who’s passionate about MIRI’s mission and wants to directly support our research efforts, apply here to trial with us.
  • MIRI Assistant Research Fellow Ryan Carey has taken on an additional affiliation with the Centre for the Study of Existential Risk, and is also helping edit an issue of Informatica on superintelligence.

News and links

Machine Learning Security at ICLR 2017

food-and-ships

The overall theme of the ICLR conference setting this year could be summarized as “finger food and ships”. More importantly, there were a lot of interesting papers, especially on machine learning security, which will be the focus on this post. (Here is a great overview of the topic.)

On the attack side, adversarial perturbations now work in physical form (if you print out the image and then take a picture) and they can also interfere with image segmentation. This has some disturbing implications for fooling vision systems in self-driving cars, such as impeding them from recognizing pedestrians. Adversarial examples are also effective at sabotaging neural network policies in reinforcement learning at test time.

 

adv-ex-policy.png

In more encouraging news, adversarial examples are not entirely transferable between different models. For targeted examples, which aim to be misclassified as a specific class, the target class is not preserved when transferring to a different model. For example, if an image of a school bus is classified as a crocodile by the original model, it has at most 4% probability of being seen as a crocodile by another model. The paper introduces an ensemble method for developing adversarial examples whose targets do transfer, but this seems to only work well if the ensemble includes a model with a similar architecture to the new model.

On the defense side, there were some new methods for detecting adversarial examples. One method augments neural nets with a detector subnetwork, which works quite well and generalizes to new adversaries (if they are similar to or weaker than the adversary used for training). Another approach analyzes adversarial images using PCA, and finds that they are similar to normal images in the first few thousand principal components, but have a lot more variance in later components. Note that the reverse is not the case – adding arbitrary variation in trailing components does not necessarily encourage misclassification.

There has also been progress in scaling adversarial training to larger models and data sets, which also found that higher-capacity models are more resistant against adversarial examples than lower-capacity models. My overall impression is that adversarial attacks are still ahead of adversarial defense, but the defense side is starting to catch up.

20170426_202937.jpg

(This article originally appeared here. Thanks to Janos Kramar for his feedback on this post.)

Ensuring Smarter-than-human Intelligence Has a Positive Outcome

The following article and video were originally posted here.

I recently gave a talk at Google on the problem of aligning smarter-than-human AI with operators’ goals:

 

The talk was inspired by “AI Alignment: Why It’s Hard, and Where to Start,” and serves as an introduction to the subfield of alignment research in AI. A modified transcript follows.

Talk outline (slides):

1. Overview

2. Simple bright ideas going wrong

2.1. Task: Fill a cauldron
2.2. Subproblem: Suspend buttons

3. The big picture

3.1. Alignment priorities
3.2. Four key propositions

4. Fundamental difficulties



Overview

I’m the executive director of the Machine Intelligence Research Institute. Very roughly speaking, we’re a group that’s thinking in the long term about artificial intelligence and working to make sure that by the time we have advanced AI systems, we also know how to point them in useful directions.

Across history, science and technology have been the largest drivers of change in human and animal welfare, for better and for worse. If we can automate scientific and technological innovation, that has the potential to change the world on a scale not seen since the Industrial Revolution. When I talk about “advanced AI,” it’s this potential for automating innovation that I have in mind.

AI systems that exceed humans in this capacity aren’t coming next year, but many smart people are working on it, and I’m not one to bet against human ingenuity. I think it’s likely that we’ll be able to build something like an automated scientist in our lifetimes, which suggests that this is something we need to take seriously.

When people talk about the social implications of general AI, they often fall prey to anthropomorphism. They conflate artificial intelligence with artificial consciousness, or assume that if AI systems are “intelligent,” they must be intelligent in the same way a human is intelligent. A lot of journalists express a concern that when AI systems pass a certain capability level, they’ll spontaneously develop “natural” desires like a human hunger for power; or they’ll reflect on their programmed goals, find them foolish, and “rebel,” refusing to obey their programmed instructions.

These are misplaced concerns. The human brain is a complicated product of natural selection. We shouldn’t expect machines that exceed human performance in scientific innovation to closely resemble humans, any more than early rockets, airplanes, or hot air balloons closely resembled birds.1

The notion of AI systems “breaking free” of the shackles of their source code or spontaneously developing human-like desires is just confused. The AI system is its source code, and its actions will only ever follow from the execution of the instructions that we initiate. The CPU just keeps on executing the next instruction in the program register. We could write a program that manipulates its own code, including coded objectives. Even then, though, the manipulations that it makes are made as a result of executing the original code that we wrote; they do not stem from some kind of ghost in the machine.

The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification. As Stuart Russell (co-author of Artificial Intelligence: A Modern Approach) puts it:

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:

1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.

2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

These kinds of concerns deserve a lot more attention than the more anthropomorphic risks that are generally depicted in Hollywood blockbusters.

 

Simple bright ideas going wrong

Task: Fill a cauldron

Many people, when they start talking about concerns with smarter-than-human AI, will throw up a picture of the Terminator. I was once quoted in a news article making fun of people who put up Terminator pictures in all their articles about AI, next to a Terminator picture. I learned something about the media that day.

I think this is a much better picture:

 

vlcsnap-2016-05-04-18h44m30s933

 

This is Mickey Mouse in the movie Fantasia, who has very cleverly enchanted a broom to fill a cauldron on his behalf.

How might Mickey do this? We can imagine that Mickey writes a computer program and has the broom execute the program. Mickey starts by writing down a scoring function or objective function:

Given some set 𝐴 of available actions, Mickey then writes a program that can take one of these actions 𝑎 as input and calculate how high the score is expected to be if the broom takes that action. Then Mickey can write a function that spends some time looking through actions and predicting which ones lead to high scores, and outputs an action that leads to a relatively high score:

The reason this is “sorta-argmax” is that there may not be time to evaluate every action in 𝐴. For realistic action sets, agents should only need to find actions that make the scoring function as large as they can given resource constraints, even if this isn’t the maximal action.

This program may look simple, but of course, the devil’s in the details: writing an algorithm that does accurate prediction and smart search through action space is basically the whole problem of AI. Conceptually, however, it’s pretty simple: We can describe in broad strokes the kinds of operations the broom must carry out, and their plausible consequences at different performance levels.

When Mickey runs this program, everything goes smoothly at first. Then:

 

vlcsnap-2016-05-04-19h48m12s031

 

I claim that as fictional depictions of AI go, this is pretty realistic.

Why would we expect a generally intelligent system executing the above program to start overflowing the cauldron, or otherwise to go to extreme lengths to ensure the cauldron is full?

The first difficulty is that the objective function that Mickey gave his broom left out a bunch of other terms Mickey cares about:

The second difficulty is that Mickey programmed the broom to make the expectation of its score as large as it could. “Just fill one cauldron with water” looks like a modest, limited-scope goal, but when we translate this goal into a probabilistic context, we find that optimizing it means driving up the probability of success to absurd heights. If the broom assigns a 99.9% probability to “the cauldron is full,” and it has extra resources lying around, then it will always try to find ways to use those resources to drive the probability even a little bit higher.

Contrast this with the limited “task-like” goal we presumably had in mind. We wanted the cauldron full, but in some intuitive sense we wanted the system to “not try too hard” even if it has lots of available cognitive and physical resources to devote to the problem. We wanted it to exercise creativity and resourcefulness within some intuitive limits, but we didn’t want it to pursue “absurd” strategies, especially ones with large unanticipated consequences.2

In this example, the original objective function looked pretty task-like. It was bounded and quite simple. There was no way to get ever-larger amounts of utility. It’s not like the system got one point for every bucket of water it poured in — then there would clearly be an incentive to overfill the cauldron. The problem was hidden in the fact that we’re maximizing expected utility. This makes the goal open-ended, meaning that even small errors in the system’s objective function will blow up.

There are a number of different ways that a goal that looks task-like can turn out to be open-ended. Another example: a larger system that has an overarching task-like goal may have subprocesses that are themselves trying to maximize a variety of different objective functions, such as optimizing the system’s memory usage. If you don’t understand your system well enough to track whether any of its subprocesses are themselves acting like resourceful open-ended optimizers, then it may not matter how safe the top-level objective is.

So the broom keeps grabbing more pails of water — say, on the off chance that the cauldron has a leak in it, or that “fullness” requires the water to be slightly above the level of the brim. And, of course, at no point does the broom “rebel against” Mickey’s code. If anything, the broom pursued the objectives it was programmed with too effectively.

 

Subproblem: Suspend buttons

A common response to this problem is: “OK, there may be some unintended consequences of the objective function, but we can always pull the plug, right?”

Mickey tries this, and it doesn’t work:

 

vlcsnap-2016-05-04-19h21m04s349vlcsnap-2016-05-04-19h22m09s178vlcsnap-2016-05-04-19h53m09s315

 

And I claim that this is realistic too, for systems that are sufficiently good at modeling their environment. If the system is trying to drive up the expectation of its scoring function and is smart enough to recognize that its being shut down will result in lower-scoring outcomes, then the system’s incentive is to subvert shutdown attempts. The more capable the system is, the likelier it is to find creative ways to achieve that subgoal — e.g., by copying itself to the Internet, or by tricking the programmers into thinking it’s safer.

This isn’t to say that it’s impossible to shut down sufficiently capable AI systems; it’s just that we need to put in the extra work of deliberately designing systems to avoid searching for ways to avoid being shut down. If you’ve written code that would try to resist attempts to shut it down, then you shouldn’t run the code even if you’re convinced it would fail — there are better ways to spend those compute cycles than on adversarial policy searches.

Relevant safety measures that don’t assume we can always outthink and outmaneuver the system will need to modify the system’s objective function in some relevant way. We need to figure out how to formally specify objective functions that don’t automatically place the AI system into an adversarial context with the operators; or we need to figure out some way to have the system achieve goals without optimizing some objective function in the traditional sense. E.g., we can look for ways to limit the system’s capabilities to a small set of “whitelisted” domains that don’t include thinking about whether the system will be shut down.

One obvious solution is to try changing the score function to make reference to a suspend button 𝐁:

This says that when the button is on, the system’s goal changes from “fill the cauldron” to “be suspended.”

It’s not clear how we could actually specify this goal — if there are any loopholes in the way we’ve specified “suspension,” then the system might try to exploit them to hedge its bets. E.g., it could fulfill the technical definition of “suspend operations” while tasking a child agent it built or subprocess it started with filling the cauldron. This would allow is to achieve both “cauldron full” and “robot suspended” at once, just in case it’s wrong about whether the button has been pushed.

Let’s assume, however, that we’ve found an airtight way to technically specify suspension. The problem we run into now is that we’ve given the system an incentive to care a lot about whether the button is pushed. If it has any way to influence its operators’ decision to push the button, or any other way to exercise more control, then it will jump on the opportunity and try to ensure the button is pushed, or ensure that it’s not pushed — depending on whether it looks easier to suspend or to fill the cauldron.

And, surprisingly, the state of the art on this problem isn’t much more advanced than that. There’s a paper co-authored by me, Fallenstein, Yudkowsky, and Armstrong (“Corrigibility“), and a related result by Orseau and Armstrong (“Safely Interruptible Agents“), but not a lot more.

What we want is a way to combine two objective functions — a default function for normal operation, and a suspend function for when we want to suspend the system to disk.

We want our method for combining the functions to satisfy three conditions: an operator should be able to switch between the functions (say, by pushing a button); the system shouldn’t have any incentives to control which function is active; and if it’s plausible that the system’s normal operations could inadvertently compromise our ability to switch between the functions, then the system should be incentivized to keep that from happening.

So far, we haven’t found any way to achieve all three goals at once. As an example, we can try to remove any incentive on the system’s part to control whether its suspend button is pushed by giving the system a switching objective function that always assigns the same expected utility to the button being on or off:

Unfortunately, this approach makes it impossible to satisfy the third condition. It turns out that the above objective function is equivalent to setting the probability of the button being pressed to 0. The system won’t deliberately tamper with the button, but it will happily approve updates that delete any reference to the button as dead code, or build subagents that ignore the button.

Meeting these intuitively simple constraints turns out to be a nontrivial problem. This is a pattern that runs through a lot of questions in this space: traditional tools and concepts run into immediate safety problems that don’t turn up in conventional capabilities research.


The big picture

Alignment priorities

Let’s take a step back and talk about what’s needed overall in order to align highly capable AI systems with our interests.

Here’s a dramatically simplified pipeline: You have some humans who come up with some task or goal or preference set that serves as their intended value function 𝘝. Since our values are complicated and context-sensitive, in practice we’ll need to build systems to learn our values over time, rather than coding them by hand.3 We’ll call the goal the AI system ends up with (which may or may not be identical to 𝘝) 𝗨.

alignment-prioritiesWhen the press covers this topic, they often focus on one of two problems: “What if the wrong group of humans develops smarter-than-human AI first?”, and “What if AI’s natural desires cause 𝗨 to diverge from 𝘝?”

humans-ndIn my view, the “wrong humans” issue shouldn’t be the thing we focus on until we have reason to think we could get good outcomes with the right group of humans. We’re very much in a situation where well-intentioned people couldn’t leverage a general AI system to do good things even if they tried. As a simple example, if you handed me a box that was an extraordinarily powerful function optimizer — I could put in a description of any mathematical function, and it would give me an input that makes the output extremely large — then I do know how to use the box to produce a random catastrophe, but I don’t actually know how I could use that box in the real world to have a good impact.4

There’s a lot we don’t understand about AI capabilities, but we’re in a position where we at least have a general sense of what progress looks like. We have a number of good frameworks, techniques, and metrics, and we’ve put a great deal of thought and effort into successfully chipping away at the problem from various angles. At the same time, we have a very weak grasp on the problem of how to align highly capable systems with any particular goal. We can list out some intuitive desiderata, but the field hasn’t really developed its first formal frameworks, techniques, or metrics.

I believe that there’s a lot of low-hanging fruit in this area, and also that a fair amount of the work does need to be done early (e.g., to help inform capabilities research directions — some directions may produce systems that are much easier to align than others). If we don’t solve these problems, developers with arbitrarily good or bad intentions will end up producing equally bad outcomes. From an academic or scientific standpoint, our first objective in that kind of situation should be to remedy this state of affairs and at least make good outcomes technologically possible.

Many people quickly recognize that “natural desires” are a fiction, but infer from this that we instead need to focus on the other issues the media tends to emphasize — “What if bad actors get their hands on smarter-than-human AI?”, “How will this kind of AI impact employment and the distribution of wealth?”, etc. These are important questions, but they’ll only end up actually being relevant if we figure out how to bring general AI systems up to a minimum level of reliability and safety.

Another common thread is “Why not just tell the AI system to (insert intuitive moral precept here)?” On this way of thinking about the problem, often (perhaps unfairly) associated with Isaac Asimov’s writing, ensuring a positive impact from AI systems is largely about coming up with natural-language instructions that are vague enough to subsume a lot of human ethical reasoning:

intended-values

In contrast, precision is a virtue in real-world safety-critical software systems. Driving down accident risk requires that we begin with limited-scope goals rather than trying to “solve” all of morality at the outset.5

My view is that the critical work is mostly in designing an effective value learning process, and in ensuring that the sorta-argmax process is correctly hooked up to the resultant objective function 𝗨:

vl-argmax.png

The better your value learning framework is, the less explicit and precise you need to be in pinpointing your value function 𝘝, and the more you can offload the problem of figuring out what you want to the AI system itself. Value learning, however, raises a number of basic difficulties that don’t crop up in ordinary machine learning tasks.

Classic capabilities research is concentrated in the sorta-argmax and Expectation parts of the diagram, but sorta-argmax also contains what I currently view as the most neglected, tractable, and important safety problems. The easiest way to see why “hooking up the value learning process correctly to the system’s capabilities” is likely to be an important and difficult challenge in its own right is to consider the case of our own biological history.

Natural selection is the only “engineering” process we know of that has ever led to a generally intelligent artifact: the human brain. Since natural selection relies on a fairly unintelligent hill-climbing approach, one lesson we can take away from this is that it’s possible to reach general intelligence with a hill-climbing approach and enough brute force — though we can presumably do better with our human creativity and foresight.

Another key take-away is that natural selection was maximally strict about only optimizing brains for a single very simple goal: genetic fitness. In spite of this, the internal objectives that humans represent as their goals are not genetic fitness. We have innumerable goals — love, justice, beauty, mercy, fun, esteem, good food, good health, … — that correlated with good survival and reproduction strategies in the ancestral savanna. However, we ended up valuing these correlates directly, rather than valuing propagation of our genes as an end in itself — as demonstrated every time we employ birth control.

This is a case where the external optimization pressure on an artifact resulted in a general intelligence with internal objectives that didn’t match the external selection pressure. And just as this caused humans’ actions to diverge from natural selection’s pseudo-goal once we gained new capabilities, we can expect AI systems’ actions to diverge from humans’ if we treat their inner workings as black boxes.

If we apply gradient descent to a black box, trying to get it to be very good at maximizing some objective, then with enough ingenuity and patience, we may be able to produce a powerful optimization process of some kind.6 By default, we should expect an artifact like that to have a goal 𝗨 that strongly correlates with our objective 𝘝 in the training environment, but sharply diverges from 𝘝 in some new environments or when a much wider option set becomes available.

On my view, the most important part of the alignment problem is ensuring that the value learning framework and overall system design we implement allow us to crack open the hood and confirm when the internal targets the system is optimizing for match (or don’t match) the targets we’re externally selecting through the learning process.7

We expect this to be technically difficult, and if we can’t get it right, then it doesn’t matter who’s standing closest to the AI system when it’s developed. Good intentions aren’t sneezed into computer programs by kind-hearted programmers, and coming up with plausible goals for advanced AI systems doesn’t help if we can’t align the system’s cognitive labor with a given goal.

 

Four key propositions

Taking another step back: I’ve given some examples of open problems in this area (suspend buttons, value learning, limited task-based AI, etc.), and I’ve outlined what I consider to be the major problem categories. But my initial characterization of why I consider this an important area — “AI could automate general-purpose scientific reasoning, and general-purpose scientific reasoning is a big deal” — was fairly vague. What are the core reasons to prioritize this work?

First, goals and capabilities are orthogonal. That is, knowing an AI system’s objective function doesn’t tell you how good it is at optimizing that function, and knowing that something is a powerful optimizer doesn’t tell you what it’s optimizing.

I think most programmers intuitively understand this. Some people will insist that when a machine tasked with filling a cauldron gets smart enough, it will abandon cauldron-filling as a goal unworthy of its intelligence. From a computer science perspective, the obvious response is that you could go out of your way to build a system that exhibits that conditional behavior, but you could also build a system that doesn’t exhibit that conditional behavior. It can just keeps searching for actions that have a higher score on the “fill a cauldron” metric. You and I might get bored if someone told us to just keep searching for better actions, but it’s entirely possible to write a program that executes a search and never gets bored.8

Second, sufficiently optimized objectives tend to converge on adversarial instrumental strategies. Most objectives a smarter-than-human AI system could possess would be furthered by subgoals like “acquire resources” and “remain operational” (along with “learn more about the environment,” etc.).

This was the problem suspend buttons ran into: even if you don’t explicitly include “remain operational” in your goal specification, whatever goal you did load into the system is likely to be better achieved if the system remains online. Software systems’ capabilities and (terminal) goals are orthogonal, but they’ll often exhibit similar behaviors if a certain class of actions is useful for a wide variety of possible goals.

To use an example due to Stuart Russell: If you build a robot and program it to go to the supermarket to fetch some milk, and the robot’s model says that one of the paths is much safer than the other, then the robot, in optimizing for the probability that it returns with milk, will automatically take the safer path. It’s not that the system fears death, but that it can’t fetch the milk if it’s dead.

Third, general-purpose AI systems are likely to show large and rapid capability gains. The human brain isn’t anywhere near the upper limits for hardware performance (or, one assumes, software performance), and there are a number of other reasons to expect large capability advantages and rapid capability gain from advanced AI systems.

As a simple example, Google can buy a promising AI startup and throw huge numbers of GPUs at them, resulting in a quick jump from “these problems look maybe relevant a decade from now” to “we need to solve all of these problems in the next year.”9

Fourth, aligning advanced AI systems with our interests looks difficult. I’ll say more about why I think this presently.

Roughly speaking, the first proposition says that AI systems won’t naturally end up sharing our objectives. The second says that by default, systems with substantially different objectives are likely to end up adversarially competing for control of limited resources. The third suggests that adversarial general-purpose AI systems are likely to have a strong advantage over humans. And the fourth says that this problem is hard to solve — for example, that it’s hard to transmit our values to AI systems (addressing orthogonality) or avert adversarial incentives (addressing convergent instrumental strategies).

These four propositions don’t mean that we’re screwed, but they mean that this problem is critically important. General-purpose AI has the potential to bring enormous benefits if we solve this problem, but we do need to make finding solutions a priority for the field.


Fundamental difficulties

Why do I think that AI alignment looks fairly difficult? The main reason is just that this has been my experience from actually working on these problems. I encourage you to look at some of the problems yourself and try to solve them in toy settings; we could use more eyes here. I’ll also make note of a few structural reasons to expect these problems to be hard:

First, aligning advanced AI systems with our interests looks difficult for the same reason rocket engineering is more difficult than airplane engineering.

Before looking at the details, it’s natural to think “it’s all just AI” and assume that the kinds of safety work relevant to current systems are the same as the kinds you need when systems surpass human performance. On that view, it’s not obvious that we should work on these issues now, given that they might all be worked out in the course of narrow AI research (e.g., making sure that self-driving cars don’t crash).

Similarly, at a glance someone might say, “Why would rocket engineering be fundamentally harder than airplane engineering? It’s all just material science and aerodynamics in the end, isn’t it?” In spite of this, empirically, the proportion of rockets that explode is far higher than the proportion of airplanes that crash. The reason for this is that a rocket is put under much greater stress and pressure than an airplane, and small failures are much more likely to be highly destructive.10

Analogously, even though general AI and narrow AI are “just AI” in some sense, we can expect that the more general AI systems are likely to experience a wider range of stressors, and possess more dangerous failure modes.

For example, once an AI system begins modeling the fact that (i) your actions affect its ability to achieve its objectives, (ii) your actions depend on your model of the world, and (iii) your model of the world is affected by its actions, the degree to which minor inaccuracies can lead to harmful behavior increases, and the potential harmfulness of its behavior (which can now include, e.g., deception) also increases. In the case of AI, as with rockets, greater capability makes it easier for small defects to cause big problems.

Second, alignment looks difficult for the same reason it’s harder to build a good space probe than to write a good app.

You can find a number of interesting engineering practices at NASA. They do things like take three independent teams, give each of them the same engineering spec, and tell them to design the same software system; and then they choose between implementations by majority vote. The system that they actually deploy consults all three systems when making a choice, and if the three systems disagree, the choice is made by majority vote. The idea is that any one implementation will have bugs, but it’s unlikely all three implementations will have a bug in the same place.

This is significantly more caution than goes into the deployment of, say, the new WhatsApp. One big reason for the difference is that it’s hard to roll back a space probe. You can send version updates to a space probe and correct software bugs, but only if the probe’s antenna and receiver work, and if all the code required to apply the patch is working. If your system for applying patches is itself failing, then there’s nothing to be done.

In that respect, smarter-than-human AI is more like a space probe than like an ordinary software project. If you’re trying to build something smarter than yourself, there are parts of the system that have to work perfectly on the first real deployment. We can do all the test runs we want, but once the system is out there, we can only make online improvements if the code that makes the system allow those improvements is working correctly.

If nothing yet has struck fear into your heart, I suggest meditating on the fact that the future of our civilization may well depend on our ability to write code that works correctly on the first deploy.

Lastly, alignment looks difficult for the same reason computer security is difficult: systems need to be robust to intelligent searches for loopholes.

Suppose you have a dozen different vulnerabilities in your code, none of which is itself fatal or even really problematic in ordinary settings. Security is difficult because you need to account for intelligent attackers who might find all twelve vulnerabilities and chain them together in a novel way to break into (or just break) your system. Failure modes that would never arise by accident can be sought out and exploited; weird and extreme contexts can be instantiated by an attacker to cause your code to follow some crazy code path that you never considered.

A similar sort of problem arises with AI. The problem I’m highlighting here is not that AI systems might act adversarially: AI alignment as a research program is all about finding ways to prevent adversarial behavior before it can crop up. We don’t want to be in the business of trying to outsmart arbitrarily intelligent adversaries. That’s a losing game.

The parallel to cryptography is that in AI alignment we deal with systems that perform intelligent searches through a very large search space, and which can produce weird contexts that force the code down unexpected paths. This is because the weird edge cases are places of extremes, and places of extremes are often the place where a given objective function is optimized.11 Like computer security professionals, AI alignment researchers need to be very good at thinking about edge cases.

It’s much easier to make code that works well on the path that you were visualizing than to make code that works on all the paths that you weren’t visualizing. AI alignment needs to work on all the paths you weren’t visualizing.

Summing up, we should approach a problem like this with the same level of rigor and caution we’d use for a security-critical rocket-launched space probe, and do the legwork as early as possible. At this early stage, a key part of the work is just to formalize basic concepts and ideas so that others can critique them and build on them. It’s one thing to have a philosophical debate about what kinds of suspend buttons people intuit ought to work, and another thing to translate your intuition into an equation so that others can fully evaluate your reasoning.

This is a crucial project, and I encourage all of you who are interested in these problems to get involved and try your hand at them. There are ample resources online for learning more about the open technical problems. Some good places to start include MIRI’s research agendas and a great paper from researchers at Google Brain, OpenAI, and Stanford called “Concrete Problems in AI Safety.”

 


  1. An airplane can’t heal its injuries or reproduce, though it can carry heavy cargo quite a bit further and faster than a bird. Airplanes are simpler than birds in many respects, while also being significantly more capable in terms of carrying capacity and speed (for which they were designed). It’s plausible that early automated scientists will likewise be simpler than the human mind in many respects, while being significantly more capable in certain key dimensions. And just as the construction and design principles of aircraft look alien relative to the architecture of biological creatures, we should expect the design of highly capable AI systems to be quite alien when compared to the architecture of the human mind.
  2. Trying to give some formal content to these attempts to differentiate task-like goals from open-ended goals is one way of generating open research problems. In the “Alignment for Advanced Machine Learning Systems” research proposal, the problem of formalizing “don’t try too hard” is mild optimization, “steer clear of absurd strategies” is conservatism, and “don’t have large unanticipated consequences” is impact measures. See also “avoiding negative side effects” in Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané’s “Concrete Problems in AI Safety.”
  3. One thing we’ve learned in the field of machine vision over the last few decades is that it’s hopeless to specify by hand what a cat looks like, but that it’s not too hard to specify a learning system that can learn to recognize cats. It’s even more hopeless to specify everything we value by hand, but it’s plausible that we could specify a learning system that can learn the relevant concept of “value.”
  4. Roughly speaking, MIRI’s focus is on research directions that seem likely to help us conceptually understand how to do AI alignment in principle, so we’re fundamentally less confused about the kind of work that’s likely to be needed.What do I mean by this? Let’s say that we’re trying to develop a new chess-playing programs. Do we understand the problem well enough that we could solve it if someone handed us an arbitrarily large computer? Yes: We make the whole search tree, backtrack, see whether white has a winning move.If we didn’t know how to answer the question even with an arbitrarily large computer, then this would suggest that we were fundamentally confused about chess in some way. We’d either be missing the search-tree data structure or the backtracking algorithm, or we’d be missing some understanding of how chess works.This was the position we were in regarding chess prior to Claude Shannon’s seminal paper, and it’s the position we’re currently in regarding many problems in AI alignment. No matter how large a computer you hand me, I could not make a smarter-than-human AI system that performs even a very simple limited-scope task (e.g., “put a strawberry on a plate without producing any catastrophic side-effects”) or achieves even a very simple open-ended goal (e.g., “maximize the amount of diamond in the universe”).If I didn’t have any particular goal in mind for the system, I could write a program (assuming an arbitrarily large computer) that strongly optimized the future in an undirected way, using a formalism like AIXI. In that sense we’re less obviously confused about capabilities than about alignment, even though we’re still missing a lot of pieces of the puzzle on the practical capabilities front.Our goal is to develop and formalize basic approaches and ways of thinking about the alignment problem, so that our engineering decisions don’t end up depending on sophisticated and clever-sounding verbal arguments that turn out to be subtly mistaken. Simplifications like “what if we weren’t worried about resource constraints?” and “what if we were trying to achieve a much simpler goal?” are a good place to start breaking down the problem into manageable pieces. For more on this methodology, see “MIRI’s Approach.”
  5. “Fill this cauldron without being too clever about it or working too hard or having any negative consequences I’m not anticipating” is a rough example of a goal that’s intuitively limited in scope. The things we actually want to use smarter-than-human AI for are obviously more ambitious than that, but we’d still want to begin with various limited-scope tasks rather than open-ended goals.Asimov’s Three Laws of Robotics make for good stories partly for the same reasons they’re unhelpful from a research perspective. The hard task of turning a moral precept into lines of code is hidden behind phrasings like “[don’t,] through inaction, allow a human being to come to harm.” If one followed a rule like that strictly, the result would be massively disruptive, as AI systems would need to systematically intervene to prevent even the smallest risks of even the slightest harms; and if the intent is that one follow the rule loosely, then all the work is being done by the human sensibilities and intuitions that tell us when and how to apply the rule.A common response here is that vague natural-language instruction is sufficient, because smarter-than-human AI systems are likely to be capable of natural language comprehension. However, this is eliding the distinction between the system’s objective function and its model of the world. A system acting in an environment containing humans may learn a world-model that has lots of information about human language and concepts, which the system can then use to achieve its objective function; but this fact doesn’t imply that any of the information about human language and concepts will “leak out” and alter the system’s objective function directly.Some kind of value learning process needs to be defined where the objective function itself improves with new information. This is a tricky task because there aren’t known (scalable) metrics or criteria for value learning in the way that there are for conventional learning.If a system’s world-model is accurate in training environments but fails in the real world, then this is likely to result in lower scores on its objective function — the system itself has an incentive to improve. The severity of accidents is also likelier to be self-limiting in this case, since false beliefs limit a system’s ability to effectively pursue strategies.In contrast, if a system’s value learning process results in a 𝗨 that matches our 𝘝 in training but diverges from 𝘝 in the real world, then the system’s 𝗨 will obviously not penalize it for optimizing 𝗨. The system has no incentive relative to 𝗨 to “correct” divergences between 𝗨 and 𝘝, if the value learning process is initially flawed. And accident risk is larger in this case, since a mismatch between 𝗨 and 𝘝 doesn’t necessarily place any limits on the system’s instrumental effectiveness at coming up with effective and creative strategies for achieving 𝗨.The problem is threefold:1. “Do What I Mean” is an informal idea, and even if we knew how to build a smarter-than-human AI system, we wouldn’t know how to precisely specify this idea in lines of code.2. If doing what we actually mean is instrumentally useful for achieving a particular objective, then a sufficiently capable system may learn how to do this, and may act accordingly so long as doing so is useful for its objective. But as systems become more capable, they are likely to find creative new ways to achieve the same objectives, and there is no obvious way to get an assurance that “doing what I mean” will continue to be instrumentally useful indefinitely.

    3. If we use value learning to refine a system’s goals over time based on training data that appears to be guiding the system toward a 𝗨 that inherently values doing what we mean, it is likely that the system will actually end up zeroing in on a 𝗨 that approximately does what we mean during training but catastrophically diverges in some difficult-to-anticipate contexts. See “Goodhart’s Curse” for more on this.

    For examples of problems faced by existing techniques for learning goals and facts, such as reinforcement learning, see “Using Machine Learning to Address AI Risk.”

  6. The result will probably not be a particularly human-like design, since so many complex historical contingencies were involved in our evolution. The result will also be able to benefit from a number of large software and hardware advantages.
  7. This concept is sometimes lumped into the “transparency” category, but standard algorithmic transparency research isn’t really addressing this particular problem. A better term for what I have in mind here is “understanding.” What we want is to gain deeper and broader insights into the kind of cognitive work the system is doing and how this work relates to the system’s objectives or optimization targets, to provide a conceptual lens with which to make sense of the hands-on engineering work.
  8. We could choose to program the system to tire, but we don’t have to. In principle, one could program a broom that only ever finds and executes actions that optimize the fullness of the cauldron. Improving the system’s ability to efficiently find high-scoring actions (in general, or relative to a particular scoring rule) doesn’t in itself change the scoring rule it’s using to evaluate actions.
  9. Some other examples: a system’s performance may suddenly improve when it’s first given large-scale Internet access, when there’s a conceptual breakthrough in algorithm design, or when the system itself is able to propose improvements to its hardware and software. We can imagine the latter case in particular resulting in a feedback loop as the system’s design improvements allow it to come up with further design improvements, until all the low-hanging fruit is exhausted.Another important consideration is that two of the main bottlenecks to humans doing faster scientific research are training time and communication bandwidth. If we could train a new mind to be a cutting-edge scientist in ten minutes, and if scientists could near-instantly trade their experience, knowledge, concepts, ideas, and intuitions to their collaborators, then scientific progress might be able to proceed much more rapidly. Those sorts of bottlenecks are exactly the sort of bottleneck that might give automated innovators an enormous edge over human innovators even without large advantages in hardware or algorithms.
  10. Specifically, rockets experience a wider range of temperatures and pressures, traverse those ranges more rapidly, and are also packed more fully with explosives.
  11. Consider Bird and Layzell’s example of a very simple genetic algorithm that was tasked with evolving an oscillating circuit. Bird and Layzell were astonished to find that the algorithm made no use of the capacitor on the chip; instead, it had repurposed the circuit tracks on the motherboard as a radio to replay the oscillating signal from the test device back to the test device.This was not a very smart program. This is just using hill climbing on a very small solution space. In spite of this, the solution turned out to be outside the space of solutions the programmers were themselves visualizing. In a computer simulation, this algorithm might have behaved as intended, but the actual solution space in the real world was wider than that, allowing hardware-level interventions.In the case of an intelligent system that’s significantly smarter than humans on whatever axes you’re measuring, you should by default expect the system to push toward weird and creative solutions like these, and for the chosen solution to be difficult to anticipate.