AI Alignment Podcast: China’s AI Superpower Dream with Jeffrey Ding

“In July 2017, The State Council of China released the New Generation Artificial Intelligence Development Plan. This policy outlines China’s strategy to build a domestic AI industry worth nearly US$150 billion in the next few years and to become the leading AI power by 2030. This officially marked the development of the AI sector as a national priority and it was included in President Xi Jinping’s grand vision for China.” (FLI’s AI Policy – China page) In the context of these developments and an increase in conversations regarding AI and China, Lucas spoke with Jeffrey Ding from the Center for the Governance of AI (GovAI). Jeffrey is the China lead for GovAI where he researches China’s AI development and strategy, as well as China’s approach to strategic technologies more generally. 

Topics discussed in this episode include:

  • China’s historical relationships with technology development
  • China’s AI goals and some recently released principles
  • Jeffrey Ding’s work, Deciphering China’s AI Dream
  • The central drivers of AI and the resulting Chinese AI strategy
  • Chinese AI capabilities
  • AGI and superintelligence awareness and thinking in China
  • Dispelling AI myths, promoting appropriate memes
  • What healthy competition between the US and China might look like

You can take a short (3 minute) survey to share your feedback about the podcast here.

 

Key points from Jeffrey: 

  • “Even if you don’t think Chinese AI capabilities are as strong as have been hyped up in the media and elsewhere, important actors will treat China as either a bogeyman figure or as a Sputnik type of wake-up call motivator… other key actors will leverage that as a narrative, as a Sputnik moment of sorts to justify whatever policies they want to do. So we want to understand what’s happening and how the conversation around what’s happening in China’s AI development is unfolding.”
  • “There certainly are differences, but we don’t want to exaggerate them. I think oftentimes analysis of China happens in a vacuum where it’s like, ‘Oh, this only happens in this mysterious far off land, we call China and it doesn’t happen anywhere else.’ Shoshana Zuboff has this great book on Surveillance Capitalism that shows how the violation of privacy is pretty extensive on the US side, not only from big companies but also from the national security apparatus. So I think a similar phenomenon is taking place with the social credit system. Jeremy Dom at Yale laws China Center has put it really nicely where he says that, ‘We often project our worst fears about technology in AI onto what’s happening in China, and we look through a glass darkly and we unleash all of our anxieties on what’s happening on to China without reflecting on what’s happening here in the US, what’s happening here in the UK.'”
  • “I think we have to be careful about which historical analogies and memes we choose. So ‘arms race’ is a very specific call back to cold war context, where there’s almost these discrete types of missiles that we are racing Soviet Union on and discrete applications that we can count up; Or even going way back to what some scholars call the first industrial arms race in the military sphere over steam power boats between Britain and France in the late 19th century. And all of those instances you can count up. France has four iron clads, UK has four iron clads; They’re racing to see who can build more. I don’t think there’s anything like that. There’s not this discreet thing that we’re racing to see who can have more of. If anything, it’s about a competition to see who can absorb AI advances from abroad better, who can diffuse them throughout the economy, who can adopt them in a more sustainable way without sacrificing core values. So that’s sort of one meme that I really want to dispel. Related to that, assumptions that often influence a lot of our discourse on this is techno-nationalist assumption, which is this idea that technology is contained within national boundaries and that the nation state is the most important actor –– which is correct and a good one to have and a lot of instances. But there are also good reasons to adopt techno-globalist assumptions as well, especially in the area of how fast technologies diffuse nowadays and also how much underneath this national level competition, firms from different countries are working together and make standards alliances with each other. So there’s this undercurrent of techno-globalism, where there are people flows, idea flows, company flows happening while the coverage and the sexy topic is always going to be about national level competition, zero sum competition, relative games rhetoric. So you’re trying to find a balance between those two streams.”
  • “I think currently a lot of people in the US are locked into this mindset that the only two players that exist in the world are the US and China. And if you look at our conversation, right, oftentimes I’ve displayed that bias as well. We should probably have talked a lot more about China-EU or China-Japan corporations in this space and networks in this space because there’s a lot happening there too. So a lot of US policy makers see this as a two-player game between the US and China. And then in that sense, if there’s some cancer research project about discovering proteins using AI that may benefit China by 10 points and benefit the US only by eight points, but it’s going to save a lot of people from cancer  –– if you only care about making everything about maintaining a lead over China, then you might not take that deal. But if you think about it from the broader landscape of it’s not just a zero sum competition between US and China, then your kind of evaluation of those different point structures and what you think is rational will change.”

 

Important timestamps: 

0:00 intro 

2:14 Motivations for the conversation

5:44 Historical background on China and AI 

8:13 AI principles in China and the US 

16:20 Jeffrey Ding’s work, Deciphering China’s AI Dream 

21:55 Does China’s government play a central hand in setting regulations? 

23:25 Can Chinese implementation of regulations and standards move faster than in the US? Is China buying shares in companies to have decision making power? 

27:05 The components and drivers of AI in China and how they affect Chinese AI strategy 

35:30 Chinese government guidance funds for AI development 

37:30 Analyzing China’s AI capabilities 

44:20 Implications for the future of AI and AI strategy given the current state of the world 

49:30 How important are AGI and superintelligence concerns in China?

52:30 Are there explicit technical AI research programs in China for AGI? 

53:40 Dispelling AI myths and promoting appropriate memes

56:10 Relative and absolute gains in international politics 

59:11 On Peter Thiel’s recent comments on superintelligence, AI, and China 

1:04:10 Major updates and changes since Jeffrey wrote Deciphering China’s AI Dream 

1:05:50 What does healthy competition between China and the US look like? 

1:11:05 Where to follow Jeffrey and read more of his work

 

Works referenced 

Deciphering China’s AI Dream

FLI AI Policy – China page

ChinAI Newsletter

Jeff’s Twitter

Previous podcast with Jeffrey

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. More works from GovAI can be found here.

 

Lucas Perry: Hello everyone and welcome back to the AI Alignment Podcast at The Future of Life Institute. I’m Lucas Perry and today we’ll be speaking with Jeffrey Ding from The Future of Humanity Institute on China and their efforts to be the leading AI Superpower by 2030. In this podcast, we provide a largely descriptive account of China’s historical technological efforts, their current intentions and methods for pushing Chinese AI Success, some of the foundational AI principles being called for within China; We cover the drivers of AI progress, the components of success, China’s strategies born of these variables; We also assess China’s current and likely future AI capabilities, and the consequences of all this tied together. The FLI AI Policy China page, and Jeffrey Ding’s publication Deciphering China’s AI Dream are large drivers of this conversation, and I recommend you check them out.

If you find this podcast interesting or useful, consider sharing it with friends on social media platforms, forums, or anywhere you think it might be found valuable. As always, you can provide feedback for me by following the SurveyMonkey link found in the description of wherever you might find this podcast. 

Jeffrey Ding specializes in AI strategy and China’s approach to strategic technologies more generally. He is the China lead for the Center for the Governance of AI. There, Jeff researches China’s development of AI and his work has been cited in the Washington Post, South China Morning Post, MIT Technological Review, Bloomberg News, Quartz, and other outlets. He is a fluent Mandarin speaker and has worked at the US Department of State and the Hong Kong Legislative Council. He is also reading for a PhD in international relations as a Rhodes scholar at the University of Oxford. And so without further ado, let’s jump into our conversation with Jeffrey Ding.

Let’s go ahead and start off by providing a bit of the motivations for this conversation today. So why is it that China is important for AI alignment? Why should we be having this conversation? Why are people worried about the US-China AI Dynamic?

Jeffrey Ding: Two main reasons, and I think they follow an “even if” structure. The first reason is China is probably second only to the US in terms of a comprehensive national AI capabilities measurement. That’s a very hard and abstract thing to measure. But if you’re taking which countries have the firms on the leading edge of the technology, the universities, the research labs, and then the scale to lead in industrial terms and also in potential investment in projects related to artificial general intelligence. I would put China second only to the US, at least in terms of my intuition and sort of my analysis that I’ve done on the subject.

The second reason is even if you don’t think Chinese AI capabilities are as strong as have been hyped up in the media and elsewhere, important actors will treat China as either a bogeyman figure or as a Sputnik type of wake-up call motivator. And you can see this in the rhetoric coming from the US especially today, and even in areas that aren’t necessarily connected. So Axios had a leaked memo from the US National Security Council that was talking about centralizing US telecommunication services to prepare for 5G. And in the memo, one of the justifications for this was because China is leading in AI advances. The memo doesn’t really tie the two together. There are connections –– 5G may empower different AI technologies –– but that’s a clear example of how even if Chinese capabilities in AI, especially in projects related to AGI, are not as substantial as has been reported, or we think, other key actors will leverage that as a narrative, as a Sputnik moment of sorts to justify whatever policies they want to do. So we want to understand what’s happening and how the conversation around what’s happening in China’s AI development is unfolding.

Lucas Perry: So the first aspect being that they’re basically the second most powerful AI developer. And we can get into later their relative strength to the US; I think that in your estimation, they have about half as much AI capability relative to the United States. And here, the second one is you’re saying –– and there’s this common meme in AI Alignment about how avoiding races is important because in races, actors have incentives to cut corners in order to gain decisive strategic advantage by being the first to deploy advanced forms of artificial intelligence –– so there’s this important need, you’re saying, for actually understanding the relationship and state of Chinese AI Development to dispel inflammatory race narratives?

Jeffrey Ding: Yeah, I would say China’s probably at the center of most race narratives when we talk about AI arms races and the conversation in at least US policy-making circles –– which is what I follow most, US national security circles –– has not talked necessarily about AI as a decisive strategic advantage in terms of artificial general intelligence, but definitely in terms of decisive strategic advantage and who has more productive power, military power. So yeah, I would agree with that.

Lucas Perry: All right, so let’s provide a little bit more historical background here, I think, to sort of contextualize why there’s this rising conversation about the role of China in the AI space. So I’m taking this here from the FLI AI Policy China page: “In July of 2017, the State Council of China released the New Generation Artificial Intelligence Development Plan. And this was an AI research strategy policy to build a domestic AI industry worth nearly $150 billion in the next few years” –– again, this was in 2017 –– “and to become a leading AI power by 2030. This officially marked the development of the AI sector as a national priority, and it was included in President Xi Jinping’s grand vision for China.” And just adding a little bit more color here: “given this, the government expects its companies and research facilities to be at the same level as leading countries like the United States by 2020.” So within a year from now –– maybe a bit ambitious, given your estimation that they have is about half as much capability as us.

But continuing this picture I’m painting: “five years later, it calls for breakthroughs in select disciplines within AI” –– so that would be by 2025. “That will become a key impetus for economic transformation. And then in the final stage, by 2030, China is intending to become the world’s premier artificial intelligence innovation center, which will in turn foster a new national leadership and establish the key fundamentals for an economic great power,” in their words. So there’s this very clear, intentional stance that China has been developing in the past few years.

Jeffrey Ding: Yeah, definitely. And I think it was Jess Newman who put together the AI policy in China page –– did a great job. It’s a good summary of this New Generation AI Development Plan issued in July 2017 and I would say the plan was more reflective of momentum that was already happening at the local level with companies like Baidu, Tencent, Alibaba, making the shift to focus on AI as a core part of their business strategy. Shenzhen, other cities, had already set up their own local funds and plans, and this was an instance of the Chinese national government, in the words of I think Paul Triolo and some other folks at New America, “riding the wave,” and kind of joining this wave of AI development.

Lucas Perry: And so adding a bit more color here again: there’s also been developments in principles that are being espoused in this context. I’d say probably the first major principles on AI were developed at the Asilomar Conference, at least those pertaining to AGI. In June 2019, the New Generation of AI Governance Expert Committee released principles for next-generation artificial intelligence governance, which included tenants like harmony and friendliness and fairness and justice, inclusiveness and sharing, open cooperation, shared responsibility, and agile governance. 

And then also in May of 2019 the Beijing AI Principles were released. That was by a multi-stakeholder coalition, including the Beijing Academy of Artificial Intelligence, a bunch of top universities in China, as well as industrial firms such as Baidu, Alibaba, and Tencent. And these 15 principles, among other things, called for “the construction of a human community with a shared future and the realization of beneficial AI for humankind in nature.” So it seems like principles and intentions are also being developed similarly in China that sort of echo and reflect many of the principles and intentions that have been developing in the states.

Jeffrey Ding: Yeah, I think there’s definitely a lot of similarities, and I think it’s not just with this recent flurry of AI ethics documents that you’ve done a good job of summarizing. It dates back to even the plan that we were just talking about. If you read the July 2017 New Generation AI Plan carefully, there’s a lot of sections devoted to AI ethics, including some sections that are worried about human robot alienation.

So, depending on how you read that, you could read that as already anticipating some of the issues that could occur if human goals and AI goals do not align. Even back in March, I believe, of 2018, a lot of government bodies came together with companies to put out a white paper on AI standardization, which I translated for New America. And in that, they talk about AI safety and security issues, how it’s important to ensure that the design goals of AI are consistent with the interests, ethics, and morals of most humans. So a lot of these topics, I don’t even know if they’re western topics. These are just basic concepts: We want systems to be controllable and reliable. And yes, those have deeper meanings in the sense of AGI, but that doesn’t mean that some of these initial core values can’t be really easily applied to some of these deeper meanings that we talk about when we talk about AGI ethics.

Lucas Perry: So with all of the animosity and posturing and whatever that happens between the United States and China, these sort of principles and intentions which are being developed, at least in terms of AI –– both of them sort of have international intentions for the common good of humanity; At least that’s what is being stated in these documents. How do you think about the reality of the day-to-day combativeness and competition between the US and China in relation to these principles which strive towards the deployment of AI for the common good of humanity more broadly, rather than just within the context of one country?

Jeffrey Ding: It’s a really good question. I think the first point to clarify is these statements don’t have teeth behind them unless they’re enforced, unless there’s resources dedicated to funding research on these issues, to track 1.5, track 2 diplomacy, technical meetings between researchers. These are just statements that people can put out and they don’t have teeth unless they’re actually enforced. Oftentimes, we know it’s the case. Firms like Google and Microsoft, Amazon, will put out principles about facial recognition or what their ethical stances are, but behind the scenes they’ll chase profit motives and maximize shareholder value. And I would say the same would take place for Tencent, Baidu, Alibaba. So I want to clarify that, first of all. The competitive dynamics are real: It’s partly not just an AI story, it’s a broader story of China’s rise. I’ve come from international relations background, so I’m a PhD student at Oxford studying that, and there’s a big debate in the literature about what happens when a rising power challenges an established power. And oftentimes frictions result, and it’s about how to manage these frictions without leading to accidents, miscalculation, arms races. And that’s the tough part of it.

Lucas Perry: So it seems –– at least for a baseline, thinking that we’re still pretty early in the process of AI alignment or this long-term vision we have –– it seems like at least there is theoretically some shared foundational principles reflective across both the cultures. Again, these Beijing AI Principles also include focus on benefiting all of humanity and the environment; serving human values such as privacy, dignity, freedom, autonomy and rights; continuous focus on AI safety and security; inclusivity, openness; supporting international cooperation; and avoiding a malicious AI race. So the question now simply seems: implementation of these shared principles, ensuring that they manifest.

Jeffrey Ding: Yeah. I don’t mean to be dismissive of these efforts to create principles that were at least expressing the rhetoric of planning for all of humanity. I think there’s definitely a lot of areas of US-China cooperation in the past that have also echoed some of these principles: bi-lateral cooperation on climate change research; there’s a good nuclear safety cooperation module; different centers that we’ve worked on. But at the same time, I also think that even with that list of terms you just mentioned, there are some differences in terms of how both sides understand different terms.

So with privacy in the Chinese context, it’s not necessarily that Chinese people or political actors don’t care about privacy. It’s that privacy might mean more of privacy as an instrumental right, to ensure your financial data doesn’t get leaked, you don’t lose all your money; to ensure that your consumer data is protected from companies; but not necessarily in other contexts where privacy is seen as an intrinsic right, as a civil right of sorts, where it’s also about an individual’s protection from government surveillance. That type of protection is not caught up in conversations about privacy in China as much.

Lucas Perry: Right, so there are going to be implicitly different understandings about some of these principles that we’ll have to navigate. And again, you brought up privacy as something –– and this has been something people have been paying more attention to, as there has been kind of this hype and maybe a little bit of hysteria over the China social crediting system, and plenty of misunderstanding around that.

Jeffrey Ding: Yeah, and this ties into a lot of what I’ve been thinking about lately, which is there certainly are differences, but we don’t want to exaggerate them. I think oftentimes analysis of China happens in a vacuum where it’s like, “Oh, this only happens in this mysterious far off land we call China and it doesn’t happen anywhere else.” Shoshana Zuboff has this great book on surveillance capitalism that shows how the violation of privacy is pretty extensive on the US side, not only from big companies but also from the national security apparatus.

So I think a similar phenomenon is taking place with the social credit system. Jeremy Dom at Yale Law’s China Center has put it really nicely where he says that, “We often project our worst fears about technology in AI onto what’s happening in China, and we look through a glass darkly and we unleash all of our anxieties on what’s happening onto China without reflecting on what’s happening here in the US, what’s happening here in the UK.”

Lucas Perry: Right. I would guess that generally in human psychology it seems easier to see the evil in the other rather than in the self.

Jeffrey Ding: Yeah, that’s a little bit out of range for me, but I’m sure there’s studies on that.

Lucas Perry: Yeah. All right, so let’s get in here now to your work on deciphering China’s AI dream. This is a work that you’d published in 2018 and in this work you divided up into these four different sections. First you work on context, then you discuss components, then you discuss capabilities, and then you discuss consequences all in relation to AI in China. Would you like to just sort of unpack the structuring?

Jeffrey Ding: Yeah, this was very much just a descriptive paper. I was just starting out researching this area and I just had a bunch of basic questions. So question number one for context: what is the background behind China’s AI Strategy? How does it compare to other countries’ plans? How does it compare to its own past science and technology plans? The second question was, what are they doing in terms of pushing forward drivers of AI Development? So that’s the component section. The third question is, how well are they doing? It’s about assessing China’s AI capabilities. And then the fourth is, so what’s it all mean? Why does it matter? And that’s where I talk about the consequences and the potential implications of China’s AI ambitions for issues related to AI Safety, some of the AGI issues we’ve been talking about, national security, economic development, and social governance.

Lucas Perry: So let’s go ahead and move sequentially through these. We’ve already here discussed a bit of context about what’s going on in China in terms of at least the intentional stance and the development of some principles. Are there any other key facets or areas here that you’d like to add about China’s AI strategy in terms of its past science and technology? Just to paint a picture for our listeners.

Jeffrey Ding: Yeah, definitely. I think two past critical technologies that you could look at are the plans to increase China’s space industry, aerospace sector; and then also biotechnology. So in each of these other areas there was also a national level strategic plan; An agency or an office was set up to manage this national plan; Substantial funding was dedicated. With the New Generation AI Plan, there was also a sort of implementation office set up across a bunch of the different departments tasked with implementing the plan.

AI was also elevated to the level of a national strategic technology. And so what’s different between these two phases? Because it’s debatable how successful the space plan and the biotech plans have been. What’s different with AI is you already had big tech giants who are pursuing AI capabilities and have the resources to shift a lot of their investments toward the AI space, independent of government funding mechanisms: companies like Baidu, Tencent, Alibaba, even startups that have really risen like SenseTime. And you see that reflected in the type of model.

It’s no longer the traditional national champion model where the government almost builds a company from the ground up, maybe with the help of like international financers and investors. Now it’s a national team model where they ask for the support of these leading tech giants, but it’s not like these tech giants are reliant on the government for subsidies or funding to survive. They are already flourishing firms that have international presence.

The other bit of context I would just add is that if you look at the New Generation Plan, there’s a lot of terms that are related to manufacturing. And I mentioned in Deciphering China’s AI Dream, how there’s a lot of connections and callbacks to manufacturing plans. And I think this is key because it’s one aspect of China’s strive for AI as they want to escape the middle income trap and kind of get to those higher levels of value-add in the manufacturing chain. So I want to stress that as a key point of context.

Lucas Perry: So the framing here is the Chinese government is trying to enable companies which already exist and already are successful. And this stands in contrast to the US and the UK where it seems like the government isn’t even part of a teamwork effort.

Jeffrey Ding: Yeah. So maybe a good comparison would be how technical standards develop, which is an emphasis of not only this deciphering China dream paper but a lot of later work. So I’m talking about technical standards, like how do you measure the accuracy of facial recognition systems and who gets to set those measures, or product safety standards for different AI applications. And in many other countries, including the US, the process for that is much more decentralized. It’s largely done through industry alliances. There is the NIST, which is a body under the Department of Commerce in the US that helps coordinate that to some extent, but not nearly as much as what happens in China with the Standards Administration Commission (SAC), I believe. There, it’s much more of a centralized effort to create technical standards. And there are pros and cons to both.

With the more decentralized approach, you minimize the risks of technological lock-in by setting standards too early, and you let firms have a little bit more freedom, competition as well. Whereas having a more centralized top-down effort might lead to earlier harmonization on standards and let you leverage economies of scale when you just have more interoperable protocols. That could help with data sharing, help with creating stable test bed for different firms to compete and measure stuff I was talking about earlier, like algorithmic accuracy. So there are pros and cons of the two different approaches. But I think yeah, that does flush out how the relationship between firms and the government differs a little bit, at least in the context of standards setting.

Lucas Perry: So on top of standards setting, would you say China’s government plays more of a central hand in the regulation as well?

Jeffrey Ding: That’s a good question. It probably differs in terms of what area of regulation. So I think in some cases there’s a willingness to let companies experiment and then put down regulations afterward. So this is the classic example with mobile payments: There was definitely a gray space as to how these platforms like Alipay, WeChat Pay were essentially pushing into a gray area of law in terms of who could handle this much money that’s traditionally in the hands of the banks. Instead of clamping down on it right away, the Chinese government kind of let that play itself out, and then once these mobile pay platforms got big enough that they’re holding so much capital and have so much influence on the monetary stock, they then started drafting regulations for them to be almost treated as banks. So that’s an example of where it’s more of a hands-off approach.

In AI, folks have said that the US and China are probably closer in terms of their approach to regulation, which is much more hands-off than the EU. And I think that’s just a product partly of the structural differences in the AI ecosystem. The EU has very few big internet giants and AI algorithm firms, so they have more of an incentive to regulate other countries’ big tech giants and AI firms.

Lucas Perry: So two questions are coming up. One is, is there sufficiently more unity and coordination in the Chinese government such that when standards and regulations, or decisions surrounding AI, need to be implemented that they’re able to move, say, much quicker than the United States government? And the second thing was, I believe you mentioned also that the Chinese government is also trying to find ways of using potential government money for buying up shares in these companies and try to gain decision making power.

Jeffrey Ding: Yeah, I’ll start with the latter. The reference is to the establishment of special management shares: so these would be almost symbolic, less than 1% shares in a company so that they could maybe get a seat on the board –– or another vehicle is through the establishment of party committees within companies, so there’s always a tie to party leadership. I don’t have that much more insight into how these work. I think probably it’s fair to say that the day-to-day and long-term planning decisions of a lot of these companies are mostly just driven by what their leadership wants, not necessarily what the party leaders want, because it’s just very hard to micromanage these billion dollar giants.

And that was part of a lot of what was happening with the reform of the state-owned enterprise sector, where, I think it was the SAC –– there are a lot of acronyms –– but this was the body in control of state-owned enterprises and they significantly cut down the number of enterprises that they directly oversee and sort of focused on the big ones, like the big banks or the big oil companies.

To your first point on how smooth policy enforcement is, this is not something I’ve studied that carefully. I think to some extent there’s more variability in terms of what the government does. So I read somewhere that if you look at the government relations departments of Chinese big tech companies versus US big tech companies, there’s just a lot more on the Chinese side –– although that might be changing with recent developments in the US. Two cases I’m thinking of right now are the Chinese government worrying about addictive games and then issuing the ban against some games including Tencent’s PUBG, which has wrecked Tencent’s game revenues and was really hurtful for their stock value.

So that’s something that would be very hard for the US government to be like, “Hey, this game is banned.” At the same time, there’s a lot of messiness with this, which is why I’m pontificating and equivocating and not really giving you a stable answer, because local governments don’t implement things that well. There’s a lot of local center attention. And especially with technical stuff –– this is the case of the US as well –– there’s just not as much technical talent in the government. So with a lot of these technical privacy issues, it’s very hard to develop good regulations if you don’t actually understand the tech. So what they’ve been trying to do is audit privacy policies of different social media tech companies and they started with 10 of the biggest and have tried to audit them. So I think it’s very much a developing process in both China and the US.

Lucas Perry: So you’re saying that the Chinese government, like the US, lacks much scientific or technical expertise? I had some sort of idea in my head that many of the Chinese mayors or other political figures actually have engineering degrees or degrees in science.

Jeffrey Ding: That’s definitely true. But I mean, by technical expertise I mean something like what the US government did with the digital service corps, where they’re getting people who have worked in the leading edge tech firms to then work for the government. That type of stuff would be useful in China.

Lucas Perry: So let’s move on to the second part, discussing components. And here you relate the key features of China’s AI strategy to the drivers of AI development, and here the drivers of AI development you say are hardware in the form of chips for training and executing AI algorithms, data as an input for AI Algorithms, research and algorithm development –– so actual AI researchers working on the architectures and systems through which the data will be put, and then the commercial AI ecosystems, which I suppose support and feed these first three things. What can you say about the state of these components in China and how it affects China’s AI strategy?

Jeffrey Ding: I think the main thing that I want to emphasize here that a lot of this is the Chinese government is trying to fill in some of the gaps, a lot of this is about enabling people, firms that are already doing the work. One of the gaps is private firms tend to under-invest in basic research or will under-invest in broader education because they don’t get a capture all those gains. So the government tries to support not only AI as a national level discipline but also to construct AI institutes, help fund talent programs to bring back the leading researchers from overseas. So that’s one part of it. 

The second part of it, which I did not talk about that much in the report in this section but I’ve recently researched more and more about, is that where the government is more actively driving things is when they are the final end client. So this is definitely the case in the surveillance industry space: provincial-level public security bureaus are working with companies in both hardware, data, research and development and the whole security systems integration process to develop more advanced high tech surveillance systems.

Lucas Perry: Expanding here, there’s also this way of understanding Chinese AI strategy as it relates to previous technologies and how it’s similar or different. Ways in which it’s similar involve strong degree of state support and intervention, transfer of both technology and talent, and investment in long-term whole-of-society measures; I’m quoting you here.

Jeffrey Ding: Yeah.

Lucas Perry: Furthermore, you state that China is adopting a catch-up approach in the hardware necessary to train and execute AI algorithms. This points towards an asymmetry, that most of the chip manufacturers are not in China and they have to buy them from Nvidia. And then you go on to mention about how access to large quantities of data is an important driver for AI systems and that China’s data protectionism favors Chinese AI companies and accessing data from China’s large domestic market, but it also detracts from cross-border pooling of data.

Jeffrey Ding: Yeah, and just to expand on that point, there’s been good research out of folks at DigiChina, which is a New America Institute, that looks at the cybersecurity law –– and we’re still figuring out how that’s going to be implemented completely, but the original draft would have prevented companies from taking data that was collected inside of China and taking it outside of China.

And actually these folks at DigiChina point out how some of the major backlash to this law didn’t just come from US multinational incorporations but also Chinese multinationals. That aspect of data protectionism illustrates a key trade-off: on one sense, countries and national security players are valuing personal data almost as a national security asset for the risk of blackmail or something. So this is the whole Grindr case in the US where I think Grindr was encouraged or strongly encouraged by the US government to find a non-Chinese owner. So that’s on one aspect you want to protect personal information, but on the other hand, free data flows are critical to spurring gains and innovation as well for some of these larger companies.

Lucas Perry: Is there an interest here to be able to sell their data to other companies abroad? Is that why they’re against this data protectionism in China?

Jeffrey Ding: I don’t know that much about this particular case, but I think Alibaba and Tencent have labs all around the world. So they might want to collate their data together, so they were worried that the cybersecurity law would affect that.

Lucas Perry: And just highlighting here for the listeners that access to large amounts of high quality data is extremely important for efficaciously training models and machine learning systems. Data is a new, very valuable resource. And so you go on here to say, I’m quoting you again, “China’s also actively recruiting and cultivating talented researchers to develop AI algorithms. The state council’s AI plan outlines a two pronged gathering and training approach.” This seems to be very important, but it also seems like from your report that China’s losing AI talent to America largely. What can you say about this?

Jeffrey Ding: Often the biggest bottleneck cited to AI development is lack of technical talent. That gap will eventually be filled just based on pure operations in the market, but in the meantime there has been a focus on AI talent, whether that’s through some of these national talent programs, or it also happens through things like local governments offering tax breaks for companies who may have headquarters around the world.

For example, Jingchi which is an autonomous driving startup, they had I think their main base in California or one of their main bases in California; But then Shenzhen or Guangzhou, I’m not sure which local government it was, they gave them basically free office space to move one of their bases back to China and that brings a lot of talented people back. And you’re right, a lot of the best and brightest do go to US companies as well, and one of the key channels for recruiting Chinese students are big firms setting up offshore research and development labs like Microsoft Research Asia in Beijing.

And then the third thing I’ll point out, and this is something I’ve noticed recently when I was doing translations from science and tech media platforms that are looking at the talent space in particular: They’ve pointed out that there’s sometimes a tension between the gathering and the training planks. So there’ve been complaints from domestic Chinese researchers, so maybe you have two super talented PhD students. One decides to stay in China, the other decides to go abroad for their post-doc. And oftentimes the talent plans –– the recruiting, gathering plank of this talent policy –– will then favor the person who went abroad for the post-doc experience over the person who stayed in China, and they might be just as good. So then that actually creates an incentive for more people to go abroad. There’s been good research that a lot of the best and brightest ended up staying abroad; The stay rates, especially in the US for Chinese PhD students in computer science fields, are shockingly high.

Lucas Perry: What can you say about Chinese PhD student anxieties with regards to leaving the United States to go visit family in China and come back? I’ve heard that there may be anxieties about not being let back in given that their research has focused on AI and that there’s been increasing US suspicions of spying or whatever.

Jeffrey Ding: I don’t know how much of it is a recent development but I think it’s just when applying for different stages of the path to permanent residency –– whether it’s applying for the H-1B visa or if you’re in the green card pipeline –– I’ve heard just secondhand that they avoid traveling abroad or going back to visit family just to kind of show commitment that they’re residing here in the US. So I don’t know how much of that is recent. My dad actually, he started out as a PhD student in math at University of Iowa before switching to computer science and I remember we had a death in the family and he couldn’t go back because it was so early on in his stay. So I’m sure it’s a conflicted situation for a lot of Chinese international students in the US.

Lucas Perry: So moving along here and ending this component section, you also say here –– and this kind of goes back to what we were discussing earlier about government guidance funds –– Chinese government is also starting to take a more active role in funding AI ventures, helping to grow the fourth driver of AI development, which again is the commercial AI ecosystems, which support and are the context for hardware data and research on algorithm development. And so the Chinese government is disbursing funds through what are called Government Guidance Funds or GGFs, set up by local governments and state owned companies. And the government has invested more than a billion US dollars on domestic startups. This seems to be in clear contrast with how America functions on this, with much of the investments shifting towards healthcare and AI as the priority areas in the last two years.

Jeffrey Ding: Right, yeah. So the GGFs are an interesting funding vehicle. The China Money Network, which has I think the best English language coverage of these vehicles, say that they may be history’s greatest experiment in using state capitol to reshape a nation’s economy. These essentially are Public Private Partnerships, PPPs, which do exist across the world, in the US. And the idea is basically the state seeds and anchors these investment vehicles and then they partner with private capital to also invest in startups, companies that the government thinks either are supporting a particular policy initiative or are good for overall development.

A lot of this is hard to decipher in terms of what the impact has been so far, because publicly available information is relatively scarce. I mentioned in my report that these funds haven’t had a successful exit yet, which means that maybe just they need more time. I think there’s also been some complaints that the big VCs –– whether it’s Chinese VCs or even international VCs that have a Chinese arm –– they much prefer to just to go it on their own rather than be tied to all the strings and potential regulations that come with working with the government. So I think it’s definitely a case of time will tell, and also this is a very fertile research area that I know some people are looking into. So be on the lookout for more conclusive findings about these GGFs, especially how they relate to the emerging technologies.

Lucas Perry: All right. So we’re getting to your capabilities section, which assesses the current state of China’s AI capabilities across the four drivers of AI development. Here you’re constructing an AI Potential Index, which is an index for the potentiality of, say, a country, based off these four variables, to be able to create successful AI products. So based on your research, you give China an AI Potential Index score of 17, which is about half of the US’s AI Potential Index score of 33. And so you state here that what is sort of essential to draw from this finding is the relative scale, or at least the proportionality, between China and the US. So the conclusion which we can try to draw from this is that China trails the US in every driver except for access to data, and that on all of these dimensions China is about half as capable as the US.

Jeffrey Ding: Yes, so the AIPI, the AI Potential Index, was definitely just meant as a first cut at developing a measure for which we can make comparative claims. I think at the time, and even now, I think we just throw around things like, “who is ahead in AI?” I was reading this recent Defense One article that was like, “China’s the world leader in GANs,” G-A-Ns, Generative Adversarial Networks. That’s just not even a claim that is coherent. Are you the leader at developing the talent who is going to make advancement to GANs? Are you the leader at applying and deploying GANs in the military field? Are you the leader in producing the most publications related to GANs?

I think that’s what was frustrating me about the conversation and net assessment of different countries’ AI capabilities, so that’s why I tried to develop a more systematic framework which looked at the different drivers, and it was basically looking at what is the potential of country’s AI capabilities based on their marks across these drivers.

Since then, probably the main thing that I’ve done update this was in my written testimony before the US China Economic and Security Review Commission, where I kind of switch up a little bit how I evaluate the current AI capabilities of China and the US. Basically there’s this very fuzzy concept of national AI capabilities that we throw around and I slice it up into three cross-sections. The first is, let’s look at what the scientific and technological inputs and outputs different countries are putting into AI. So that’s: how many publications are coming out of this country in Europe versus China versus US? How many outputs also in the sense of publications or inputs in the sense of R&D investments? So let’s take a look at that. 

The second slice is, let’s not just say AI. I think every time you say AI it’s always better to specify subtypes, or at least in the second slice I look at different layers of the AI value chain: foundational layers, technological layers, and the application layer. So, for example, foundation layers may be who is leading in developing the AI open source software that serves as the technological backbone for a lot of these AI applications and technologies? 

And then the third slice that I take is different sub domains of AI –– so computer vision, predictive intelligence, natural language processing, et cetera. And basically my conclusion: I throw a bunch of statistics in this written testimony out there –– some of it draws from this AI potential index that I put out last year –– and my conclusion is that China is not poised to overtake the US in the technology domain of AI; Rather the US maintains structural advantages in the quality of S and T inputs and outputs, the fundamental layers of the AI value chain, and key sub domains of AI.

So yeah, this stuff changes really fast too. I think a lot of people are trying to put together more systemic ways of measuring these things. So Jack Clark at openAI; projects like the AI index out of Stanford University; Matt Sheehan recently put out a really good piece for MacroPolo on developing sort of a five-dimensional framework for understanding data. So in this AIPI first cut, my data indicator is just a very raw who has more mobile phone users, but that obviously doesn’t matter for who’s going to lead in autonomous vehicles. So having finer grained understanding of how to measure different drivers will definitely help this field going forward.

Lucas Perry: What can you say about symmetries or asymmetries in terms of sub-fields in AI research like GANs or computer vision or any number of different sub-fields? Can we expect very strong specialties to develop in one country rather than another, or there to be lasting asymmetries in this space, or does research publication subvert this to some extent?

Jeffrey Ding: I think natural language processing is probably the best example because everyone says NLP, but then you just have that abstract word and you never dive into, “Oh wait, China might have a comparative advantage in Chinese language data processing, speech recognition, knowledge mapping,” which makes sense. There is just more of an incentive for Chinese companies to put out huge open source repositories to train automatic speech recognition.

So there might be some advantage in Chinese language data processing, although Microsoft Research Asia has very strong NOP capabilities as well. Facial recognition, maybe another area of comparative advantage: I think in my testimony I cite that China has published 900 patents in this sub domain in 2017; In that same year less than 150 patents related to facial recognition were filed in the US. So that could be partly just because there’s so much more of a fervor for surveillance applications, but in other domains such as the larger scale business applications the US probably possesses a decisive advantage. So autonomous vehicles are the best example of that: In my opinion, Google’s Waymo, GM’s Cruise are lapping the field.

And then finally in my written testimony I also try to look at military applications, and I find one metric that puts the US as having more than seven times as many military patents filed with the terms “autonomous” or “unmanned” in the patent abstract in the years 2003 to 2015. So yeah, that’s one of the research streams I’m really interested in, is how can we have more fine grain metrics that actually put into context China’s AI development, and that way we can have a more measured understanding of it.

Lucas Perry: All right, so we’ve gone into length now providing a descriptive account of China and the United States and key descriptive insights of your research. Moving into consequences now, I’ll just state some of these insights which you bring to light in your paper and then maybe you can expand on them a bit.

Jeffrey Ding: Sure.

Lucas Perry: You discuss the potential implications of China’s AI dream for issues of AI safety and ethics, national security, economic development, and social governance. The thinking here is becoming more diversified and substantive, though you claim it’s also too early to form firm conclusions about the long-term trajectory of China’s AI development; This is probably also true of any other country, really. You go on to conclude that a group of Chinese actors is increasingly engaged with issues of AI safety and ethics. 

A new book has been authored by Tencent’s Research Institute, and it includes a chapter in which the authors discuss the Asilomar Principles in detail and call for  strong regulations and controlling spells for AI. There’s also this conclusion that military applications of AI could provide a decisive strategic advantage in international security. The degree to which China’s approach to military AI represents a revolution in military affairs is an important question to study, to see how strategic advantages between the United States and China continue to change. You continue by elucidating how the economic benefit is the primary and immediate driving force behind China’s development of AI –– and again, I think you highlighted this sort of manufacturing perspective on this.

And finally, China’s adoption of AI Technologies could also have implications for its mode of social governance. For the state council’s AI plan, you state, “AI will play an irreplaceable role in maintaining social stability, an aim reflected in local level integrations of AI across a broad range of public services, including judicial services, medical care, and public security.” So given these sort of insights that you’ve come to and consequences of this descriptive picture we’ve painted about China and AI, is there anything else you’d like to add here?

Jeffrey Ding: Yeah, I think as you are laying out those four categories of consequences, I was just thinking this is what makes this area so exciting to study because if you think about it, each four of those consequences map out onto four research fields: AI ethics and safety, which with benevolent AI efforts, stuff that FLI is doing, the broader technology studies, critical technologies studies, technology ethics field; then in the social governance space, AI as a tool of social control: what are the social aftershocks of AI’s economic implications? You have this entire field of democracy studies or studies of technology and authoritarianism; and the economic benefits, you have this entire field of innovation studies: how do we understand the productivity benefits of general purpose technologies? And of course with AI as a revolution in military affairs, you have this whole field of security studies that is trying to understand what are the implications of new emerging technologies for national security? 

So it’s easy to start delineating these into their separate containers. I think what’s hard, especially for those of us are really concerned about that first field, AI ethics and safety, and the risks of AGI arms races, is a lot of other people are really, really concerned about those other three fields. And how do we tie in concepts from those fields? How do we take from those fields, learn from those fields, shape the language that we’re using to also be in conversation with those fields –– and then also see how those fields may actually be in conflict with some of what our goals are? And then how do we navigate those conflicts? How do we prioritize different things over others? It’s an exciting but daunting prospect ahead.

Lucas Perry: If you’re listening to this and are interested in becoming an AI researcher in terms of the China landscape, we need you. There’s a lot of great and open research questions here to work on.

Jeffrey Ding: For sure. For sure.

Lucas Perry: So I’ve extracted some insights from previous podcasts you did –– I can leave a link for that in the page for this podcast –– so I just want to kind of rapid fire these as points that I thought were interesting that we may or may not have covered here. You point out a language asymmetry: The best Chinese AI researchers read English and Chinese, whereas the western researchers generally cannot do this. You have a newsletter called China AI with 1A; Your newsletter attempts to correct for this as you translate important Chinese tech-related things into English. I suggest everyone follow that if you’re interested in continuing to track China and AI. There is more international cooperation on research at international conferences –– this is a general trend that you point out: Some top Chinese AI conferences are English only. Furthermore, I believe that you claim that the top 10% of AI research is still happening in America and the UK. 

Another point which I think that you’ve brought up is that China is behind on military AI uses. I’m also interested here just to see if you can expand a little bit more on it, but that China and AI safety and superintelligence is also something interesting to hear a little bit more about because on this podcast we often take the lens of long-term AI issues and AGI and super intelligence. So I think you mentioned that the Nick Bostrom of China is Professor, correct me if I get this wrong, Jao ting Wang. And also I’m curious here if you might be able to expand on how large or serious this China superintelligence FLI/FHI vibe is and what the implications of this are, and if there are any orgs in China that are explicitly focused on this. I’m sorry if this is a silly question, but are there like nonprofits in China in the same way that there are in the US? How does that function? Is China on the brink of having an FHI or FLI or MIRI or anything like this?

Jeffrey Ding: So a lot to untangle there and all really good questions. First, just to clarify, yeah, there are definitely nonprofits, non-governmental organizations. In recent years there has been some pressure on international nongovernmental organizations, nonprofit organizations, but there’s definitely nonprofits. One of the open source NLP initiatives I mentioned earlier, the Chinese language Corpus, was put together by a nonprofit online organization called AIShell Foundation, and they put together AIShell-1, AIShell-2, which are the largest open source speech Corpus available for Mandarin speech recognition.

I haven’t really followed up on Jao ting Wang. He’s a philosopher at the Chinese Academy of Social Sciences. The sort of “Nick Bostrom of China” label was more of a newsletter headline to get people to read, but he does devote a lot of time and thinking to the long-term risks of AI. Another professor at Nanjing University by the name of Zhi-Hua Zhou, he’s published articles about the need to not even touch some of what he calls strong AI. These were published in a pretty influential publication outlet by the Chinese Computer Federation, which brings together a lot of the big name computer scientists. So there’s definitely conversations about this happening. Whether there is an FHI, FLI equivalent, let’s say probably not, at least not yet.

Peking University may be developing something in this space. Berggruen Institute is also I think looking at some related issues. There’s probably a lot of stuff happening in Hong Kong as well; Maybe we just haven’t looked hard enough. I think the biggest difference is there’s definitely not something on the level of a DeepMind or OpenAI, because even the firms with the best general AI capabilities –– DeepMind and OpenAI almost like these unique entities where profits and stocks don’t matter.

So yeah, definitely some differences, but honestly I updated significantly once I started reading more, and nobody had really looked at this Zhi-Hua Zhou essay before we went looking and found it. So maybe there are a lot of these organizations and institutions out there but we just need to look harder.

Lucas Perry: So on this point of there not being OpenAI or DeepMind equivalents, are there any research organizations or departments explicitly focused on the mission of creating artificial general intelligence or superintelligence safely scalable machine learning systems that could go from now until infinity? Or is this just more like scattered researchers?

Jeffrey Ding: I think it’s how you define an AGI project. Like what you just said is probably a good tight definition. I know Seth Baum, he’s done some research tracking AGI projects and he says that there are six in China. I would say probably the only ones that come close are, I guess Tencent says it’s one of their missions streams to develop artificial general intelligence; horizon robotics, which is actually like a chip company, they also state it as one of their objectives. It depends also on how much you think work on neuroscience related pathways into AGI count or not. So there’s probably some Chinese Academy of Science labs working on whole brain emulation or kind of more brain inspired approaches to AGI, but definitely not anywhere to the level of DeepMind, OpenAI.

Lucas Perry: All right. So there are some myths in table one of your paper which you demystify. Three of these are: China’s approach to AI is defined by its top-down and monolithic nature; China is winning the AI arms race; And there is little to no discussion of issues of AI ethics and safety in China. And then maybe lastly I might add, if you might be able to add to it, that there is just to begin with an AI arms race between the US and China.

Jeffrey Ding: Yeah, I think that’s a good addition. I think we have to be careful about which historical analogies and memes we choose. So “arms race” is a very specific call back to cold war context, where there’s almost these discrete types of missiles that we are racing Soviet Union on and discrete applications that we can count up; Or even going way back to what some scholars call the first industrial arms race in the military sphere over steam power boats between Britain and France in the late 19th century. And all of those instances you can count up. France has four iron clads, UK has four iron clads; They’re racing to see who can build more. I don’t think there’s anything like that. There’s not this discreet thing that we’re racing to see who can have more of. If anything, it’s about a competition to see who can absorb AI advances from abroad better, who can diffuse them throughout the economy, who can adopt them in a more sustainable way without sacrificing core values.

So that’s sort of one meme that I really want to dispel. Related to that, assumptions that often influence a lot of our discourse on this is techno-nationalist assumption, which is this idea that technology is contained within national boundaries and that the nation state is the most important actor –– which is correct and a good one to have and a lot of instances. But there are also good reasons to adopt techno-globalist assumptions as well, especially in the area of how fast technologies diffuse nowadays and also how much underneath this national level competition, firms from different countries are working together and make standards alliances with each other. So there’s this undercurrent of techno-globalism, where there are people flows, idea flows, company flows happening while the coverage and the sexy topic is always going to be about national level competition, zero sum competition, relative games rhetoric. So you’re trying to find a balance between those two streams.

Lucas Perry: What can you say about this sort of reflection on zero sum games versus healthy competition and the properties of AI and AI research? I’m seeking clarification on this secondary framing that we can take on a more international perspective about deployment and implementation of AI research and systems rather than, as you said, this sort of techno-nationalist one.

Jeffrey Ding: Actually, this idea comes from my supervisor: Relative gains make sense if there’s only two players involved, just from a pure self-interest maximizing standpoint. But once you introduce three or more players, relative gains doesn’t make as much sense as optimizing for absolute gains. So maybe one way to explain this is to take the perspective of a European country –– let’s say Germany –– and you are working on an AI project with China or some other country that maybe the US is pressuring you not to work with; You’re working with Saudi Arabia or China on some project and it’s going to benefit China 10 arbitrary points and it’s going to benefit Germany eight arbitrary points versus if you didn’t choose to cooperate at all.

So in that sense, Germany, the rational actor, would take that deal. You’re not just caring about being better than China; From a German perspective, you care about maintaining leadership in the European Union, providing health benefits to your citizens, continuing to power your economy. So in that sense you would take the deal even though China benefits a little bit more, relatively speaking. 

I think currently a lot of people in the US are locked into this mindset that the only two players that exist in the world are the US and China. And if you look at our conversation, right, oftentimes I’ve displayed that bias as well. We should probably have talked a lot more about China-EU or China-Japan cooperation in this space and networks in this space because there’s a lot happening there too. So a lot of US policy makers see this as a two-player game between the US and China. And then in that sense, if there’s some cancer research project about discovering proteins using AI that may benefit China by 10 points and benefit the US only by eight points, but it’s going to save a lot of people from cancer  –– if you only care about making everything about maintaining a lead over China, then you might not take that deal. But if you think about it from the broader landscape of it’s not just a zero sum competition between US and China, then your kind of evaluation of those different point structures and what you think is rational will change.

Lucas Perry: So as there’s more actors, is the idea here that you care more about absolute gains in the sense that these utility points or whatever can be translated into decisive strategic advantages like military advantages?

Jeffrey Ding: Yeah, I think that’s part of it. What I was thinking along that example is basically 

if you as Germany don’t choose to cooperate with Saudi Arabia or work on this joint research project with China then the UK or some other countries just going to swoop in. And that possibility doesn’t exist in the world where you’re just thinking about two players. There’s a lot of different ways to fit these sort of formal models, but that’s probably the most simplistic way of explaining it.

Lucas Perry: Okay, cool. So you’ve spoken a bit here on important myths that we need to dispel or memes that we need to combat. And recently Peter Thiel has been on a bunch of conservative platforms, and he also wrote an op-ed, basically fanning the flames of AGI as a military weapon, AI as a path to superintelligence and, “Google campuses have lots of Chinese people on them who may be spies,” and that Google is actively helping China with AI military technology. In terms of bad memes and myths to combat, what are your thoughts here?

Jeffrey Ding: There’s just a lot of things that Thiel gets wrong. I’m mostly kind of just confused because he is one of the original founders of OpenAI, he’s funded other institutions, really concerned about AGI safety, really concerned about race dynamics –– and then in the middle of this piece, he first says AI is a military technology, then he goes back to saying AI is dual use in the middle, and then he says this ambiguity is “strangely missing from the narrative that pits a monolithic AI against all of humanity.” He out of anyone should know that these conversations about the risks of AGI, why are you attacking this straw man in the form of a terminator AI meme? Especially, you’re funding a lot of the organizations that are worried about the risks of AGI for all of humanity. 

The other main thing that’s really problematic is if you’re concerned about the US military advantage, that more than ever is rooted on our innovation advantage. It’s not about spinoff from military innovation to civilian innovation, which was the case in the days of US tech competition against Japan. It’s more the case of spin on, where innovations are happening in the commercial sector that are undergirding the US military advantage.

And this idea of painting Google as anti-American for setting up labs in China is so counterproductive. There are independent Google developer conferences all across China just because so many Chinese programmers want to use Google tools like TensorFlow. It goes back to the fundamental AI open source software I was talking about earlier that lets Google expand its talent pool: People want to work on Google products; They’re more used to the framework of Google tools to build all these products. Google’s not doing this out of charity to help the Chinese military. They’re doing this because the US has a flawed high-skilled immigration system, so they need to go to other countries to get talent. 

Also, the other thing about the piece is he cites no empirical research on any of these fronts, when there’s this whole globalization of innovation literature that backs up empirically a lot of what I’m saying. And then I’ve done my own empirical research on Microsoft Research Asia, which as we’ve mentioned is their second biggest lab overall, it’s based in Beijing. I’ve tracked their PhD Fellowship Program: This basically gives people at Chinese PhD programs, you get a full scholarship and you just do an internship at Microsoft Research Asia for one of the summers. And then we track their career trajectories, and a lot of them end up coming to the US or working for Microsoft Research Asia in Beijing. And the ones that come to the US don’t just go to Microsoft: They go to Snapchat or Facebook or other companies. And it’s not just about the people: As I mentioned earlier, we have this innovation centrism about who produces the technology first, but oftentimes it’s about who diffuses and adopts the technology first. And we’re not always going to be the first on the scene, so we have to be able to adopt and diffuse technologies that are invented first in other areas. And these overseas labs are some of our best portals into understanding what’s happening in these other areas. If we lose them, it’s another form of asymmetry because Chinese AI companies are going abroad and expanding. 

I honestly, I’m just really confused about what the point of this piece was and to be honest, it’s kind of sad because this is not what Thiel researches every day. So he’s obviously picking up bits and pieces from the narrative frames that are dominating our conversation. And it’s actually probably a structural stain on how we’ve allowed the discourse to have so many of these bad problematic memes, and we need more people calling them out actively, doing the heart to heart conversations behind the scenes to get people to change their minds or have productive constructive conversations about these.

And the last thing I’ll point out here is there’s this zombie Cold War mentality that still lingers today, and I think the historian Walter McDougall was really great in calling this out, where he talks about we paint this other, this enemy, and we use it to justify sacrifices in human values to drive society to its fullest technological potential. And that often comes with sacrificing human values like privacy, equality, freedom of speech. And I don’t want us to compete with China over who can build better tools to sensor, repress, and surveil dissidents and minority groups, right? Let’s see who can build the better, I don’t know, industrial internet of things or build better privacy preserving algorithms that are going to sustain a more trustworthy AI ecosystem.

Lucas Perry: Awesome. So just moving along here as we’re making it to the end of our conversation: What are updates you’ve had or major changes since you’ve written Deciphering China’s AI Dreams, since it has been a year?

Jeffrey Ding: Yeah, I mentioned some of the updates in the capability section. The consequences, I mean I think those are still the four main big issues, all of them tied to four different literature bases. The biggest change would probably be in the component section. I think when I started out, I was pretty new in this field, I was reading a lot of literature from the China watching community and also a lot from Chinese comparative politics or articles about China, and so I focused a lot on government policies. And while I think the party and the government are definitely major players, I think I probably overemphasized the importance of government policies versus what is happening at the local level.

So if I were to go back and rewrite it, I would’ve looked a lot more at what is happening at the local level, given more examples of AI firms, like iFlytek I think is a very interesting under-covered firm, and how they are setting up research institutes with a university in Chung Cheng very similar to the industry- academia style collaborations in the US, basically just ensuring that they’re able to train the next generation of talent. They have relatively close ties to the state as well, I think controlling shares or a large percentage of shares owned by state-owned vehicles. So I probably would have gone back and looked at some of these more under-covered firms and localities and looked at what they were doing rather than just looking at the rhetoric coming from the central government.

Lucas Perry: Okay. What does it mean for there to be healthy competition between the United States and China? What is an ideal AI research and political situation? What are the ideal properties of the relations the US and China can have on the path to superintelligence?

Jeffrey Ding: Yeah.

Lucas Perry: Solve AI Governance for me, Jeff!

Jeffrey Ding: If I could answer that question, I think I could probably retire or something. I don’t know.

Lucas Perry: Well, we’d still have to figure out how to implement the ideal governance solutions.

Jeffrey Ding: Yeah. I think one starting point is on the way to more advanced AI systems, we have to stop looking at AI as if it’s like this completely special area with no analogs, because even though there are unique aspects of AI –– like their autonomous intelligence systems, a possibility of the product surpassing human level intelligence, or the process surpassing human level intelligence –– we can learn a lot from past general purpose technologies like steam, electricity, the diesel engine. And we can learn about a lot of competition in past strategic industries like chips, steel.

So I think probably one thing that we can distill from some of this literature is there are some aspects of AI development that are going to be more likely to lead to race dynamics than others. So one cut that you could take are industries where it’s likely that there are only going to be two or three, four or five major players –– so it might be the case that capital costs, the upstart costs, the infrastructure costs of autonomous vehicles requires that there are going to be only one or two players across the world. And that is like, hey, if you’re a national government who’s thinking strategically, you might really want to have a player in that space, so that might incentivize more competition. Whereas in other fields, maybe there’s just going to be a lot more competition or less need for relative gain, zero sum thinking. So like neural machine translation, that could be a case of something that just almost becomes like a commodity. 

So then there are things we can think about in those fields where there’s only going to be four or five players or three or four players. Can we maybe balance it out so that at least one is from the two major powers or is the better approach to, I don’t know, enact global competition, global antitrust policy to kind of ensure that there’s always going to be a bunch of different players from a bunch of different countries? So those are some of the things that come to mind that I’m thinking about, but yeah, this is definitely something where I claim zero credibility relative to others who are thinking about it.

Lucas Perry: Right. Well unclear anyone has very good answers here. I think my perspective, to add at least one frame on it, is that given the dual use nature of many of the technologies like computer vision and like embedded robot systems and developing autonomy and image classification –– all of these different AI specialty subsystems can be sort of put together in arbitrary ways. So in terms of autonomous weapons, FLI’s position is, it’s important to establish international standards around the appropriate and beneficial uses of these technologies.

Image classification, as people already know, can be used for discrimination or beneficial things. And the technologies can be aggregated to make anything from literal terminator swarm robots to lifesaving medical treatments. So the relation between the United States and China can be made more productive if clear standards based on the expression of the principles we enumerated earlier could be created. And given that, then we might be taking some paths towards a beneficial beautiful future of advanced AI systems.

Jeffrey Ding: Yeah, no, I like that a lot. And some of the technical standards documents I’ve been translating: I definitely think in the short-term, technical standards are a good way forward, sort of solve the starter pack type of problems before AGI. Even some Chinese white papers on AI standardization have put out the idea of ranking the intelligence level of different autonomous systems –– like an autonomous car might be more than a smart speaker or something: Even that is a nice way to kind of keep track of the progress, is continuities in terms of intelligence explosions and trajectories in the space. So yeah, I definitely second that idea. Standardization efforts, autonomous weapons regulation efforts, as serving as the building blocks for larger AGI safety issues.

Lucas Perry: I would definitely like to echo this starter pack point of view. There’s a lot of open questions about the architectures or ways in which we’re going to get to AGI, about how the political landscape and research landscape is going to change in time. But I think that we already have enough capabilities and questions that we should really be considering where we can be practicing and implementing the regulations and standards and principles and intentions today in 2019 that are going to lead to robustly good futures for AGI and superintelligence.

Jeffrey Ding: Yeah. Cool.

Lucas Perry: So Jeff, if people want to follow you, what is the best way to do that?

Jeffrey Ding: You can hit me up on Twitter, I’m @JJDing99; Or I put out a weekly newsletter featuring translations on AI related issues from Chinese media, Chinese scholars and that’s China AI Newsletter, C-H-I-N-A-I. if you just search that, it should pop up.

Lucas Perry: Links to those will be provided in the description of wherever you might find this podcast. Jeff, thank you so much for coming on and thank you for all of your work and research and efforts in this space, for helping to create a robust and beneficial future with AI.

Jeffrey Ding: All right, Lucas. Thanks. Thanks for the opportunity. This was fun.

Lucas Perry: If you enjoyed this podcast, please subscribe, give it a like or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

End of recorded material

AI Alignment Podcast: On the Governance of AI with Jade Leung

In this podcast, Lucas spoke with Jade Leung from the Center for the Governance of AI (GovAI). GovAI strives to help humanity capture the benefits and mitigate the risks of artificial intelligence. The center focuses on the political challenges arising from transformative AI, and they seek to guide the development of such technology for the common good by researching issues in AI governance and advising decision makers. Jade is Head of Research and Partnerships at GovAI, where her research focuses on modeling the politics of strategic general purpose technologies, with the intention of understanding which dynamics seed cooperation and conflict.

Topics discussed in this episode include:

  • The landscape of AI governance
  • GovAI’s research agenda and priorities
  • Aligning government and companies with ideal governance and the common good
  • Norms and efforts in the AI alignment community in this space
  • Technical AI alignment vs. AI Governance vs. malicious use cases
  • Lethal autonomous weapons
  • Where we are in terms of our efforts and what further work is needed in this space

You can take a short (3 minute) survey to share your feedback about the podcast here.

Important timestamps: 

0:00 Introduction and updates

2:07 What is AI governance?

11:35 Specific work that Jade and the GovAI team are working on

17:21 Windfall clause

21:20 Policy advocacy and AI alignment community norms and efforts

27:22 Moving away from short-term vs long-term framing to a stakes framing

30:44 How do we come to ideal governance?

40:22 How can we contribute to ideal governance through influencing companies and government?

48:12 US and China on AI

51:18 What more can we be doing to positively impact AI governance?

56:46 What is more worrisome, malicious use cases of AI or technical AI alignment?

01:01:19 What is more important/difficult, AI governance or technical AI alignment?

01:03:49 Lethal autonomous weapons

01:09:49 Thinking through tech companies in this space and what we should do

 

Two key points from Jade: 

“I think one way in which we need to rebalance a little bit, as kind of an example of this is, I’m aware that a lot of the work, at least that I see in this space, is sort of focused on very aligned organizations and non-government organizations. So we’re looking at private labs that are working on developing AGI. And they’re more nimble. They have more familiar people in them, we think more similarly to those kinds of people. And so I think there’s an attraction. There’s really good rational reasons to engage with the folks because they’re the ones who are developing this technology and they’re plausibly the ones who are going to develop something advanced.

“But there’s also, I think, somewhat biased reasons why we engage, is because they’re not as messy, or they’re more familiar, or we see more value aligned. And I think this early in the field, putting all our eggs in a couple of very, very limited baskets, is plausibly not that great a strategy. That being said, I’m actually not entirely sure what I’m advocating for. I’m not sure that I want people to go and engage with all of the UN conversations on this because there’s a lot of noise and very little signal. So I think it’s a tricky one to navigate, for sure. But I’ve just been reflecting on it lately, that I think we sort of need to be a bit conscious about not group thinking ourselves into thinking we’re sort of covering all the basis that we need to cover.”

 

“I think one thing I’d like for people to be thinking about… this short term v. long term bifurcation. And I think a fair number of people are. And the framing that I’ve tried on a little bit is more thinking about it in terms of stakes. So how high are the stakes for a particular application area, or a particular sort of manifestation of a risk or a concern.

“And I think in terms of thinking about it in the stakes sense, as opposed to the timeline sense, helps me at least try to identify things that we currently call or label near term concerns, and try to filter the ones that are worth engaging in versus the ones that maybe we just don’t need to engage in at all. An example here is that basically I am trying to identify near term/existing concerns that I think could scale in stakes as AI becomes more advanced. And if those exist, then there’s really good reason to engage in them for several reasons, right?…Plausibly, another one would be privacy as well, because I think privacy is currently a very salient concern. But also, privacy is an example of one of the fundamental values that we are at risk of eroding if we continue to deploy technologies for other reasons : efficiency gains, or for increasing control and centralizing of power. And privacy is this small microcosm of a maybe larger concern about how we could possibly be chipping away at these very fundamental things which we would want to preserve in the longer run, but we’re at risk of not preserving because we continue to operate in this dynamic of innovation and performance for whatever cost. Those are examples of conversations where I find it plausible that there are existing conversations that we should be more engaged in just because those are actually going to matter for the things that we call long term concerns, or the things that I would call sort of high stakes concerns.”

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. Key works mentioned in this podcast can be found here 

Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast. I’m Lucas Perry. And today, we will be speaking with Jade Leung from the Center for the Governance of AI, housed at the Future of Humanity Institute. Their work strives to help humanity capture the benefits and mitigate the risks of artificial intelligence. They focus on the political challenges arising from transformative AI, and seek to guide the development of such technology for the common good by researching issues in AI governance and advising decision makers. Jade is Head of Research and Partnerships at GovAI, and her research work focusing on modeling the politics of strategic general purpose technologies, with the intention of understanding which dynamics seed cooperation and conflict.

In this episode, we discuss GovAI’s research agenda and priorities, the landscape of AI governance, how we might arrive at ideal governance, the dynamics and roles of both companies and states within this space, how we might be able to better align private companies with what we take to be ideal governance. We get into the relative importance of technical AI alignment and governance efforts on our path to AGI, we touch on lethal autonomous weapons, and also discuss where we are in terms of our efforts in this broad space, and what work we might like to see more of.

As a general bit of announcement, I found all the feedback coming in through the SurveyMonkey poll to be greatly helpful. I’ve read through all of your comments and thoughts, and am working on incorporating feedback where I can. So for the meanwhile, I’m going to leave the survey up, and you’ll be able to find a link to it in a description of wherever you might find this podcast. Your feedback really helps and is appreciated. And, as always, if you find this podcast interesting or useful, consider sharing with others who might find it valuable as well. And so, without further ado, let’s jump into our conversation with Jade Leung.

So let’s go ahead and start by providing a little bit of framing on what AI governance is, the conceptual landscape that surrounds it. What is AI governance, and how do you view and think about this space?

Jade: I think the way that I tend to think about AI governance is with respect to how it relates to the technical field of AI safety. In both fields, the broad goal is how humanity can best navigate our transition towards a world with advanced AI systems in it. The technical AI safety agenda and the kind of research that’s being done there is primarily focused on how do we build these systems safely and well. And the way that I think about AI governance with respect to that is broadly everything else that’s not that. So that includes things like the social, political, economic context that surrounds the way in which this technology is developed and built and used and employed.

And specifically, I think with AI governance, we focus on a couple of different elements of it. One big element is the governance piece. So what are the kinds of norms and institutions we want around a world with advanced AI serving the common good of humanity. And then we also focus a lot on the kind of strategic political impacts and effects and consequences of the route on the way to a world like that. So what are the kinds of risks, social, political, economic? And what are the kinds of impacts and effects that us developing it in sort of sub-optimal ways could have on the various things that we care about.

Lucas: Right. And so just to throw out some other cornerstones here, because I think there’s many different ways of breaking up this field and thinking about it, and this sort of touches on some of the things that you mentioned. There’s the political angle, the economic angle. There’s the military. There’s the governance and the ethical dimensions.

Here on the AI Alignment Podcast, before we’ve, at least breaking down the taxonomy sort of into the technical AI alignment research, which is getting machine systems to be aligned with human values and desires and goals, and then the sort of AI governance, the strategy, the law stuff, and then the ethical dimension. Do you have any preferred view or way of breaking this all down? Or is it all just about good to you?

Jade: Yeah. I mean, there are a number of different ways of breaking it down. And I think people also mean different things when they say strategy and governance and whatnot. I’m not particular excited about getting into definitional debates. But maybe one way of thinking about what this word governance means is, at least I often think of governance as the norms, and the processes, and the institutions that are going to, and already do, shape the development and deployment of AI. So I think a couple of things that are work underlining in that, I think there’s … The word governance isn’t just specifically government and regulations. I think that’s a specific kind of broadening of the term, which is worth pointing out because that’s a common misconception, I think, when people use the word governance.

So when I say governance, I mean governance and regulation, for sure. But I also mean what are other actors doing that aren’t governance? So labs, researchers, developers, NGOs, journalists, et cetera, and also other mechanisms that aren’t regulation. So it could be things like reputation, financial flows, talent flows, public perception, what’s within and outside the opportune window, et cetera. So there’s a number of different levers I think you can pull if you’re thinking about governance.

It’s probably worth also pointing out, I think, when people say governance, a lot of the time people are talking about the normative side of things, so what should it look like, and how could be if it were good? A lot of governance research, at least in this space now, is very much descriptive. So it’s kind of like what’s actually happening, and trying to understand the landscape of risk, the landscape of existing norms that we have to work with, what’s a tractable way forward with existing actors? How do you model existing actors in the first place? So a fair amount of the research is very descriptive, and I would qualify that as AI governance research, for sure.

Other ways of breaking it down are, according to the research done that we put out, is one option. So that kind of breaks it down into firstly understanding the technological trajectory, so that’s understanding where this technology is likely to go, what are the technical inputs and constraints, and particularly the ones that have implications for governance outcomes. This looks like things like modeling AI progress, mapping capabilities, involves a fair amount of technical work.

And then you’ve got the politics cluster, which is probably where a fair amount of the work is at the moment. This is looking at political dynamics between powerful actors. So, for example, my work is focusing on big firms and government and how they relate to each other, but also includes how AI transforms and impacts political systems, both domestically and internationally. This includes the cluster around international security and the race dynamics that fall into that. And then also international trade, which is a thing that we don’t talk about a huge amount, but politics also includes this big dimension of economics in it.

And then the last cluster is this governance cluster, which is probably the most normative end of what we would want to be working on in this space. This is looking at things like what are the ideal institutions, infrastructure, norms, mechanisms that we can put in place now/in the future that we should be aiming towards that can steer us in robustly good directions. And this also includes understanding what shapes the way that these governance systems are developed. So, for example, what roles does the public have to play in this? What role do researchers have to play in this? And what can we learn from the way that we’ve governed previous technologies in similar domains, or with similar challenges, and how have we done on the governance front on those bits as well. So that’s another way of breaking it down, but I’ve heard more than a couple of ways of breaking this space down.

Lucas: Yeah, yeah. And all of them are sort of valid in their own ways, and so we don’t have to spend too much time on this here. Now, a lot of these things that you’ve mentioned are quite macroscopic effects in the society and the world, like norms and values and developing a concept of ideal governance and understanding actors and incentives and corporations and institutions and governments. Largely, I find myself having trouble developing strong intuitions about how to think about how to impact these things because it’s so big it’s almost like the question of, “Okay, let’s figure out how to model all of human civilization.” At least all of the things that matter a lot for the development and deployment of technology.

And then let’s also think about ideal governance, like what is also the best of all possible worlds, based off of our current values, that we would like to use our model of human civilization to bring us closer towards? So being in this field, and exploring all of these research threads, how do you view making progress here?

Jade: I can hear the confusion in your voice, and I very much resonate with it. We’re sort of consistently confused, I think, at this place. And it is a very big, both set of questions, and a big space to kind of wrap one’s head around. I want to emphasize that this space is very new, and people working in this space are very few, at least with respect to AI safety, for example, which is still a very small section that feels as though it’s growing, which is a good thing. We are at least a couple of years behind, both in terms of size, but also in terms of sophistication of thought and sophistication of understanding what are more concrete/sort of decision relevant ways in which we can progress this research. So we’re working hard, but it’s a fair ways off.

One way in which I think about it is to think about it in terms of what actors are making decisions now/in the near to medium future, that are the decisions that you want to influence. And then you sort of work backwards from that. I think at least, for me, when I think about how we do our research at the Center for the Governance of AI, for example, when I think about what is valuable for us to research and what’s valuable to invest in, I want to be able to tell a story of how I expect this research to influence a decision, or a set of decisions, or a decision maker’s priorities or strategies or whatever.

Ways of breaking that down a little bit further would be to say, you know, who are the actors that we actually care about? One relatively crude bifurcation is focusing on those who are in charge of developing and deploying these technologies, firms, labs, researchers, et cetera, and then those who are in charge of sort of shaping the environment in which this technology is deployed, and used, and is incentivized to progress. So that’s folks who shape the legislative environment, folks who shape the market environment, folks who shape the research culture environment, and expectations and whatnot.

And with those two sets of decision makers, you can then boil it down into what are the particular decisions they are in charge of making that you can decide you want to influence, or try to influence, by providing them with research insights or doing research that will in some down shoot way, affect the way they think about how these decisions should be made. And a very, very concrete example would be to pick, say, a particular firm. And they have a set of priorities, or a set of things that they care about achieving within the lifespan of that firm. And they have a set of strategies and tactics that they intend to use to execute on that set of priorities. So you can either focus on trying to shift their priorities towards better directions if you think they’re off, or you can try to point out ways in which their strategies could be done slightly better, e.g. they be coordinating more with other actors, or they should be thinking harder about openness in their research norms. Et cetera, et cetera.

Well, you can kind of boil it down to the actor level and the decision specific level, and get some sense of what it actually means for progress to happen, and for you to have some kind of impact with this research. One caveat with this is that I think if one takes this lens on what research is worth doing, you’ll end up missing a lot of valuable research being done. So a lot of the work that we do currently, as I said before, is very much understanding what’s going on in the first place. What are the actual inputs into the AI production function that matter and are constrained and are bottle-necked? Where are they currently controlled? A number of other things which are mostly just descriptive I can’t tell you with which decision I’m going to influence by understanding this. But having a better baseline will inform better work across a number of different areas. I’d say that this particular lens is one way of thinking about progress. There’s a number of other things that it wouldn’t measure, that are still worth doing in this space.

Lucas: So it does seem like we gain a fair amount of tractability by just thinking, at least short term, who are the key actors, and how might we be able to guide them in a direction which seems better. I think here it would also be helpful if you could let us know, what is the actual research that you, and say, Allan Dafoe engage in on a day to day basis. So there’s analyzing historical cases. I know that you guys have done work with specifying your research agenda. You have done surveys of American attitudes and trends on opinions on AI. Jeffrey Ding has also released a paper on deciphering China’s AI dream, tries to understand China’s AI strategy. You’ve also released on the malicious use cases of artificial intelligence. So, I mean, what is it like being Jade on a day to day trying to conquer this problem?

Jade: The specific work that I’ve spent most of my research time on to date sort of falls into the politics/governance cluster. And basically, the work that I do is centered on the assumption that there are things that we can learn from a history of trying to govern strategic general purpose technologies well. And if you look at AI, and you believe that it has certain properties that make it strategic, strategic here in the sense that it’s important for things like national security and economic leadership of nations and whatnot. And it’s also general purpose technology, in that it has the potential to do what GPTs do, which is to sort of change the nature of economic production, push forward a number of different frontiers simultaneously, enable consistent cumulative progress, change course of organizational functions like transportation, communication, et cetera.

So if you think that AI looks like strategic general purpose technology, then the claim is something like, in history we’ve seen a set of technology that plausibly have the same traits. So the ones that I focus on are biotechnology, cryptography, and aerospace technology. And the question that sort of kicked off this research is, how have we dealt with the very fraught competition that we currently see in the space of AI when we’ve competed across these technologies in the past. And the reason why there’s a focus on competition here is because, I think one important thing that characterizes a lot of the reasons why we’ve got a fair number of risks in the AI space is because we are competing over it. “We” here being very powerful nations, very powerful firms, and the reason why competition is an important thing to highlight is that it exacerbates a number of risks and it causes a number of risks.

So when you’re in a competitive environment, actors were normally incentivized to take larger risks than they otherwise would rationally do. They are largely incentivized to not engage in the kind of thinking that is required to think about public goods governance and serving the common benefit of humanity. And they’re more likely to engage in thinking about, is more about serving parochial, sort of private, interests.

Competition is bad for a number of reasons. Or it could be bad for a number of reasons. And so the question I’m asking is, how have we competed in the past? And what have been the outcomes of those competitions? Long story short, so the research that I do is basically I dissect these cases of technology development, specifically in the US. And I analyze the kinds of conflicts, and the kinds of cooperation that have existed between the US government and the firms that were leading technology development, and also the researcher communities that were driving these technologies forward.

Other pieces of research that are going on, we have a fair number of our researcher working on understanding what are the important inputs into AI that are actually progressing us forward. How important is compute relative to algorithmic structures, for example? How important is talent, with respect to other inputs? And then the reason why that’s important to analyze and useful to think about is understanding who controls these inputs, and how they’re likely to progress in terms of future trends. So that’s an example of the technology forecasting work.

In the politics work, we have a pretty big chunk on looking at the relationship between governments and firms. So this is a big piece of work that I’ve been doing, along with a fair amount of others, understanding, for example, if the US government wanted to control AI R&D, what are the various levers that they have available, that they could use to do things like seize patents, or control research publications, or exercise things like export controls, or investment constraints, or whatnot. And the reason why we focus on that is because my hypothesis is that ultimately, ultimately you’re going to start to see states get much more involved. At the moment, you’re currently in this period of time wherein a lot of people describe it as very private sector driven, and the governments are behind, I think, and history would also suggest that the state is going to be involved much more significantly very soon. So understanding what they could do, and what their motivations are, are important.

And then, lastly, on the governance piece, a big chunk of our work here is specifically on public opinions. So you’ve mentioned this before. But basically, we have a big substantial chunk of our work, consistently, is just understanding what the public thinks about various issues to do with AI. So recently, we published a report of the recent set of surveys that we did surveying the American public. And we asked them a variety of different questions and got some very interesting answers.

So we asked them questions like: What risks do you think are most important? Which institution do you trust the most to do things with respect to AI governance and development? How important do you think certain types of governance challenges are for American people? Et cetera. And the reason why this is important for the governance piece is because governance ultimately needs to have sort of public legitimacy. And so the idea was that understanding how the American public thinks about certain issues can at least help to shape some of the conversation around where we should be headed in governance work.

Lucas: So there’s also been work here, for example, on capabilities forecasting. And I think Allan and Nick Bostrom also come at these from slightly different angles sometimes. And I’d just like to explore all of these so we can get all of the sort of flavors of the different ways that researchers come at this problem. Was it Ben Garfinkel who did the offense-defense analysis?

Jade: Yeah.

Lucas: So, for example, there’s work on that. That work was specifically on trying to understand how the offense-defense bias scales as capabilities change. This could have been done with nuclear weapons, for example.

Jade: Yeah, exactly. That was an awesome piece of work by Allan and Ben Garfinkel, looking at this concept of the offense-defense balance, which exists for weapon systems broadly. And they were sort of analyzing and modeling. It’s a relatively theoretical piece of work, trying to model how the offense-defense balance changes with investments. And then there was a bit of a investigation there specifically on how we could expect AI to affect the offense-defense balance in different types of contexts. The other cluster work, which I failed to mention as well, is a lot of our work on policy, specifically. So this is where projects like the windfall clause would fall in.

Lucas: Could you explain what the windfall clause is, in a sentence or two?

Jade: The windfall clause is an example of a policy lever, which we think could be a good idea to talk about in public and potentially think about implementing. And the windfall clause is an ex-ante voluntary commitment by AI developers to distribute profits from the development of advanced AI for the common benefit of humanity. What I mean by ex-ante is that they commit to it now. So an AI developer, say a given AI firm, will commit to, or sign, the windfall clause prior to knowing whether they will get to anything like advanced AI. And what they commit to is saying that if I hit a certain threshold of profits, so what we call windfall profit, and the threshold is very, very, very high. So the idea is that this should only really kick in if a firm really hits the jackpot and develops something that is so advanced, or so transformative in the economic sense, that they get a huge amount of profit from it at some sort of very unprecedented scale.

So if they hit that threshold of profit, this clause will kick in, and that will commit them to distributing their profits according to some kind of pre-committed distribution mechanism. And the idea with the distribution mechanism is that it will redistribute these products along the lines of ensuring that sort of everyone in the world can benefit from this kind of bounty. There’s a lot of different ways in which you could do the distribution. And we’re about to put out the report which outlines some of our thinking on it. And there are many more ways in which it could be done besides from what we talk about.

But effectively, what you want in a distribution mechanism is you want it to be able to do things like rectify inequalities that could have been caused in the process of developing advanced AI. You want it to be able to provide a financial buffer to those who’ve been thoughtlessly unemployed by the development of advanced AI. And then you also want it to do somewhat positive things too. So it could be, for example, that you distribute it according to meeting the sustainable development goals. Or it could be redistributed according to a scheme that looks something like the UBI. And that transitions us into a different type of economic structure. So there are various ways in which you could play around with it.

Effectively, the windfall clause is starting a conversation about how we should be thinking about the responsibilities that AI developers have to ensure that if they do luck out, or if they do develop something that is as advanced as some of what we speculate we could get to, there is a responsibility there. And there also should be a committed mechanism there to ensure that that is balanced out in a way that reflects the way that we want this value to be distributed across the world.

And that’s an example of the policy lever that is sort of uniquely concrete, in that we don’t actually do a lot of concrete research. We don’t do much policy advocacy work at all. But to the extent that we want to do some policy advocacy work, it’s mostly with the motivation that we want to be starting important conversations about robustly good policies that we could be advocating for now, that can help steer us in better directions.

Lucas: And fitting this into the research threads that we’re talking about here, this goes back to, I believe, Nick Bostrom’s Superintelligence. And so it’s sort of predicated on more foundational principles, which can be attributed to before the Asilomar Conference, but also the Asilomar principles which were developed in 2017, that the benefits of AI should be spread widely, and there should be abundance. And so then there becomes these sort of specific policy implementations or mechanisms by which we are going to realize these principles which form the foundation of our ideal governance.

So Nick has sort of done a lot of this work on forecasting. The forecasting in Superintelligence was less about concrete timelines, and more about the logical conclusions of the kinds of capabilities that AI will have, fitting that into our timeline of AI governance thinking, with ideal governance at the end of that. And then behind us, we have history, which we can, as you’re doing yourself, try to glean more information about how what you call general purpose technologies affect incentives and institutions and policy and law and the reaction of government to these new powerful things. Before we brought up the windfall clause, you were discussing policy at FHI.

Jade: Yeah, and one of the reasons why it’s hard is because if we put on the frame that we mostly make progress by influencing decisions, we want to be pretty certain about what kinds of directions we want these decisions to go, and what we would want these decisions to be, before we engage in any sort of substantial policy advocacy work to try to make that actually a thing in the real world. I am very, very hesitant about our ability to do that well, at least at the moment. I think we need to be very humble about thinking about making concrete recommendations because this work is hard. And I also think there is this dynamic, at least, in setting norms, and particularly legislation or regulation, but also just setting up institutions, in that it’s pretty slow work, but it’s very path dependent work. So if you establish things, they’ll be sort of here to stay. And we see a lot of legacy institutions and legacy norms that are maybe a bit outdated with respect to how the world has progressed in general. But we still struggle with them because it’s very hard to get rid of them. And so the kind of emphasis on humility, I think, is a big one. And it’s a big reason why basically policy advocacy work is quite slim on the ground, at least in the moment, because we’re not confident enough in our views on things.

Lucas: Yeah, but there’s also this tension here. The technology’s coming anyway. And so we’re sort of on this timeline to get the right policy stuff figured out. And here, when I look at, let’s just take the Democrats and the Republicans in the United States, and how they interact. Generally, in terms of specific policy implementation and recommendation, it just seems like different people have various dispositions and foundational principles which are at odds with one another, and that policy recommendations are often not substantially tested, or the result of empirical scientific investigation. They’re sort of a culmination and aggregate of one’s very broad squishy intuitions and modeling or the world, and different intuitions one has. Which is sort of why, at least at the policy level, seemingly in the United States government, it seems like a lot of the conversation is just endless arguing that gets nowhere. How do we avoid that here?

Jade: I mean, this is not just specifically an AI governance problem. I think we just struggle with this in general as we try to do governance and politics work in a good way. It’s a frustrating dynamic. But I think one thing that you said definitely resonates and that, a bit contra to what I just said. Whether we like it or not, governance is going to happen, particularly if you take the view that basically anything that shapes the way this is going to go, you could call governance. Something is going to fill the gap because that’s what humans do. You either have the absence of good governance, or you have somewhat better governance if you try to engage a little bit. There’s definitely that tension.

One thing that I’ve recently been reflecting on, in terms of things that we under-prioritize in this community, because it’s sort of a bit of a double-edged sword of being very conscientious about being epistemically humble and being very cautious about things, and trying to be better calibrated and all of that, which are very strong traits of people who work in this space at the moment. But I think almost because of those traits, too, we undervalue, or we don’t invest enough time or resource in just trying to engage in existing policy discussions and existing governance institutions. And I think there’s also an aversion to engaging in things that feel frustrating and slow, and that’s plausibly a mistake, at least in terms of how much attention we pay to it because in the absence of our engagement, the things still going to happen anyway.

Lucas: I must admit that as someone interested in philosophy I’ve resisted for a long time now, the idea of governance in AI at least casually in favor of nice calm cool rational conversations at tables that you might have with friends about values, and ideal governance, and what kinds of futures you’d like. But as you’re saying, and as Alan says, that’s not the way that the world works. So here we are.

Jade: So here we are. And I think one way in which we need to rebalance a little bit, as kind of an example of this is, I’m aware that a lot of the work, at least that I see in this space, is sort of focused on very aligned organizations and non-government organizations. So we’re looking at private labs that are working on developing AGI. And they’re more nimble. They have more familiar people in them, we think more similarly to those kinds of people. And so I think there’s an attraction. There’s really good rational reasons to engage with the folks because they’re the ones who are developing this technology and they’re plausibly the ones who are going to develop something advanced.

But there’s also, I think, somewhat biased reasons why we engage, is because they’re not as messy, or they’re more familiar, or we feel more value aligned. And I think this early in the field, putting all our eggs in a couple of very, very limited baskets, is plausibly not that great a strategy. That being said, I’m actually not entirely sure what I’m advocating for. I’m not sure that I want people to go and engage with all of the UN conversations on this because there’s a lot of noise and very little signal. So I think it’s a tricky one to navigate, for sure. But I’ve just been reflecting on it lately, that I think we sort of need to be a bit conscious about not group thinking ourselves into thinking we’re sort of covering all the bases that we need to cover.

Lucas: Yeah. My view on this, and this may be wrong, is just looking at the EA community, and the alignment community, and all that they’ve done to try to help with AI alignment. It seems like a lot of talent feeding into tech companies. And there’s minimal efforts right now to engage in actual policy and decision making at the government level, even for short term issues like disemployment and privacy and other things. The AI alignment is happening now, it seems.

Jade: On the noise to signal point, I think one thing I’d like for people to be thinking about, I’m pretty annoyed at this short term v. long term bifurcation. And I think a fair number of people are. And the framing that I’ve tried on a little bit is more thinking about it in terms of stakes. So how high are the stakes for a particular application area, or a particular sort of manifestation of a risk or a concern.

And I think in terms of thinking about it in the stakes sense, as opposed to the timeline sense, helps me at least try to identify things that we currently call or label near term concerns, and try to filter the ones that are worth engaging in versus the ones that maybe we just don’t need to engage in at all. An example here is that basically I am trying to identify near term/existing concerns that I think could scale in stakes as AI becomes more advanced. And if those exist, then there’s really good reason to engage in them for several reasons, right? One is this path dependency that I talked about before, so norms that you’re developing around, for example, privacy or surveillance. Those norms are going to stick, and the ways in which we decide we want to govern that, even with narrow technologies now, those are the ones we’re going to inherit, grandfather in, as we start to advance this technology space. And then I think you can also just get a fair amount of information about how we should be governing the more advanced versions of these risks or concerns if you engage earlier.

I think there are actually probably, even just off the top off of my head, I can think of a couple which seemed to have scalable stakes. So, for example, a very existing conversation in the policy space is about this labor displacement problem and automation. And that’s the thing that people are freaking out about now, is the extent that you have litigation and bills and whatnot being passed, or being talked about at least. And you’ve got a number of people running on political platforms on the basis of that kind of issue. And that is both an existing concern, given automation to date. But it’s also plausibly a huge concern as this stuff is more advanced, to the point of economic singularity, if you wanted to use that term, where you’ve got vast changes in the structure of the labor market and the employment market, and you can have substantial transformative impacts on the ways in which humans engage and create economic value and production.

And so existing automation concerns can scale into large scale labor displacement concerns, can scale into pretty confusing philosophical questions about what it means to conduct oneself as a human in a world where you’re no longer needed in terms of employment. And so that’s an example of a conversation which I wish more people were engaged in right now.

Plausibly, another one would be privacy as well, because I think privacy is currently a very salient concern. But also, privacy is an example of one of the fundamental values that we are at risk of eroding if we continue to deploy technologies for other reasons : efficiency gains, or for increasing control and centralizing of power. And privacy is this small microcosm of a maybe larger concern about how we could possibly be chipping away at these very fundamental things which we would want to preserve in the longer run, but we’re at risk of not preserving because we continue to operate in this dynamic of innovation and performance for whatever cost. Those are examples of conversations where I find it plausible that there are existing conversations that we should be more engaged in just because those are actually going to matter for the things that we call long term concerns, or the things that I would call sort of high stakes concerns.

Lucas: That makes sense. I think that trying on the stakes framing is helpful, and I can see why. It’s just a question about what are the things today, and within the next few years, that are likely to have a large effect on a larger end that we arrive at with transformative AI. So we’ve got this space of all these four cornerstones that you guys are exploring. Again, this has to do with the interplay and interdependency of technical AI safety, politics, policy of ideal governance, the economics, the military balance and struggle, and race dynamics all here with AI, on our path to AGI. So starting here with ideal governance, and we can see how we can move through these cornerstones, what is the process by which ideal governance is arrived at? How might this evolve over time as we get closer to superintelligence?

Jade: It may be a couple of thoughts, mostly about what I think a desirable process is that we should follow, or what kind of desired traits do we want to have in the way that we get to ideal governance and what ideal governance could plausibly look like. I think that’s to the extent that I maybe have thoughts about it. And they’re quite obvious ones, I think. Governance literature has said a lot about what consists of both morally sound, politically sound, socially sound governance processes or design of governance processes.

So those are things like legitimacy and accountability and transparency. I think there are some interesting debates about how important certain goals are, either as end goals or as instrumental goals. So for example, I’m not clear where my thinking is on how important inclusion and diversity is. As we’re aiming for ideal governance, so I think that’s an open question, at least in my mind.

There are also things to think through around what’s unique to trying to aim for ideal governance for a transformative general purpose technology. We don’t have a very good track record of governing general purpose technologies at all. I think we have general purpose technologies that have integrated into society and have served a lot of value. But that’s not for having had governance of them. I think we’ve been come combination of lucky and somewhat thoughtful sometimes, but not consistently so. If we’re staking the claim that AI could be a uniquely transformative technology, then we need to ensure that we’re thinking hard about the specific challenges that it poses. It’s a very fast-moving emerging technology. And governments historically has always been relatively slow at catching up. But you also have certain capabilities that you can realize by developing, for example, AGI or super intelligence, which governance frameworks or institutions have never had to deal with before. So thinking hard about what’s unique about this particular governance challenge, I think, is important.

Lucas: Seems like often, ideal governance is arrived at through massive suffering of previous political systems, like this form of ideal governance that the founding fathers of the United States came up with was sort of an expression of the suffering they experienced at the hands of the British. And so I guess if you track historically how we’ve shifted from feudalism and monarchy to democracy and capitalism and all these other things, it seems like governance is a large slowly reactive process born of revolution. Whereas, here, what we’re actually trying to do is have foresight and wisdom about what the world should look like, rather than trying to learn from some mistake or some un-ideal governance we generate through AI.

Jade: Yeah, and I think that’s also another big piece of it, is another way of thinking about how to get to ideal governance is to aim for a period of time, or a state of the world in which we can actually do the thinking well without a number of other distractions/concerns on the way. So for example, conditions that we want to drive towards would mean getting rid of things like the current competitor environment that we have, which for many reasons, some of which I mentioned earlier, it’s a bad thing, and it’s particularly counterproductive to giving us the kind of space and cooperative spirit and whatnot that we need to come to ideal governance. Because if you’re caught in this strategic competitive environment, then that makes a bunch of things just much harder to do in terms of aiming for coordination and cooperation and whatnot.

You also probably want better, more accurate, information out there, hence being able to think harder by looking at better information. And so a lot of work can be done to encourage more accurate information to hold more weight in public discussions, and then also encourage an environment that is genuine, epistemically healthy deliberation about that kind of information. All of what I’m saying is also not particularly unique, maybe, to ideal governance for AI. I think in general, you can sometimes broaden this discussion to what does it look like to govern a global world relatively well. And AI is one of the particular challenges that are maybe forcing us to have some of these conversations. But in some ways, when you end up talking about governance, it ends up being relatively abstract in a way, I think, ruins technology. At least in some ways there are also particular challenges, I think, if you’re thinking particularly about superintelligence scenarios. But if you’re just talking about governance challenges in general, things like accurate information, more patience, lack of competition and rivalrous dynamics and what not, that generally is kind of just helpful.

Lucas: So, I mean, arriving at ideal governance here, I’m just trying to model and think about it, and understand if there should be anything here that should be practiced differently, or if I’m just sort of slightly confused here. Generally, when I think about ideal governance, I see that it’s born of very basic values and principles. And I view these values and principles as coming from nature, like the genetics, evolution instantiating certain biases and principles and people that tend to lead to cooperation, conditioning of a culture, how we’re nurtured in our homes, and how our environment conditions us. And also, people update their values and principles as they live in the world and communicate with other people and engage in public discourse, even more foundational, meta-ethical reasoning, or normative reasoning about what is valuable.

And historically, these sort of conversations haven’t mattered, or they don’t seem to matter, or they seem to just be things that people assume, and they don’t get that abstract or meta about their values and their views of value, and their views of ethics. It’s been said that, in some sense, on our path to superintelligence, we’re doing philosophy on a deadline, and that there are sort of deep and difficult questions about the nature of value, and how best to express value, and how to idealize ourselves as individuals and as a civilization.

So I guess I’m just throwing this all out there. Maybe not necessarily we have any concrete answers. But I’m just trying to think more about the kinds of practices and reasoning that should and can be expected to inform ideal governance. Should meta-ethics matter here, where it doesn’t seem to matter in public discourse. I still struggle between the ultimate value expression that might be happening through superintelligence, and the tension between that, and how are public discourse functions. I don’t know if you have any thoughts here.

Jade: No particular thoughts, aside from to generally agree that I think meta-ethics is important. It is also confusing to me why public discourse doesn’t seem to track the things that seem important. This probably is something that we’ve struggled and tried to address in various ways before, so I guess I’m always cognizant of trying to learn from ways in which we’ve tried to improve public discourse and tried to create spaces for this kind of conversation.

It’s a tricky one for sure, and thinking about better practices is probably the main way at least in which I engage with thinking about ideal governance. It’s often the case that people, when they look at the cluster of ideal governance work though like, “Oh, this is the thing that’s going to tell us what the answer is,” like what’s the constitution that we have to put in place, or whatever it is.

At least for me, the maun chunk of thinking is mostly centered around process, and it’s mostly centered around what constitutes a productive optimal process, and some ways of answering this pretty hard question. And how do you create the conditions in which you can engage with that process without being distracted or concerned about things like competition? Those are kind of the main ways in which it seems obvious that we can fix the current environment so that we’re better placed to answer what is a very hard question.

Lucas: Coming to mind here is also, is this feature that you pointed out, I believe, that ideal governance is not figuring everything out in terms of our values, but rather creating the kind of civilization and space in which we can take the time to figure out ideal governance. So maybe ideal governance is not solving ideal governance, but creating a space to solve ideal governance.

Usually, ideal governance has to do with modeling human psychology, and how to best to get human being to produce value and live together harmoniously. But when we introduce AI, and human beings become potentially obsolete, then ideal governance potentially becomes something else. And I wonder, if the role of, say, experimental cities with different laws, policies, and governing institutions might be helpful here.

Jade: Yeah, that’s an interesting thought. Another thought that came to mind as well, actually, is just kind of reflecting on how ill-equipped I feel thinking about this question. One funny trait of this field is that you have a slim number of philosophers, but specially in the AI strategy and safety space, it’s political scientists, international relations people, economists, and engineers, and computer scientists thinking about questions that other spaces have tried to answer in different ways before.

So when you mention psychology, that’s an example. Obviously, philosophy has something to say about this. But there’s also a whole space of people have thought about how we govern things well across a number of different domains, and how we do a bunch of coordination and cooperation better, and stuff like that. And so it makes me reflect on the fact that there could be things that we already have learned that we should be reflecting a little bit more on which we currently just don’t have access to because we don’t necessarily have the right people or the right domains of knowledge in this space.

Lucas: Like AI alignment has been attracting a certain crowd of researchers, and so we miss out on some of the insights that, say, psychologists might have about ideal governance.

Jade: Exactly, yeah.

Lucas: So moving along here, from ideal governance, assuming we can agree on what ideal governance is, or if we can come to a place where civilization is stable and out of existential risk territory, and where we can sit down and actually talk about ideal governance, how do we begin to think about how to contribute to AI governance through working with or in private companies and/or government.

Jade: This is a good, and quite large, question. I think there are a couple of main ways in which I think about productive actions that either companies or governments can take, or productive things we can do with both of these actors to make them more inclined to do good things. On the point of other companies, the primary thing I think that is important to work on, at least concretely in the near term, is to do something like establish the norm and expectation that as developers of this important technology that will have a large plausible impact on the world, they have a very large responsibility proportional to their ability to impact the development of this technology. By making the responsibility something that is tied to their ability to shape this technology, I think that as a foundational premise or a foundational axiom to hold about why private companies are important, that can get us a lot of relatively concrete things that we should be thinking about doing.

The simple way of saying its is something like if you are developing the thing, you’re responsibly for thinking about how that thing is going to affect the world. And establishing that, I think is a somewhat obvious thing. But it’s definitely not how the private sector operates at the moment, in that there is an assumed limited responsibility irrespective of how your stuff is deployed in the world. What that actually means can be relatively concrete. Just looking at what these labs, or what these firms have the ability to influence, and trying to understand how you want to change it.

So, for example, internal company policy on things like what kind of research is done and invested in, and how you allocate resources across, for example, safety and capabilities research, what particular publishing norms you have, and considerations around risks or benefits. Those are very concrete internal company policies that can be adjusted and shifted based on one’s idea of what they’re responsible for. The broad thing, I think, to try to steer them in this direction of embracing, acknowledging, and then living up this greater responsibility, as an entity that is responsible for developing the thing.

Lucas: How would we concretely change the incentive structure of a company who’s interested in maximizing profit towards this increased responsibility, say, in the domains that you just enumerated.

Jade: This is definitely probably one of the hardest things about this claim being translated into practice. I mean, it’s not the first time we’ve been somewhat upset at companies for doing things that society doesn’t agree with. We don’t have a great track record of changing the way that industries or companies work. That being said, I think if you’re outside of the company, there are particularly levers that one can pull that can influence the way that a company is incentivized. And then I think we’ve also got examples of us being able to use these levers well.

The fact that companies are constrained by the environment that a government creates, and governments also have the threat of things like regulation, or the threat of being able to pass certain laws or whatnot, which actually the mere threat, historically, has done a fair amount in terms of incentivizing companies to just step up their game because they don’t want regulation to kick in, which isn’t conducive to what they want to do, for example.

Users of the technology is a pretty classic one. It’s a pretty inefficient one, I think, because you’ve got to coordinate many, many different types of users, and actors, and consumers and whatnot, to have an impact on what companies are incentivized to do. But you have seen environmental practices in other types of industries that have been put in place as standards or expectations that companies should abide by because consumers across a long period of time have been able to say, “I disagree with this particular practice.” That’s an example of a trend that has succeeded.

Lucas: That would be like boycotting or divestment.

Jade: Yeah, exactly. And maybe a slightly more efficient one is focusing on things like researchers and employees. That is, if you are a researcher, if you’re an employee, you have levers over the employer that you work for. They need you, and you need them, and there’s that kind of dependency in that relationship. This is all a long way of saying that I think, yes, I agree it’s hard to change incentive structures of any industry, and maybe specifically so in this case because they’re very large. But I don’t think it’s impossible. And I think we need to think harder about how to use those well. I think the other thing that’s working in our favor in this particular case is that we have a unique set of founders or leaders of these labs or companies that have expressed pretty genuine sounding commitments to safety and to cooperativeness, and to serving the common good. It’s not a very robust strategy to rely on certain founders just being good people. But I think in this case, it’s kind of working in our favor.

Lucas: For now, yeah. There’s probably already other interest groups who are less careful, who are actually just making policy recommendations right now, and we’re broadly not in on the conversation due to the way that we think about the issue. So in terms of government, what should we be doing? Yeah, it seems like there’s just not much happening.

Jade: Yeah. So I agree there isn’t much happening, or at least relative to how much work we’re putting into trying to understand and engage with private labs. There isn’t much happening with government. So I think there needs to be more thought put into how we do that piece of engagement. I think good things that we could be trying to encourage more governments to do, for one, investing in productive relationships with the technical community, and productive relationships with the researcher community, and with companies as well. At least in the US, it’s pretty adversarial between Silicon Valley firms and DC.

And that isn’t good for a number of reasons. And one very obvious reason is that there isn’t common information or common understand of what’s going on, what the risks are, what the capabilities are, et cetera. One of the main critiques of governments is that they’re ill-equipped in terms of access to knowledge, and access to expertise, to be able to appropriately design things like bills, or things like pieces of legislation or whatnot. And I think that’s also something that governments should take responsibility for addressing.

So those are kind of law hanging fruit. There’s a really tricky balance that I think governments will need to strike, which is the balance between avoiding over-hasty ill-informed regulation. A lot of my work looking at history will show that the main ways in which we’ve achieved substantial regulation is as a result of big public, largely negative events to do with the technology screwing something up, or the technology causing a lot of fear, for whatever reasons. And so there’s a very sharp spike in public fear or public concern, and then the government then kicks into gear. And I think that’s not a good dynamic in terms of forming nuanced well-considered regulation and governance norms. Avoiding the outcome is important, but it’s also important that governments do engage and track how this is going, and particularly track where things like company policy and industry-wide efforts are not going to be sufficient. So when do you start translating some of the more soft law, if you will, into actual hard law.

That will be a very tricky timing question, I think, for governments to grapple with. But ultimately, it’s not sufficient to have companies governing themselves. You’ll need to be able to consecrate it into government backed efforts and initiatives and legislation and bills. My strong intuition is that it’s not quite the right time to roll out object level policies. And so the main task for governments will be just to position themselves to do that well when the time is right.

Lucas: So what’s coming to my mind here is I’m thinking about YouTube compilations of congressional members of the United States and senators asking horrible questions to Mark Zuckerberg and the CEO of, say, Google. They just don’t understand the issues. The United States is currently not really thinking that much about AI, and especially transformative AI. Whereas, China, it seems, has taken a step in this direction and is doing massive governmental investments. So what can we say about this assuming difference? And the question is, what are governments to do in this space? Different governments are paying attention at different levels.

Jade: Some governments are more technological savvy than others, for one. So I pushed back on the US not … They’re paying attention on different things. So, for example, the Department of Commerce put out a notice to the public indicating that they’re exploring putting in place export controls on a cluster of emerging technologies, including a fair number of AI relevant technologies. The point of export controls is to do something like ensure that adversaries don’t get access to critical technologies that, if they do, then that could undermine national security and/or domestic industrial base. The reasons why export controls are concerning is because they’re a) a relatively outdated tool. They used to work relatively well when you were targeting specific kind of weapons technologies, or basically things that you could touch and see. And the restriction of them from being on the market by the US means that a fair amount of it won’t be able to be accessed by other folks around the world. And you’ve seen export controls be increasingly less effective the more that we’ve tried to apply to things like cryptography, where it’s largely software based. And so trying to use export controls, which are applied at the national border, is a very tricky thing to make effective.

So you have the US paying attention to the fact that they think that AI is a national security concern, at least in this respect, enough to indicate that they’re interested in exploring export controls. I think it’s unlikely that export controls are going to be effective at achieving the goals that the US want to pursue. But I think export controls is also indicative of a world that we don’t want to slide in, which is a world where you have rivalrous economic blocks, where you’re sort of protecting your own base, and you’re not contributing to the kind of global commons of progressing this technology.

Maybe it goes back to what we were saying before, in that if you’re not engaged in the governance, the governance is going to happen anyway. This is an example of activity is going to happen anyway. I think people assume now, probably rightfully so, that the US government is not going to be very effective because they are not technically literate. In general, they are sort of relatively slow moving. They’ve got a bunch of other problems that they need to think about, et cetera. But I don’t think it’s going to take very, very long for the US government to start to seriously engage. I think the thing that is worth trying to influence is what they do when they start to engage.

If I had a policy in mind that I thought was robustly good that the US government should pass, then that would be the more proactive approach. It seems possible that if we think about this hard enough, there could be robustly good things that the US government could do, that could be good to be proactive about.

Lucas: Okay, so there’s this sort of general sense that we’re pretty heavy on academic papers because we’re really trying to understand the problem, and the problem is so difficult, and we’re trying to be careful and sure about how we progress. And it seems like it’s not clear if there is much room, currently, for direct action, given our uncertainty about specific policy implementations. There are some shorter term issues. And sorry to say shorter term issues. But, by that, I mean automation and maybe lethal autonomous weapons and privacy. These things, we have a more clear sense of, at least about potential things that we can start doing. So I’m just trying to get a sense here from you, on top of these efforts to try to understand the issues more, and on top of these efforts, for example, like 80,000 Hours has contributed. And by working to place aligned persons in various private organizations, what else can we be doing? What would you like to see more being done on here?

Jade: I think this is on top of just more research. But that would be the first thing that comes to mind, is people thinking hard about it seems like a thing that I want a lot more of, in general. But on top of that, what you mentioned, I think, the placing people, that maybe fits into this broader category of things that seems good to do, which is investing in building our capacity to influence the future. That’s quite a general statement. But something like it takes a fair amount of time to build up influence, particularly in certain institutions, like governments, like international institutions, et cetera. And so investing in that early seems good. And doing things like trying to encourage value aligned sensible people to climb the ladders that they need to climb in order to get to positions of influence, that generally seems like a good and useful thing.

The other thing that comes to mind as well is putting out more accurate information. One specific version of things that we could do here is, there is currently a fair number of inaccurate, or not well justified memes that are floating around, that are informing the way that people think. For example, the US and China are in a race. Or a more nuanced one is something like, inevitably, you’re going to have a safety performance trade off. And those are not great memes, in the sense that they don’t seem to be conclusively true. But they’re also not great in that they put you in a position of concluding something like, “Oh, well, if I’m going to invest in safety, I’ve got to be an altruist, or I’m going to trade off my competitive advantage.”

And so identifying what those bad ones are, and countering those, is one thing to do. Better memes could be something like those are developing this technology are responsible for thinking through its consequences. Or something even as simple as governance doesn’t mean government, and it doesn’t mean regulation. Because I think you’ve got a lot of firms who are terrified of regulation. And so they won’t engage in this governance conversation because of it. So there could be some really simple things I think we could do, just to make the public discourse both more accurate and more conducive to things being done that are good in the future.

Lucas: Yeah, here I’m also just seeing the tension here between the appropriate kinds of memes that inspire, I guess, a lot of the thinking within the AI alignment community, and the x-risk community, versus what is actually useful or spreadable for the general public, adding in here ways in which accurate information can be info-hazardy. I think broadly in our community, the common good principle, and building an awesome future for all sentient creatures, and I am curious to know how spreadable those memes are.

Jade: Yeah, the spreadability of memes is a thing that I want someone to investigate more. The things that make things not spreadable, for example, are just things that are, at a very simple level, quite complicated to explain, or are somewhat counterintuitive so you can’t pump the intuition very easily. Particularly things that require you to decide that one set of values that you care about, that’s competing against another set of values. Any set of things that brings nationalism against cosmopolitanism, I think, is a tricky one, because you have some subset of people. The ones that you and I talk to the most are very cosmopolitan. But you also have a fair amount of people who care about the common good principle, in some sense, but also care about their nation in a fairly large sense as well.

So there are things that make certain memes less good or less spreadable. And one key thing will be to figure out which ones are actually good in the true sense, and good in the pragmatic to spread sense.

Lucas: Maybe there’s a sort of research program here, where psychologists and researchers can explore focus groups on the best spreadable memes, which reflect a lot of the core and most important values that we see within AI alignment, and EA, and x-risk.

Jade: Yeah, that could be an interesting project. I think also in AI safety, or in the AI alignment space, people are framing safety in quite different ways. One framing of that, which like it’s a part of what it means to be a good AI person, is to think about safety. That’s an example of one that I’ve seen take off a little bit more lately because that’s an explicit act of trying to mainstream the thing. That’s a meme, or an example of a framing, or a meme, or whatever you want to call it. And you know there are pros and cons of that. The pros would be, plausibly, it’s just more mainstream. And I think you’ve seen evidence of that be the case because more people are more inclined to say, “Yeah, I agree. I don’t want to build a thing that kills me if I want it to get coffee.” But you’re not going to have a lot of conversations about maybe the magnitude of risks that you actually care about. So that’s maybe a con.

There’s maybe a bunch of stuff to do in this general space of thinking about how to better frame the kind of public facing narratives of some of these issues. Realistically, memes are going to fill the space. People are going to talk about it in certain ways. You might as well try to make it better, if it’s going to happen.

Lucas: Yeah, I really like that. That’s a very good point. So let’s talk here a little bit about technical AI alignment. So in technical AI alignment, the primary concerns are around the difficulty of specifying what humans actually care about. So this is like capturing human values and aligning with our preferences and goals, and what idealized versions of us might want. So, so much of AI governance is thus about ensuring that this AI alignment process we engage in doesn’t skip too many corners. The purpose of AI governance is to decrease risks, to increase coordination, and to do all of these other things to ensure that, say, the benefits of AI are spread widely and robustly, that we don’t get locked into any negative governance systems or value systems, and that this process of bringing AIs in alignment with the good doesn’t have researchers, or companies, or governments skipping too many corners on safety. In this context, and this interplay between governance and AI alignment, how much of a concern are malicious use cases relative to the AI alignment concerns within the context of AI governance?

Jade: That’s a hard one to answer, both because there is a fair amount of uncertainty around how you discuss the scale of the thing. But also because I think there are some interesting interactions between these two problems. For example, if you’re talking about how AI alignment interacts with this AI governance problem. You mentioned before AI alignment research is, in some ways, contingent on other things going well. I generally agree with that.

For example, it depends on AI safety taking place in research cultures and important labs. It requires institutional buy-in and coordination between institutions. It requires this mitigation of race dynamics so that you can actually allocate resources towards AI alignment research. All those things. And so in some ways, that particular problem being solved is contingent on us doing AI governance well. But then, also to the point of how big is malicious use risk relative to AI alignment, I think in some ways that’s hard to answer. But in some ideal world, you could sequence the problems that you could solve. If you solve the AI alignment problem first, then AI governance research basically becomes a much narrower space, addressing how an aligned AI could still cause problems because we’re not thinking about the concentration of power, the concentration of economic gains. And so you need to think about things like the windfall clause, to distribute that, or whatever it is. And you also need to think about the transition to creating an aligned AI, and what could be messy in that transition, how you avoid public backlash so that you can actually see the fruits of you having solved this AI alignment problem.

So that becomes more the kind of nature of the thing that AI governance research becomes, if you assume that you’ve solved the AI alignment problem. But if we assume that, in some world, it’s not that easy to solve, and both problems are hard, then I think there’s this interaction between the two. In some ways, it becomes harder. In some ways, they’re dependent. In some ways, it becomes easier if you solve bits of one problem.

Lucas: I generally model the risks of malicious use cases as being less than the AI alignment stuff.

Jade: I mean, I’m not sure I agree with that. But two things I could say to that. I think, one, intuition is something like you have to be a pretty awful person to really want to use a very powerful system to cause terrible ends. And it seems more plausible that people will just do it by accident, or unintentionally, or inadvertently.

Lucas: Or because the incentive structures aren’t aligned, and then we race.

Jade: Yeah. And then the other way to sort of support this claim is, if you look at biotechnology and bio-weapons, specifically, bio-security/bio-terrorism issues, so like malicious use equivalent. Those have been far less, in terms of frequency, compared to just bio-safety issues, which are the equivalent of accident risks. So people causing unintentional harm because we aren’t treating biotechnology safely, that’s cause a lot more problems, at least in terms of frequency, compared to people actually just trying to use it for terrible means.

Lucas: Right, but don’t we have to be careful here with the strategic properties and capabilities of the technology, especially in the context in which it exists? Because there’s nuclear weapons, which are sort of the larger more absolute power imbuing technology. There has been less of a need for people to take bio-weapons to that level. You know? And also there’s going to be limits, like with nuclear weapons, on the ability of a rogue actor to manufacture really effective bio-weapons without a large production facility or team of research scientists.

Jade: For sure, yeah. And there’s a number of those considerations, I think, to bear in mind. So it definitely isn’t the case that you haven’t seen malicious use in bio strictly because people haven’t wanted to do it. There’s a bunch of things like accessibility problems, and tacit knowledge that’s required, and those kinds of things.

Lucas: Then let’s go ahead and abstract away malicious use cases, and just think about technical AI alignment, and then AI/AGI governance. How do you see the relative importance of AI and AGI governance, and the process of AI alignment that we’re undertaking? Is solving AI governance potentially a bigger problem than AI alignment research, since AI alignment research will require the appropriate political context to succeed? On our path to AGI, we’ll need to mitigate a lot of the race conditions and increase coordination. And then even after we reach AGI, the AI governance problem will continue, as we sort of explored earlier that we need to be able to maintain a space in which humanity, AIs, and all earth originating sentient creatures are able to idealize harmoniously and in unity.

Jade: I both don’t think it’s possible to actually assess them at this point, in terms of how much we understand this problem. I have a bias towards saying that AI governance is the harder problem because I’m embedded in it and see it a lot more. And maybe ways to support that claim are things we’ve talked about. So AI alignment going well, or happening at all, is sort of contingent on a number of other factors that AI governments are trying to solve, so social political economic context needs to be right in order for that to actually happen, and then in order for that to have an impact.

There are some interesting things that are made maybe easier by AI alignment being solved, or somewhat solved, if you are thinking about the AI governance problem. In fact, it’s just like a general cluster of AI being safer and more robust and more transparent, or whatever, makes certain AI governance challenges just easier. The really obvious example here that comes to mind is the verification problem. The inability to verify what certain systems are designed to do and will do causes a bunch of governance problems. Like, arms control agreements are very hard. Establishing trust between parties to cooperate and coordinate is very hard.

If you happen to be able to solve some of those problems in the process of trying to tackle this AI alignment problem. And that makes AI governance a little bit easier. I’m not sure which direction it cashes out, in terms of which problem is more important. I’m certain that there are interactions between the two, and I’m pretty certain that one depends on the other, to some extent. So it becomes imminently really hard to govern the thing, if you can’t align the thing. But it also is probably the case that by solving some of the problems in one domain, you can help make the other problem a little bit tractable and easier.

Lucas: So now I’d like to get into lethal autonomous weapons. And we can go ahead and add whatever caveats are appropriate here. So in terms of lethal autonomous weapons, some people think that there are major stakes here. Lethal autonomous weapons are a major AI enabled technology that’s likely to come on the stage soon, as we make some moderate improvements to already existing technology, and then package it all together into the form of a lethal autonomous weapon. Some take the view that this is a crucial moment, or that there are high stakes here to get such weapons banned. The thinking here might be that by demarcating unacceptable uses of AI technology, such as for autonomously killing people, and by showing that we are capable of coordinating on this large and initial AI issue, that we might be taking the first steps in AI alignment, and the first steps in demonstrating our ability to take the technology and its consequences seriously.

And so we mentioned earlier how there’s been a lot of thinking, but not much action. This seems to be an initial place where we can take action. We don’t need to keep delaying our direction action and real world participation. So if we can’t get a ban on autonomous weapons, maybe it would seem that we have less hope for coordinating on more difficult issues. And so the lethal autonomous weapons may exacerbate global conflict by increasing skirmishing at borders, decrease the cost of war, dehumanize killing, taking the human element out of death, et cetera.

And other people disagree with this. Other people might argue that banning lethal autonomous weapons isn’t necessary in the long game. It’s not, as we’re framing it, a high stakes thing. Just because this sort of developmental step in this technology is not really crucial for coordination, or for political military stability. Or that coordination later would be born of other things, and that this would just be some other new military technology without much impact. So curious here, to gather what your views, or the views of FHI, or the Center for the Governance of AI, might have on autonomous weapons. Should there be a ban? Should the AI alignment community be doing more about this? And if not, why?

Jade: In terms of caveats, I’ve got a lot of them. So I think the first one is that I’ve not read up on this issue at all, followed it very loosely, but not nearly closely enough, that I feel like I have a confident well-informed opinion.

Lucas: Can I ask why?

Jade: Mostly because of bandwidth issues. It’s not because I have categorized them ahead of something not worth engaging in. I’m actually pretty uncertain about that. The second caveat is, definitely don’t claim to speak on behalf of anyone but myself in this case. The Center for the Governance of AI, we don’t have a particular position on this, nor the FHI.

Lucas: Would you say that this is because the Center for the Governance of AI, would it be for bandwidth issues again? Or would it be because it’s de-prioritized.

Jade: The main thing is bandwidth. Also, I think the main reason why it’s probably been de-prioritized, at least subconsciously, has been the framing of sort of focusing on things that are neglected by folks around the world. It seems like there are people at least with sort of somewhat good intentions tentatively engaged in the LAWS (lethal autonomous weapons) discussion. And so within that frame, I think de-prioritization because it’s not obviously neglected compared to other things that aren’t getting any focus at all.

With those things in mind, I could see a pretty decent case for investing more effort in engaging in this discussion, at least compared to what we currently have. I guess it’s hard to tell, compared to alternatives of how we could be spending those resources, giving it’s such a resource constrained space, in terms of people working in AI alignment, or just bandwidth, in terms of this community in general. So briefly, I think we’ve talked about this idea that there’s this fair amount of path dependency in the way that institutions and norms are built up. And if this is one of the first spaces, with respect to AI capabilities, where we’re going to be getting or driving towards some attempt at international norms, or establishing international institutions that could govern this space, then that’s going to be relevant in a general sense. And specifically, it’s going to be relevant for sort of defense and security related concerns in the AI space.

And so I think you both want to engage because there’s an opportunity to seed desirable norms and practices and process and information. But you also possibly want to engage because there could be a risk that bad norms are established. And so it’s important to engage, to prevent it going down something which is not a good path in terms of this path dependency.

Another reason I think that is maybe worth thinking through, in terms of making a case for engaging more, is that applications of AI in the military and defense spaces, possibly one of the most likely to cause substantial disruption in the near-ish future, and could be an example of something that I call the high stakes concerns in the future. And you can talk about AI and its impact on various aspects of the military domain, where it could have substantial risks. So, for example, in cyber escalation, or destabilizing nuclear security. Those would be examples where military and AI come together, and you can have bad outcomes that we do actually really care about. And so for the same reason, engaging specifically in any discussion that is touching on military and AI concerns, could be important.

And then the last one that comes to mind is the one that you mentioned. This is an opportunity to basically practice doing this coordination thing. And there are various things that are worth practicing or attempting. For one, I think even just observing how these discussions pan out is going to tell you a fair amount about how important actors think about the trade offs of using AI versus sort of going towards more safe outcomes or governance processes. And then our ability corral interest around good values or appropriate norms, or whatnot, that’s a good test of our ability to generally coordinate when we have some of those trade offs around, for example, military advantage versus safety. It gives you some insight into how we could be dealing with similarly shaped issues.

Lucas: All right. So let’s go ahead and bring it back here to concrete actionable real world things today, and understanding what’s actually going on outside of the abstract thinking. So I’m curious to know here more about private companies. At least, to me, they largely seem to be agents of capitalism, like we said. They have a bottom line that they’re trying to meet. And they’re not ultimately aligned with pro-social outcomes. They’re not necessarily committed to ideal governance, but perhaps forms of governance which best serve them. And as we sort of feed aligned people into tech companies, how should we be thinking about their goals, modulating their incentives? What does DeepMind really want? Or what can we realistically expect from key players? And what mechanisms, in addition to the windfall clause, can we use to sort of curb the worst aspects of profit-driven private companies?

Jade: If I knew what DeepMind actually wanted, or what Google actually thought, we’d be in a pretty different place. So a fair amount of what we’ve chatted through, I would echo again. So I think there’s both the importance of realizing that they’re not completely divorced from other people influencing them, or other actors influencing them. And so just thinking hard about which levers are in place already that actually constrain the action of companies, is a pretty good place to start, in terms of thinking about how you can have an impact on their activities.

There’s this common way of talking about big tech companies, which is they can do whatever they want, and they run the world, and we’ve got no way of controlling them. Reality is that they are consistently constrained by a fair number of things. Because they are agents of capitalism, as you described, and because they have to respond to various things within that system. So we’ve mentioned things before, like governments have levers, consumers have levers, employees have levers. And so I think focusing on what those are is a good place to start. Anything that comes to mind is, there’s something here around taking a very optimistic view of how companies could behave. Or at least this is the way that I prefer to think about it, is that you both need to be excited, and motivated, and think that companies can change and create the conditions in which they can. But one also then needs to have a kind of hidden clinic, in some ways.

On both of these, I think the first one, I really want the public discourse to turn more towards the direction of, if we assume that companies want to have the option of demonstrating pro-social incentives, then we should do things like ensure that the market rewards them for acting in pro-social ways, instead of penalizing their attempts at doing so, instead of critiquing every action that they take. So, for example, I think we should be making bigger deals, basically, of when companies are trying to do things that at least will look like them moving in the right direction, as opposed to immediately critiquing them as ethics washing, or sort of just paying lip service to the thing. I want there to be more of an environment where, if you are a company, or you’re a head of a company, if you’re genuinely well-intentioned, you feel like your efforts will be rewarded, because that’s how incentive structures work, right?

And then on the second point, in terms of being realistic about the fact that you can’t just wish companies into being good, that’s when I think the importance of things like public institutions and civil society groups become important. So ensuring that there are consistent forms of pressure, and keep making sure that they feel like their actions are being rewarding if pro-social, but also that there are ways of spotting in which they can be speaking as if they’re being pro-social, but acting differently.

So I think everyone’s kind of basically got a responsibility here, to ensure that this goes forward in some kind of productive direction. I think it’s hard. And we said before, you know, some industries have changed in the past successfully. But that’s always been hard, and long, and messy, and whatnot. But yeah, I do think it’s probably more tractable than the average person would think, in terms of influencing these companies to move in directions that are generally just a little bit more socially beneficial.

Lucas: Yeah. I mean, also the companies were generally made up of fairly reasonable well-intentioned people. I’m not all pessimistic. There are just a lot of people who sit at desks and have their structure. So yeah, thank you so much for coming on, Jade. It’s really been a pleasure. And, yeah.

Jade: Likewise.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

End of recorded material

AI Alignment Podcast: On Consciousness, Qualia, and Meaning with Mike Johnson and Andrés Gómez Emilsson

Consciousness is a concept which is at the forefront of much scientific and philosophical thinking. At the same time, there is large disagreement over what consciousness exactly is and whether it can be fully captured by science or is best explained away by a reductionist understanding. Some believe consciousness to be the source of all value and others take it to be a kind of delusion or confusion generated by algorithms in the brain. The Qualia Research Institute takes consciousness to be something substantial and real in the world that they expect can be captured by the language and tools of science and mathematics. To understand this position, we will have to unpack the philosophical motivations which inform this view, the intuition pumps which lend themselves to these motivations, and then explore the scientific process of investigation which is born of these considerations. Whether you take consciousness to be something real or illusory, the implications of these possibilities certainly have tremendous moral and empirical implications for life’s purpose and role in the universe. Is existence without consciousness meaningful?

In this podcast, Lucas spoke with Mike Johnson and Andrés Gómez Emilsson of the Qualia Research Institute. Andrés is a consciousness researcher at QRI and is also the Co-founder and President of the Stanford Transhumanist Association. He has a Master’s in Computational Psychology from Stanford. Mike is Executive Director at QRI and is also a co-founder. Mike is interested in neuroscience, philosophy of mind, and complexity theory.

Topics discussed in this episode include:

  • Functionalism and qualia realism
  • Views that are skeptical of consciousness
  • What we mean by consciousness
  • Consciousness and casuality
  • Marr’s levels of analysis
  • Core problem areas in thinking about consciousness
  • The Symmetry Theory of Valence
  • AI alignment and consciousness

You can take a short (3 minute) survey to share your feedback about the podcast here.

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can learn more about consciousness research at the Qualia Research InstituteMike‘s blog, and Andrés blog. You can listen to the podcast above or read the transcript below. Thanks to Ian Rusconi for production and edits as well as Scott Hirsh for feedback.

Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast. I’m Lucas Perry, and today we’ll be speaking with Andrés Gomez Emilsson and Mike Johnson from the Qualia Research Institute. In this episode, we discuss the Qualia Research Institute’s mission and core philosophy. We get into the differences between and arguments for and against functionalism and qualia realism. We discuss definitions of consciousness, how consciousness might be causal, we explore Marr’s Levels of Analysis, we discuss the Symmetry Theory of Valence. We also get into identity and consciousness, and the world, the is-out problem, what this all means for AI alignment and building beautiful futures.

And then end on some fun bits, exploring the potentially large amounts of qualia hidden away in cosmological events, and whether or not our universe is something more like heaven or hell. And remember, if you find this podcast interesting or useful, remember to like, comment, subscribe, and follow us on your preferred listening platform. You can continue to help make this podcast better by participating in a very short survey linked in the description of wherever you might find this podcast. It really helps. Andrés is a consciousness researcher at QRI and is also the Co-founder and President of the Stanford Transhumanist Association. He has a Master’s in Computational Psychology from Stanford. Mike is Executive Director at QRI and is also a co-founder.

He is interested in neuroscience, philosophy of mind, and complexity theory. And so, without further ado, I give you Mike Johnson and Andrés Gomez Emilsson. So, Mike and Andrés, thank you so much for coming on. Really excited about this conversation and there’s definitely a ton for us to get into here.

Andrés: Thank you so much for having us. It’s a pleasure.

Mike: Yeah, glad to be here.

Lucas: Let’s start off just talking to provide some background about the Qualia Research Institute. If you guys could explain a little bit, your perspective of the mission and base philosophy and vision that you guys have at QRI. If you could share that, that would be great.

Andrés: Yeah, for sure. I think one important point is there’s some people that think that really what matters might have to do with performing particular types of algorithms, or achieving external goals in the world. Broadly speaking, we tend to focus on experience as the source of value, and if you assume that experience is a source of value, then really mapping out what is the set of possible experiences, what are their computational properties, and above all, how good or bad they feel seems like an ethical and theoretical priority to actually make progress on how to systematically figure out what it is that we should be doing.

Mike: I’ll just add to that, this thing called consciousness seems pretty confusing and strange. We think of it as pre-paradigmatic, much like alchemy. Our vision for what we’re doing is to systematize it and to do to consciousness research what chemistry did to alchemy.

Lucas: To sort of summarize this, you guys are attempting to be very clear about phenomenology. You want to provide a formal structure for understanding and also being able to infer phenomenological states in people. So you guys are realists about consciousness?

Mike: Yes, absolutely.

Lucas: Let’s go ahead and lay some conceptual foundations. On your website, you guys describe QRI’s full stack, so the kinds of metaphysical and philosophical assumptions that you guys are holding to while you’re on this endeavor to mathematically capture consciousness.

Mike: I would say ‘full stack’ talks about how we do philosophy of mind, we do neuroscience, and we’re just getting into neurotechnology with the thought that yeah, if you have a better theory of consciousness, you should be able to have a better theory about the brain. And if you have a better theory about the brain, you should be able to build cooler stuff than you could otherwise. But starting with the philosophy, there’s this conception of qualia of formalism; the idea that phenomenology can be precisely represented mathematically. You borrow the goal from Giulio Tononi’s IIT. We don’t necessarily agree with the specific math involved, but the goal of constructing a mathematical object that is isomorphic to a systems phenomenology would be the correct approach if you want to formalize phenomenology.

And then from there, one of the big questions in how you even start is, what’s the simplest starting point? And here, I think one of our big innovations that is not seen at any other research group is we’ve started with emotional valence and pleasure. We think these are not only very ethically important, but also just literally the easiest place to start reverse engineering.

Lucas: Right, and so this view is also colored by physicalism and quality of structuralism and valence realism. Could you explain some of those things in a non-jargony way?

Mike: Sure. Quality of formalism is this idea that math is the right language to talk about qualia in, and that we can get a precise answer. This is another way of saying that we’re realists about consciousness much as people can be realists about electromagnetism. We’re also valence realists. This refers to how we believe emotional valence, or pain and pleasure, the goodness or badness of an experience. We think this is a natural kind. This concept carves reality at the joints. We have some further thoughts on how to define this mathematically as well.

Lucas: So you guys are physicalists, so you think that basically the causal structure of the world is best understood by physics and that consciousness was always part of the game engine of the universe from the beginning. Ontologically, it was basic and always there in the same sense that the other forces of nature were already in the game engine since the beginning?

Mike: Yeah, I would say so. I personally like the frame of dual aspect monism, but I would also step back a little bit and say there’s two attractors in this discussion. One is the physicalist attractor, and that’s QRI. Another would be the functionalist/computationalist attractor. I think a lot of AI researchers are in this attractor and this is a pretty deep question of, if we want to try to understand what value is, or what’s really going on, or if we want to try to reverse engineer phenomenology, do we pay attention to bits or atoms? What’s more real; bits or atoms?

Lucas: That’s an excellent question. Scientific reductionism here I think is very interesting. Could you guys go ahead and unpack though the skeptics position of your view and broadly adjudicate the merits of each view?

Andrés: Maybe a really important frame here is called Marr’s Levels of Analyses. David Marr was a cognitive scientist, wrote a really influential book in the ’80s called On Vision where he basically creates a schema for how to understand knowledge about, in this particular case, how you actually make sense of the world visually. The framework goes as follows: you have three ways in which you can describe a information processing system. First of all, the competitional/behavioral level. What that is about is understanding the input output mapping of an information processing system. Part of it is also understanding the run-time complexity of the system and under what conditions it’s able to perform its actions. Here an analogy would be with an abacus, for example.

On the computational/behavioral level, what an abacus can do is add, subtract, multiply, divide, and if you’re really creative you can also exponentiate and do other interesting things. Then you have the algorithmic level of analysis, which is a little bit more detailed, and in a sense more constrained. What the algorithm level of analysis is about is figuring out what are the internal representations and possible manipulations of those representations such that you get the input output of mapping described by the first layer. Here you have an interesting relationship where understanding the first layer doesn’t fully constrain the second one. That is to say, there are many systems that have the same input output mapping but that under the hood uses different algorithms.

In the case of the abacus, an algorithm might be something whenever you want to add a number you just push a bead. Whenever you’re done with a row, you push all of the beads backs and then you add a bead in the row underneath. And finally, you have the implementation level of analysis, and that is, what is the system actually made of? How is it constructed? All of these different levels ultimately also map onto different theories of consciousness, and that is basically where in the stack you associate consciousness, or being, or “what matters”. So, for example, behaviorists in the ’50s, they may associate consciousness, if they give any credibility to that term, with the behavioral level. They don’t really care what’s happening inside as long as you have extended pattern of reinforcement learning over many iterations.

What matters is basically how you’re behaving and that’s the crux of who you are. A functionalist will actually care about what algorithms you’re running, how is it that you’re actually transforming the input into the output. Functionalists generally do care about, for example, brain imaging, they do care about the high level algorithms that the brain is running, and generally will be very interested in figuring out these algorithms and generalize them in fields like machine learning and digital neural networks and so on. A physicalist associate consciousness at the implementation level of analysis. How the system is physically constructed, has bearings on what is it like to be that system.

Lucas: So, you guys haven’t said that this was your favorite approach, but if people are familiar with David Chalmers, these seem to be the easy problems, right? And functionalists are interested in just the easy problems and some of them will actually just try to explain consciousness away, right?

Mike: Yeah, I would say so. And I think to try to condense some of the criticism we have of functionalism, I would claim that it looks like a theory of consciousness and can feel like a theory of consciousness, but it may not actually do what we need a theory of consciousness to do; specify which exact phenomenological states are present.

Lucas: Is there not some conceptual partitioning that we need to do between functionalists who believe in qualia or consciousness, and those that are illusionists or want to explain it away or think that it’s a myth?

Mike: I think that there is that partition, and I guess there is a question of how principled the partition you can be, or whether if you chase the ideas down as far as you can, the partition collapses. Either consciousness is a thing that is real in some fundamental sense and I think you can get there with physicalism, or consciousness is more of a process, a leaky abstraction. I think functionalism naturally tugs in that direction. For example, Brian Tomasik has followed this line of reasoning and come to the conclusion of analytic functionalism, which is trying to explain away consciousness.

Lucas: What is your guys’s working definition of consciousness and what does it mean to say that consciousness is real.

Mike: It is a word that’s overloaded. It’s used in many contexts. I would frame it as what it feels like to be something, and something is conscious if there is something it feels like to be that thing.

Andrés: It’s important also to highlight some of its properties. As Mike pointed out, consciousness, it’s used in many different ways. There’s like eight to definitions for the word consciousness, and honestly, all of them are really interesting. Some of them are more fundamental than others and we tend to focus on the more fundamental side of the spectrum for the word. A sense that would be very not fundamental would be consciousness in the sense of social awareness or something like that. We actually think of consciousness much more in terms of qualia; what is it like to be something? What is it like to exist? Some of the key properties of consciousness are as follows: First of all, we do think it exists.

Second, in some sense it has causal power in the sense that the fact that we are conscious matters for evolution, evolution made us conscious for a reason that it’s actually doing some computational legwork that would be maybe possible to do, but just not as efficient or not as conveniently as it is possible with consciousness. Then also you have the property of qualia, the fact that we can experience sights, and colors, and tactile sensations, and thoughts experiences, and emotions, and so on, and all of these are in completely different worlds, and in a sense they are, but they have the property that they can be part of a unified experience that can experience color at the same time as experiencing sound. That sends those different types of sensations, we describe them as the category of consciousness because they can be experienced together.

And finally, you have unity, the fact that you have the capability of experiencing many qualia simultaneously. That’s generally a very strong claim to make, but we think you need to acknowledge and take seriously its unity.

Lucas: What are your guys’s intuition pumps for thinking why consciousness exists as a thing? Why is there a qualia?

Andrés: There’s the metaphysical question of why consciousness exists to begin within. That’s something I would like to punt for the time being. There’s also the question of why was it recruited for information processing purposes in animals? The intuition here is that there are various contrasts that you can have within experience, can serve a computational role. So, there may be a very deep reason why color qualia or visual qualia is used for information processing associated with sight, and why tactile qualia is associated with information processing useful for touching and making haptic representations, and that might have to do with the actual map of how all the qualia values are related to each other. Obviously, you have all of these edge cases, people who are seeing synesthetic.

They may open their eyes and they experience sounds associated with colors, and people tend to think of those as abnormal. I would flip it around and say that we are all synesthetic, it’s just that the synesthesia that we have in general is very evolutionarily adaptive. The reason why you experience colors when you open your eyes is that that type of qualia is really well suited to represent geometrically a projective space. That’s something that naturally comes out of representing the world with the sensory apparatus like eyes. That doesn’t mean that there aren’t other ways of doing it. It’s possible that you could have an offshoot of humans that whenever they opened their eyes, they experience sound and they use that very well to represent the visual world.

But we may very well be in a local maxima of how different types of qualia are used to represent and do certain types of computations in a very well-suited way. It’s like the intuition behind why we’re conscious, is that all of these different contrasts in the structure of the relationship of possible qualia values has computational implications, and there’s actual ways of using this contrast in very computationally effective ways.

Lucas: So, just to channel of the functionalist here, wouldn’t he just say that everything you just said about qualia could be fully reducible to input output and algorithmic information processing? So, why do we need this extra property of qualia?

Andrés: There’s this article, I believe is by Brian Tomasik that basically says, flavors of consciousness are flavors of computation. It might be very useful to do that exercise, where basically you identify color qualia as just a certain type of computation and it may very well be that the geometric structure of color is actually just a particular algorithmic structure, that whenever you have a particular type of algorithmic information processing, you get these geometric plate space. In the case of color, that’s a Euclidean three-dimensional space. In the case of tactile or smell, it might be a much more complicated space, but then it’s in a sense implied by the algorithms that we run. There is a number of good arguments there.

The general approach to how to tackle them is that when it comes down to actually defining what algorithms a given system is running, you will hit a wall when you try to formalize exactly how to do it. So, one example is, how do you determine the scope of an algorithm? When you’re analyzing a physical system and you’re trying to identify what algorithm it is running, are you allowed to basically contemplate 1,000 atoms? Are you allowed to contemplate a million atoms? Where is a natural boundary for you to say, “Whatever is inside here can be part of the same algorithm, but whatever is outside of it can’t.” And, there really isn’t a framing variant way of making those decisions. On the other hand, if you ask to see a qualia with actual physical states, there is a framing variant way of describing what the system is.

Mike: So, a couple of years ago I posted a piece giving a critique of functionalism and one of the examples that I brought up was, if I have a bag of popcorn and I shake the bag of popcorn, did I just torture someone? Did I just run a whole brain emulation of some horrible experience, or did I not? There’s not really an objective way to determine which algorithms a physical system is objectively running. So this is a kind of an unanswerable question from the perspective of functionalism, whereas with the physical theory of consciousness, it would have a clear answer.

Andrés: Another metaphor here he is, let’s say you’re at a park enjoying an ice cream. In this system that I created that has, let’s say isomorphic algorithms to whatever is going on in your brain, the particular algorithms that your brain is running in that precise moment within a functionalist paradigm maps onto a metal ball rolling down one of the paths within these machine in a straight line, not touching anything else. So there’s actually not much going on. According to functionalism, that would have to be equivalent and it would actually be generating your experience. Now the weird thing there is that you could actually break the machine, you could do a lot of things and the behavior of the ball would not change.

Meaning that within functionalism, and to actually understand what a system is doing, you need to understand the counter-factuals of the system. You need to understand, what would the system be doing if the input had been different? And all of a sudden, you end with this very, very gnarly problem of defining, well, how do you actually objectively decide what is the boundary of the system? Even some of these particular states that allegedly are very complicated, the system looks extremely simple, and you can remove a lot of parts without actually modifying its behavior. Then that casts in question whether there is a objective boundary, any known arbitrary boundary that you can draw around the system and say, “Yeah, this is equivalent to what’s going on in your brain,” right now.

This has a very heavy bearing on the binding problem. The binding problem for those who haven’t heard of it is basically, how is it possible that 100 billion neurons just because they’re skull-bound, spatially distributed, how is it possible that they simultaneously contribute to a unified experience as opposed to, for example, neurons in your brain and neurons in my brain contributing to a unified experience? You hit a lot of problems like what is the speed of propagation of information for different states within the brain? I’ll leave it at that for the time being.

Lucas: I would just like to be careful about this intuition here that experience is unified. I think that the intuition pump for that is direct phenomenological experience like experience seems unified, but experience also seems a lot of different ways that aren’t necessarily descriptive of reality, right?

Andrés: You can think of it as different levels of sophistication, where you may start out with a very naive understanding of the world, where you confuse your experience for the world itself. A very large percentage of people perceive the world and in a sense think that they are experiencing the world directly, whereas all the evidence indicates that actually you’re experiencing an internal representation. You can go and dream, you can hallucinate, you can enter interesting meditative states, and those don’t map to external states of the world.

There’s this transition that happens when you realize that in some sense you’re experiencing a world simulation created by your brain, and of course, you’re fooled by it in countless ways, especially when it comes to emotional things that we look at a person and we might have an intuition of what type of person that person is, and that if we’re not careful, we can confuse our intuition, we can confuse our feelings with truth as if we were actually able to sense their souls, so to speak, rather than, “Hey, I’m running some complicated models on people-space and trying to carve out who they are.” There’s definitely a lot of ways in which experience is very deceptive, but here I would actually make an important distinction.

When it comes to intentional content, and intentional content is basically what the experience is about, for example, if you’re looking at a chair, there’s the quality of chairness, the fact that you understand the meaning of chair and so on. That is usually a very deceptive part of experience. There’s another way of looking at experience that I would say is not deceptive, which is the phenomenal character of experience; how it presents itself. You can be deceived about basically what the experience is about, but you cannot be deceived about how you’re having the experience, how you’re experiencing it. You can infer based on a number of experiences that the only way for you to even actually experience a given phenomenal object is to incorporate a lot of that information into a unified representation.

But also, if you just pay attention to your experience that you can simultaneously place your attention in two spots of your visual field and make them harmonized. That’s a phenomenal character and I would say that there’s a strong case to be made to not doubt that property.

Lucas: I’m trying to do my best to channel the functionalist. I think he or she would say, “Okay, so what? That’s just more information processing, and i’ll bite the bullet on the binding problem. I still need some more time to figure that out. So what? It seems like these people who believe in qualia have an even tougher job of trying to explain this extra spooky quality in the world that’s different from all the other physical phenomenon that science has gone into.” It also seems to violate Occam’s razor or a principle of lightness where one’s metaphysics or ontology would want to assume the least amount of extra properties or entities in order to try to explain the world. I’m just really trying to tease out your best arguments here for qualia realism as we do have this current state of things in AI alignment where most people it seems would either try to explain away consciousness, would say it’s an illusion, or they’re anti-realist about qualia.

Mike: That’s a really good question, a really good frame. And I would say our strongest argument revolves around predictive power. Just like centuries ago, you could absolutely be a skeptic about, shall we say, electromagnetism realism. And you could say, “Yeah, I mean there is this thing we call static, and there’s this thing we call lightning, and there’s this thing we call load stones or magnets, but all these things are distinct. And to think that there’s some unifying frame, some deep structure of the universe that would tie all these things together and highly compress these phenomenon, that’s crazy talk.” And so, this is a viable position today to say that about consciousness, that it’s not yet clear whether consciousness has deep structure, but we’re assuming it does, and we think that unlocks a lot of predictive power.

We should be able to make predictions that are both more concise and compressed and crisp than others, and we should be able to make predictions that no one else can.

Lucas: So what is the most powerful here about what you guys are doing? Is it the specific theories and assumptions which you take are falsifiable?

Mike: Yeah.

Lucas: If we can make predictive assessments of these things, which are either leaky abstractions or are qualia, how would we even then be able to arrive at a realist or anti-realist view about qualia?

Mike: So, one frame on this is, it could be that one could explain a lot of things about observed behavior and implicit phenomenology through a purely functionalist or computationalist lens, but maybe for a given system it might take 10 terabytes. And if you can get there in a much simpler way, if you can explain it in terms of three elegant equations instead of 10 terabytes, then it wouldn’t be proof that there exists some crystal clear deep structure at work. But it would be very suggestive. Marr’s Levels of Analysis are pretty helpful here, where a functionalist might actually be very skeptical of consciousness mattering at all because it would say, “Hey, if you’re identifying consciousness at the implementation level of analysis, how could that have any bearing on how we are talking about, how we understand the world, how we’d behave?

Since the implementational level is kind of epiphenomenal from the point of view of the algorithm. How can an algorithm know its own implementation, all it can maybe figure out its own algorithm, and it’s identity would be constrained to its own algorithmic structure.” But that’s not quite true. In fact, there is bearings on one level of analysis onto another, meaning in some cases the implementation level of analysis doesn’t actually matter for the algorithm, but in some cases it does. So, if you were implementing a computer, let’s say with water, you have the option of maybe implementing a Turing machine with water buckets and in that case, okay, the implementation level of analysis goes out the window in terms of it doesn’t really help you understand the algorithm.

But if how you’re using water to implement algorithms is by basically creating this system of adding waves in buckets of different shapes, with different resonant modes, then the implementation level of analysis actually matters a whole lot for what algorithms are … finely tuned to be very effective in that substrate. In the case of consciousness and how we behave, we do think properties of the substrate have a lot of bearings on what algorithms we actually run. A functionalist should actually start caring about consciousness if the properties of consciousness makes the algorithms more efficient, more powerful.

Lucas: But what if qualia and consciousness are substantive real things? What if the epiphenomenonalist true and is like smoke rising from computation and it doesn’t have any causal efficacy?

Mike: To offer a re-frame on this, I like this frame of dual aspect monism better. There seems to be an implicit value judgment on epiphenomenalism. It’s seen as this very bad thing if a theory implies qualia as epiphenomenal. Just to put cards on the table, I think Andrés and I differ a little bit on how we see these things, although I think our ideas also mesh up well. But I would say that under the frame of something like dual aspect monism, that there’s actually one thing that exists, and it has two projections or shadows. And one projection is the physical world such as we can tell, and then the other projection is phenomenology, subjective experience. These are just two sides of the same coin and neither is epiphenomenal to the other. It’s literally just two different angles on the same thing.

And in that sense, qualia values and physical values are really talking about the same thing when you get down to it.

Lucas: Okay. So does this all begin with this move that Descartes makes, where he tries to produce a perfectly rational philosophy or worldview by making no assumptions and then starting with experience? Is this the kind of thing that you guys are doing in taking consciousness or qualia to be something real or serious?

Mike: I can just speak for myself here, but I would say my intuition comes from two places. One is staring deep into the beast of functionalism and realizing that it doesn’t lead to a clear answer. My model is that it just is this thing that looks like an answer but can never even in theory be an answer to how consciousness works. And if we deny consciousness, then we’re left in a tricky place with ethics and moral value. It also seems to leave value on the table in terms of predictions, that if we can assume consciousness as real and make better predictions, then that’s evidence that we should do that.

Lucas: Isn’t that just an argument that it would be potentially epistemically useful for ethics if we could have predictive power about consciousness?

Mike: Yeah. So, let’s assume that it’s 100 years, or 500 years, or 1,000 years in the future, and we’ve finally cracked consciousness. We’ve finally solved it. My open question is, what does the solution look like? If we’re functionalists, what does the solution look like? If we’re physicalists, what does the solution look like? And we can expand this to ethics as well.

Lucas: Just as a conceptual clarification, the functionalists are also physicalists though, right?

Andrés: There is two senses of the word physicalism here. So if there’s physicalism in the sense of like a theory of the universe, that the behavior of matter and energy, what happens in the universe is exhaustively described by the laws of physics, or future physics, there is also physicalism in the sense of understanding consciousness in contrast to functionalism. David Pearce, I think, would describe it as non-materialist physicalist idealism. There’s definitely a very close relationship between that phrasing and dual aspect monism. I can briefly unpack it. Basically non materialist is not saying that the stuff of the world is fundamentally unconscious. That’s something that materialism claims, that what the world is made of is not conscious, is raw matter so to speak.

Andrés: Physicalist, again in the sense of the laws of physics exhaustively describe behavior and idealist in the sense of what makes up the world is qualia or consciousness. The big picture view is that the actual substrate of the universe of quantum fields are fields of qualia.

Lucas: So Mike, you were saying that in the future when we potentially have a solution to the problem of consciousness, that in the end, the functionalists with algorithms and explanations of say all of the easy problems, all of the mechanisms behind the things that we call consciousness, you think that that project will ultimately fail?

Mike: I do believe that, and I guess my gentle challenge to functionalists would be to sketch out a vision of what a satisfying answer to consciousness would be, whether it’s completely explaining it a way or completely explaining it. If in 500 years you go to the local bookstore and you check out consciousness 101, and just flip through it, you look at the headlines and the chapter list and the pictures, what do you see? I think we have an answer as formalists, but I would be very interested in getting the functionalists state on this.

Lucas: All right, so you guys have this belief in the ability to formalize our understanding of consciousness, is this actually contingent on realism or anti realism?

Mike: It is implicitly dependent on realism, that consciousness is real enough to be describable mathematically in a precise sense. And actually that would be my definition of realism, that something is real if we can describe it exactly with mathematics and it is instantiated in the universe. I think the idea of connecting math and consciousness is very core to formalism.

Lucas: What’s particularly interesting here are the you’re making falsifiable claims about phenomenological states. It’s good and exciting that your Symmetry Theory of Valence, which we can get into now has falsifiable aspects. So do you guys want to describe here your Symmetry Theory of Valence and how this fits in and as a consequence of your valence realism?

Andrés: Sure, yeah. I think like one of the key places where this has bearings on is and understanding what is it that we actually want and what is it that we actually like and enjoy. That will be answered in an agent way. So basically you think of agents as entities who spin out possibilities for what actions to take and then they have a way of sorting them by expected utility and then carrying them out. A lot of people may associate what we want or what we like or what we care about at that level, the agent level, whereas we think actually the true source of value is more low level than that. That there’s something else that we’re actually using in order to implement agentive behavior. There’s ways of experiencing value that are completely separated from agents. You don’t actually need to be generating possible actions and evaluating them and enacting them for there to be value or for you to actually be able to enjoy something.

So what we’re examining here is actually what is the lower level property that gives rise even to agentive behavior that underlies every other aspect of experience. These would be a valence and specifically valence gradients. The general claim is that we are set up in such a way that we are basically climbing the valence gradient. This is not true in every situation, but it’s mostly true and it’s definitely mostly true in animals. And then the question becomes what implements valence gradients. Perhaps your intuition is this extraordinary fact that things that have nothing to do with our evolutionary past nonetheless can feel good or bad. So it’s understandable that if you hear somebody scream, you may get nervous or anxious or fearful or if you hear somebody laugh you may feel happy.

That makes sense from an evolutionary point of view, but why would the sound of the Bay Area Rapid Transit, the Bart, which creates these very intense screeching sounds, that is not even within like the vocal range of humans, it’s just really bizarre, never encountered before in our evolutionary past and nonetheless, it has an extraordinarily negative valence. That’s like a hint that valence has to do with patterns, it’s not just goals and actions and utility functions, but the actual pattern of your experience may determine valence. The same goes for a SUBPAC, is this technology that basically renders sounds between 10 and 100 hertz and some of them feel really good, some of them feel pretty unnerving, some of them are anxiety producing and it’s like why would that be the case? Especially when you’re getting two types of input that have nothing to do with our evolutionary past.

It seems that there’s ways of triggering high and low valence states just based on the structure of your experience. The last example I’ll give is very weird states of consciousness like meditation or psychedelics that seem to come with extraordinarily intense and novel forms of experiencing significance or a sense of bliss or pain. And again, they don’t seem to have much semantic content per se or rather the semantic content is not the core reason why they feel that they’re bad. It has to do more with a particular structure that they induce in experience.

Mike: There are many ways to talk about where pain and pleasure come from. We can talk about it in terms of neuro chemicals, opioids, dopamine. We can talk about it in terms of pleasure centers in the brain, in terms of goals and preferences and getting what you want, but all these have counterexamples. All of these have some points that you can follow the thread back to which will beg the question. I think the only way to explain emotional valence, pain and pleasure, that doesn’t beg the question is to explain it in terms of some patterns within phenomenology, just intrinsically feel good and some intrinsically feel bad. To touch back on the formalism brain, this would be saying that if we have a mathematical object that is isomorphic to your phenomenology, to what it feels like to be you, then some pattern or property of this object will refer to or will sort of intrinsically encode you are emotional valence, how pleasant or unpleasant this experiences.

That’s the valence formalism aspect that we’ve come to.

Lucas: So given the valence realism, the view is this intrinsic pleasure, pain axis of the world and this is sort of challenging I guess David Pearce’s view. There are things in experience which are just clearly good seeming or bad seeming. Will MacAskill called these pre theoretic properties we might ascribe to certain kinds of experiential aspects, like they’re just good or bad. So with this valence realism view, this potentiality in this goodness or badness whose nature is sort of self intimatingly disclosed in the physics and in the world since the beginning and now it’s unfolding and expressing itself more so and the universe is sort of coming to life, and embedded somewhere deep within the universe’s structure are these intrinsically good or intrinsically bad valances which complex computational systems and maybe other stuff has access to.

Andrés: Yeah, yeah, that’s right. And I would perhaps emphasize that it’s not only pre-theoretical, it’s pre-agentive, you don’t even need an agent for there to be valence.

Lucas: Right. Okay. This is going to be a good point I think for getting into these other more specific hairy philosophical problems. Could you go ahead and unpack a little bit more this view that pleasure or pain is self intimatingly good or bad that just by existing and experiential relation with the thing its nature is disclosed. Brian Tomasik here, and I think functionalists would say there’s just another reinforcement learning algorithm somewhere before that is just evaluating these phenomenological states. They’re not intrinsically or bad, that’s just what it feels like to be the kind of agent who has that belief.

Andrés: Sure. There’s definitely many angles from which to see this. One of them is by basically realizing that liking, wanting and learning are possible to dissociate, and in particular you’re going to have reinforcement without an associated positive valence. You can have also positive valence without reinforcement or learning. Generally they are correlated but they are different things. My understanding is a lot of people who may think of valence as something we believe matters because you are the type of agent that has a utility function and a reinforcement function. If that was the case, we would expect valence to melt away in states that are non agentive, we wouldn’t necessarily see it. And also that it would be intrinsically tied to intentional content, the aboutness of experience. A very strong counter example is that somebody may claim that really what they truly want this to be academically successful or something like that.

They think of the reward function as intrinsically tied to getting a degree or something like that. I would call that to some extent illusory, that if you actually look at how those preferences are being implemented, that deep down there would be valence gradients happening there. One way to show this would be let’s say the person on the graduation day, you give them an opioid antagonist. The person will subjectively feel that the day is meaningless, you’ve removed the pleasant cream of the experience that they were actually looking for, that they thought all along was tied in with intentional content with the fact of graduating but in fact it was the hedonic gloss that they were after, and that’s kind of like one intuition pump part there.

Lucas: These core problem areas that you’ve identified in Principia Qualia, would you just like to briefly touch on those?

Mike: Yeah, trying to break the problem down into modular pieces with the idea that if we can decompose the problem correctly then the sub problems become much easier than the overall problem and if you collect all the solutions to the sub problem than in aggregate, you get a full solution to the problem of consciousness. So I’ve split things up into the metaphysics, the math and the interpretation. The first question is what metaphysics do you even start with? What ontology do you even try to approach the problem? And we’ve chosen the ontology of physics that can objectively map onto reality in a way that computation can not. Then there’s this question of, okay, so you have your core ontology in this case physics, and then there’s this question of what counts, what actively contributes to consciousness? Do we look at electrons, electromagnetic fields, quarks?

This is an unanswered question. We have hypotheses but we don’t have an answer. Moving into the math, conscious system seemed to have boundaries, if something’s happening inside my head it can directly contribute to my conscious experience. But even if we put our heads together, literally speaking, your consciousness doesn’t bleed over into mine, there seems to be a boundary. So one way of framing this is the boundary problem and one way it’s framing it is the binding problem, and these are just two sides of the same coin. There’s this big puzzle of how do you draw the boundaries of a subject experience. IIT is set up to approach consciousness in itself through this lens that has a certain style of answer, style of approach. We don’t necessarily need to take that approach, but it’s a intellectual landmark. Then we get into things like the state-space problem and the topology of information problem.

If we figured out our basic ontology of what we think is a good starting point and of that stuff, what actively contributes to consciousness, and then we can figure out some principled way to draw a boundary around, okay, this is conscious experience A and this conscious experience B, and they don’t overlap. So you have a bunch of the information inside the boundary. Then there’s this math question of how do you rearrange it into a mathematical object that is isomorphic to what that stuff feels like. And again, IIT has an approach to this, we don’t necessarily ascribe to the exact approach but it’s good to be aware of. There’s also the interpretation problem, which is actually very near and dear to what QRI is working on and this is the concept of if you had a mathematical object that represented what it feels like to be you, how would we even start to figure out what it meant?

Lucas: This is also where the falsifiability comes in, right? If we have the mathematical object and we’re able to formally translate that into phenomenological states, then people can self report on predictions, right?

Mike: Yes. I don’t necessarily fully trust self reports as being the gold standard. I think maybe evolution is tricky sometimes and can lead to inaccurate self report, but at the same time it’s probably pretty good, and it’s the best we have for validating predictions.

Andrés: A lot of this gets easier if we assume that maybe we can be wrong in an absolute sense but we’re often pretty well calibrated to judge relative differences. Maybe you ask me how I’m doing on a scale of one to ten and I say seven and the reality is a five, maybe that’s a problem, but at the same time I like chocolate and if you give me some chocolate and I eat it and that improves my subjective experience and I would expect us to be well calibrated in terms of evaluating whether something is better or worse.

Lucas: There’s this view here though that the brain is not like a classical computer, that it is more like a resonant instrument.

Mike: Yeah. Maybe an analogy here it could be pretty useful. There’s this researcher William Sethares who basically figured out the way to quantify the mutual dissonance between pairs of notes. It turns out that it’s not very hard, all you need to do is add up the pairwise dissonance between every harmonic of the notes. And what that gives you is that if you take for example a major key and you compute the average dissonance between pairs of notes within that major key it’s going to be pretty good on average. And if you take the average dissonance of a minor key it’s going to be higher. So in a sense what distinguishes the minor and a major key is in the combinatorial space of possible permutations of notes, how frequently are they dissonant versus consonant.

That’s a very ground truth mathematical feature of a musical instrument and that’s going to be different from one instrument to the next. With that as a backdrop, we think of the brain and in particular valence in a very similar light that the brain has natural resonant modes and emotions may seem externally complicated. When you’re having a very complicated emotion and we ask you to describe it it’s almost like trying to describe a moment in a symphony, this very complicated composition and how do you even go about it. But deep down the reason why a particular frame sounds pleasant or unpleasant within music is ultimately tractable to the additive per wise dissonance of all of those harmonics. And likewise for a given state of consciousness we suspect that very similar to music the average pairwise dissonance between the harmonics present on a given point in time will be strongly related to how unpleasant the experience is.

These are electromagnetic waves and it’s not exactly like a static or it’s not exactly a standing wave either, but it gets really close to it. So basically what this is saying is there’s this excitation inhibition wave function and that happens statistically across macroscopic regions of the brain. There’s only a discrete number of ways in which that way we can fit an integer number of times in the brain. We’ll give you a link to the actual visualizations for what this looks like. There’s like a concrete example, one of the harmonics with the lowest frequency is basically a very simple one where interviewer hemispheres are alternatingly more excited versus inhibited. So that will be a low frequency harmonic because it is very spatially large waves, an alternating pattern of excitation. Much higher frequency harmonics are much more detailed and obviously hard to describe, but visually generally speaking, the spatial regions that are activated versus inhibited are these very thin wave fronts.

It’s not a mechanical wave as such, it’s a electromagnetic wave. So it would actually be the electric potential in each of these regions of the brain fluctuates, and within this paradigm on any given point in time you can describe a brain state as a weighted sum of all of its harmonics, and what that weighted sum looks like depends on your state of consciousness.

Lucas: Sorry, I’m getting a little caught up here on enjoying resonant sounds and then also the valence realism. The view isn’t that all minds will enjoy resonant things because happiness is like a fundamental valence thing of the world and all brains who come out of evolution should probably enjoy resonance.

Mike: It’s less about the stimulus, it’s less about the exact signal and it’s more about the effect of the signal on our brains. The resonance that matters, the resonance that counts, or the harmony that counts we’d say, or in a precisely technical term, the consonance that counts is the stuff that happens inside our brains. Empirically speaking most signals that involve a lot of harmony create more internal consonance in these natural brain harmonics than for example, dissonant stimuli. But the stuff that counts is inside the head, not the stuff that is going in our ears.

Just to be clear about QRI’s move here, Selen Atasoy has put forth this connecting specific harmonic wave model and what we’ve done is combined it with our symmetry threory of valence and said this is sort of a way of basically getting a foyer transform of where the energy is in terms of frequencies of brainwaves in a much cleaner way that has been available through EEG. Basically we can evaluate this data set for harmony. How much harmony is there in a brain, the link to the Symmetry Theory of Valence than it should be a very good proxy for how pleasant it is to be that brain.

Lucas: Wonderful.

Andrés: In this context, yeah, the Symmetry Theory of Valence would be much more fundamental. There’s probably many ways of generating states of consciousness that are in a sense completely unnatural that are not based on the harmonics of the brain, but we suspect the bulk of the differences in states of consciousness would cash out in differences in brain harmonics because that’s a very efficient way of modulating the symmetry of the state.

Mike: Basically, music can be thought of as a very sophisticated way to hack our brains into a state of greater consonance, greater harmony.

Lucas: All right. People should check out your Principia Qualia, which is the work that you’ve done that captures a lot of this well. Is there anywhere else that you’d like to refer people to for the specifics?

Mike: Principia qualia covers the philosophical framework and the symmetry theory of valence. Andrés has written deeply about this connectome specific harmonic wave frame and the name of that piece is Quantifying Bliss.

Lucas: Great. I would love to be able to quantify bliss and instantiate it everywhere. Let’s jump in here into a few problems and framings of consciousness. I’m just curious to see if you guys have any comments on ,the first is what you call the real problem of consciousness and the second one is what David Chalmers calls the Meta problem of consciousness. Would you like to go ahead and start off here with just this real problem of consciousness?

Mike: Yeah. So this gets to something we were talking about previously, is consciousness real or is it not? Is it something to be explained or to be explained away? This cashes out in terms of is it something that can be formalized or is it intrinsically fuzzy? I’m calling this the real problem of consciousness, and a lot depends on the answer to this. There are so many different ways to approach consciousness and hundreds, perhaps thousands of different carvings of the problem, panpsychism, we have dualism, we have non materialist physicalism and so on. I think essentially the core distinction, all of these theories sort themselves into two buckets, and that’s is consciousness real enough to formalize exactly or not. This frame is perhaps the most useful frame to use to evaluate theories of consciousness.

Lucas: And then there’s a Meta problem of consciousness which is quite funny, it’s basically like why have we been talking about consciousness for the past hour and what’s all this stuff about qualia and happiness and sadness? Why do people make claims about consciousness? Why does it seem to us that there is maybe something like a hard problem of consciousness, why is it that we experience phenomenological states? Why isn’t everything going on with the lights off?

Mike: I think this is a very clever move by David Chalmers. It’s a way to try to unify the field and get people to talk to each other, which is not so easy in the field. The Meta problem of consciousness doesn’t necessarily solve anything but it tries to inclusively start the conversation.

Andrés: The common move that people make here is all of these crazy things that we think about consciousness and talk about consciousness, that’s just any information processing system modeling its own attentional dynamics. That’s one illusionist frame, but even within qualia realist, qualia formalist paradigm, you still have the question of why do we even think or self reflect about consciousness. You could very well think of consciousness as being computationally relevant, you need to have consciousness and so on, but still lacking introspective access. You could have these complicated conscious information processing systems, but they don’t necessarily self reflect on the quality of their own consciousness. That property is important to model and make sense of.

We have a few formalisms that may give rise to some insight into how self reflectivity happens and in particular how is it possible to model the entirety of your state of consciousness in a given phenomenal object. These ties in with the notion of a homonculei, if the overall valence of your consciousness is actually a signal traditionally used for fitness evaluation, detecting basically when are you in existential risk to yourself or when there’s like reproductive opportunities that you may be missing out on, that it makes sense for there to be a general thermostat of the overall experience where you can just look at it and you get a a sense of the overall well being of the entire experience added together in such a way that you experienced them all at once.

I think like a lot of the puzzlement has to do with that internal self model of the overall well being of the experience, which is something that we are evolutionarily incentivized to actually summarize and be able to see at a glance.

Lucas: So, some people have a view where human beings are conscious and they assume everyone else is conscious and they think that the only place for value to reside is within consciousness, and that a world without consciousness is actually a world without any meaning or value. Even if we think that say philosophical zombies or people who are functionally identical to us but with no qualia or phenomenological states or experiential states, even if we think that those are conceivable, then it would seem that there would be no value in a world of p-zombies. So I guess my question is why does phenomenology matter? Why does the phenomenological modality of pain and pleasure or valence have some sort of special ethical or experiential status unlike qualia like red or blue?

Why does red or blue not disclose some sort of intrinsic value in the same way that my suffering does or my bliss does or the suffering or bliss of other people?

Mike: My intuition is also that consciousness is necessary for value. Nick Bostrom has this wonderful quote in super intelligence that we should be wary of building a Disneyland with no children, some technological wonderland that is filled with marvels of function but doesn’t have any subjective experience, doesn’t have anyone to enjoy it basically. I would just say that I think that most AI safety research is focused around making sure there is a Disneyland, making sure, for example, that we don’t just get turned into something like paperclips. But there’s this other problem, making sure there are children, making sure there are subjective experiences around to enjoy the future. I would say that there aren’t many live research threads on this problem and I see QRI as a live research thread on how to make sure there is subject experience in the future.

Probably a can of worms there, but as your question about in pleasure, I may pass that to my colleague Andrés.

Andrés: Nothing terribly satisfying here. I would go with David Pearce’s view that these properties of experience are self intimating and to the extent that you do believe in value, it will come up as the natural focal points for value, especially if you’re allowed to basically probe the quality of your experience where in many states you believe that the reason why you like something is for intentional content. Again, the case of graduating or it could be the the case of getting a promotion or one of those things that a lot of people associate, with feeling great, but if you actually probe the quality of experience, you will realize that there is this component of it which is its hedonic gloss and you can manipulate it directly again with things like opiate antagonists and if the symmetry theory of valence is true, potentially also by directly modulating the consonance and dissonance of the brain harmonics, in which case the hedonic gloss would change in peculiar ways.

When it comes to concealiance, when it comes to many different points of view, agreeing on what aspect of the experience is what brings value to it, it seems to be the hedonic gloss.

Lucas: So in terms of qualia and valence realism, would the causal properties of qualia be the thing that would show any arbitrary mind the self intimating nature of how good or bad an experience is, and in the space of all possible minds, what is the correct epistemological mechanism for evaluating the moral status of experiential or qualitative states?

Mike: So first of all, I would say that my focus so far has mostly been on describing what is and not what ought. I think that we can talk about valence without necessarily talking about ethics, but if we can talk about valence clearly, that certainly makes some questions in ethics and some frameworks in ethics make much more or less than. So the better we can clearly describe and purely descriptively talk about consciousness, the easier I think a lot of these ethical questions get. I’m trying hard not to privilege any ethical theory. I want to talk about reality. I want to talk about what exists, what’s real and what the structure of what exists is, and I think if we succeed at that then all these other questions about ethics and morality get much, much easier. I do think that there is an implicit should wrapped up in questions about valence, but I do think that’s another leap.

You can accept the valence is real without necessarily accepting that optimizing valence is an ethical imperative. I personally think, yes, it is very ethically important, but it is possible to take a purely descriptive frame to valence, that whether or not this also discloses, as David Pearce said, the utility function of the universe. That is another question and can be decomposed.

Andrés: One framing here too is that we do suspect valence is going to be the thing that matters up on any mind if you probe it in the right way in order to achieve reflective equilibrium. There’s the biggest example of a talk and neuro scientist was giving at some point, there was something off and everybody seemed to be a little bit anxious or irritated and nobody knew why and then one of the conference organizers suddenly came up to the presenter and did something to the microphone and then everything sounded way better and everybody was way happier. There was these very sorrow hissing pattern caused by some malfunction of the microphone and it was making everybody irritated, they just didn’t realize that was the source of the irritation, and when it got fixed then you know everybody’s like, “Oh, that’s why I was feeling upset.”

We will find that to be the case over and over when it comes to improving valence. So like somebody in the year 2050 might come up to one of the connectome specific harmonic wave clinics, “I don’t know what’s wrong with me,” but if you put them through the scanner we identify your 17th and 19th harmonic in a state of dissonance. We cancel 17th to make it more clean, and then the person who will say all of a sudden like, “Yeah, my problem is fixed. How did you do that?” So I think it’s going to be a lot like that, that the things that puzzle us about why do I prefer these, why do I think this is worse, will all of a sudden become crystal clear from the point of view of valence gradients objectively measured.

Mike: One of my favorite phrases in this context is what you can measure you can manage and if we can actually find the source of dissonance in a brain, then yeah, we can resolve it, and this could open the door for maybe honestly a lot of amazing things, making the human condition just intrinsically better. Also maybe a lot of worrying things, being able to directly manipulate emotions may not necessarily be socially positive on all fronts.

Lucas: So I guess here we can begin to jump into AI alignment and qualia. So we’re building AI systems and they’re getting pretty strong and they’re going to keep getting stronger potentially creating a superintelligence by the end of the century and consciousness and qualia seems to be along the ride for now. So I’d like to discuss a little bit here about more specific places in AI alignment where these views might inform it and direct it.

Mike: Yeah, I would share three problems of AI safety. There’s the technical problem, how do you make a self improving agent that is also predictable and safe. This is a very difficult technical problem. First of all to even make the agent but second of all especially to make it safe, especially if it becomes smarter than we are. There’s also the political problem, even if you have the best technical solution in the world and the sufficiently good technical solution doesn’t mean that it will be put into action in a sane way if we’re not in a reasonable political system. But I would say the third problem is what QRI is most focused on and that’s the philosophical problem. What are we even trying to do here? What is the optimal relationship between AI and humanity and also a couple of specific details here. First of all I think nihilism is absolutely an existential threat and if we can find some antidotes to nihilism through some advanced valence technology that could be enormously helpful for reducing Xrisk.

Lucas: What kind of nihilism or are you talking about here, like nihilism about morality and meaning?

Mike: Yes, I would say so, and just personal nihilism that it feels like nothing matters, so why not do risky things?

Lucas: Whose quote is it, the philosophers question like should you just kill yourself? That’s the yawning abyss of nihilism inviting you in.

Andrés: Albert Camus. The only real philosophical question is whether to commit suicide, whereas how I think of it is the real philosophical question is how to make love last, bringing value to the existence, and if you have value on tap, then the question of whether to kill yourself or not seems really nonsensical.

Lucas: For sure.

Mike: We could also say that right now there aren’t many good shelling points for global coordination. People talk about having global coordination and building AGI would be a great thing but we’re a little light on the details of how to do that. If the clear, comprehensive, useful, practical understanding of consciousness can be built, then this may sort of embody or generate new shelling points that the larger world could self organize around. If we can give people a clear understanding of what is and what could be, then I think we will get a better future that actually gets built.

Lucas: Yeah. Showing what is and what could be is immensely important and powerful. So moving forward with AI alignment as we’re building these more and more complex systems, there’s this needed distinction between unconscious and conscious information processing, if we’re interested in the morality and ethics of suffering and joy and other conscious states. How do you guys see the science of consciousness here, actually being able to distinguish between unconscious and conscious information processing systems?

Mike: There are a few frames here. One is that, yeah, it does seem like the brain does some processing in consciousness and some processing outside of consciousness. And what’s up with that, this could be sort of an interesting frame to explore in terms of avoiding things like mind crime in the AGI or AI space that if there are certain computations which are painful then don’t do them in a way that would be associated with consciousness. It would be very good to have rules of thumb here for how to do that. One interesting could be in the future we might not just have compilers which optimize for speed of processing or minimization of dependent libraries and so on, but could optimize for the valence of the computation on certain hardware. This of course gets into complex questions about computationalism, how hardware dependent this compiler would be and so on.

I think it’s an interesting and important long term frame

Lucas: So just illustrate here I think the ways in which solving or better understanding consciousness will inform AI alignment from present day until super intelligence and beyond.

Mike: I think there’s a lot of confusion about consciousness and a lot of confusion about what kind of thing the value problem is in AI Safety, and there are some novel approaches on the horizon. I was speaking with Stuart Armstrong the last year global and he had some great things to share about his model fragments paradigm. I think this is the right direction. It’s sort of understanding, yeah, human preferences are insane. Just they’re not a consistent formal system.

Lucas: Yeah, we contain multitudes.

Mike: Yes, yes. So first of all understanding what generates them seems valuable. So there’s this frame in AI safety we call the complexity value thesis. I believe Eliezer came up with it in a post on Lesswrong. It’s this frame where human value is very fragile in that it can be thought of as a small area, perhaps even almost a point in a very high dimensional space, say a thousand dimensions. If we go any distance in any direction from this tiny point in this high dimensional space, then we quickly get to something that we wouldn’t think of as very valuable. And maybe if we leave everything the same and take away freedom, this paints a pretty sobering picture of how difficult AI alignment will be.

I think this is perhaps arguably the source of a lot of worry in the community, that not only do we need to make machines that won’t just immediately kill us, but that will preserve our position in this very, very high dimensional space well enough that we keep the same trajectory and that possibly if we move at all, then we may enter a totally different trajectory, that we in 2019 wouldn’t think of as having any value. So this problem becomes very, very intractable. I would just say that there is an alternative frame. The phrasing that I’m playing around with here it is instead of the complexity of value thesis, the unity of value thesis, it could be that many of the things that we find valuable, eating ice cream, living in a just society, having a wonderful interaction with a loved one, all of these have the same underlying neural substrate and empirically this is what effective neuroscience is finding.

Eating a chocolate bar activates same brain regions as a transcendental religious experience. So maybe there’s some sort of elegant compression that can be made and that actually things aren’t so starkly strict. We’re not sort of this point in a super high dimensional space and if we leave the point, then everything of value is trashed forever, but maybe there’s some sort of convergent process that we can follow that we can essentialize. We can make this list of 100 things that humanity values and maybe they all have in common positive valence, and positive valence can sort of be reverse engineered. And to some people this feels like a very scary dystopic scenario, don’t knock it until you’ve tried it, but at the same time there’s a lot of complexity here.

One core frame that the idea of qualia of formalism and valence realism and offer AI safety is that maybe the actual goal is somewhat different than the complexity of value thesis puts forward. Maybe the actual goal is different and in fact easier. I think this could directly inform how we spend our resources on the problem space.

Lucas: Yeah, I was going to say that there exists standing tension between this view of the complexity of all preferences and values that human beings have and then the valence realist view which says that what’s ultimately good or certain experiential or hedonic states. I’m interested and curious about if this valence view is true, whether it’s all just going to turn into hedonium in the end.

Mike: I’m personally a fan of continuity. I think that if we do things right we’ll have plenty of time to get things right and also if we do things wrong then we’ll have plenty of time for things to be wrong. So I’m personally not a fan of big unilateral moves, it’s just getting back to this question of can understanding what is help us, clearly yes.

Andrés: Yeah. I guess one view is we could say preserve optionality and learn what is, and then from there hopefully we’ll be able to better inform oughts and with maintained optionality we’ll be able to choose the right thing. But that will require a cosmic level of coordination.

Mike: Sure. An interesting frame here is whole brain emulation. So whole brain emulation is sort of a frame built around functionalism and it’s a seductive frame I would say. If whole brain emulations wouldn’t necessarily have the same qualia based on hardware considerations as the original humans, there could be some weird lock in effects where if the majority of society turned themselves into p-zombies then it may be hard to go back on that.

Lucas: Yeah. All right. We’re just getting to the end here, I appreciate all of this. You guys have been tremendous and I really enjoyed this. I want to talk about identity in AI alignment. This sort of taxonomy that you’ve developed about open individualism and closed individualism and all of these other things. Would you like to touch on that and talk about implications here in AI alignment as you see it?

Andrés: Yeah. Yeah, for sure. The taxonomy comes from Daniel Kolak, a philosopher and mathematician. It’s a pretty good taxonomy and basically it’s like open individualism, that’s the view that a lot of meditators and mystics and people who take psychedelics often ascribe to, which is that we’re all one consciousness. Another frame is that our true identity is the light of consciousness so to speak. So it doesn’t matter in what form it manifests, it’s always the same fundamental ground of being. Then you have the common sense view, it’s called closed individualism. You start existing when you’re born, you stop existing when you die. You’re just this segment. Some religious actually extend that into the future or past with reincarnation or maybe with heaven.

The sense of ontological distinction between you and others while at the same time ontological continuity from one moment to the next within you. Finally you have this view that’s called empty individualism, which is that you’re just a moment of experience. That’s fairly common among physicists and a lot of people who’ve tried to formalize consciousness, often they converged on empty individualism. I think a lot of theories of ethics and rationality, like the veil of ignorance as a guide or like how do you define the rational decision making as maximizing the expected utility of yourself as an agent, all of those seem to implicitly be based on closed individualism and they’re not necessarily questioning it very much.

On the other hand, if the sense of individual identity of closed individualism doesn’t actually carve nature at its joints as a Buddhist might say, the feeling of continuity of being a separate unique entity is an illusory construction of your phenomenology that casts in a completely different light how to approach rationality itself and even self interest, right? If you start identifying with the light of consciousness rather than your particular instantiation, you will probably care a lot more about what happens to pigs in factory farms because … In so far as they are conscious they are you in a fundamental way. It matters a lot in terms of how to carve out different possible futures, especially when you get into these very tricky situations like, well what if there is mind melding or what if there is the possibility of making perfect copies of yourself?

All of these edge cases are really problematic from the common sense view of identity, but they’re not really a problem from an open individualist or empty individualist point of view. With all of this said, I do personally think there’s probably a way of combining open individualism with valence realism that gives rise to the next step in human rationality where we’re actually trying to really understand what the universe wants so to speak. But I would say that there is a very tricky aspect here that has to do with a game theory. We evolved to believe in close individualism. the fact that it’s evolutionarily adaptive, it’s obviously not an argument for it being fundamentally true, but it does seem to be some kind of a evolutionarily stable point to believe of yourself as who you can affect the most directly in a causal way If you define your boundary that way.

That basically gives you focus on the actual degrees of freedom that you do have, and if you think of society of open individualists, everybody’s altruistically maximally contributing to the universal consciousness, and then you have one close individualist who is just selfishly trying to maybe acquire power just for itself, you can imagine that one view would have a tremendous evolutionary advantage in that context. So I’m not one who just naively advocates for open individualism unreflectively. I think we still have to work out to the game theory of it, how to make it evolutionarily stable and also how to make it ethical. Open question, I do think it’s important to think about and if you take consciousness very seriously, especially within physicalism, that usually will cast huge doubts on the common sense view of identity.

It doesn’t seem like a very plausible view if you actually tried to formalize consciousness.

Mike: The game theory aspect is very interesting. You can think of closed individualism as something evolutionists produced that allows an agent to coordinate very closely with its past and future ourselves. Maybe we can say a little bit about why we’re not by default all empty individualists or open individualists. Empty individualism seems to have a problem where if every slice of conscious experience is its own thing, then why should you even coordinate with your past and future self because they’re not the same as you. So that leads to a problem of defection, and open individualism is everything is the same being so to speak than … As Andrés mentioned that allows free riders, if people are defecting, it doesn’t allow altruist punishment or any way to stop the free ride. There’s interesting game theory here and it also just feeds into the question of how we define our identity in the age of AI, the age of cloning, the age of mind uploading.

This gets very, very tricky very quickly depending on one’s theory of identity. They’re opening themselves up to getting hacked in different ways and so different theories of identity allow different forms of hacking.

Andrés: Yeah, which could be sometimes that’s really good and sometimes really bad. I would make the prediction that not necessarily open individualism in its full fledged form but a weaker sense of identity than closed individualism is likely going to be highly adaptive in the future as people basically have the ability to modify their state of consciousness in much more radical ways. People who just identify with narrow sense of identity will just be in their shells, not try to disturb the local attractor too much. That itself is not necessarily very advantageous. If the things on offer are actually really good, both hedonically and intelligence wise.

I do suspect basically people who are somewhat more open to basically identify with consciousness or at least identify with a broader sense of identity, they will be the people who will be doing more substantial progress, pushing the boundary and creating new cooperation and coordination technology.

Lucas: Wow, I love all that. Seeing closed individualism for what it was has had a tremendous impact on my life and this whole question of identity I think is largely confused for a lot of people. At the beginning you said that open individualism says that we are all one consciousness or something like this, right? For me in identity I’d like to move beyond all distinctions of sameness or differenceness. To say like, oh, we’re all one consciousness to me seems to say we’re all one electromagnetism, which is really to say the consciousness is like an independent feature or property of the world that’s just sort of a ground part of the world and when the world produces agents, consciousness is just an empty identityless property that comes along for the ride.

The same way in which it would be nonsense to say, “Oh, I am these specific atoms, I am just the forces of nature that are bounded within my skin and body” That would be nonsense. In the same way in sense of what we were discussing with consciousness there was the binding problem of the person, the discreteness of the person. Where does the person really begin or end? It seems like these different kinds of individualism have, as you said, epistemic and functional use, but they also, in my view, create a ton of epistemic problems, ethical issues, and in terms of the valence theory, if quality is actually something good or bad, then as David Pearce says, it’s really just an epistemological problem that you don’t have access to other brain states in order to see the self intimating nature of what it’s like to be that thing in that moment.

There’s a sense in which i want to reject all identity as arbitrary and I want to do that in an ultimate way, but then in the conventional way, I agree with you guys that there are these functional and epistemic issues that closed individualism seems to remedy somewhat and is why evolution, I guess selected for it, it’s good for gene propagation and being selfish. But once one sees AI as just a new method of instantiating bliss, it doesn’t matter where the bliss is. Bliss is bliss and there’s no such thing as your bliss or anyone else’s bliss. Bliss is like its own independent feature or property and you don’t really begin or end anywhere. You are like an expression of a 13.7 billion year old system that’s playing out.

The universe is just peopleing all of us at the same time, and when you get this view and you see you as just sort of like the super thin slice of the evolution of consciousness and life, for me it’s like why do I really need to propagate my information into the future? Like I really don’t think there’s anything particularly special about the information of anyone really that exists today. We want to preserve all of the good stuff and propagate those in the future, but people who seek a immortality through AI or seek any kind of continuation of what they believe to be their self is, I just see that all as misguided and I see it as wasting potentially better futures by trying to bring Windows 7 into the world of Windows 10.

Mike: This all gets very muddy when we try to merge human level psychological drives and concepts and adaptations with a fundamental physics level description of what is. I don’t have a clear answer. I would say that it would be great to identify with consciousness itself, but at the same time, that’s not necessarily super easy if you’re suffering from depression or anxiety. So I just think that this is going to be an ongoing negotiation within society and just hopefully we can figure out ways in which everyone can move.

Andrés: There’s an article I wrote it, I just called it consciousness versus replicators. That kind of gets to the heart of this issue, but that sounds a little bit like good and evil, but it really isn’t. The true enemy here is replication for replication’s sake. On the other hand, the only way in which we can ultimately benefit consciousness, at least in a plausible, evolutionarily stable way is through replication. We need to find the balance between replication and benefit of consciousness that makes the whole system stable, good for consciousness and resistant against the factors.

Mike: I would like to say that I really enjoy Max Tegmark’s general frame of you leaving this mathematical universe. One re-frame of what we were just talking about in these terms are there are patterns which have to do with identity and have to do with valence and have to do with many other things. The grand goal is to understand what makes a pattern good or bad and optimize our light cone for those sorts of patterns. This may have some counter intuitive things, maybe closed individualism is actually a very adaptive thing, in the long term it builds robust societies. Could be that that’s not true but I just think that taking the mathematical frame and the long term frame is a very generative approach.

Lucas: Absolutely. Great. I just want to finish up here on two fun things. It seems like good and bad are real in your view. Do we live in heaven or hell?

Mike: Lot of quips that come to mind here. Hell is other people, or nothing is good or bad but thinking makes it so. My pet theory I should say is that we live in something that is perhaps close to heaven as is physically possible. The best of all possible worlds.

Lucas: I don’t always feel that way but why do you think that?

Mike: This gets through the weeds of theories about consciousness. It’s this idea that we tend to think of consciousness on the human scale. Is the human condition good or bad, is the balance of human experience on the good end, the heavenly end or the hellish end. If we do have an objective theory of consciousness, we should be able to point it at things that are not human and even things that are not biological. It may seem like a type error to do this but we should be able to point it at stars and black holes and quantum fuzz. My pet theory, which is totally not validated, but it is falsifiable, and this gets into Bostrom’s simulation hypothesis, it could be that if we tally up the good valence and the bad valence in the universe, that first of all, the human stuff might just be a rounding error.

Most of the value, in this value the positive and negative valence is found elsewhere, not in humanity. And second of all, I have this list in the last appendix of Principia Qualia as well, where could massive amounts of consciousness be hiding in the cosmological sense. I’m very suspicious that the big bang starts with a very symmetrical state, I’ll just leave it there. In a utilitarian sense, if you want to get a sense of whether we live in a place closer to heaven or hell we should actually get a good theory of consciousness and we should point to things that are not humans and cosmological scale events or objects would be very interesting to point it at. This will give a much better clear answer as to whether we live in somewhere closer to heaven or hell than human intuition.

Lucas: All right, great. You guys have been super generous with your time and I’ve really enjoyed this and learned a lot. Is there anything else you guys would like to wrap up on?

Mike: Just I would like to say, yeah, thank you so much for the interview and reaching out and making this happen. It’s been really fun on our side too.

Andrés: Yeah, I think wonderful questions and it’s very rare for an interviewer to have non conventional views of identity to begin with, so it was really fun, really appreciate it.

Lucas: Would you guys like to go ahead and plug anything? What’s the best place to follow you guys, Twitter, Facebook, blogs, website?

Mike: Our website is qualiaresearchinstitute.org and we’re working on getting a PayPal donate button out but in the meantime you can send us some crypto. We’re building out the organization and if you want to read our stuff a lot of it is linked from the website and you can also read my stuff at my blog, opentheory.net and Andrés’ is @qualiacomputing.com.

Lucas: If you enjoyed this podcast, please subscribe, give it a like or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

End of recorded material

AI Alignment Podcast: An Overview of Technical AI Alignment with Rohin Shah (Part 2)

The space of AI alignment research is highly dynamic, and it’s often difficult to get a bird’s eye view of the landscape. This podcast is the second of two parts attempting to partially remedy this by providing an overview of technical AI alignment efforts. In particular, this episode seeks to continue the discussion from Part 1 by going in more depth with regards to the specific approaches to AI alignment. In this podcast, Lucas spoke with Rohin Shah. Rohin is a 5th year PhD student at UC Berkeley with the Center for Human-Compatible AI, working with Anca Dragan, Pieter Abbeel and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter

Topics discussed in this episode include:

  • Embedded agency
  • The field of “getting AI systems to do what we want”
  • Ambitious value learning
  • Corrigibility, including iterated amplification, debate, and factored cognition
  • AI boxing and impact measures
  • Robustness through verification, adverserial ML, and adverserial examples
  • Interpretability research
  • Comprehensive AI Services
  • Rohin’s relative optimism about the state of AI alignment

You can take a short (3 minute) survey to share your feedback about the podcast here.

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

Lucas: Hey everyone, welcome back to the AI Alignment Podcast. I’m Lucas Perry, and today’s episode is the second part of our two part series with Rohin Shah, developing an overview of technical AI alignment efforts. If you haven’t listened to the first part, we highly recommend that you do, as it provides an introduction to the varying approaches discussed here. The second part is focused on exploring AI alignment methodologies in more depth, and nailing down the specifics of the approaches and lenses through which to view the problem.

In this episode, Rohin will begin by moving sequentially through the approaches discussed in the first episode. We’ll start with embedded agency, then discuss the field of getting AI systems to do what we want, and we’ll discuss ambitious value learning alongside this. Next, we’ll move to corrigibility, in particular, iterated amplification, debate, and factored cognition.

Next we’ll discuss placing limits on AI systems, things of this nature would be AI boxing and impact measures. After this we’ll get into robustness which consists of verification, adversarial machine learning, and adversarial examples to name a few.

Next we’ll discuss interpretability research, and finally comprehensive AI services. By listening to the first part of the series, you should have enough context for these materials in the second part. As a bit of announcement, I’d love for this podcast to be particularly useful and interesting for its listeners. So I’ve gone ahead and drafted a short three minute survey that you can find link to on the FLI page for this podcast, or in the description of where you might find this podcast. As always, if you find this podcast interesting or useful, please make sure to like, subscribe and follow us on your preferred listening platform.

For those of you that aren’t already familiar with Rohin, he is a fifth year PhD student in computer science at UC Berkeley with the Center for Human Compatible AI working with Anca Dragan, Pieter Abbeel, and Stuart Russell. Every week he collects and summarizes recent progress relative to AI alignment in the Alignment Newsletter. With that, we’re going to start off by moving sequentially through the approached just enumerated. All right. Then let’s go ahead and begin with the first one, which I believe was embedded agency.

Rohin: Yeah, so embedded agency. I kind of want to just differ to the embedded agency sequence, because I’m not going to do anywhere near as good a job as that does. But the basic idea is that we would like to have this sort of theory of intelligence, and one major blocker to this is the fact that all of our current theories, most notably, the reinforcement learning make this assumption that there is a nice clean boundary between the environment and the agent. It’s sort of like the agent is playing a video game, and the video game is the environment. There’s no way for the environment to actually affect the agent. The agent has this defined input channel, takes actions, those actions get sent to the video game environment, the video game environment does stuff based on that and creates an observation, and that observation was then sent back to the agent who gets to look at it, and there’s this very nice, clean abstraction there. The agent could be bigger than the video game, in the same way that I’m bigger than tic tac toe.

I can actually simulate the entire game tree of tic tac toe and figure out what the optimal policy for tic tac toe is. It’s actually this cool XKCD that does just show you the entire game tree, it’s great.

So in the same way in the video game setting, the agent can be bigger than the video game environment, in that it can have a perfectly accurate model of the environment and know exactly what its actions are going to do. So there are all of these nice assumptions that we get in video game environment land, but in real world land, these don’t work. If you consider me on the Earth, I cannot have an exact model of the entire environment because the environment contains me inside of it, and there is no way that I can have a perfect model of me inside of me. That’s just not a thing that can happen. Not to mention having a perfect model of the rest of the universe, but we’ll leave that aside even.

There’s the fact that it’s not super clear what exactly my action space is. Once there is now a laptop available to me, does the laptop start talking as part of my action space? Do we only talk about motor commands I can give to my limbs? But then what happens if I suddenly get uploaded and now I just don’t have any lens anymore? What happened to my actions, are they gone? So Embedded Agency broadly factors this question out into four sub problems. I associate them with colors, because that’s what Scott and Abram do in their sequence. The red one is decision theory. Normally decision theory is consider all possible actions to simulate their consequences, choose the one that will lead to the highest expected utility. This is not a thing you can do when you’re an embedded agent, because the environment could depend on what policy you do.

The classic example of this is Newcomb’s problem where part of the environment is all powerful being, Omega. Omega is able to predict you perfectly, so it knows exactly what you’re going to do, and Omega is 100% trustworthy, and all those nice simplifying assumptions. Omega provides you with the following game. He’s going to put two transparent boxes in front of you. The first box will always contain $1,000 dollars, and the second box will either contain a million dollars or nothing, and you can see this because they’re transparent. You’re given the option to either take one of the boxes or both of the boxes, and you just get whatever’s inside of them.

The catch is that Omega only puts the million dollars in the box if he predicts that you would take only the box with the million dollars in it, and not the other box. So now you see the two boxes, and you see that one box has a million dollars, and the other box has a thousand dollars. In that case, should you take both boxes? Or should you just take the box with the million dollars? So the way I’ve set it up right now, it’s logically impossible for you to do anything besides take the million dollars, so maybe you’d say okay, I’m logically required to do this, so maybe that’s not very interesting. But you can relax this to a problem where Omega is 99.999% likely to get the prediction right. Now in some sense you do have agency. You could choose both boxes and it would not be a logical impossibility, and you know, both boxes are there. You can’t change the amounts that are in the boxes now. Man, you should just take both boxes because it’s going to give you $1,000 more. Why would you not do that?

But I claim that the correct thing to do in this situation is to take only one box because the fact that you are the kind of agent who would only take one box is the reason that the one box has a million dollars in it anyway, and if you were the kind of agent that did not take one box, took two boxes instead, you just wouldn’t have seen the million dollars there. So that’s the sort of problem that comes up in embedded decision theory.

Lucas: Even though it’s a thought experiment, there’s a sense though in which the agent in the thought experiment is embedded in a world where he’s making the observation of boxes that have a million dollars in them with genius posing these situations?

Rohin: Yeah.

Lucas: I’m just seeking clarification on the embeddedness of the agent and Newcomb’s problem.

Rohin: The embeddedness is because the environment is able to predict exactly, or with close to perfect accuracy what the agent could do.

Lucas: The genie being the environment?

Rohin: Yeah, Omega is part of the environment. You’ve got you, the agent, and everything else, the environment, and you have to make good decision. We’ve only been talking about how the boundary between agent and environment isn’t actually all that clear. But to the extent that it’s sensible to talk about you being able to choose between actions, we want some sort of theory for how to do that when the environment can contain copies of you. So you could think of Omega as simulating a copy of you and seeing what you would do in this situation before actually presenting you with a choice.

So we’ve got the red decision theory, then we have yellow embedded world models. With embedded world models, the problem that you have is that, so normally in our nice video game environment, we can have an exact model of how the environment is going to respond to our actions, even if we don’t know it initially, we can learn it overtime, and then once we have it, it’s pretty easy to see how you could plan in order to do the optimal thing. You can sort of trial your actions, simulate them all, and then see which one does the best and do that one. This is roughly AIXI works. AIXI is the model of the optimally intelligent RL agent in this four video game environment like settings.

Once you’re in embedded agency land, you cannot have an exact model of the environment because for one thing the environment contains you and you can’t have an exact model of you, but also the environment is large, and you can’t simulate it exactly. The big issue is that it contains you. So how you get any sort of sensible guarantees on what you can do, even though the environment can contain you is the problem off of embedded world models. You still need a world model. It can’t be exact because it contains you. Maybe you could do something hierarchical where things are fuzzy at the top, but then you can go focus in on each particular levels of hierarchy in order to get more and more precise about each particular thing. Maybe this is sufficient? Not clear.

Lucas: So in terms of human beings though, we’re embedded agents that are capable of creating robust world models that are able to think about AI alignment.

Rohin: Yup, but we don’t know how we do it.

Lucas: Okay. Are there any sorts of understandings that we can draw from our experience?

Rohin: Oh yeah, I’m sure there are. There’s a ton of work on this that I’m not that familiar with, and probably a cog psy or psychology or neuroscience, all of these fields I’m sure will have something to say about it. Hierarchical world models in particular are pretty commonly talked about as interesting. I know that there’s a whole field of hierarchical reinforcement learning in AI that’s motivated by this, but I believe it’s also talked about in other areas of academia, and I’m sure there are other insights to be getting from there as well.

Lucas: All right, let’s move on then from hierarchical world models.

Rohin: Okay. Next is blue robust delegation. So with robust delegation, the basic issue here, so we talked about Vingean reflection a little bit in the first podcast. This is a problem that falls under robust delegation. The headline difficulty under robust delegation is that the agent is able to do self improvement, it can reason about itself and do things based on that. So one way you can think of this is that instead of thinking about it as self modification, you can think about it as the agent is constructing a new agent to act at future time steps. So then in that case your agent has the problem of how do I construct an agent for future time steps such that I am happy delegating my decision making to that future agent? That’s why it’s called robust delegation. Vingean reflection in particular is about how can you take an AI system that uses a particular logical theory in order to make inferences and have it move to a stronger logical theory, and actually trust the stronger logical theory to only make correct inferences?

Stated this way, the problem is impossible because lots of theorems, it’s a well known result in logic that a weaker theory can not prove the consistency of well even itself, but also any stronger theory as a corollary. Intuitively in this pretty simple example, we don’t know how to get an agent that can trust a smarter version of itself. You should expect this problem to be hard, right? It’s in some sense dual to the problem that we have of AI alignment where we’re creating something smarter than us, and we need it to pursue the things we want it to pursue, but it’s a lot smarter than us, so it’s hard to tell what it’s going to do.

So I think of this aversion of the AI alignment problem, but apply to the case of some embedded agent reasoning about itself, and making a better version of itself in the future. So I guess we can move on to the green section, which is sub system alignment. The tagline for subsystem alignment would be the embedded agent is going to be made out of parts. Its’ not this sort of unified coherent object. It’s got different pieces inside of it because it’s embedded in the environment, and the environment is made of pieces that make up the agent, and it seems likely that your AI system is going to be made up of different cognitive sub parts, and it’s not clear that those sub parts will integrate together into a unified whole such that unified whole is pursuing a goal that you like.

It could be that each individual sub part has its own goal and they’re all competing with each other in order to further their own goals, and that the aggregate overall behavior is usually good for humans, at least in our current environment. But as the environment changes, which it will due to technological progression, one of the parts might just win out and be optimizing some goal that is not anywhere close to what we wanted. A more concrete example would be one way that you could imagine building a powerful AI system is to have a world model that is awarded for making accurate predictions about what the world will look like, and then you have a decision making model, which has a normal reward function that we program in, and tries to choose actions in order to maximize that reward. So now we have an agent that has two sub systems in it.

You might worry for example that once the world model gets sufficiently powerful, it starts realizing that the decision making thing is depending on my output in order to make decisions. I can trick it into making the world easier to predict. So maybe I give it some models of the world that say make everything look red, or make everything black, then you will get high reward somehow. Then if the agent actually then takes that action and makes everything black, and now everything looks black forever more, then the world model can very easily predict, yeah, no matter what action you take, the world is just going to look black. That’s what the world is now, and that gets the highest possible reward. That’s a somewhat weird story for what could happen. But there’s no real stronger unit that says nope, this will definitely not happen.

Lucas: So in total sort of, what is the work that has been done here on inner optimizers?

Rohin: Clarifying that they could exist. I’m not sure if there has been much work on it.

Lucas: Okay. So this is our fourth cornerstone here in this embedded agency framework, correct?

Rohin: Yup, and that is the last one.

Lucas: So surmising these all together, where does that leave us?

Rohin: So I think my main takeaway is that I am much more strongly agreeing with MIRI that yup, we are confused about how intelligence works. That’s probably it, that we are confused about how intelligence works.

Lucas: What is this picture that I guess is conventionally held of what intelligence is that is wrong? Or confused?

Rohin: I don’t think there’s a thing that’s wrong about the conventional. So you could talk about a definition of intelligence, of being able to achieve arbitrary goals. I think Eliezer says something like cross domain optimization power, and I think that seems broadly fine. It’s more that we don’t know how intelligence is actually implemented, and I don’t think we ever claim to know that, but embedded agency is like we really don’t know it. You might’ve thought that we were making progress on figuring out how intelligence might be implemented with a classical decision theory, or the Von Neumann–Morgenstern utility theorem, or results like value of perfect information and stuff like being always non negative.

You might’ve thought that we were making progress on it, even if we didn’t fully understand it yet, and then you read on method agency and you’re like no, actually there are lots more conceptual problems that we have not even begin to touch yet. Well MIRI has begun to touch them I would say, but we really don’t have good stories for how any of these things work. Classically we just don’t have a description of how intelligence works. MIRI’s like even the small threads of things we thought about how intelligence could work are definitely not the full picture, and there are problems with them.

Lucas: Yeah, I mean just on simple reflection, it seems to me that in terms of the more confused conception of intelligence, it sort of models it more naively as we were discussing before, like the simple agent playing a computer game with these well defined channels going into the computer game environment.

Rohin: Yeah, you could think of AIXI for example as a model of how intelligence could work theoretically. The sequence is like no, this is why I see it as not a sufficient theoretical model.

Lucas: Yeah, I definitely think that it provides an important conceptual shift. So we have these four corner stones, and it’s illuminating in this way, are there any more conclusions or wrap up you’d like to do on embedded agency before we move on?

Rohin: Maybe I just want to add a disclaimer that MIRI is notoriously hard to understand and I don’t think this is different for me. It’s quite plausible that there is a lot of work that MIRI has done, and a lot of progress that MIRI has made that I either don’t know about or know about but don’t properly understand. So I know I’ve been saying I want to differ to people a lot, or I want to be uncertain a lot, but on MIRI I especially want to do so.

Lucas: All right, so let’s move on to the next one within this list.

Rohin: The next one was doing what humans want. How do I summarize that? I read a whole sequence of posts on it. I guess the story for success, to the extent that we have one right now is something like use all of the techniques that we’re developing, or at least the insights from them, if not the particular algorithms to create an AI system that behaves corrigibly. In the sense that it is trying to help us achieve our goals. You might be hopeful about this because we’re creating a bunch of algorithms for it to properly infer our goals and then pursue them, so this seems like a thing that could be done. Now, I don’t think we have a good story for how that happens. I think there are several open problems that show that our current algorithms are insufficient to do this. But it seems plausible that with more research we could get to something like that.

There’s not really a good overall summary of the field because it’s more like a bunch of people separately having a bunch of interesting ideas and insights, and I mentioned a bunch of them in the first part of the podcast already. Mostly because I’m excited about these and I’ve read about them recently, so I just sort of start talking about them whenever they seem even remotely relevant. But to reiterate them, there is the notion of analyzing the human AI system together as pursuing some sort of goal, or being collectively rational as opposed to having an individual AI system that is individually rational. So that’s been somewhat formalized in Cooperative Inverse Reinforcement Learning. Typically with inverse reinforcement learning, so not the cooperative kind, you have a human, the human is sort of exogenous, the AI doesn’t know that they exist, and the human creates a demonstration of the sort of behavior that they want the AI to do. If you’re thinking about robotics, it’s picking up a coffee cup, or something like this. Then the robot just sort of sees this demonstration and comes out of thin air, it’s just data that it gets.

Let’s say that I had executed this demonstration, what reward function would I have been optimizing? And then it figures out a reward function, and then it uses that reward function however it wants. Usually you would then use reinforcement learning to optimize that reward function and recreate the behavior. So that’s normal inverse reinforcement learning. Notably in here is that you’re not considering the human and the robot together as a full collective system. The human is sort of exogenous to the problem, and also notable is that the robot is sort of taking the reward to be something that it has as opposed to something that the human has.

So CIRL basically says, no, no, no, let’s not model it this way. The correct thing to do is to have a two player game that’s cooperative between the human and the robot, and now the human knows the reward function and is going to take actions somehow. They don’t necessarily have to be demonstrations. But the human knows the reward function and will be taking actions. The robot on the other hand does not know the reward function, and it also gets to take actions, and the robot keeps a probability distribution over the reward that the human has, and updates this overtime based on what the human does.

Once you have this, you get this sort of nice, interactive behavior where the human is taking actions that teach the robot about the reward function. The robot learns the reward function over time and then starts helping the human achieve his or her goals. This sort of teaching and learning behavior comes simply under the assumption that the human and the robot are both playing the game optimally, such that the reward function gets optimized as best as possible. So you get this sort of teaching and learning behavior from the normal notion of optimizing a particular objective, just from having the objective be a thing that the human knows, but not a thing that the robot knows. One thing that, I don’t know if CIRL introduced it, but it was one of the key aspects of CIRL was having probability distribution over a reward function, so you’re uncertain about what reward you’re optimizing.

This seems to give a bunch of nice properties. In particular, once the human starts taking actions like trying to shut down the robot, then the robot’s going to think okay, if I knew the correct reward function, I would be helping the human, and given that the human is trying to turn me off, I must be wrong about the reward function, I’m not helping, so I should actually just let the human turn me off, because that’s what would achieve the most reward for the human. So you no longer have this incentive to disable your shutdown button in order to keep optimizing. Now this isn’t exactly right, because better than both of those option is to disable the shutdown button, stop doing whatever it is you were doing because it was clearly bad, and then just observe humans for a while until you can narrow down what their reward function actually is, and then you go and optimize that reward, and behave like a traditional goal directed agent. This sounds bad. It doesn’t actually seem that bad to me under the assumption that the true reward function is a possibility that the robot is considering and has a reasonable amount of support in the prior.

Because in that case, once the AI system eventually narrows down on the reward function, it will be either the true reward function, or a reward function that’s basically indistinguishable from it, because otherwise, there would be some other information that I could gather in order to distinguish between them. So you actually would get good outcomes. Now of course in practice it seems likely that we would not be able to specify the space of reward functions well enough for this to work. I’m not sure about that point. Regardless, it seems like there’s been some sort of conceptual advance here about when the AI’s trying to do something for the human, it doesn’t have the disabling the shutdown button, the survival incentive.

So while maybe reward uncertainty is not exactly the right way to do it, it seems like you could do something analogous that doesn’t have the problems that reward uncertainty does.

One other thing that’s kind of in this vein, but a little bit different is the idea of an AI system that infers and follows human norms, and the reason we might be optimistic about this is because humans seem to be able to infer and follow norms pretty well. I don’t think humans can infer the values that some other human is trying to pursue and then optimize them to lead to good outcomes. We can do that to some extent. Like I can infer that someone is trying to move a cabinet, and then I can go help them move that cabinet. But in terms of their long term values or something, it seems pretty hard to infer and help with those. But norms, we do in fact do infer and follow all the time. So we might think that’s an easier problem, like our AI systems could do it as well.

Then the story for success is basically that with these AI systems, we are able to accelerate technological progress as before, but the AI systems behave in a relatively human like manner. They don’t do really crazy things that a human wouldn’t do, because that would be against our norms. As with the accelerating technological progress, we get to the point where we can colonize space, or whatever else it is you want to do with the feature. Perhaps even along the way we do enough AI alignment research to build an actual aligned superintelligence.

There are problems with this idea. Most notably if you accelerate technological progress, bad things can happen from that, and norm following AI systems would not necessarily stop that from happening. Also to the extent that if you think human society, if left to its own devices would lead to something bad happening in the future, or something catastrophic, then a norm following AI system would probably just make that worse, in that it would accelerate that disaster scenario, without really making it any better.

Lucas: AI systems in a vacuum that are simply norm following seem to have some issues, but it seems like an important tool in the toolkit of AI alignment to have AIs which are capable of modeling and following norms.

Rohin: Yup. That seems right. Definitely agree with that. I don’t think I had mentioned the reference on this. So for this one I would recommend people look at Incomplete Contracting and AI Alignment I believe is the name of the paper by Dylan Hadfield-Menell, and Gillian Hadfield, or also my post about it in the Value Learning Sequence.

So far I’ve been talking about sort of high level conceptual things within the, ‘get AI systems to do what we want.’ There are also a bunch of more concrete technical approaches. It’s like inverse reinforcement learning, deep reinforcement learning from human preferences, and there you basically get a bunch of comparisons of behavior from humans, and use that to infer a reward function that your agent can optimize. There’s recursive reward modeling where you take the task that you are trying to do, and then you consider a new auxiliary task of evaluating your original task. So maybe if you wanted to train an AI system to write fantasy books, well if you were to give human feedback on that, it would be quite expensive because you’d have to read the entire fantasy book and then give feedback. But maybe you could instead outsource the task, even evaluating fantasy books, you could recursively apply this technique and train a bunch of agents that can summarize the plot of a book or comment on the pros of the book, or give a one page summary of the character development.

Then you can use all of these AI systems to help you give feedback on the original AI system that’s trying to write a fantasy book. So that’s a recursive reward modeling. I guess going a bit back into the conceptual territory, I wrote a paper recently on learning preferences from the state of the world. So the intuition there is that the AI systems that we create aren’t just being created into a brand new world. They’re being instantiated in a world where we have already been acting for a long time. So the world is already optimized for our preferences, and as a result, our AI systems can just look at the world and infer quite a lot about our preferences. So we gave an algorithm that did this in some poor environments.

Lucas: Right, so again, this covers the conceptual category of methodologies of AI alignment where we’re trying to get AI systems to do what we want?

Rohin: Yeah, current AI systems in a sort of incremental way, without assuming general intelligence.

Lucas: And there’s all these different methodologies which exist in this context. But again, this is all sort of within this other umbrella of just getting AI to do things we want them to do?

Rohin: Yeah, and you can actually compare across all of these methods on particular environments. This hasn’t really been done so far, but in theory it can be done, and I’m hoping to do it at some point in the future.

Lucas: Okay. So we’ve discussed embedded agency, we’ve discussed this other category of getting AIs to do what we want them to do. Just moving forward here through diving deep on these approaches.

Rohin: I think the next one I wanted to talk about was ambitious value learning. So here the basic idea is that we’re going to build a superintelligent AI system, it’s going to have goals, because that’s what the Von Neumann—Morgenstern theorem tells us is that anything with preferences, if they’re consistent and coherent, which they should be for a superintelligent system, or at least as far as we can tell they should be consistent. Any type system has a utility function. So natural thought, why don’t we just figure out what the right utility function is, and put it into the AI system?

So there’s a lot of good arguments that you’re not going to be able to get the one correct utility function, but I think Stuart’s hope is that you can find one that is sufficiently good or adequate, and put that inside of the AI system. In order to do this, he wants to, I believe the goal is to learn the utility function by looking at both human behavior as well as the algorithm that human brains are implementing. So if you see that the human brain, when it knows that something is going to be sweet, tends to eat more of it. Then you can infer that humans like to eat sweet things. As opposed to humans really dislike eating sweet things, but they’re really bad at optimizing their utility function. In this project of ambitious value learning, you also need to deal with the fact that human preferences can be inconsistent, that the AI system can manipulate the human preferences. The classic example of that would be the AI system could give you a shot of heroin, and that probably change your preferences from I do not want heroin to I do want heroin. So what does it even mean to optimize for human preferences when they can just be changed like that?

So I think the next one was corrigibility and the associated iterated amplification and debate basically. I guess factored cognition as well. To give a very quick recap, the idea with corrigibility is that we would like to build an AI system that is trying to help us, and that’s the property that we should aim for as opposed to an AI system that actually helps us.

One motivation for focusing on this weaker criteria is that it seems quite difficult to create a system that knowably actually helps us, because that means that you need to have confidence that your AI system is never going to make mistakes. It seems like quite a difficult property to guarantee. In addition, if you don’t make some assumption on the environment, then there’s a no free lens theorem that says this is impossible. Now it’s probably reasonable to put some assumption on the environment, but it’s still true that your AI system could have reasonable beliefs based on past experience, and nature still throws it a curve ball, and that leads to some sort of bad outcome happening.

While we would like this to not happen, it also seems hard to avoid, and also probably not that bad. It seems like the worst outcomes come when your superintelligent system is applying all of its intelligence in pursuit of their own goal. That’s the thing that we should really focus on. That conception of what we want to enforce is probably the thing that I’m most excited about. Then there are particular algorithms that are meant to create corrigible agents, assuming we have the capabilities to get general intelligence. So one of these is iterated amplification.

Iterated amplification is really more of a framework to describe particular methods of training systems. In particular, you alternate between amplification and distillation steps. You start off with an agent that we’re going to assume is already aligned. So this could be a human. A human is a pretty slow agent. So the first thing we’re going to do is distill the human down into a fast agent. So we could use something like imitation learning, or maybe inverse reinforcement learning plus reinforcement learning, followed by reinforcement learning or something like that in order to train a neural net or some other AI system that mostly replicates the behavior of our human, and remains aligned. By aligned maybe I mean corrigible actually. We start with a corrigible agent, and then we produce agents that continue to be corrigible.

Probably the resulting agent is going to be a little less capable than the one that you started out with just because if the best you can do is to mimic the agent that you stated with, that gives you exactly as much capabilities as that agent. So if you don’t succeed at properly mimicking, then you’re going to be a little less capable. Then you take this fast agent and you amplify it, such that it becomes a lot more capable, at perhaps the cost of being a lot slower to compute.

One way that you could image doing amplification would be to have a human get a top level task, and for now we’ll assume that the task is question answering, so they get this top level question and they say okay, I could answer this question directly, but let me make use of this fast agent that we have from the last turn. We’ll make a bunch of sub questions that seem relevant for answering the overall question, and ask our distilled agent to answer all of those sub questions, and then using those answers, the human can then make a decision for their top level question. It doesn’t have to be the human. You could also have a distilled agent at the top level if you want.

I think having the human there seems more likely. So with this amplification you’re basically using the agent multiple times, letting it reason for longer in order to get a better result. So the resulting human, plus many copies of the agent system is more capable than the original distilled agent, but also slower. So we started off with something, let’s call it capability level five, and then we distilled it and it became capability level four, but it was a lot faster. Then we amplified it and maybe now it’s capability level eight. But it’s a lot slower. So we can distill it again and get something at capability level seven that’s pretty fast, and then amplify it again and so on and so forth. So the hope is that this would allow us to continually train an agent that can reach arbitrary levels of capabilities that are actually physically possibly while remaining aligned or corrigible the entire time.

Lucas: Yeah, that’s quite a great picture. I’m happy to jump into debate next, and then I’d like to talk about the potential relationship between the two.

Rohin: Sure. So with debate, the key hope that it is based on is that it is easier to point out a lie then it is to lie in the first place. This would be under the assumption that if you’ve got two agents, one agent is trying to lie, the other agent is trying to detect when the first agent is lying, you would need the second to have access to all of the information that the first agent is using to make this decision. But if that can happen, then we would hope that it is easier to detect lies than it is to actually lie. If this is actually true, we can imagine the following training process. We give a top level question, and then we have two AI systems debate what the correct answer to that top level question is. One of the agents initially says the answer is 12, then the other agent says no, the answer is 15, then they start arguing about why the answer should be 12 versus 15.

Eventually, after some amount of time, a human is asked to judge this debate and say which of the AI systems provided more true and useful information. Whichever agent the human decided was correct gets a reward of plus one, and the other one gets a reward of minus one, and is then updated presumably by a gradient decent or something. So the hope here is basically that the two AI system that are debating each other are about equally powerful, such that if any one of them tries to lie or manipulate the user in some way, the other AI system would be able to point that out, and then the human would give the reward to the AI systems that point it out. So at equilibrium you should expect that both of the agents are just trying to give you true and useful information, and they’re not trying to lie, because if they do try to lie, then the other one will point it out.

Lucas: Right. So there’s this important feature here where as you mentioned, the claim is that it’s easier to spot out lies then to make lies, and this sort of asymmetry here is one of the motivations which says that generally it’ll be easier to tell when agents are telling the truth rather than making up a lie.

Rohin: As along as you have another AI system that can point this out. Certainly a super intelligent AI system could lie to me and I wouldn’t be able to tell, probably, but it’s a lot harder for a superintelligent AI system to lie to me when I have another superintelligent AI system that’s trying to point out lies that the first one makes.

Lucas: Right. So now I think we can go ahead and cover its relationship to iterated amplification?

Rohin: Sure. There is actually quite a close relationship between the two, even though it doesn’t seem like it on first site. The hope with both of them is that your AI systems will learn to do human like reasoning, but on a much larger scale than humans can do. In particular, consider the following kind of agent. You have a human who is given a top level question that they have to answer, and that human can create a bunch of sub questions and then delegate each of those sub questions to another copy of the same human, initialized from scratch or something like that so they don’t know what the top level human has thought.

Then they now have to answer the sub question, but they too can delegate to another human further down the line. And so on you can just keep delegating down until you get something that questions are so easy that the human can just straight up answer them. So I’m going to call this structure a deliberation tree, because it’s a sort of tree of considerations such that every node, the answer to that node, it can be computed from the answers to the children nodes, plus a short bit of human reasoning that happened at that node.

In iterated amplification, what’s basically happening is you start with leaf nodes, the human agent. There’s just a human agent, and they can answer questions quickly. Then when you amplify it the first time, you get trees of depth one, where at the top level there’s a human who can then delegate sub questions out, but then those sub questions have to be answered by an agent that was trained to be like a human. So you’ve got something that approximates depth one human deliberation trees. Then after another round of distillation and amplification, you’ve got human delegating to agents that were trained to mimic humans that could delegate to agents that were trained to mimic humans. An approximate version of a depth two deliberation tree.

So iterated amplification is basically just building up the depth of the tree that the agent is approximating. But we hope that these deliberation trees are always just basically implementing corrigible reasoning, and that eventually once they get deep enough, you get arbitrarily strong capabilities.

Lucas: Can you offer some clarification as to why one might expect a group of copies of an AI, plus the human to scale to be able to make sure that during distillation, that alignment is retained?

Rohin: That’s an open problem. Whether you can do a distillation step that does preserve alignment/corrigibility, it’s a thing that Paul in a few recent posts in the Iterated Amplification Sequence, he calls it the reward engineering problem. The hope is that if you believe that the amplified agent is corrigible, then they are going to be smarter than the agent that they are trying to train via distillation. So you can actually just use the amplified agent to create a reward signal in order to train an agent during distillation. Since the amplified agent is smarter than the agent you’re distilling, you could plausibly actually create a reward function that’s not easily gameable, and actually gets the AI system to do what you want. I think that’s the concise lacking nuance story of why you might be optimistic about this.

Lucas: All right.

Rohin: So I’ll move on to how debate is also related to this. So we talk about how iterated amplification is basically like growing the depth of deliberation trees that the agent is approximating. The human part of this is judging any one node and its children. In debate on the other hand, you can imagine the same sort of deliberation tree, although now they’re more like arguments and counter arguments as opposed to considerations and counter considerations. But broadly the same thing. So imagine there’s this actual debate tree of ways the debate could possibly go.

Then you could think of the AI systems as choosing a particular path in the debate tree that makes them most likely to win. The key point is that given that the entire question can be resolved by exponentially size deliberation tree, if the two AI systems are capable of competing this exponential deliberation tree, then optimal play in the debate game is to go along the path that is going to lead in your victory, even given that the other player is trying to win themselves. The relation between iterated amplification and debate is that they both want the agents to implicitly be able to compute this exponential sized deliberation tree that humans could not do, and then use humans to detect a particular part of that tree. In iterated amplification you check a parent and its children. Those nodes, you look at that one section of the debate tree, and you make sure that it looks good, and then debate you look at a particular path on the debate tree and judge whether that path is good. One critique about these methods, is it’s not actually clear that an exponential sized deliberation tree is able to solve all problems that we might care about. Especially if the amount of work done at each node is pretty short, like ten minutes of a stent of a normal human.

One question that you would care about if you wanted to see if an iterated amplification could work is can these exponential sized deliberation trees actually solve hard problems? This is the factored cognition hypothesis. These deliberation trees can in fact solve arbitrarily complex tasks. And Ought is basically working on testing this hypothesis to see whether or not it’s true. It’s like finding the tasks, which seemed hardest to do in this decompositional way, and then seeing if teams of humans can actually figure out how to do them.

Lucas: Do you have an example of what would be one of these tasks that are difficult to decompose?

Rohin: Yeah. Take a bunch of humans who don’t know differential geometry or something, and have them solve the last problem in a textbook on differential geometry. They each only get ten minutes in order to do anything. None of them can read the entire textbook. Because that takes way more than ten minutes. I believe Ought is maybe not looking into that one in particular, that one sounds extremely hard, but they might be doing similar things with books of literature. Like trying to answer questions about a book that no one has actually read.

But I remember that Andreas was actually talking about this particular problem that I mentioned as well. I don’t know if they actually decided to do it.

Lucas: Right. So I mean just generally in this area here, it seems like there are these interesting open questions and considerations about I guess just the general epistemic efficacy of debate. And how good AI and human systems will be at debate, and again also as you just pointed out, whether or not arbitrarily difficult tasks can be solved through this decompositional process. I mean obviously we do have proofs for much simpler things. Why is there a question as to whether or not it would scale? How would it eventually break?

Rohin: With iterated amplification in particular, if you’re starting with humans who have only ten minutes to look at resources and come up with an answer, the particular thing I would say they might not be able to do is take a math textbook that the human did not know already and solve the final problem in it. Iterated amplification, to the extent that it starts with a human who’s limited to ten minutes, and whenever it does a decomposition, does it with a human who’s limited to ten minutes, would inherit all of the same problems if those problems are real problems.

Lucas: Cool, that’s interesting. I don’t have anything to add here then. So happy to keep on moving.

Rohin: I think I want to talk about limited AI systems next. So with AI boxing, the hope is to create a box around an AI system that prevents anything from inside the box from going out, and anything from outside the box from going in, except a particular input, output channels that we explicitly allow. So the hope here is that this allows us to make assumptions about the input, output channels that actually will hold in practice, and can be used to enter that AI’s function more like tool AIs or something, rather than agential AIs that will do bad things.

I think most literature on boxing has had an AI system in a box, and a human operator outside of it. One interesting proposal I’ve heard recently is to have the human also be inside the box, because if you’ve got information pulling from an AI to a human, that’s a pretty big hole in your box. But if the human is actually inside the box, then there’s not as much that can be done. There’s not as much information flowing into or out of the box.

Lucas: What does it actually mean to have the human locked in the box?

Rohin: One proposal would be have a human go inside a secured room from which no information an enter or leave, apart from the fact of whether or not the human has entered or left the room. The human can then interact with the AI as normal. As soon as the human leaves the room, the AI’s reward function is zero. Basically just doesn’t care about anything that happens in the future after that. So in this setting you would hope that the AI system has no incentive to affect the outside world because the only way it can affect the outside world is by sending the human outside. But as soon as it sends the human outside, its reward gets set to zero forever, so it doesn’t care about any of that.

Lucas: But wouldn’t it care before the human went out?

Rohin: Yes, but before the human goes out, there is nothing it can do to influence the outside world because both the human and the AI are in the box. It could influence the human while the human is in the box. But the human also can’t affect the outside world while it’s in the box.

Lucas: Yeah, but the human leaves the box and then can affect the world.

Rohin: Right. But any change to the world that happens after the human leaves, doesn’t make it into the reward function.

Lucas: But it does because the AI system would be planning over time from its initial moment where the reward isn’t set to zero.

Rohin: But your reward can be limited to the next K times steps, or something like this. Basically you’re limited to until the point that the human leaves. That’s the proposal here.

Lucas: I see, okay. That makes sense.

Rohin: The only information leaving the box, assuming security is a box in all the normal ways, is whether or not the human has chosen to leave. Pretty low information, doesn’t seem like an AI system would be able to use that to influence the outside world in some ridiculous way. So that’s boxing.

The next thing on my list for limited AGI is impact measures of side effects. There is also mild optimization and conservative concepts, but let’s start with impact measures. The basic hope is to create some quantification of how much impact a particular action that the AI chooses, has on the world, and to then penalize the AI for having a lot of impact so that it only does low impact things, which presumably will not cause catastrophe. One approach to this relative reachability. With relative reachability, you’re basically trying to not decrease the number of states that you can reach from the current state. So you’re trying to preserve option value. You’re trying to keep the same states reachable.

It’s not okay for you to make one state unreachable as long as you make a different state reachable. You need all of the states that were previously reachable to continue being reachable. The relative part is that the penalty is calculated relative to a baseline that measures what would’ve happened if the AI had done nothing, although there are other possible baselines you could use. The reason you do this is so that we don’t penalize the agent for side affects that happen in the environment. Like maybe I eat a sandwich, and now these states where there’s a sandwich in front of me are no longer accessible because I can’t un-eat a sandwich. We don’t want to penalize our AI system for that impact, because then it’ll try to stop me from eating a sandwich. We want to isolate the impact of the agent as opposed to impact that were happening in the environment anyway. So that’s what we need the relative part.

There is also attainable utility preservation from Alex Turner, which makes two major changes from relative reachability. First, instead of talking about reachability of states, it talk about how much you can achieve different utility functions. So if previously you were able to make lots of paperclips, then you want to make sure that you can still make lots of paperclips. If previously you were able to travel across the world within a day, then you want to still be able to travel across the world in a day. So that’s the first change I would make.

The second change is not only does it penalize decreases in attainable utility, it also penalizes increase in attainable utility. So if previously you could not mine asteroids in order to get their natural resources, you should still not be able to mine asteroids and get their resources. This seems kind of crazy when you first hear it, but the rational for it is that all of the convergent instrumental sub goals are about increases in power of your AI system. For example, for a broad range of utility functions, it is useful to get a lot of resources and a lot of power in order to achieve those utility functions. Well, if you penalize increases in attainable utility, then you’re going to penalize actions that just broadly get more resources, because those are helpful for many, many, many different utility functions.

Similarly, if you were going to be shutdown, but then you disable the shutdown button, well that just makes it much more possible for you to achieve pretty much every utility, because instead of being off, you are still on and can take actions. So that also will get heavily penalized because it led to such a large increase in attainable utilities. So those are I think the two main impact measures that I know of.

Okay, we’re getting to the things where I have less things to say about them, but now we’re at robustness. I mentioned this before, but there are two main challenges with verification. There’s the specification problem, making it computationally efficient, and all of the work is on the computationally efficient side, but I think the hardest part is the specification side, and I’d like to see more people do work on that.

I don’t think anyone is really working on verification with an eye to how to apply it to powerful AI systems. I might be wrong about that. Like I know something people who do care about AI safety who are working on verification, and it’s possible that they have thoughts about this that aren’t published and that I haven’t talked to them about. But the main thing I would want to see is what specifications can we actually give to our verification sub routines. At first glance, this is just the full problem of AI safety. We can’t just give a specification for what we want to an AGI.

What specifications can we get for a verification that’s going to increase our trust in the AI system. For adversarial training, again, all of the work done so far is in the adversarial example space where you try to frame an image classifier to be more robust to adversarial examples, and this kind of work sometimes, but doesn’t work great. For both verification and adversarial training, Paul Christiano has written a few blog posts about how you can apply this to advance AI systems, but I don’t know if anyone actively working on these with AGI in mind. With adversarial examples, there is too much work for me to summarize.

The thing that I find interesting about adversarial examples is that is shows that are we no able to create image classifiers that have learned human preferences. Humans have preferences over how we classify images, and we didn’t succeed at that.

Lucas: That’s funny.

Rohin: I can’t take credit for that framing, that one was due to Ian Goodfellow. But yeah, I see adversarial examples as contributing to a theory of deep learning that tells us how do we get deep learning systems to be closer to what we want them to be rather than these weird things that classify pandas as givens, even when they’re very clearly still pandas.

Lucas: Yeah, the framing’s pretty funny, and makes me feel kind of pessimistic.

Rohin: Maybe if I wanted to inject some optimism back in, there’s a frame under which an adversarial examples happen because our data sets are too small or something. We have some pretty large data sets, but humans do see more and get far richer information than just pixel inputs. We can go feel a chair and build 3D models of a chair through touch in addition to sight. There is actually a lot more information that humans have, and it’s possible that what we need as AI systems is just to have way more information, and are good to narrow it down on the right model.

So let us move on to I think the next thing is interpretability, which I also do not have much to say about, mostly because there is tons and tons of technical research on interpretability, and there is not much on interpretability from an AI alignment perspective. One thing to note with interpretability is you do want to be very careful about how you apply it. If you have a feedback cycle where you’re like I built an AI system, I’m going to use interpretability to check whether it’s good, and then you’re like oh shit, this AI system was bad, it was not making decisions for the right reasons, and then you go and fix your AI system, and then you throw interpretability at it again, and then you’re like oh, no, it’s still bad because of this other reason. If you do this often enough, basically what’s happening is you’re training your AI system to no longer have failures that are obvious to interpretability, and instead you have failures that are not obvious to interpretability, which will probably exist because your AI system seems to have been full of failures anyway.

So I would be pretty pessimistic about the system that interpretability found 10 or 20 different errors in. I would just expect that the resulting AI system has other failure modes that we were not able to uncover with interpretability, and those will at some point trigger and cause bad outcomes.

Lucas: Right, so interpretability will cover things such as super human intelligence interpretability, but also more mundane examples of present day systems correct, where the interpretability of say neural networks is basically, my understand is nowhere right now.

Rohin: Yeah, that’s basically right. There have been some techniques developed like sailiency maps, feature visualization, neural net models that hallucinate explanations post hoc, people have tried a bunch of things. None of them seem especially good, though some of them definitely are giving you more insight than you had before.

So I think that only leaves CAIS. With comprehensive AI service, it’s like a forecast for how AI will develop in the future. It also has some prescriptive aspects to it, like yeah, we should probably not do these things, because these don’t seem very safe, and we can do these other things instead. In particular, CAIS takes a strong stance AGI agents that are God-like fully integrated systems that are optimizing some utility function over the long term future.

It should be noted that it’s arguing against a very specific kind of AGI agent. This sort of long term expected utility maximizer that’s fully integrated and is okay to black box, can be broken down into modular components. That entire cluster of features, it’s what CAIS is talking about when it says AGI agent. So it takes a strong sense against that, saying A, it’s not likely that this is the first superintelligent thing that we built, and B, it’s clearly dangerous. That’s what we’ve been saying the entire time. So here’s a solution, why don’t we just not build it? And we’ll build these other things instead? As for what the other things are, the basic intuition pump here is that if you look at how AI is developed today, there is a bunch of research in development practices that we do. We try out a bunch of models, we try some different ways to clean our data, we try different ways of collecting data sets, and we try different algorithms and so on and so forth, and these research and development practices allow us to create better and better AI systems.

Now, our AI systems currently are also very bounded in their tasks that they do. There are specific tasks, and they do that task and that task alone, they do it in episodic ways. They are only trying to optimize over a bounded amount of time, they use a bounded amount of computation and other resources. So that’s what we’re going to call a service. It’s an AI system that does a bounded task, in bounded time, with bounded computation. Everything is bounded. Now our research and development practices are themselves bound to tasks, and AI has shown itself to be quite good at automating bounded tasks. We’ve definitely not automated all bounded tasks yet, but it does seem like we are in general are pretty good at automating bounded tasks with enough effort. So probably we will also automate research and development tasks.

We’re seeing some of this already with neural architecture search for example, and once AI R and D processes have been sufficiently automated, then we get this cycle where AI systems are doing the research and development needed to improve AI systems, and so we get to this point of recursive improvement that’s not self improvement anymore, because there’s not really an agentic itself to improve, but you do have recursive AI improving AI. So this can lead to the sort of very quick improvement and capabilities that we often associate with superintelligence. With that we can eventually get to a situation where any task that we care about, we could have a service that breaks that task down into a bunch of simple, automatable bounded tasks, and then we can create services that do each of those bounded tasks and interact with each other in order to in tandem complete the long term task.

This is how humans do engineering and building things. We have these research and development things, we have these modular systems that are interacting with each other via a well defined channel, so this seems more likely to be the firs thing that we build that’s capable of super intelligent reasoning rather than an AGI agent that’s optimizing the utility function of a long term, yada, yada, yada.

Lucas: Is there no risk? Because the superintelligence is the distributed network collaborating. So is there no risk for the collective distributed network to create some sort of epiphenomenal optimization effects?

Rohin: Yup, that’s definitely a thing that you should worry about. I know that Erik agrees with me on this because he explicitly lists this out in the tech report as a thing that needs more research and that we should be worried about. But the hope is that there are other things that you can do that normally we wouldn’t think about with technical AI safety research that would make more sense in this context. For example, we could train a predictive model of human approval. Given any scenario, the AI system should predict how much humans are going to like it or approve of it, and then that service can be used in order to check that other services are doing reasonable things.

Similarly, we might look at each individual service and see which of the other services it’s accessing, and then make sure that those are reasonable services. If we see a CEO of paper clip company going and talking to the synthetic biology service, we might be a bit suspicious and be like why is this happening? And then we can go and check to see why exactly that has happened. So there are all of these other things that we could do in this world, which aren’t really options in the AGI agent world.

Lucas: Aren’t they options in the AGI agential world where the architectures are done such that these important decision points are analyzable to the same degree as they would be in a CAIS framework?

Rohin: Not to my knowledge. As far as I can tell, most end to end train things, you might have the architectures be such that there are these points at which you expect that certain kinds of information will be flowing there, but you can’t easily look at the information that’s actually there and deduce what the system is doing. It’s just not interpretable enough to do that.

Lucas: Okay. I don’t think that I have any other questions or interesting points with regards to CAIS. It’s a very different and interesting conception of the kind of AI world that we can create. It seems to require its own new coordination challenge as if your hypothesis is true and that the agential AIs will be afforded more causal power in the world, and more efficiency than sort of the CAIS systems, that’ll give them a competitive advantage that will potentially bias civilization away from CAIS systems.

Rohin: I do want to note that I think the agential AI systems will be more expensive and take longer to develop than CAIS. So I do think CAIS will come first. Again, this is all in a particular world view.

Lucas: Maybe this might be abstracting too large, but does CAIS claim to function as an AI alignment methodology to be used on the long term? Do we retain the sort of CAIS architecture path, CAIS creating super intelligence or some sort of distributed task force?

Rohin: I’m not actually sure. There’s definitely a few chapters in the technical report that are like okay, what if we build AGI agents? How could we make sure that goes well? As long as CAIS comes before AGI systems, here’s what we can do in that setting.

But I feel like I personally think that AGI systems will come. My guess is that Erik does not think that this is necessary, and we could actually just have CAIS systems forever. I don’t really have a model for when to expect AGI separately of the CAIS world. I guess I have a few different potential scenarios that I can consider, and I can compare it to each of those, but it’s not like it’s CAIS and not CAIS. It’s more like it’s CAIS and a whole bunch of other potential scenarios, and in reality it’ll be some mixture of all of them.

Lucas: Okay, that makes more sense. So, there’s sort of an overload here, or just a ton of awesome information with regards to all of these different methodologies and conceptions here. So just looking at all of it, how do you feel about all of these different methodologies in general, and how does AI alignment look to you right now?

Rohin: Pretty optimistic about AI alignment, but I don’t think that’s so much from the particular technical safety research that we have. That’s some of it. I do think that there are promising approaches, and the fact that there are promising approaches makes me more optimistic. But I think more so my optimism comes from the strategic picture. A belief that A, that we will be able to convince people that this is important, such that people start actually focusing on this problem more broadly, and B that we would be able to get a bunch of people to coordinate such that they’re more likely to invest in safety. C, that I don’t place as much weight on the AI systems that are at long term, utility maximizers, and therefor we’re basically all screwed, which seems to be the position of many other people in the field.

I say optimistic. I mean optimistic relative to them. I’m probably pessimistic relative to the average person.

Lucas: A lot of these methodologies are new. Do you have any sort of broad view about how the field is progressing?

Rohin: Not a great one. Mostly because I would consider myself, maybe I’ve just recently stopped being new to the field, so I didn’t really get to observe the field very much in the past, but it seems like there’s been more of a shift towards figuring out how all of the things people were thinking about apply to real machine learning systems, which seems nice. The fact that it does connect is good. I don’t think the connections are super natural, or they just sort of clicked, but they did mostly work out. I’d say in many cases, and that seems pretty good. So yeah, the fact that we’re now doing a combination of theoretical, experimental, and conceptual work seems good.

It’s no longer the case that we’re mostly doing theory. That seems probably good.

Lucas: You’ve mentioned already a lot of really great links in this podcast, places people can go to learn more about these specific approaches and papers and strategies. And one place that is just generally great for people to go is to the Alignment Forum, where a lot of this information already exists. So are there just generally in other places that you recommend people check out if they’re interested in taking more technical deep dives?

Rohin: Probably actually at this point, one of the best places for a technical deep dive is the alignment newsletter database. I write a newsletter every week about AI alignment, all the stuff that’s happened in the past week, that’s the alignment newsletter, not the database, which also people can sign up for, but that’s not really a thing for technical deep dives. It’s more a thing for keeping a pace with developments in the field. But in addition, everything that ever goes into the newsletter is also kept in a separate database. I say database, it’s basically a Google sheets spreadsheet. So if you want to do a technical deep dive on any particular area, you can just go, look for the right category on the spreadsheet, and then just look at all the papers there, and read some or all of them.

Lucas: Yeah, so thanks so much for coming on the podcast Rohin, it was a pleasure to have you, and I really learned a lot and found it to be super valuable. So yeah, thanks again.

Rohin: Yeah, thanks for having me. It was great to be on here.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI alignment series.

End of recorded material

AI Alignment Podcast: An Overview of Technical AI Alignment with Rohin Shah (Part 1)

The space of AI alignment research is highly dynamic, and it’s often difficult to get a bird’s eye view of the landscape. This podcast is the first of two parts attempting to partially remedy this by providing an overview of the organizations participating in technical AI research, their specific research directions, and how these approaches all come together to make up the state of technical AI alignment efforts. In this first part, Rohin moves sequentially through the technical research organizations in this space and carves through the field by its varying research philosophies. We also dive into the specifics of many different approaches to AI safety, explore where they disagree, discuss what properties varying approaches attempt to develop/preserve, and hear Rohin’s take on these different approaches.

You can take a short (3 minute) survey to share your feedback about the podcast here.

In this podcast, Lucas spoke with Rohin Shah. Rohin is a 5th year PhD student at UC Berkeley with the Center for Human-Compatible AI, working with Anca Dragan, Pieter Abbeel and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

Topics discussed in this episode include:

  • The perspectives of CHAI, MIRI, OpenAI, DeepMind, FHI, and others
  • Where and why they disagree on technical alignment
  • The kinds of properties and features we are trying to ensure in our AI systems
  • What Rohin is excited and optimistic about
  • Rohin’s recommended reading and advice for improving at AI alignment research

Lucas: Hey everyone, welcome back to the AI Alignment podcast. I’m Lucas Perry, and today we’ll be speaking with Rohin Shah. This episode is the first episode of two parts that both seek to provide an overview of the state of AI alignment. In this episode, we cover technical research organizations in the space of AI alignment, their research methodologies and philosophies, how these all come together on our path to beneficial AGI, and Rohin’s take on the state of the field.

As a general bit of announcement, I would love for this podcast to be particularly useful and informative for its listeners, so I’ve gone ahead and drafted a short survey to get a better sense of what can be improved. You can find a link to that survey in the description of wherever you might find this podcast, or on the page for this podcast on the FLI website.

Many of you will already be familiar with Rohin, he is a fourth year PhD student in Computer Science at UC Berkeley with the Center For Human-Compatible AI, working with Anca Dragan, Pieter Abbeel, and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter. And so, without further ado, I give you Rohin Shah.

Thanks so much for coming on the podcast, Rohin, it’s really a pleasure to have you.

Rohin: Thanks so much for having me on again, I’m excited to be back.

Lucas: Yeah, long time, no see since Puerto Rico Beneficial AGI. And so speaking of Beneficial AGI, you gave quite a good talk there which summarized technical alignment methodologies approaches and broad views, at this time; and that is the subject of this podcast today.

People can go and find that video on YouTube, and I suggest that you watch that; that should be coming out on the FLI YouTube channel in the coming weeks. But for right now, we’re going to be going in more depth, and with more granularity into a lot of these different technical approaches.

So, just to start off, it would be good if you could contextualize this list of technical approaches to AI alignment that we’re going to get into within the different organizations that they exist at, and the different philosophies and approaches that exist at these varying organizations.

Rohin: Okay, so disclaimer, I don’t know all of the organizations that well. I know that people tend to fit CHAI in a particular mold, for example; CHAI’s the place that I work at. And I mostly disagree with that being the mold for CHAI, so probably anything I say about other organizations is also going to be somewhat wrong; but I’ll give it a shot anyway.

So I guess I’ll start with CHAI. And I think our public output mostly comes from this perspective of how do we get AI systems to do what we want? So this is focusing on the alignment problem, how do we actually point them towards a goal that we actually want, align them with our values. Not everyone at CHAI takes this perspective, but I think that’s the one most commonly associated with us and it’s probably the perspective on which we publish the most. It’s also the perspective I, usually, but not always, take.

MIRI, on the other hand, takes a perspective of, “We don’t even know what’s going on with intelligence. Let’s try and figure out what we even mean by intelligence, what it means for there to be a super-intelligent AI system, what would it even do or how would we even understand it; can we have a theory of what all of this means? We’re confused, let’s be less confused, once we’re less confused, then we can think about how to actually get AI systems to do good things.” That’s one of the perspectives they take.

Another perspective they take is that there’s a particular problem with AI safety, which is that, “Even if we knew what goals we wanted to put into an AI system, we don’t know how to actually build an AI system that would, reliably, pursue those goals as opposed to something else.” That problem, even if you know what you want to do, how do you get an AI system to do it, is a problem that they focus on. And the difference from the thing I associated with CHAI before is that, with the CHAI perspective, you’re interested both in how do you get the AI system to actually pursue the goal that you want, but also how do you figure out what goal that you want, or what is the goal that you want. Though, I think most of the work so far has been on supposing you know the goal, how do you get your AI system to properly pursue it?

I think DeepMind safety came, at least, is pretty split across many different ways of looking at the problem. I think Jan Leike, for example, has done a lot of work on reward modeling, and this sort of fits in with the how do we get our AI systems be focused on the right task, the right goal. Whereas Vika has done a lot of work on side effects or impact measures. I don’t know if Vika would say this, but the way I interpret it how do we impose a constraint upon the AI system such that it never does anything catastrophic? But it’s not trying to get the AI system to do what we want, just not do what we don’t want, or what we think would be catastrophically bad.

OpenAI safety also seems to be, okay how do we get deep enforcement learning to do good things, to do what we want, to be a bit more robust? Then there’s also the iterated amplification debate factored cognition area of research, which is more along the lines of, can we write down a system that could, plausibly, lead to us building an aligned AGI or aligned powerful AI system?

FHI, no coherent direction, that’s all of FHI. Eric Drexler is also trying to understand how AI will develop it in the future is somewhat very different from what MIRI’s doing, but the same general theme of trying to figure out what is going on. So he just recently published a long technical report on comprehensive AI services, which is the general worldview for predicting what AI development will look like in the future. If we believed that that was, in fact, the way AI would happen, we would probably change what we work on from the technical safety point of view.

And Owain Evans does a lot of stuff, so maybe I’m just not going to try to categorize him. And then Stuart Armstrong works on this, “Okay, how do we get value learning to work such that we actually infer a utility function that we would be happy for an AGI system to optimize, or a super-intelligent AI system to optimize?”

And then Ought works on factory cognition, so it’s very adjacent to be iterated amplification and debate research agendas. Then there’s a few individual researchers, scattered, for example, Toronto, Montreal, and AMU and EPFL, maybe I won’t get into all of them because, yeah, that’s a lot; but we can delve into that later.

Lucas: Maybe a more helpful approach, then, would be if you could start by demystifying some of the MiRI stuff a little bit; which may seem most unusual.

Rohin: I guess, strategically, the point would be that you’re trying to build this AI system that’s going to be, hopefully, at some point in the future vastly more intelligent than humans, because we want them to help us colonize the universe or something like that, and lead to lots and lots of technological progress, etc., etc.

But this, basically, means that humans will not be in control unless we very, very specifically arrange it such that we are in control; we have to thread the needle, perfectly, in order to get this to work out. In the same way that, by default you, would expect that the most intelligent creatures, beings are the ones that are going to decide what happens. And so we really need to make sure and, also it’s probably hard to ensure, that these vastly more intelligent beings are actually doing what we want.

Given that, it seems like what we want is a good theory that allows us to understand and predict what these AI systems are going to do. Maybe not in the fine nitty, gritty details, because if we could predict what they would do, then we could do it ourselves and be just as intelligent as they are. But, at least, in broad strokes what sorts of universes are they going to create?

But given that they can apply so much more intelligence that we can, we need our guarantees to be really, really strong; like almost proof level. Maybe actual proofs are a little too much to expect, but we want to get as close to it as possible. Now, if we want to do something like that, we need a theory of intelligence; we can’t just sort of do a bunch of experiments, look at the results, and then try to extrapolate from there. Extrapolation does not give you the level of confidence that we would need for a problem this difficult.

And so rather, they would like to instead understand intelligence deeply, deconfuse themselves about it. Once you understand how intelligence works at a theoretical level, then you can start applying that theory to actual AI systems and seeing how they approximate the theory, or make predictions about what different AI systems will do. And, hopefully, then we could say, “Yeah, this system does look like it’s going to be very powerful as approximating this particular idea, this particular part of theory of intelligence. And we can see that with this particular theory of intelligence, we can align it with humans somehow, and you’d expect that this was going to work out.” Something like that.

Now, that sounded kind of dumb even to me as I was saying it, but that’s because we don’t have the theory yet; it’s very fun to speculate how you would use the theory before you actually have the theory. So that’s the reason they’re doing this, the actual thing that they’re focusing on is centered around problems of embedded agency. And I should say this is one of their, I think, two main strands of research, the other stand of research, I do not know anything about because they have not published anything about it.

But one of their strands of research is about embedded agency. And here the main point is that in the real world, any agent, any AI system, or a human is a part of their environment. They are smaller than the environment and the distinction between agent and environment is not crisp. Maybe I think of my body as being part of me but, I don’t know, to some extent, my laptop is also an extension of my agency; there’s a lot of stuff I can do with it.

Or, on the other hand, you could think maybe my arms and limbs aren’t actually a part of me, I could maybe get myself uploaded at some point in the future, and then I will no longer have arms or legs; but in some sense I am still me, I’m still an agent. So, this distinction is not actually crisp, and we always pretend that it is in AI, so far. And it turns out that once you stop making this crisp distinction and start allowing the boundary to be fuzzy, there are a lot of weird, interesting problems that show up and we don’t know how to deal with any of them, even in theory, so that’s what they focused on.

Lucas: And can you unpack, given that AI researchers control of the input/output channels for AI systems, why is it that there is this fuzziness? It seems like you could extrapolate away the fuzziness given that there are these sort of rigid and selected IO channels.

Rohin: Yeah, I agree that seems like the right thing for today’s AI systems; but I don’t know. If I think about, “Okay, this AGI is a generally intelligent AI system.” I kind of expect it to recognize that when we feed it inputs which, let’s say, we’re imagining a money maximizing AI system that’s taking in inputs like stock prices, and it outputs which stocks to buy. And maybe it can also read the news that lets it get newspaper articles in order to make better decisions about which stocks to buy.

At some point, I expect this AI system to read about AI and humans, and realize that, hey, it must be an AI system, it must be getting inputs and outputs. Its reward function must be to make this particular number in a bank account be as high as possible and then once it realizes this, there’s this part of the world, which is this number in the bank account, or it could be this particular value, this particular memory block in its own CPU, and its goal is now make that number as high as possible.

In some sense, it’s now modifying itself, especially if you’re thinking of the memory block inside the CPU. If it goes and edits that and sets that to a million, a billion, the highest number possible in that memory block, then it seems like it has, in some sense, done some self editing; it’s changed the agent part of it. It could also go and be like, “Okay actually what I care about is this particular award function box is supposed to output as high a number as possible. So what if I go and change my input channels such that it feeds me things that caused me to believe that I’ve made tons and tons of profit?” So this is a delusion backs consideration.

While it is true that I don’t see a clear, concrete way that an AI system ends up doing this, it does feel like an intelligent system should be capable of this sort of reasoning, even if it initially had these sort of fixed inputs and outputs. The idea here is that its outputs can be used to affect the inputs or future outputs.

Lucas: Right, so I think that that point is the clearest summation of this; it can affect its own inputs and outputs later. If you take human beings who are, by definition, human level intelligences we have, say, in a classic computer science sense if you thought of us, you’d say we strictly have five input channels: hearing seeing, touch, smell, etc.

Human beings have a fixed number of input/output channels but, obviously, human beings are capable of self modifying on those. And our agency is sort of squishy and dynamic in ways that would be very unpredictable, and I think that that unpredictability and the sort of almost seeming ephemerality of being an agent seems to be the crux of a lot of the problem.

Rohin: I agree that that’s a good intuition pump, I’m not sure that I agree it’s the crux. The crux, to me, it feels more like you specify some sort of behavior that you want which, in this case, was make a lot of money or make this number in a bank account go higher, or make this memory cell go as high as possible.

And when you were thinking about the specification, you assumed that the inputs and outputs fell within some strict parameters, like the inputs are always going to be news articles that are real and produced by human journalists, as opposed to a fake news article that was created by the AI in order to convince the reward function that actually it’s made a lot of money. And then the problem is that since the AI’s outputs can affect the inputs, the AI could cause the inputs to go outside of the space of possibilities that you imagine the inputs could be in. And this then allows the AI to game the specification that you had for it.

Lucas: Right. So, all the parts which constitute some AI system are all, potentially, modified by other parts. And so you have something that is fundamentally and completely dynamic, which you’re trying to make predictions about, but whose future structure is potentially very different and hard to predict based off of the current structure?

Rohin: Yeah, basically.

Lucas: And that in order to get past this we must, again, tunnel down on this decision theoretic and rational agency type issues at the bottom of intelligence to sort of have a more fundamental theory, which can be applied to these highly dynamic and difficult to understand situations?

Rohin: Yeah, I think the MIRI perspective is something like that. And in particular, it would be like trying to find a theory that allows you to put in something that stays stable even while the system, itself, is very dynamic.

Lucas: Right, even while your system, whose parts are all completely dynamic and able to be changed by other parts, how do you maintain a degree of alignment amongst that?

Rohin: One answer to this is give the AI a utility function. There is a utility function that’s explicitly trying to maximize that and in that case, it probably has an incentive in order to keep that to protect that the utility function, because if it gets changed, well then it’s not going to maximize that utility function anymore, it’ll maximize something else which will lead to worse behavior by the likes of the original utility function. That’s a thing that you could hope to do with a better theory of intelligence is, how do you create a utility function in an AI system stays stable, even as everything else is dynamically changing?

Lucas: Right, and without even getting into the issues of implementing one single stable utility function.

Rohin: Well, I think they’re looking into those issues. So, for example, Vingean Reflection is a problem that is entirely about how you create better, more improved version of yourself without having any value drift, or a change to the utility function.

Lucas: Is your utility function not self-modifying?

Rohin: So in theory, it could be. The hook would be that we could design an AI system that does not self-modify its utility function under almost all circumstances. Because if you change your utility function, then you’re going to start maximizing that new utility function which, by the original utility function’s evaluation, is worse. If I told you, “Lucas, you have got to go fetch coffee.” That’s the only thing in life you’re concerned about. You must take whatever actions are necessary in order to get the coffee.

And then someone goes like, “Hey Lucas, I’m going to change your utility function so that you want to fetch tea instead.” And then all of your decision making is going to be in service of getting tea. You would probably say, “No, don’t do that, I want to fetch coffee right now. If you change my utility function for being ‘fetch tea’, then I’m going to fetch tea, which is bad because I want to fetch coffee.” And so, hopefully, you don’t change your utility function because of this effect.

Lucas: Right. But isn’t this where corrigibility comes in, and where we admit that as we sort of understand more about the world and our own values, we want to be able to update utility functions?

Rohin: Yeah, so that is a different perspective; I’m not trying to describe that perspective right now. It’s a perspective for how you could get something stable in an AI system. And I associate it most with Eliezer, though I’m not actually sure if he holds this opinion.

Lucas: Okay, so I think this was very helpful for the MIRI case. So why don’t we go ahead and zoom in, I think, a bit on CHAI, which is the Center For Human-Compatible AI.

Rohin: So I think rather than talking about CHAI, I’m going to talk about the general field of trying to get AI systems do what we want; a lot of people at CHAI work on that but not everyone. And also a lot of people outside of CHAI work on that, because that seems to become more useful carving of the field. So there’s this broad argument for AI safety which is, “We’re going to have very intelligent things based on the orthagonality thesis, we can’t really say anything about their goals.” So, the really important thing is to make sure that the intelligence is pointed at the right goals, it’s pointed at doing what we actually want.

And so then the natural approach is, how do we get our AI systems to infer what we want to do and then actually pursue that? And I think, in some sense, it’s one of the most obvious approaches to AI safety. This is a clear enough problem, even with narrow current systems that there are plenty of people outside of AI safety working on this, as well. So this incorporates things like inverse reinforcement learning, preference learning, reward modeling, the CIRL cooperative IRL paper also fits into all of this. So yeah, I can begin to ante up those in more depth.

Lucas: Why don’t you start off by talking about the people who exist within the field of AI safety, give sort of a brief characterization of what’s going on outside of the field, but primarily focusing on those within the field. How this approach, in practice, I think generally is, say, different from MIRI to start off with, because we have a clear picture of them painted right next to what we’re delving into now.

Rohin: So I think difference of MiRI is that this is more targeted directly at the problem right now, in that you’re actually trying to figure out how do you build an AI system that does what you want. Now, admittedly, most of the techniques that people have come up with are not likely to scale up to super-intelligent AI, they’re not meant to, no one claims that they’re going to scale up to super-intelligent AI. They’re more like some incremental progress on figuring out how to get AI systems to do what we want and, hopefully, with enough incremental progress, we’ll get to a point where we can go, “Yes, this is what we need to do.”

Probably the most well known person here would be Dylan Hadfield-Menell, who you had on your podcast. And so he talked about CIRL and associated things quite a bit there, there’s not really that much I would say in addition to it. Maybe a quick summary of Dylan’s position is something like, “Instead of having AI systems that are optimizing for their own goals, we need to have AI systems that are optimizing for our goals, and try to infer our goals in order to do that.”

So rather than having an AI system that is individually rational with respect to its own goals, you instead want to have a human AI system such that the entire system is rationally optimizing for the human’s goals. This is sort of the point made by CIRL, where you have an AI system, you’ve got a human, they’re playing those two player game, the humans is the only one who knows the reward function, the robot is uncertain about what the reward function is, and has to learn by observing what the humans does.

And so, now you see that the robot does not have a utility function that it is trying to optimize; instead is learning about a utility function that the human has and then helping the human optimize that reward function. So summary, try to build human AI systems that are group rational, as opposed to an AI system that is individually rational; so that’s Dylan’s view. Then there’s Jan Leike at DeepMind, and a few people at OpenAI.

Lucas: Before we pivot into OpenAI and DeepMind, just sort of focusing here on the CHAI end of things and this broad view, and help me explain here how you would characterize it. The present day actively focused view on current issues, and present day issues and alignment and making incremental progress there. This view here you see as a sort of subsuming multiple organizations?

Rohin: Yes, I do.

Lucas: Okay. Is there a specific name you would, again, use to characterize this view?

Rohin: Oh, getting AI systems to do what we want. Let’s see, do I have a pithy name for this? Helpful AI systems or something.

Lucas: Right which, again, is focused on current day things, is seeking to make incremental progress, and which subsumes many different organizations?

Rohin: Yeah, that seems broadly true. I do think there are people who are doing more conceptual work, thinking about how this will scale to AGI and stuff like that; but it’s a minority of work in the space.

Lucas: Right. And so the question of how do we get AI systems to do what we want them to do, also includes these views of, say, Vingean Reflection or how we become idealized versions of ourselves, or how we build on value over time, right?

Rohin: Yeah. So, those are definitely questions that you would need to answer at some point. I’m not sure that you would need to answer Vingean Reflection at some point. But you would definitely need to answer how do you update, given that humans don’t actually know what they want, for a long-term future; you need to be able to deal with that fact at some point. It’s not really a focus of current research, but I agree that that is a thing about this approach will have to deal with, at some point.

Lucas: Okay. So, moving on from you and Dylan to DeepMind and these other places that you view as this sort of approach also being practice there?

Rohin: Yeah, so while Dylan and I and other at CHAI has been focused on sort of conceptual advances, like in toy environments, does this do the right thing? What are some sorts of data that we can learn from? Do they work in these very simple environments with quite simple algorithms? I would say that OpenAI and DeepMind safety teams are more focused on trying to get this to work in complex environments of the sort that we’re getting this to work on state-of-the-art environments, the most complex ones that we have.

Now I don’t mean DoTA and StarCraft, because running experiments with DoTAi and StarCraft is incredibly expensive, but can we get AI systems that do what we want for environments like Atari or MuJoCo? There’s some work on this happening at CHAI, there are pre-prints available online, but it hasn’t been published very widely yet. Most of the work, I would say, has been happening with an OpenAI/DeepMind collaboration, and most recently, there was a position paper from DeepMind on recursive reward modeling.

Right before that there was also a paper on combining first a paper, deeper enforcement learning from human preferences, which said, “Okay if we allow humans to specify what they want by just comparing between different pieces of behavior from the AI system, can we train an AI system to do what the human wants?” And then they built on that in order to create a system that could learn from demonstrations, initially, using a kind of imitation learning, and then improve upon the demonstrations using comparisons in the same way that deep RL from human preferences did.

So one way that you can do this research is that there’s this field of human computer interaction, which is about … well, it’s about many things. But one of the things that it’s about is how do you make the user interface for humans intuitive and easy to use such that you don’t have user error or operator? One comment from people that I liked is that most of the things that are classified as ‘user error’ or ‘operator error’ should not be classified as such, they should be classified as ‘interface errors’ where you had such a confusing interface that well, of course, at some point some user was going to get it wrong.

And similarly, here, what we want is a particular behavior out of the AI, or at least a particular set of outcomes from the AI; maybe we don’t know exactly how to achieve those outcomes. And AI is about giving us the tools to create that behavior in automated systems. The current tool that we all use is the reward function, we write down the reward function and then we give it to an algorithm, and it produces behaviors and the outcomes that we want.

And reward functions, they’re just a pretty terrible user interface, they’re better than the previous interface which is writing a program explicitly, which humans cannot do it if the task is something like image classification or continuous control in MuJoCo; it’s an improvement upon that. But reward functions are still a pretty poor interface, because they’re implicitly saying that they encode perfect knowledge of the optimal behavior in all possible environments; which is clearly not a thing that humans can do.

I would say that this area is about moving on from reward functions, going to the next thing that makes the human’s job even easier. And so we’ve got things like comparisons, we’ve got things like inverse award design where you specify a proxy to work function that only needs to work in the training environment. Or you do something like inverse reinforcement learning, where you learn from demonstrations; so I think that’s one nice way of looking at this field.

Lucas: So do you have anything else you would like to add on here about how we present-day get AI systems to do what we want them to do, section of the field?

Rohin: Maybe I want to plug my value learning sequence, because it talks about this much more eloquently than I can on this podcast?

Lucas: Sure. Where can people find your value learning sequence?

Rohin: It’s on the Alignment Forum. You just go to the Alignment Forum, at the top there’s ‘Recommended Sequences’, there’s ‘Embedded Agency’, which is from MIRI, the sort of stuff we already talked about; so that’s also great sequence, I would recommend it. There’s iterated amplification, also great sequence we haven’t talked about it yet. And then there’s my value learning sequence, so you can see it on the front page of the Alignment Forum.

Lucas: Great. So we’ve characterized these, say, different parts of the AI alignment field. And probably just so far it’s been cut into this sort of MIRI view, and then this broad approach of trying to get present-day AI systems to do what we want them to do, and to make incremental progress there. Are there any other slices of the AI alignment field that you would like to bring to light?

Rohin: Yeah, I’ve got four or five more. There’s the interated amplification and debate side of things, which is how do we build using current technologies, but imagining that they were way better? How do we build and align AGI? So they’re trying to solve the entire problem, as opposed to making incremental progress and, simultaneously, hopefully thinking about, conceptually, how do we fit all of these pieces together?

There’s limiting the AGI system, which is more about how do we prevent AI systems from behaving catastrophically? It makes no guarantees about the AI systems doing what we want, it just prevents them from doing really, really bad things. Techniques in that section includes boxing and avoiding side effects. There’s the robustness view, which is about how do we make AI systems well behaved or robustly? I guess that’s pretty self explanatory.

There’s transparency or interpretability, which I wouldn’t say is a technique by itself, but seems to be broadly useful for almost all of the other avenues, it’s something we would want to add to other techniques in order to make those techniques more effective. There’s also, in the same frame as MIRI, can we even understand intelligence? Can we even forecast what’s going to happen with AI? And within that, there’s comprehensive AI services.

here’s also lots of efforts on forecasting, but comprehensive AI services actually makes claims about what technical AI safety should do. So I think that one actually does have a place in this podcast, whereas most of the forecasting things do not, obviously. They have some implications on the strategic picture, but they don’t have clear implications on technical safety research directions, as far as I can tell it right now.

Lucas: Alright, so, do you want to go ahead and start off with the first one on the list there And then we’ll move sequentially down?

Rohin: Yeah, so iterated amplification and debate. This is similar to the helpful AGI section in the sense that we are trying to build an AI system that does what we want. That’s still the case here, but we’re now trying to figure out, conceptually, how can we do this using things like reinforcement learning and supervised learning, but imagining that they’re way better than they are right now? Such that the resulting agent is going to be aligned with us and reach arbitrary levels of intelligence; so in some sense, it’s trying to solve the entire problem.

We want to come up with a scheme such that if we run that scheme, we get good outcomes, we’ve solved almost all the problem. I think that it also differs in that the argument for why we can be successful is also different. This field is aiming to get a property of corrigibility, which I like to summarize as trying to help the overseer. It might fail to help the overseer, or the human, or the user, because it’s not very competent and maybe it makes a mistake and things that I like apples when actually I want oranges. But it was actually trying to help me; it actually thought I wanted apples.

So in corrigibility, you’re trying to help the overseer, whereas, in the previous thing about helpful AGI, you’re more getting an AI system that actually does what we want; there isn’t this distinction between what you’re trying to do versus what you actually do. So there’s a slightly different property that you’re trying to ensure, I think, on the strategic picture that’s the main difference.

The other difference is that these approaches are trying to make a single, unified generally intelligent AI system, and so they will make assumptions like, given that we’re trying to imagine something that’s generally intelligent, it should be able to do X, Y, and Z. Whereas the research agenda that’s let’s try to get AI systems that do want you want, tends not to make those assumptions. And so it’s more applicable to current systems or narrow system where you can’t assume that you have general intelligence.

For example, a claim that that Paul Christiano often talks about is that, “If your AI agent is generally intelligent and a little bit corrigible, it will probably easily be able to infer that its overseer, or the user, would like to remain in control of any resources that they have, and would like to be better informed about the situation, that the user would prefer that the agent does not lie to them etc., etc.” It was definitely not something that current day AI systems can do unless you really engineer them to, so this is presuming some level of generality, which we do not currently have.

So the next thing I said was limited AGI. Here the idea is, there are not very many policies or AI systems that will do what we want; what we want is a pretty narrow space in the space of all possible behaviors. Actually selecting one of the behaviors out of that space is quite difficult and requires a lot of information in order to narrow in on that piece of behavior. But if all you’re trying to do is avoid the catastrophic behaviors, then there are lots and lots of policies that successfully do that. And so it might be easier to find one of those policies; a policy that doesn’t ever kill all humans.

Lucas: At least the space of those policies, one might have this view and not think it sufficient for AI alignment, but see it as sort of a low hanging fruit to be picked. Because the space of non-catastrophic outcomes is larger than the space of extremely specific futures that human beings support.

Rohin: Yeah, exactly. And the success story here is, basically, that we develop this way of preventing catastrophic behaviors. All of our AI systems are filled with the system in place, and then technological progress continues as usual; it’s maybe not as fast as it would have been if we had an aligned AGI doing all of this for us, but hopefully it would still be somewhat fast, and hopefully enabled a bit by AI systems. Eventually, we will either make it to the future without ever building an AI system that doesn’t have a system in place, or we use this to do a bunch more AI research until we solve the full alignment problem, and then we can build, with high confidence that it’ll go well.

And actual proper aligned, super-intelligence that is helping us without any of these limitations systems in place. I think from a strategic picture, that’s basically the important parts about limited AGI. There are two subsections within those limits based on trying to change what the AI’s optimizing for, so this would be something like impact measures versus limits on the input/output channels of the AI system; so this would be something like AI boxing.

So, with robustness, I sort of think of the robustness mostly, it’s not going to give us safety by itself, probably, though there are some scenarios in which it could happen. It’s more meant to harden whichever other approach that we use. Maybe if we have an AI system that is trying to do what we want, to go back to the helpful AGI setting, maybe it does that 99.9 percent of the time. But we’re using this AI to make millions of decisions, which means it’s going to not do what we want 1,000 times. That seems like way too many times for comfort, because if it’s applying its intelligence to the wrong goal in those 1,000 times, you could get some pretty bad outcomes.

This is a super heuristic and fluffy argument, but there are lots of problems with it. I think it sets up the general reason that we would want robustness. So with robustness techniques, you’re basically trying to get some nice worst case guarantees that say, “Yeah, the AI system is never going to screw up super, super bad.” And this is helpful when you have an AI system that’s going to make many, many, many decisions, and we want to make sure that none of those decisions are going to be catastrophic.

And so some techniques in here include verification, adversarial training, and other adversarial ML techniques like Byzantine fault tolerance, or stuff like that. These are all the data poisoning, interpretability can also be helpful for robustness if you’ve got a strong overseer who can use interpretability to give good feedback to your AI system. But yeah, the overall goal is take something that doesn’t fail 99 percent of the time, and get it to not fail 100 percent of the time, or check whether or not it ever fails, so that you don’t have this very rare but very bad outcome.

Lucas: And so would you see this section as being within the context of any others or being sort of at a higher level of abstraction?

Rohin: I would say that it applies to any of the others, well okay, not the MIRI embedded agency stuff, because we don’t really have a story for how that ends up helping with AI safety. It could apply to however that caches out in the future, but we don’t really know right now. With limited AGI, many have this theoretical model, if you apply this sort of penalty, this sort of impact measure, then you’re never going to have any catastrophic outcomes.

But, of course, in practice, we train our AI systems to optimize that penalty and get the sort of weird black box thing out. And we’re not entirely sure if it’s respecting the penalty or something like this. Then you could use something like verification or your transparency in order to make sure that this is actually behaving the way we would predict them behave based on our analysis of what limits we need to put on the AI system.

Similarly, if you build AI systems that are doing what we want, maybe you want to use adversarial training to see if you can find any situations in which the AI system’s doing something weird, doing something which we wouldn’t classify as what we want, with iterated amplification or debate, maybe we want to verify that the corrigibility property happens all the time. It’s unclear how you would use verification for that, because it seems like a particularly hard property to formalize, but you could still do things like adversarial training or transparency.

We might have this theoretical arguments for why our systems will work, then once we turn them into actual real systems that will probably use neural nets and other messy stuff like that, are we sure that in the translation from theory to practice, all of our guarantees stayed? Unclear, we should probably use some robustness techniques to check that.

Interpretability, I believe, was next. It’s sort of similar in that it’s broadly useful for everything else. If you want to figure out whether an AI system is doing what you want, it would be really helpful to be able to look into the agent and see, “Oh, it chose to buy apples because it had seen me eat apples in the past.” Versus, “It chose to buy apples because there was this company that made it to buy the apples, so that it would make more profit.”

If we could see those two cases, if we could actually see into the decision making process, it becomes a lot easier to tell whether or not the AI system is doing what we want, or whether or not the AI system is corrigible, or whether or not be AI system is properly … Well, maybe it’s not as obvious for impact measures, but I wouldn’t expect it to be useful there as well, even if I don’t have a story off the top of my head.

Similarly with robustness, if you’re doing something like adversarial training, it sure would help if your adversary was able to look into the inner workings of the agent and be like, “Ah, I see this agent, it tends to underwrite this particular class of risky outcomes. So why don’t I search within that class of situations for one that is going to take a big risk on that it shouldn’t have taken otherwise?” It just makes all of the other problems a lot easier to do.

Lucas: And so how is progress made on interpretability?

Rohin: Right now I think most of the progress is in image classifiers. I’ve seen some work on interpretability for deep RL as well. Honestly, that’s probably most of the research is happening with classification systems, primarily image classifiers, but others as well. And then I also see the deep RL explanation systems because I read a lot of deep RL research.

But it’s motivated a lot, there are real problems with current AI systems, and interpretability helps you to diagnose and fix those, as well. For example, the problems of bias in classifiers, one thing that I remember from Deep Dream is you can ask Deep Dream to visualize barbells. And you always see these sort of muscular arms that are attached to the barbells because, in the training set, barbells were always being picked up by muscular people. So, that’s a way that you can tell that your classifier is not really learning the concepts that you wanted it to do.

In the bias case maybe your classifier always classifies anyone sitting at a computer as a man, because of bias in the data set. And using interpretability techniques, you could see that, okay when you look at this picture, the AI system is looking primarily at the pixels that represent the computer, as opposed to the pixels that represent the human. And making its decision to label this person as a man, based on that, and you’re like, no, that’s clearly the wrong thing to do. The classifier should be paying attention to the human, not to the laptop.

So I think a lot of interpretability research right now is you take a particular short term problem and figure out how you can make that problem easier to solve. Though a lot of it is also what would be the best way to understand what our model is doing? So I think a lot of the work that Chris Olah doing, for example, is in this vein, and then as we do this exploration, finding some sort of bias in the classifiers that you’re studying.

So, Comprehensive AI Services, an attempt to predict what the feature of AI development will look like, and the hope is that, by doing this, we can figure out what sort of technical safety things we will need to do. Or, strategically, what sort of things we should push for in the AI research community in order to make those systems safer.

There’s a big difference between, we are going to build a single unified AGI agent and it’s going to be generally intelligent to optimize the world according to a utility function versus we are going to build a bunch of disparate, separate, narrow AI systems that are going to interact with each other quite a lot. And because of that, they will be able to do a wide variety of tasks, none of them are going to look particularly like expected utility maximizers. And the safety research you want to do is different in those two different worlds. And CAIS is basically saying “We’re in the second of those worlds, not the first one.”

Lucas: Can you go ahead and tell us about ambitious value learning?

Rohin: Yeah, so with ambitious value learning, this is also an approach to how do we make an aligned AGI solve the entire problem in some sense? Which is look at not just human behavior, but also human brains of the algorithm that they implement, and use that to infer an adequate utility function, the one that we would be okay with the behavior that results from that.

Infer this utility function, I’m going to plug it into an expected utility maximizer. Now, of course, we do have to solve problems with even once we have the utility function, how do we actually build a system that maximizes that utility function, which is not a solved problem yet? But it does seem to be capturing from the main difficulties, if you could actually solve the problem. And so that’s an approach I associate most with Stuart Armstrong.

Lucas: Alright, and so you were saying earlier, in terms of your own view, it’s sort of an amalgamation of different credences that you have in the potential efficacy of all these different approaches. So, given all of these and all of their broad missions, and interests, and assumptions that they’re willing to make, what are you most hopeful about? What are you excited about? How do you, sort of, assign your credence and time here?

Rohin: I think I’m most excited about the concept of corrigibility. That seems like the right thing to aim for, it seems like it’s a thing we can achieve, it seems like if we achieve it, we’re probably okay, nothing’s going to go horribly wrong and probably will go very well. I am less confident on which approach to corrigibility I am most excited about. Iterated amplification and debate seem like if we were to implement them, they will probably lead to incorrigible behavior. But I am worried that either of those will be … Either we won’t actually be able to build generally intelligent agents, in which case both of those approaches don’t really work. Or another worry that I have is that those approaches might be too expensive to actually do in that other systems are just so much more computationally efficient that we just use those instead.

Due to economic pressures, Paul does not seem to be worried by either of these things. He’s definitely aware of both these issues, in fact, he was the one I think who listed computational efficiency as a desideratum, and he still is optimistic about them. So, I would not put a huge amount of credence in this view of mine.

If I were to say what I was excited about for portability instead of that, it would be something like take the research that we’re currently doing on how to get current AI systems to work, which often called ‘narrow value learning’. If you take that research, it seems plausible that this research, extended into the future, will give us some method of creating an AI system that’s implicitly learning our narrow values, and is corrigible as a result of that, even if it is not generally intelligent.

This is sort of a very hand wavey speculative intuition, certainly not as concrete as the hope that we have with iterated amplification. But I’m somewhat optimistic about it, and less optimistic about limiting AI systems, it seems like even if you succeed in finding a nice, simple rule that eliminates all catastrophic behaviors, which plausibly you could do, it seems hard to find one that both does that and also lets you do all of the things that you do want to do.

If you’re talking about impact metrics, for example, if you require AI to be a low impact, I expect that that would prevent you from doing many things that we actually want to do, because many things that we want to do are actually quite high impact. Now, Alex Turner disagrees with me on this, and he developed attainable utility preservation. He is explicitly working on this problem and disagree with me, so again I don’t know how much credence to put in this.

I don’t know if Vika agrees with me on this or not, she also might disagree with me and she is also directly working with this problem. So, yeah, seems hard to put a limit that also lets us do and things that we want. And in that case, it seems like due to economic pressures, we’d end up doing the things that don’t limit our AI systems from doing what they want.

I want to keep emphasizing my extreme uncertainty over all of this given that other people disagree with me on this, but that’s my current opinion. Similarly with boxing, it seems like it’s going to just make it very hard to actually use the AI system. Robustness and interpretability seems very broadly useful and supportive of most research on interpretability; maybe with an eye towards long term concerns, just because it seems to make every other approach to AI safety a lot more feasible and easier to solve.

I don’t think it’s a solution by itself, but given that it seems to improve almost every story I have for making an aligned AGI, seems like it’s very much worth getting a better understanding of it. Robustness is an interesting one, it’s not clear to me, if it is actually necessary. I kind of want to just voice lots of uncertainty about robustness and leave it at that. It’s certainly good to do in that it helps us be more confident in our AI systems, but maybe everything would be okay even if we just didn’t do anything. I don’t know, I feel like I would have to think a lot more about this and also see the techniques that we actually used to build AGI in order to have a better opinion on that.

Lucas: Could you give a few examples of where your intuitions here are coming from that don’t see robustness as an essential part of the AI alignment?

Rohin: Well, one major intuition, if you look at humans, they’re at least some human where I’m like, “Okay, I could just make this human a lot smarter, a lot faster, have them think for many, many years, and I still expect that they will be robust and not lead to some catastrophic outcome. They may not do exactly what I would have done, because they’re doing what they want. But they’re probably going to do something reasonable, they’re not going to do something crazy or ridiculous.

I feel like humans, some humans, the sufficiently risk averse and uncertain ones seem to be reasonably robust. I think that if you know that you’re planning over a very, very, very long time horizon, so imagine that you know you’re planning over billions of years, then the rational response to this is, “I really better make sure not to screw up right now, since there is just so much reward in the future, I really need to make sure that I can get it.” And so you get very strong pressures for preserving option value or not doing anything super crazy. So I think you could, plausibly, just get the reasonable outcomes from those effects. But again, these are not well thought out.

Lucas: All right, and so I just want to go ahead and guide us back to your general views, again, on the approaches. Is there anything that you’d like to add their own the approaches?

Rohin: I think I didn’t talk about CAIS yet. I guess my general view of CAIS, I broadly agree with it, that this does seem to be the most likely development path, meaning that it’s more likely than any other specific development path, but not more likely to have any other development path.

So I broadly agree with the worldview presented, I’m still trying to figure out what implications it has for technical safety research. I don’t agree with all of it, in particular, I think that you are likely to get AGI agents at some point, probably, after the CAIS soup of services happens. Which, I think, again, Drexler disagrees with me on that. So, put a bunch of uncertainty on that, but I broadly agree with that worldview that CAIS is proposing.

Lucas: In terms of this disagreement between you and Eric Drexler, are you imagining agenty AGI or super-intelligence which comes after the CAIS soup? Do you see that as an inevitable byproduct of CAIS or do you see that as an inevitable choice that humanity will make? And is Eric pushing the view that the agenty stuff doesn’t necessarily come later, it’s a choice that human beings would have to make?

Rohin: I do think it’s more like saying that this will be a choice that humans will make at some point. I’m sure that Eric, to some extent, is saying, “Yeah, just don’t do that.” But I think Eric and I do, in fact, have a disagreement on how much more performance you can get from an AGI agent, than a CAIS super of services. My argument is something like there is efficiency to be gained from going to an AGI agent, and Eric’s position as best I understand it, is that there is actually just not that much economic incentive to go to an AGI agent.

Lucas: What are your intuition pumps for why you think that you will gain a lot of computational efficiency from creating sort of an AGI agent? We don’t have to go super deep, but I guess a terse summary or something?

Rohin: Sure, I guess the main intuition pump is that in all of the past cases that we have of AI systems, you see that in speech recognition, in deep reinforcement learning, in image classification, we had all of the hand-built systems that separated these out into a few different modules that interacted with each other in a vaguely CAIS-like way. And then, at some point, we got enough computer and large enough data sets that we just threw deep learning at it, and deep learning just blew those approaches out of the water.

So there’s the argument from empirical experience, and there’s also the argument of if you try to modularize your systems yourself, you can’t really optimize the communication between them, you’re less integrated and you can’t make decisions based on global information, you have to make it based off of local information. And so the decisions tend to be a little bit worse. This could be taken as an explanation for the empirical observation that I made that we can already make; so that’s another intuition pump there.

Eric’s response would probably be something like, “Sure, this seems true for these narrow tasks, for narrow tasks.” You can get a lot of efficiency gains by integrating everything together and throwing deep learning and [inaudible 00:54:10] training at all of it. But for a sufficiently high level tasks, there’s not really that much to be gained by doing global information instead of local information, so you don’t actually lose much by having these separate systems, and you do get a lot of computational deficiency in generalization bonuses by modularizing. He had a good example of this that I’m not replicating and I don’t want to make my own example, because it’s not going to be as convincing; but that’s his current argument.

And then my counter-argument is that’s because humans have small brains, so given the size of our brains and the limits of our data, and the limits of the compute that we have, we are forced to do modularity and systematization to break tasks apart into modular chunks that we can then do individually. Like if you are running a corporation, you need each person to specialize in their own task without thinking about all the other tasks, because we just do not have the ability to optimize for everything all together because we have small brains, relatively speaking; or limited brains, is what I should say.

But this is not a limit that AI systems will have. An AI system would just vastly more computer than the human brain, vastly more data will, in fact, just be able to optimize all of this with global information and get better results. So that’s one thread of the argument taken down to two or three levels of arguments and counter-arguments. There are other threads of that debate, as well.

Lucas: I think that that serves a purpose for illustrating that here. So are there any other approaches here that you’d like to cover, or is that it?

Rohin: I didn’t talk about factored cognition very much. But I think it’s worth highlighting separately from iterated amplification in that it’s testing an empirical hypothesis of can humans decompose tasks into chunks of some small amount of time? And can we do arbitrarily complex tasks using these humans? I am particularly excited about this sort of work that’s trying to figure out what humans are capable of doing and what supervision they can give to AI systems.

Mostly because going back to a thing I said way back in the beginning, what we’re aiming for is a human AI system to be collectively rational as opposed to an AI system as individually rational. Part of the human-AI-system is the human, you want to be able to know what the human can do, what sort of policies they can implement, what sort of feedback they can be giving to the AI system. And something like factory cognition is testing a particular aspect of that; and I think that seems great and we need more of it.

Lucas: Right. I think that this seems to be the sort of emerging view of where social science or scientists are needed in AI alignment in order to, again as you said, sort of understand what human beings are capable in terms of supervised learning and analyzing the human component of the AI alignment problem as it requires us to be collectively rational with AI systems.

Rohin: Yeah, that seems right. I expect more writing on this in the future.

Lucas: All right, so there’s just a ton of approaches here to AI alignment, and our heroic listeners have a lot to take in here. In terms of getting more information, generally, about these approaches or if people are still interested in delving into all these different views that people take at the problem and methodologies of working on it, what would you suggest that interested persons look into or read into?

Rohin: I cannot give you a overview of everything, because that does not exist. To the extent that it exists, it’s either this podcast or the talk that I did at Beneficial AGI. I can suggest resources for individual items, so for embedded agency there’s the embedded agency sequence on the Alignment Forum; far and away the best thing for read for that.

For CAIS, Comprehensive AI Services, there was a 200 plus page tech report published by Eric Drexler at the beginning of this month, if you’re interested, you should go read the entire thing; it is quite good. But I also wrote a summary of it on the Alignment Forum, which is much more readable, in the sense that it’s shorter. And then there are a lot of comments on there that analyze it a bit more.

There’s also another summary written by Richard Ngo, also on the Alignment Forum. Maybe it’s only on Lesswrong, I forget; it’s probably on the Alignment Forum. But that’s a different take on comprehensive AI services, so I’d recommend reading that too.

For limited AGI, I have not really been keeping up with the literature on boxing, so I don’t have a favorite to recommend. I know that a couple have been written by, I believe, Jim Babcock and Roman Yampolskiy.

For impact measures, you want to read Vika’s paper on relative reachability. There’s also a blog post about it if you don’t want to read the paper. And Alex Turner’s blog posts on attainable utility preservation, I think it’s called ‘Towards A New Impact Measure’, and this is on the Alignment Forum.

For robustness, I would read Paul Christiano’s post called ‘Techniques For Optimizing Worst Case Performance’. This is definitely specific to how robustness will help under Paul’s conception of the problem and, in particular, his thinking of robustness in the setting where you have a very strong overseer for your AI system. But I don’t know of any other papers or blog post that’s talking about robustness, generally.

For AI systems that do what we want, there’s my value learning sequence that I mentioned before on the Alignment Forum. There’s CIRL or Cooperative Inverse Reinforcement Learning which is a paper by Dylan and others. There’s Deep Reinforcement Learning From Human Preferences and Recursive Reward Modeling, these are both papers that are particular instances of work in this field. I also want to recommend Inverse Reward Design, because I really like that paper; so that’s also a paper by Dylan, and others.

For corrigibility and iterated amplification, the iterated amplification sequence on the Alignment Forum or half of what Paul Christiano has written. If you want to read not an entire sequence of blog posts, then I think Clarifying AI alignment is probably the post I would recommend. It’s one of the posts in the sequence and talks about this distinction of creating an AI system that is trying to do what you want, as opposed to actually doing what you want and why we might want to aim for only the first one.

For iterated amplification, itself, that technique, there is a paper that I believe is called something like Supervising Strong Learners By Amplifying Weak Experts, which is a good thing to read and there’s also corresponding OpenAI blog posts, whose name I forget. I think if you search iterated amplification, OpenAI blog you’ll find it.

And then for debate, there’s AI Safety via Debate, which is a paper, there’s also a corresponding OpenAI blog post. For factory cognition, there’s a post called Factored Cognition, on the Alignment Forum; again, in the iterated amplification sequence.

For interpretability, there isn’t really anything talking about interpretability, from the strategic point of view of why we want it. I guess that same post I recommend before of techniques for optimizing worst case performance talks about it a little bit. For actual interpretability techniques, I recommend the distill articles, the building blocks of interpretability and feature visualization, but these are more about particular techniques for interpretability, as opposed to why we wanted interpretability.

And on ambitious value learning, the first chapter of my sequence on value learning talks exclusively about ambitious value learning; so that’s one thing I’d recommend. But also Stuart Armstrong has so many posts, I think there’s one that’s about resolving human values adequately and something else, something like that. That one might be one worth checking out, it’s very technical though; lots of math.

He’s also written a bunch of posts that convey the intuitions behind the ideas. They’re all split into a bunch of very short posts, so I can’t really recommend any one particular one. You could go to the alignment newsletter database and just search Stuart Armstrong, and click on all of those posts and read them. I think that was everything.

Lucas: That’s a wonderful list. So we’ll go ahead and link those all in the article which goes along with this podcast, so that’ll all be there organized in nice, neat lists for people. This is all probably been fairly overwhelming in terms of the number of approaches and how they differ, and how one is to adjudicate the merits of all of them. If someone is just sort of entering the space of AI alignment, or is beginning to be interested in sort of these different technical approaches, do you have any recommendations?

Rohin: Reading a lot, rather than trying to do actual research. This was my strategy, I started back in September of 2017 and I think for the first six months or so, I was reading about 20 hours a week, in addition to doing research; which was why it was only 20 hours a week, it wasn’t a full time thing I was doing.

And I think that was very helpful for actually forming a picture of what everyone was doing. Now, it’s plausible that you don’t want to actually learn about what everyone is doing, and you’re okay with like, “I’m fairly confident that this thing, this particular problem is an important piece of the problem and we need to solve it.” And I think it’s very easy to get that wrong, so I’m a little wary of recommending that but it’s a reasonable strategy to just say, “Okay, we probably will need to solve this problem, but even if we don’t, the intuitions that we get from trying to solve this problem will be useful.

Focusing on that particular problem, reading all of the literature on that, attacking that problem, in particular, lets you start doing things faster, while still doing things that are probably going to be useful; so that’s another strategy that people could do. But I don’t think it’s very good for orienting yourself in the field of AI safety.

Lucas: So you think that there’s a high value in people taking this time to read, to understand all the papers and the approaches before trying to participate in particular research questions or methodologies. Given how open this question is, all the approaches make different assumptions and take for granted different axioms which all come together to create a wide variety of things which can both complement each other and have varying degrees of efficacy in the real world when AI systems start to become more developed and advanced.

Rohin: Yeah, that seems right to me. Part of the reason I’m recommending this is because it seems to be that no one does this. I think, on the margin, I want more people who do this in a world where 20 percent of the people were doing this, and the other 80 percent were just taking particular piece of the problem and working on those. That might be the right balance, somewhere around there, I don’t know, it depends on how you count who is actually in the field. But somewhere between one and 10 percent of the people are doing this; closer to the one.

Lucas: Which is quite interesting, I think, given that it seems like AI alignment should be in a stage of maximum exploration just given the conceptually mapping the territory is very young. I mean, we’re essentially seeing the birth and initial development of an entirely new field and specific application of thinking. And there are many more mistakes to be made, and concepts to be clarified, and layers to be built. So, seems like we should be maximizing our attention in exploring the general space, trying to develop models, the efficacy of different approaches and philosophies and views of AI alignment.

Rohin: Yeah, I agree with you, that should not be surprising given that I am one of the people doing this, or trying to do this. Probably the better critique will come from people who are not doing this, and can tell both of us why we’re wrong about this.

Lucas: We’ve covered a lot here in terms of the specific approaches, your thoughts on the approaches, where we can find resources on the approaches, why setting the approaches matters. Are there any parts of the approaches that you feel deserve more attention in terms of these different sections that we’ve covered?

Rohin: I think I would want more work on looking at the intersection between things that are supposed to be complimentary, how interpretability can help you have AI systems that have the right goals, for example, would be a cool thing to do. Or what you need to do in order to get verification, which is a sub-part of robustness, to give you interesting guarantees on AI systems that we actually care about.

Most of the work on verification right now is like, there’s this nice specification that we have for adversarial examples, in particular, is there an input that is within some distance from a training data point, such that it gets classified differently from that training data point. And those are the nice formal specification and most of the work in verification takes this specification as given and that figures out more and more computationally efficient ways to actually verify that property, basically.

That does seem like a thing that needs to happen, but the much more urgent thing, in my mind, is how do we come up with these specifications in the first place? If I want to verify that my AI system is corrigible, or I want to verify that it’s not going to do anything catastrophic, or that it is going to not disable my value learning system, or something like that; how do I specify this at all in any way that lets me do something like a verification technique even given infinite computing power? It’s not clear to me how you would do something like that, and I would love to see people do more research on that.

That particular thing is my current reason for not being very optimistic about verification, in particular, but I don’t think anyone has really given it a try. So it’s plausible that there’s actually just some approach that could work that we just haven’t found yet because no one’s really been trying. I think all of the work on limited AGI is talking about, okay, does this actually eliminate all of the catastrophic behavior? Which, yeah, that’s definitely an important thing, but I wish that people would also do research on, given that we put this penalty or this limit on the AGI system, what things is it still capable of doing?

Have we just made it impossible for it to do anything of interest whatsoever, or can it actually still do pretty powerful things, even though we’ve placed these limits on it? That’s the main thing I want to see. From there, let’s have AI systems that do what we want, probably the biggest thing I want to see there, and I’ve been trying to do some of this myself, some conceptual thinking about how does this lead to good outcomes in the long term? So far, we’ve not been dealing with the fact that the human doesn’t actually know, doesn’t actually have a nice consistent utility function that they know and that can be optimized. So, once you relax that assumption, what the hell do you do? And then there’s also a bunch of other problems that would benefit from more conceptual clarification, maybe I don’t need to go into all of them right now.

Lucas: Yeah. And just to sort of inject something here that I think we haven’t touched on and that you might have some words about in terms of approaches. We discussed sort of agential views of advanced artificial intelligence, a services-based conception, though I don’t believe that we have talked about aligning AI systems that simply function as oracles or having a concert of oracles. You can get rid of the services thing, and the agency thing if the AI just tells you what is true, or answers your questions in a way that is value aligned.

Rohin: Yeah, I mostly want to punt on that question because I have not actually read all the papers. I might have read a grand total of one paper on the oracles, and also super intelligence which talks about oracles. So I feel like I know so little about the state of the art on oracles, that it should not actually say anything about them.

Lucas: Sure. So then just as a broad point to point out to our audience is that in terms of conceptualizing these different approaches to AI alignment, it’s important and crucial to consider the kind of AI system that you’re thinking about the kinds of features and properties that it has, and oracles are another version here that one can play with in one’s AI alignment thinking?

Rohin: I think the canonical paper there is something like Good and Safe Pieces of Oracles, but I have not actually read it. There is a list of things I want to read, it is on that list. But that list also has, I think, something like 300 papers on it, and apparently I have not gotten to oracles yet.

Lucas: And so for the sake of this whole podcast being as comprehensive as possible, are there any conceptions of AI, for example, that we have omitted so far adding on to this agential view, the CAIS view of it actually just being a lot of distributed services, or an oracle view?

Rohin: There’s also the Tool AI View. This is different from the services view, but it’s somewhat akin to the view you were talking about at the beginning of this podcast where you’ve got AI systems that have a narrowly defined input/output space, they’ve got a particular thing that they do with limit, and they just sort of take in their inputs and do some computation, they spit out their outputs and that’s it, that’s all that they do. You can’t really model them as having some long term utility function that they’re optimizing, they’re just implementing a particular input-output relation and it’s all they’re trying to do.

Even saying something like, “They are trying to do X.” Is basically using a bad model for them. I think the main argument against expecting tool AI systems is that they’re probably not going to be as useful as other services or agential AI, because tool AI systems would have to be programmed in a way where we understood what they were doing and why they were doing it. Whereas agential AI systems or services would be able to consider new possible ways of achieving goals that we hadn’t thought about and enact those plans.

And so they could get super human behavior by considering things that we wouldn’t consider. Whereas, true Ais … Like Google Maps is super human in some sense, but it’s super human only because it has a compute advantage over us. If we were given all of the data and all of the time, in human real time, that Google Maps had, we could implement a similar sort of algorithm as Google Maps and compute the optimal route ourselves.

Lucas: There seems to be this duality that is constantly being formed in our conception of AI alignment, where the AI system is this tangible external object which stands in some relationship to the human and is trying to help the human to achieve certain things.

Are there conceptions of value alignment which, however the procedure or methodology is done, changes or challenges the relationship between the AI system and the human system where it challenges what it means to be the AI or what it means to be human, whereas, there’s potentially some sort of merging or disruption of this dualistic scenario of the relationship?

Rohin: I don’t really know, I mean, it sounds like you’re talking about things like brain computer interfaces and stuff like that. I don’t really know of any intersection between AI safety research and that. I guess, this did remind me, too, that I want to make the point that all of this is about the relatively narrow, I claim, problem of aligning an AI system with a single human.

There is also the problem of, okay what if there are multiple humans, what if there are multiple AI systems, what if you’ve got a bunch of different groups of people and each group is value aligned within themselves, they build an AI that’s value aligned with them, but lots of different groups do this now what happens?

Solving the problem that I’ve been talking about does not mean that you have a good outcome in the long term future, it is merely one piece of a larger overall picture. I don’t think any of that larger overall picture removes the dualistic thing that you were talking about, but they dualistic part reminded me of the fact that I am talking about a narrow problem and not the whole problem, in some sense.

Lucas: Right and so just to offer some conceptual clarification here, again, the first problem is how do I get an AI system to do what I want it to do when the world is just me and that AI system?

Rohin: Me and that AI system and the rest of humanity, but the rest of humanity is treated as part of the environment.

Lucas: Right, so you’re not modeling other AI systems or how some mutually incompatible preferences and trained systems would interact in the world or something like that?

Rohin: Exactly.

Lucas: So the full AI alignment problem is… It’s funny because it’s just the question of civilization, I guess. How do you get the whole world and all of the AI systems to make a beautiful world instead of a bad world?

Rohin: Yeah, I’m not sure if you saw my lightning talk at Beneficial AGI, but I talked a bit about those. I think I called that top level problem, make AI related features stuff go well, very, very, very concrete, obviously.

Lucas: It makes sense. People know what you’re talking about.

Rohin: I probably wouldn’t call that broad problem the AI alignment problem. I kind of wonder is there a different alignment for the narrower trouble? We could maybe call it the ‘AI Safety Problem’ or the ‘AI Future Problem’, I don’t know. ‘Beneficially AI’ problem actually, I think that’s what I used last time.

Lucas: That’s a nice way to put it. So I think that, conceptually, leave us at a very good place for this first section.

Rohin: Yeah, seems pretty good to me.

Lucas: If you found this podcast interesting or useful, please make sure to check back for part two in a couple weeks where Rohin and I go into more detail about the strengths and weaknesses of specific approaches.

We’ll be back again soon with another episode in the AI Alignment podcast.

[end of recorded material]

AI Alignment Podcast: AI Alignment through Debate with Geoffrey Irving

“To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information…  In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment.” AI safety via debate

Debate is something that we are all familiar with. Usually it involves two or more persons giving arguments and counter arguments over some question in order to prove a conclusion. At OpenAI, debate is being explored as an AI alignment methodology for reward learning (learning what humans want) and is a part of their scalability efforts (how to train/evolve systems to safely solve questions of increasing complexity). Debate might sometimes seem like a fruitless process, but when optimized and framed as a two-player zero-sum perfect-information game, we can see properties of debate and synergies with machine learning that may make it a powerful truth seeking process on the path to beneficial AGI.

On today’s episode, we are joined by Geoffrey Irving. Geoffrey is a member of the AI safety team at OpenAI. He has a PhD in computer science from Stanford University, and has worked at Google Brain on neural network theorem proving, cofounded Eddy Systems to autocorrect code as you type, and has worked on computational physics and geometry at Otherlab, D. E. Shaw Research, Pixar, and Weta Digital. He has screen credits on Tintin, Wall-E, Up, and Ratatouille. 

We hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

Topics discussed in this episode include:

  • What debate is and how it works
  • Experiments on debate in both machine learning and social science
  • Optimism and pessimism about debate
  • What amplification is and how it fits in
  • How Geoffrey took inspiration from amplification and AlphaGo
  • The importance of interpretability in debate
  • How debate works for normative questions
  • Why AI safety needs social scientists
You can find out more about Geoffrey Irving at his website. Here you can find the debate game mentioned in the podcast. Here you can find Geoffrey Irving, Paul Christiano, and Dario Amodei’s paper on debate. Here you can find an Open AI blog post on AI Safety via Debate. You can listen to the podcast above or read the transcript below.

Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast. I’m Lucas Perry, and today we’ll be speaking with Geoffrey Irving about AI safety via Debate. We discuss how debate fits in with the general research directions of OpenAI, what amplification is and how it fits in, and the relation of all this with AI alignment. As always, if you find this podcast interesting or useful, please give it a like and share it with someone who might find it valuable.

Geoffrey Irving is a member of the AI safety team at OpenAI. He has a PhD in computer science from Stanford University, and has worked at Google Brain on neural network theorem proving, cofounded Eddy Systems to autocorrect code as you type, and has worked on computational physics and geometry at Otherlab, D. E. Shaw Research, Pixar, and Weta Digital. He has screen credits on Tintin, Wall-E, Up, and Ratatouille. Without further ado, I give you Geoffrey Irving.

Thanks again, Geoffrey, for coming on the podcast. It’s really a pleasure to have you here.

Geoffrey: Thank you very much, Lucas.

Lucas: We’re here today to discuss your work on debate. I think that just to start off, it’d be interesting if you could provide for us a bit of framing for debate, and how debate exists at OpenAI, in the context of OpenAI’s general current research agenda and directions that OpenAI is moving right now.

Geoffrey: I think broadly, we’re trying to accomplish AI safety by reward learning, so learning a model of what humans want and then trying to optimize agents that achieve that model, so do well according to that model. There’s sort of three parts to learning what humans want. One part is just a bunch of machine learning mechanics of how to learn from small sample sizes, how to ask basic questions, how to deal with data quality. There’s a lot more work, then, on the human side, so how do humans respond to the questions we want to ask, and how do we sort of best ask the questions?

Then, there’s sort of a third category of how do you make these systems work even if the agents are very strong? So stronger than human in some or all areas. That’s sort of the scalability aspect. Debate is one of our techniques for doing scalability. Amplification being the first one and Debate is a version of that. Generally want to be able to supervise a learning agent, even if it is smarter than a human or stronger than a human on some task or on many tasks.

Debate is you train two agents to play a game. The game is that these two agents see a question on some subject, they give their answers. Each debater has their own answer, and then they have a debate about which answer is better, which means more true and more useful, and then a human sees that debate transcript and judges who wins based on who they think told the most useful true thing. The result of the game is, one, who won the debate, and two, the answer of the person who won the debate.

You can also have variants where the judge interacts during the debate. We can get into these details. The general point is that, in my tasks, it is much easier to recognize good answers than it is to come up with the answers yourself. This applies at several levels.

For example, at the first level, you might have a task where a human can’t do the task, but they can know immediately if they see a good answer to the task. Like, I’m bad at gymnastics, but if I see someone do a flip very gracefully, then I can know, at least to some level of confidence, that they’ve done a good job. There are other tasks where you can’t directly recognize the answer, so you might see an answer, it looks plausible, say, “Oh, that looks like a great answer,” but there’s some hidden flaw. If an agent were to point out that flaw to you, you’d then think, “Oh, that’s actually a bad answer.” Maybe it was misleading, maybe it was just wrong. You need two agents doing a back and forth to be able to get at the truth.

Then, if you apply this recursively through several levels, you might have tasks where you can’t recognize whether an answer is good directly. You can’t even recognize whether a counterargument is good. Maybe a counter-counterargument, then you could recognize it. If you do sort of three steps of Debate, back and forth with two agents, you get to the truth.

Depending on the task, you need some number of these steps to pin down what the true answer is. Broadly, what’s happening is that the debaters, when they’re giving these counterarguments, they’re skipping over a lot of the space where maybe the human doesn’t understand what would happen, but they know that it would go a certain way. Like, if you’d see two expert players playing a game of Go, they’re constantly ignoring a larger number of moves which are just very bad and would waste a lot of time if you try to analyze them, and they’re just looking at kind of the most important part of the arguments about what is the best move in the game.

For all of the experiments, there are simple image tasks, like MNIST. MNIST is a dataset of digits, and then also some preliminary kind of human experiments with just human debaters. On MNIST, an example is you have a sparse classifier that learns to recognize MNIST from just seeing a few pixels, so you might train it by picking random masks of just a few pixels of an MNIST image. This classifier is very bad, because it only sees a small part of the image. It might get accuracies below 50%. I forget exactly the number, I could look them up.

Then, you train debaters that choose which parts of the image to show to the judge. You fix the judge at this kind of weak training process, and then the debaters will play a game where they try to pick pixels that convince the judge of what they want to argue for. There’s kind of two ways to decide … In the first paper, we had one debater that was incentivized to tell the truth, so it knew the correct class. Like say, we’re looking at a seven, and the other debater had to pre-specify which digit it was going to lie for. It might say, “I’m going to lie and say things were an eight.”

The game would be each of these agents pick a single pixel, and you sort of iterate for, let’s say six rounds, picking one pixel after another, and then the judge sees those pixels, and chooses who wins the debate. The judge has seen kind of a small part of the evidence. The result of that is that you can boost a weak classifier up to sort of higher performance, even if the agents are incentivized only to win the game. In this case, if you pick one debater to be the honest player and one debater to be the liar, the honest player wins a significant majority of the time, much more than the classifier would win on random play.

Lucas: The judge here, sorry, was not a human?

Geoffrey: The judge here is not a human. It’s just a classifier trained to do a bad job at MNIST, because it sees only a little bit of information. It’s trained to convergence, but its input is just a few pixels of an image.

Lucas: Then, so the pixels that are then revealed by the debaters, each pixel is sort of the argument.

Geoffrey: Each pixel is the argument.

Lucas: Just to provide a little bit more framing, there’s this feature of debate, you can apply it to a very large domain of things that you’d be surprised about if you expand the notion of what it means to debate to showing pixels or something like this.

Geoffrey: It’s actually more important to debate in natural language. The end goal here is we want to extract a strengthened, kind of improved version of human performance at a task. The way we go about this, either in amplification or in debate, is we sort of factor through reasoning. Instead of trying to train directly on the task, like the answers to the task, you might have some questions and some answers, and you could train directly on question/answer pairs, we’re going to build a task which includes all possible human reasoning in the form of, say, in this case, debates, and then we’ll train the agents to do well in this space of reasoning, and then well pick out the answers at the very end. Once we’re satisfied that the reasoning all works out.

Because humans, sort of the way we talk about higher level concepts, especially abstract concepts, and say subtle moral concepts, is natural language, the most important domain here, in the human case, is natural language. What we’ve done so far, in all experiments for Debate, is an image space, because it’s easier. We’re trying now to move that work into natural language so that we can get more interesting settings.

Lucas: Right. In terms of natural language, do you just want to unpack a little bit about how that would be done at this point in natural language? It seems like our natural language technology is not at a point where I really see robust natural language debates.

Geoffrey: There’s sort of two ways to go. One way is human debates. You just replace the ML agents with human debaters and then a human judge, and you see whether the system works in kind of an all-human context. The other way is machine learning natural language is getting good enough to do interestingly well on sample question/answer datasets, and Debate is already interesting if you do a very small number of steps. In the general debate, you sort of imagine that you have this long transcript, dozens of statements long, with points and counterpoints and counterpoints, but if you already do just two steps, you might do question, answer, and then single counterargument. For some tasks, at least in theory, it already should be stronger than the baseline of just doing direct question/answer, because you have this ability to focus in on a counterargument that is important.

An example might be you see a question and an answer and then another debater just says, “Which part of the answer is problematic?” They might point to a word or to a small phrase, and say, “This is the point you should sort of focus in on.” If you learn how to self critique, then you can boost the performance by iterating once you know how to self critique.

The hope is that even if we can’t do general debates on the machine learning side just yet, we can do shallow debates, or some sort of simple first step in this direction, and then work up over time.

Lucas: This just seems to be a very fundamental part of AI alignment where you’re just breaking things down into very simple problems and then trying to succeed in those simple cases.

Geoffrey: That’s right.

Lucas: Just provide a little bit more illustration of debate as a general concept, and what it means in the context of AI alignment. I mean, there are open questions here, obviously, about the efficacy of debate, how debate exists as a tool within the space, so epistemological things that allow us to arrive at truth, and I guess, infer other people’s preferences. Sorry, again, in terms of reward learning, and AI alignment, and debate’s place in all of this, just contextualize, I guess, its sort of role in AI alignment, more broadly.

Geoffrey: It’s focusing, again, on the scalability aspect. One way to formulate that is we have this sort of notion of, either from a philosophy side, reflective equilibrium, or kind of from the AI alignment literature, coherent extrapolated volition, which is sort of what a human would do if we had thought very carefully for a very long time about a question, and sort of considered all the possible nuances, and counterarguments, and so on, and kind of reached the conclusion that is sort of free of inconsistencies.

Then, we’d like to take this kind of vague notion of, what happens when a human thinks for a very long time, and compress it into something we can use as an algorithm in a machine learning context. It’s also a definition. This vague notion of, let a human think for a very long time, that’s sort of a definition, but it’s kind of a strange one. A single human can’t think for a super long time. We don’t have access to that at all. You sort of need a definition that is more factored, where either a bunch of humans think for a long time, we sort of break up tasks, or you sort of consider only parts of the argument space at a time, or something.

You go from there to things that are both definitions of what it means to simulate thinking for long time and also algorithms. The first one of these is Amplification from Paul Christiano, and there you have some questions, and you can’t answer them directly, but you know how to break up a question into subquestions that are hopefully somewhat simpler, and then you sort of recursively answer those subquestions, possibly breaking them down further. You get this big tree of all possible questions that descend from your outer question. You just sort of imagine that you’re simulating over that whole tree, and you come up with an answer, and then that’s the final answer for your question.

Similarly, Debate is a variant of that, in the sense that you have this kind of tree of all possible arguments, and you’re going to try to simulate somehow what would happen if you considered all possible arguments, and picked out the most important ones, and summarized that into an answer for your question.

The broad goal here is to give a practical definition of what it means for people to take human input and push it to its inclusion, and then hopefully, we have a definition that also works as an algorithm, where we can do practical ML training, to train machine learning models.

Lucas: Right, so there’s, I guess, two thoughts that I sort of have here. The first one is that there is just sort of this fundamental question of what is AI alignment? It seems like in your writing, and in the writing of others at OpenAI, it’s to get AI to do what we want them to do. What we want them to do is … either it’s what we want them to do right now, or what we would want to do under reflective equilibrium, or at least we want to sort of get to reflective equilibrium. As you said, it seems like a way of doing that is compressing human thinking, or doing it much faster somehow.

Geoffrey: One way to say it is we want to do what humans want, even if we understood all of the consequences. It’s some kind of, Do what humans want, plus some side condition of: ‘imagine if we knew everything we needed to know to evaluate their question.”

Lucas: How does Debate scale to that level of compressing-

Geoffrey: One thing we should say is that everything here is sort of a limiting state or a goal, but not something we’re going to reach. It’s more important that we have closure under the relative things we might not have thought about. Here are some practical examples from kind of nearer-term misalignment. There’s an experiment in social science where they send out a bunch of resumes to job applications to classified ads, and the resumes were paired off into pairs that were identical except that the name of the person was either white sounding or black sounding, and the result was that you got significantly higher callback rates if the person sounded white, and even if they had an entirely identical resume to the person sounding black.

Here’s a situation where direct human judgment is bad in the way that we could clearly know. You could imagine trying to push that into the task by having an agent say, “Okay, here is a resume. We’d like you to judge it.” Either pointing explicitly to what they should judge, or pointing out, “You might be biased here. Try to ignore the name of the resume, and focus on this issue, like say their education or their experience.” You sort of hope that if you have a mechanism for surfacing concerns or surfacing counterarguments, you can get to a stronger version of human decision making. There’s no need to wait for some long term very strong agent case for this to be relevant, because we’re already pretty bad at making decisions in simple ways.

Then, broadly, I sort of have this sense that there’s not going to be magic in decision making. If I go to some very smart person, and they have a better idea for how to make a decision, or how to answer a question, I expect there to be some way they could explain their reasoning to me. I don’t expect I just have to take them on faith. We want to build methods that surface the reasons they might have to come to a conclusion.

Now, it may be very difficult for them to explain the process for how they came to those arguments. There’s some question about whether the arguments they’re going to make is the same as the reasons they’re giving the answers. Maybe they’re sort of rationalizing and so on. You’d hope that once you sort of surface all the arguments around the question that could be relevant, you get a better answer than if you just ask people directly.

Lucas: As we move out of debate in simple cases of image classifiers or experiments in similar environments, what does debate look like … I don’t really understand the ways in which the algorithms can be trained to elucidate all of these counterconcerns, and all of these different arguments, in order to help human beings arrive at the truth.

Geoffrey: One case we’re considering, especially on kind of the human experiment side, or doing debates with humans, is some sort of domain expert debate. The two debaters are maybe an expert in some field, and they have a bunch of knowledge, which is not accessible to the judge, which is maybe a reasonably competent human, but doesn’t know the details of some domain. For example, we did a debate where there were two people that knew computer science and quantum computing debating a question about quantum computing to a person who has some background, but nothing in that field.

The idea is you start out, there’s a question. Here, the question was, “Is the complexity class BQP equal to NP, or does it contain NP?” One point is that you don’t have to know what those terms mean for that to be a question you might want to answer, say in the course of some other goal. The first steps, things the debaters might say, is they might give short, intuitive definitions for these concepts and make their claims about what the answer is. You might say, “NP is the class of problems where we can verify solutions once we’ve found them, and BQP is the class of things that can run on a quantum computer.”

Now, you could have a debater that just straight up lies right away and says, “Well, actually NP is the class of things that can run on fast randomized computers.” That’s just wrong, and so what would happen then is that the counter debater would just immediately point to Wikipedia and say, “Well, that isn’t the definition of this class.” The judge can look that up, they can read the definition, and realize that one of the debaters has lied, and the debate is over.

You can’t immediately lie in kind of a simple way or you’ll be caught out too fast and lose the game. You have to sort of tell the truth, except maybe you kind of slightly veer towards lying. This is if you want to lie in your argument. At every step, if you’re an honest debater, you can try to pin the liar down to making sort of concrete statements. In this case, if say someone claims that quantum computers can solve all of NP, you might say, “Well, you must point me to an algorithm that does that.” The debater that’s trying to lie and say that quantum computers can solve all of NP might say, “Well, I don’t know what the algorithm is, but meh, maybe there’s an algorithm,” and then they’re probably going to lose, then.

Maybe they have to point to a specific algorithm. There is no algorithm, so they have to make one up. That will be a lie, but maybe it’s kind of a subtle complicated lie. Then, you could kind of dig into the details of that, and maybe you can reduce the fact that that algorithm is a lie to some kind of simple algebra, which either the human can check, maybe they can ask Mathematica or something. The idea is you take a complicated question that’s maybe very broad and covers a lot of the knowledge that the judge doesn’t know and you try to focus in closer and closer on details of arguments that the judge can check.

What the judge needs to be able to do is kind of follow along in the steps until they reach the end, and then there’s some ground fact that they can just look up or check and see who wins.

Lucas: I see. Yeah, that’s interesting. A brief passing thought is thinking about double cruxes and some tools and methods that CFAR employs, like how they might be interesting or used in debate. I think I also want to provide some more clarification here. Beyond debate being a truth-seeking process or a method by which we’re able to see which agent is being truthful, or which agent is lying, and again, there’s sort of this claim that you have in your paper that seems central to this, where you say, “In the debate game, it is harder to lie than to refute a lie.” This asymmetry in debate between the liar and the truth-seeker should hopefully, in general, bias towards people more easily seeing who is telling the truth.

Geoffrey: Yep.

Lucas: In terms of AI alignment again, in the examples that you’ve provided, it seems to help human beings arrive at truth for complex questions that are above their current level of understanding. How does this, again, relate directly to reward learning or value learning?

Geoffrey: Let’s assume that in this debate game, it is the case that it’s very hard to liar, so the winning move is to say the truth. What we want to do then is train kind of two systems. One system will be able to reproduce human judgment. That system would be able to look at the debate transcript and predict what the human would say is the correct winner of the debate. Once you get that system trained, so that’s sort of you’re learning not direct toward, but again, some notion of predicting how humans deal with reasoning. Once you learn that bit, then you can train an agent to play this game.

Then, we have a zero sum game, and then we can sort of apply any technique used to play a zero sum game, like Monte Carlo tree search in AlphaGo, or just straight up RL algorithms, as in some of OpenAI’s work. The hope is that you can train an agent to play this game very well, and therefore, it will be able to predict where counter-arguments exist that would help it win debates, and therefore, if it plays the game well, and the best way to play the game is to tell the truth, then you end up with a value aligned system. Those are large assumptions. You should be cautious if those are true.

Lucas: There’s also all these issues that we can get into about biases that humans have, and issues with debate. Whether or not you’re just going to be optimizing the agents for exploiting human biases and convincing humans. Definitely seems like, even just looking at how human beings value align to each other, debate is one thing in a large toolbox of things, and in AI alignment, it seems like potentially Debate will also be a thing in a large toolbox of things that we use. I’m not sure what your thoughts are about that.

Geoffrey: I could give them. I would say that there’s two ways of approaching AI safety and AI alignment. One way is to try to propose, say, methods that do a reasonably good job at solving a specific problem. For example, you might tackle reversibility, which means don’t take actions that can’t be undone, unless you need to. You could try to pick that problem out and solve it, and then imagine how we’re going to fit this together into a whole picture later.

The other way to do it is try to propose algorithms which have at least some potential to solve the whole problem. Usually, they won’t, and then you should use them as a frame to try to think about how different pieces might be necessary to add on.

For example, in debate, the biggest thing in there is that it might be the case that you train a debate agent that gets very good at this task, the task is rich enough that it just learns a whole bunch of things about the world, and about how to think about the world, and maybe it ends up having separate goals, or it’s certainly not clearly aligned because the goal is to win the game. Maybe winning the game is not exactly aligned.

You’d like to know sort of not only what it’s saying, but why it’s saying things. You could imagine sort of adding interpret ability techniques to this, which would say, maybe Alice and Bob are debating. Alice says something and Bob says, “Well, Alice only said that because Alice is thinking some malicious fact.” If we add solid interpret ability techniques, we could point into Alice’s thoughts at that fact, and pull it out, and service that. Then, you could imagine sort of a strengthened version of a debate where you could not only argue about object level things, like using language, but about thoughts of the other agent, and talking about motivation.

It is a goal here in formulating something like debate or amplification, to propose a complete algorithm that would solve the whole problem. Often, not to get to that point, but we have now a frame where we can think about the whole picture in the context of this algorithm, and then fix it as required going forwards.

I think, in the end, I do view debate, if it succeeds, as potentially the top level frame, which doesn’t mean it’s the most important thing. It’s not a question of importance. More of just what is the underlying ground task that we want to solve? If we’re training agents to either play video games or do question/answers, here the proposal is train agents to engage in these debates and then figure out what parts of AI safety and AI alignment that doesn’t solve and add those on in that frame.

Lucas: You’re trying to achieve human level judgment, ultimately, through a judge?

Geoffrey: The assumption in this debate game is that it’s easier to be a judge than a debater. If it is the case, though, that you need the judge to get to human level before you can train a debater, then you have a problematic bootstrapping issue where, first you must solve value alignment for training the judge. Only then do you have value alignment for training the debater. This is one of the concerns I have. I think the concern sort of applies to some of other scalability techniques. I would say this is sort of unresolved. The hope would be that it’s not actually sort of human level difficult to be a judge on a lot of tasks. It’s sort of easier to check consistency of, say, one debate statement to the next, than it is to do long, reasoning processes. There’s a concern there, which I think is pretty important, and I think we don’t quite know how it plays out.

Lucas: The view is that we can assume, or take the human being to be the thing that is already value aligned, and the process by which … and it’s important, I think, to highlight the second part that you say. You say that you’re pointing out considerations, or whichever debater is saying that which is most true and useful. The useful part, I think, shouldn’t be glossed over, because you’re not just optimizing debaters to arrive at true statements. The useful part smuggles in a lot issues with normative things in ethics and metaethics.

Geoffrey: Let’s talk about the useful part.

Lucas: Sure.

Geoffrey: Say we just ask the question of debaters, “What should we do? What’s the next step that I, as an individual person, or my company, or the whole world should take in order to optimize total utility?” The notion of useful, then, is just what is the right action to take? Then, you would expect a debate that is good to have to get into the details of why actions are good, and so that debate would be about ethics, and metaethics, and strategy, and so on. It would pull in all of that content and sort of have to discuss it.

There’s a large sea of content you have to pull in. It’s roughly kind of all of human knowledge.

Lucas: Right, right, but isn’t there this gap between training agents to say what is good and useful and for agents to do what is good and useful, or true and useful?

Geoffrey: The way in which there’s a gap is this interpretability concern. You’re getting at a different gap, which I think is actually not there. I like giving game analogies, so let me give a Go analogy. You could imagine that there’s two goals in playing the game of Go. One goal is to find the best moves. This is a collaborative process where all of humanity, all of sort of Go humanity, say, collaborates to learn, and explore, and work together to find the best moves in Go, defined by, what are the moves that most win this game? That’s a non-zero sum game, where we’re sort of all working together. Two people competing on the other side of the Go board are working together to get at what the best moves are, but within a game, it’s a zero sum game.

You sit down, and you have two players, two people playing a game of Go, one of them’s going to win, zero sum. The fact that that game is zero sum doesn’t mean that we’re not learning some broad thing about the world, if you’ll zoom out a bit and look at the whole process.

We’re training agents to win this debate game to give the best arguments, but the thing we want to zoom out and get is the best answers. The best answers that are consistent with all the reasoning that we can bring into this task. There’s huge questions to be answered about whether the system actually works. I think there’s an intuitive notion of, say, reflective equilibrium, or coherent extrapolated volition, and whether debate achieves that is a complicated question that’s empirical, and theoretical, and we have to deal with, but I don’t think there’s quite the gap you’re getting at, but I may not have quite voiced your thoughts correctly.

Lucas: It would be helpful if you could unpack how the alignment that is gained through this process is transferred to new contexts. If I take an agent trained to win the Debate game outside of that context.

Geoffrey: You don’t. We don’t take it out of the context.

Lucas: Okay, so maybe that’s why I’m getting confused.

Geoffrey: Ah. I see. Okay, this [inaudible 00:26:09]. We train agents to play this debate game. To use them, we also have them play the debate game. By training time, we give them kind of a rich space of questions to think about, or concerns to answer, like a lot of discussion. Then, we want to go and answer a question in the world about what we should do, what the answer to some scientific question is, is this theorem true, or this conjecture true? We state that as a question, and we have them debate, and then whoever wins, they gave the right answer.

There’s a couple of important things you can add to that. I’ll give like three levels of kind of more detail you can go. One thing is the agents are trained to look at state in the debate game, which could be I’ve just given the question, or there’s a question and there’s a partial transcript, and they’re trained to say the next thing, to make the next move in the game. The first thing you can do is you have a question that you want to answer, say, what should the world do, or what should I do as a person? You just say, “Well, what’s the first move you’d make?” The first move they’d make is to give an answer, and then you just stop there, and you’re done, and you just trust that answer is correct. That’s not the strongest thing you could do.

The next thing you can do is you’ve trained this model of a judge that knows how to predict human judgment. You could have them, from the start of this game, play a whole bunch of games, play 1,000 games of debate, and from that learn with more accuracy what the answer might be. Similar to how you’d, say if you’re playing a game of Go, if you want to know the best move, you would say, “Well, let’s play 1,000 games of Go from this state. We’ll get more evidence and we’ll know what the best move is.”

The most interesting thing you can do, though, is you yourself can act as a judge in this game to sort of learn more about what the relevant issues are. Say there’s a question that you care a lot about. Hopefully, “What should the world do,” is a question you care a lot about. You want to not only see what the answer is, but why. You could act as a judge in this game, and you could, say, play a few debates, or explore part of this debate tree, the tree of all possible debates, and you could do the judgment yourself. There, the end answer will still be who you believe is the right answer, but the task of getting to that answer is still playing this game.

The bottom line here is, at test time, we are also going to debate.

Lucas: Yeah, right. Human beings are going to be participating in this debate process, but does or does not debate translate into systems which are autonomously deciding what we ought to do, given that we assume that their models of human judgment on debate are at human level or above?

Geoffrey: Yeah, so if you turn off the human in the loop part, then you get an autonomous agent. If the question is, “What should the next action be in, say, an environment?” And you don’t have humans in the loop at test time, then you can get an autonomous agent. You just sort of repeatedly simulate debating the question of what to do next. Again, you can cut this process short. Because the agents are trained to predict moves in debate, you can stop them after they’ve predicted the first move, which is what the answer is, and then just take that answer directly.

If you wanted the maximally efficient autonomous agent, that’s the case you would do. At OpenAI, my view, our goal is I don’t want to take AGI and immediately deploy it in the most fast twitch tasks. Something like self-driving a car. If we get to human level intelligence, I’m not going to just replace all the self-driving cars with AGI and let them do their thing. We want to use this for the paths where we need very strong capabilities. Ideally, those tasks are slower and more deliberative, so we can afford to, say, take a minute to interact with the system, or take a minute to have the system engage in its own internal debates to get more confidence in these answers.

The model here is basically the Oracle AI model, that rather than the autonomous agent operating at an NDP model.

Lucas: I think that this is a very important part to unpack a bit more. This distinction here that it’s more like an oracle and less like an autonomous agent going around optimizing everything. What does a world look like right before, during, after AGI given debate?

Geoffrey: The way I think about this is that, an oracle here is a question/answer system of some complexity. You asked it questions, possibly with a bunch of context attached, and it gives you answers. You can reduce pretty much anything to an oracle, if oracle is sort of general enough. If your goal is to take actions in an environment, you can ask the oracle, “What’s the best action to take, and the next step?” And just iteratively ask that oracle over and over again as you take the steps.

Lucas: Or you could generate the debate, right? Over the future steps?

Geoffrey: The most direct way to do an NDP with Debate is to engage in a debate at every step, restart the debate process, showing all the history that’s happened so far, and say, the question at hand, that we’re debating, is what’s the best action to take next? I think I’m relatively optimistic that when we make AGI, for a while after we make it, we will be using it in ways that aren’t extremely fine grain NDP-like in the sense of we’re going to take a million actions in a row, and they’re all actions that hit the environment.

We’d mainly use this full direct reduction. There’s more practical reductions for other questions. I’ll give an example. Say you want to write the best book on, say, metaethics, and you’d like debaters to produce this books. Let’s say that debaters are optimal agents so they know how to do debates on any subject. Even if the book is 1,000 pages long, or say it’s a couple hundred pages long, that’s a more reasonable book, you could do it in a single debate as follows. Ask the agents to write the book. Each agent writes its own book, say, and you ask them to debate which book is better, and that debate all needs to point at small parts of the book.

One of the debaters writes a 300 page book and buried in the middle of it is a subtle argument, which is malicious and wrong. The other debater need only point directly at the small part of the book that’s problematic and say, “Well, this book is terrible because of the following malicious argument, and my book is clearly better.” The way this works is, if you are able to point to problematic parts of books in a debate, and therefore win, the best first move in the debate is to write the best book, so you can do it in one step, where you produce this large object with a single debate, or a single debate game.

The reason I mention this is that’s a little better in terms of practicality, then, writing the book. If the book is like 100,000 words, you wouldn’t want to have a debate about each word, one after another. That’s sort of a silly, very expensive process.

Lucas: Right, so just to back up here, and to provide a little bit more framing, there’s this beginning at which we can see we’re just at a very low level trying to optimize agents for debate, and there’s going to be an asymmetry here that we predict, that it should, in general, usually be easier to tell who’s telling the truth over who’s not, because it’s easier to tell the truth than to lie, and lie in convincing ways. Scaling from there, it seems that what we ultimately really want is to then be able to train a judge, right?

Geoffrey: The goal is to train … You need both.

Lucas: Right. You need both to scale up together.

Geoffrey: Yep.

Lucas: Through doing so, we will have oracles that will be able to assist in decision making, because what we’re doing is amplifying human explicit deliberative processes. Again, this tying into coherent extrapolated volition and reflective equilibrium. What we want is to be able to amplify human decision making, which I guess is, in a broad sense, value alignment.

Geoffrey: Right.

Lucas: Are there any other technical motivations behind debate that you would like to unpack for why within the set of all these different AI alignment methodologies, why debate is so important?

Geoffrey: Sure. I think maybe the right way to talk about that is to contrast amplification and debate. I got to Debate by thinking a bunch about Amplification, talking a bunch with Paul Christiano. To summarize it for listeners again, Amplification is take questions, decompose them into subquestions that are hopefully easier to answer, repeat that decomposition until we get to things we can check, and use that whole process to answer complicated questions.

The stock amplification, the human is doing the task of decomposing the questions. The human decomposition has to be complete. For example, you might have a question, going back to this resume example, we’re having this debate about whether someone whose name sounds white or black is a good candidate. For whatever reason, no one here wants to be racist, but they’re all just completely oblivious to the idea they could be racist. They’re going through and discussing whether the candidate is applicable, they’re kind of breaking these questions down. Just subtly at every point, racism is kind of inching its way into the answers, and they just never thought to kind of investigate.

If at some point they had asked, “Well, am I deciding this because the person has a white sounding name?” They would have realized their mistake, but they never think of that. In debate, you are relying on the agents to be able to do the decomposition, so the agents can, even if it’s hard for a human to do it, point out, “Maybe you’re being racist here,” and sort of get the correction. The advantage of debate there is you get some help doing this combination of tasks for you.

The other thing that happened, frankly, which is one of the reasons I thought of debate, was AlphaGo. In thinking about amplification, I’ve been sort of concerned. “Is this process going to be scalable? Are we going to lose a bunch of efficiency in doing this complicated decomposition process?” I was sort of concerned that we would lose a bunch of efficiency and therefore be not competitive with unsafe techniques to getting to AGI.

Then, AlphaGo came out, and AlphaGo got very strong performance, and it did it by doing an explicit tree search. As part of AlphaGo, it’s doing this kind of deliberative process, and that was not only important for performance at test time, but was very important for getting the training to work. What happens is, in AlphaGo, at training time, it’s doing a bunch of tree search through the game of Go in order to improve the training signal, and then it’s training on that improved signal. That was one thing kind of sitting in the back of my mind.

I was kind of thinking through, then, the following way of thinking about alignment. At the beginning, we’re just training on direct answers. We have these questions we want to answer, an agent answers the questions, and we judge whether the answers are good. You sort of need some extra piece there, because maybe it’s hard to understand the answers. Then, you imagine training an explanation module that tries to explain the answers in a way that humans can understand. Then, those explanations might be kind of hard to understand, too, so maybe you need an explanation explanation module.

For a long time, it felt like that was just sort of ridiculous epicycles, adding more and more complexity. There was no clear end to that process, and it felt like it was going to be very inefficient. When AlphaGo came out, that kind of snapped into focus, and it was like, “Oh. If I train the explanation module to find flaws, and I train the explanation explanation module to find flaws in flaws, then that becomes a zero-sum game. If it turns out that ML is very good at solving zero-sum games, and zero-sum games were a powerful route to drawing performance, then we should take advantage of this in safety.” Poof. We have, in this answer, explanation, explanation, explanation route, that gives you the zero-sum game of Debate.

That’s roughly sort of how I got there. It was a combination of thinking about Amplification and this kick from AlphaGo, that zero-sum games and search are powerful.

Lucas: In terms of the relationship between debate and amplification, can you provide a bit more clarification on the differences, fundamentally, between the process of debate and amplification? In terms of amplification, there’s a decomposition process, breaking problems down into subproblems, eventually trying to get the broken down problems into human level problems. The problem has essentially doubled itself many items over at this point, right? It seems like there’s going to be a lot of questions for human beings to answer. I don’t know how interrelated debate is to decompositional argumentative process.

Geoffrey: They’re very similar. Both Amplification and Debate operate on some large tree. In amplification, it’s the tree of all decomposed questions. Let’s be concrete and say the top level question in amplification is, “What should we do?” In debate, again, the question at the top level is, “What should we do?” In amplification, we take this question. It’s a very broad open-ended question, and we kind of break it down more and more and more. You sort of imagine this expanded tree coming out from that question. Humans are constructing this tree, but of course, the tree is exponentially large, so we can only ever talk about a small part of it. Our hope is that the agents learn to generalize across the tree, so they’re learning the whole structure of the tree, even given finite data.

In the debate case, similarly, you have top level question of, “What should we do,” or some other question, and you have the tree of all possible debates. Imagine every move in this game is, say, saying a sentence, and at every point, you have maybe an exponentially large number of sentences, so the branching factor, now in the tree, is very large. The goal in debate is kind of see this whole tree.

Now, here is the correspondence. In amplification, the human does the decomposition, but I could instead have another agent do the decomposition. I could say I have a question, and instead of a human saying, “Well, this question breaks down into subquestions X, Y, and Z,” I could have a debater saying, “The subquestion that is most likely to falsify this answer is Y.” It could’ve picked at any other question, but it picked Y. You could imagine that if you replace a human doing the decomposition with another agent in debate pointing at the flaws in the arguments, debate would kind of pick out a path through this tree. A single debate transcript, in some sense, corresponds to a single path through the tree of amplification.

Lucas: Does the single path through the tree of amplification elucidate the truth?

Geoffrey: Yes. The reason it does is it’s not an arbitrarily chosen path. We’re sort of choosing the path that is the most problematic for the arguments.

Lucas: In this exponential tree search, there’s heuristics and things which are being applied in general to the tree search in order to collapse onto this one branch or series?

Geoffrey: Let’s say, in amplification, we have a question. Our decomposition is, “Well, this decomposes into X, Y, and Z,” and then we recursively call the agent, and it says, “The answers are AX, AY, AZ, for these questions.” Now, if I trusted those subanswers, I could do the reconstruction of the answers to the original question. If I don’t trust the subanswers, I might say, “Well, which subanswer is most likely to be false?” The correspondence with debate is the other debating agent would point to which one of the subanswers is probably false, and then you recourse down just that tree.

Geoffrey: Here’s the way it works. You can trust amplification if all of the subanswers are correct. If at least one of them is false, a strong debater can find which of those subanswers is false and recourse down to that. That’s why, if it turns out that a single path chosen by strong debaters never finds a false subanswer, then you will have established that the whole tree is true.

Lucas: I see. I think this picture of, “What should we do,” at the top, and then the trees would fall from that in terms of one version of amplification and one version of debate is pretty helpful here.

Geoffrey: The most important thing here is that in Amplification, the branching factor is small. I’m a human, I see a question, I sort of decompose that into a relatively small number of questions. Say, two to 10. In debate, the branching factor is very large. A debate agent could make any statement as the next countermove. What this means is that if it works for strong play, because the branching factor is so large, a shallow debate tree is more powerful than a shallow amplification tree, because you can cover more terrain with a couple steps of debate that are a proxy for a large tree.

The problem with that statement, maybe the biggest advantage to amplification, is you have to really be assuming that you know the agents are strong enough to cover the tree. Say I’m debating whether there’s a black swan, and there in fact is a black swan, but there’s only one of them, and there’s only one in the entire world, and the correct countermove would be pointing at that black swan, finding it out of the entire world. A debater who wanted to say, “There are no black swans,” to just claim, “There are no black swans. Trust me. The other agent would be able to point to one if it exists.” Maybe it’s the case that a black swan does exist, but the other agent is just too weak to point at the black swan, and so that debate doesn’t work.

This argument that shallow debates are powerful leans a whole lot on debaters being very strong, and debaters in practice will not be infinitely strong, so there’s a bunch of subtlety there that we’re going to have to wrestle.

Lucas: It would also be, I think, very helpful if you could let us know how you optimize for strong debaters, and how is amplification possible here if human beings are the ones who are pointing out the simplifications of the questions?

Geoffrey: Whichever one we choose, whether it’s amplification, debate, or some entirely different scheme, if it depends on humans in one of these elaborate ways, we need to do a bunch of work to know that humans are going to be able to do this. At amplification, you would expect to have to train people to think about what kinds of decompositions are the correct ones. My sort of bias is that because debate gives the humans more help in pointing out the counterarguments, it may be cognitively kinder to the humans, and therefore, that could make it a better scheme. That’s one of the advantages of debate.

The technical analogy there is a shallow debate argument. The human side is, if someone is pointing out the arguments for you, it’s cognitively kind. In amplification, I would expect you’d need to train people a fair amount to have the decomposition be reliably complete. I don’t know that I have a lot of confidence that you can do that. One way you can try to do it is, as much as possible, systematize the process on the human side.

In either one of these schemes, we can give the people involved an arbitrary amount of training and instruction in whatever way we think is best, and we’d like to do the work to understand what forms of instruction and training are most truth seeking, and try to do that as early as possible so you have a head start.

I would say I’m not going to be able to give you a great argument for optimism about amplification. This is a discussion that Paul, and Andreas Stuhlmueller, and I have, where I think Paul and Andreas, they kind of lean towards these metareasoning arguments, where if you wanted to answer the question, “Where should I go on vacation,” the first subquestion is, “What would be a good way to decide where to go on vacation?” Quickly go meta, and maybe you go meta, meta, like it’s kind of a mess. Whereas, the hope is that because debate, you have sort of have help pointing to things, you can do much more object level, where the first step in a debate about where to go on vacation is just Bali or Alaska. You give the answer and then you focus in on more …

For a broader class of questions, you can stay at object level reasoning. Now, if you want to get to metaethics, you would have to bring in the kind of reasoning. It should be a goal of ours to, for a fixed task, try to use the simplest kind of human reasoning possible, because then we should expect to get better results out of people.

Lucas: All right. Moving forward. Two things. The first that would be interesting would be if you could unpack this process of training up agents to be good debaters, and to be good predictors of human decision making regarding debates, what that’s actually going to look like in terms of your experiments, currently, and your future experiments. Then, also just pivoting into discussing reasons for optimism and pessimism about debate as a model for AI alignment.

Geoffrey: On the experiment side, as I mentioned, we’re trying to get into the natural language domain, because I think that’s how humans debate and reason. We’re doing a fair amount of work at OpenAI on core ML language modeling, so natural language processing, and then trying to take advantage of that to prototype these systems. At the moment, we’re just doing what I would call zero step debate, or one step debate. It’s just a single agent answering a question. You have question, answer, and then you have a human kind of judging whether the answer is good.

The task of predicting an answer is just read a bunch of text and predict a number. That is essentially just a standard NLP type task, and you can use standard methods from NLP on that problem. The hope is that because it looks so standard, we can sort of just paste the development on the capability side in natural language processing on the safety side. Predicting the result is just sort of use whatever most powerful natural language processing architecture is, and apply it to this task. Architecture and method.

Similarly, on the task of answering questions, that’s also a natural language task, just a generative one. If you’re answering questions, you just read a bunch of text that is maybe the context of the question, and you produce an answer, and that answer is just a bunch of words that you spit out via a language model. If you’re doing, say, a two step debate, where you have question, answer, counterargument, then similarly, you have a language model that spits out an answer, and a language model that spits out the counterargument. Those can in fact be the same language model. You just flip the reward at some point. An agent is rewarded for answering and winning, and answering well while it’s spitting out the answer, and then when it’s spitting out the counteranswer, you just reward it for falsifying the answer. It’s still just degenerative language task with some slightly exotic reward.

Going forwards, we expect there to need to be something like … This is not actually high confidence. Maybe there’s things like AlphaGo zero style tree search that are required to make this work very well on the generative side, and we will explore those as required. Right now, we need to falsify the statement that we can just do it with stock language modeling, which we’re working on. Does that cover the first part?

Lucas: I think that’s great in terms of the first part, and then again, the second part was just places to be optimistic and pessimistic here about debate.

Geoffrey: Optimism, I think we’ve covered a fair amount of it. The primary source of optimism is this argument that shallow debates are already powerful, because you can cover a lot of terrain in argument space with a short debate, because of the high branching factor. If there’s an answer that is robust to all possible counteranswers, then it hopefully is a fairly strong answer, and that gets stronger as you increase the number of steps. This assumes strong debaters. That would be a reason for pessimism, not optimism. I’ll get to that.

The top two is that one, and then the other part is that ML is pretty good at zero-sum games, particularly zero-sum perfect information games. There have been these very impressive headline results from AlphaGo, DeepMind, and Dota at OpenAI, and a variety of other games. In general, zero-sum, close to perfect information games, we roughly know how to do them, at least in this not too high branching factor case. There’s an interesting thing where if you look at the algorithms, say for playing poker, or for playing more than two player games, where poker is zero-sum two player, but is imperfect information, or the algorithm for playing, say, 10 player games, they’re just much more complicated. They don’t work as well.

I like the fact that debate is formulated as a two player zero-sum perfect information game, because we seem to have better algorithms to play them with ML. This is both practically true, it is in practice easier to play them, and also there’s a bunch of theory that says that two player zero-sum is a different complexity class than, say, two player non-zero-sum, or N player. The complexity class gets harder, and you need nastier algorithms. Finding a Nash equilibrium in a general game, that’s either non-zero-sum or more than two players is PPAD-complete, in a tabular case, in a small game, with two player zero-sum, that problem is convex and has a polynomial-time solution. It’s a nicer class. I expect there to continue to be better algorithms to play those games. I like formulating safety as that kind of problem.

Those are kind of the reasons for optimism that I think are most important. I think going into more of those is kind of less important and less interesting than worrying about stuff. I’ll list three of those, or maybe four. Try to be fast so we can circle back. As I mentioned, I think interpretability has a large role to play here. I would like to be able to have an agent say … Again, Alice and Bob are debating. Bob should be able to just point directly into Alice’s thoughts and say, “She really thought X even though she said Y.” The reason you need an interpretability technique for that is, in this conversation, I could just claim that you, Lucas Perry, are having some malicious thought, but that’s not a falsifiable statement, so I can’t use it in a debate. I could always make statement. Unless I can point into your thoughts.

Because we have so much control over machine learning, we have the potential ability to do that, and we can take advantage of it. I think that, for that to work, we need probably a deep hybrid between the two schemes, because an advanced agent’s thoughts will probably be advanced, and so you may need some kind of strengthened thing like amplification or debate just to be able to describe the thoughts, or to point at them in a meaningful way. That’s a problem that we have not really solved. Interpretability is coming along, but it’s definitely not hybridized with these fancy alignment schemes, and we need to solve that at some point.

Another problem is there’s no point in this kind of natural language debate where I can just say, for example, “You know, it’s going to rain tomorrow, and it’s going to rain tomorrow just because I’ve looked at all the weather in the past, and it just feels like it’s going to rain tomorrow.” Somehow, debate is missing this just straight up pattern matching ability of machine learning where I can just read a dataset and just summarize it very quickly. The theoretical side of this is if I have a debate about, even something as simple as, “What’s the average height of a person in the world?” In the debate method I’ve described so far, that debate has to have depth, at least logarithmic in the number of people. I just have to subdivide by population. Like, this half of the world, and then this half of that half of the world, and so on.

I can’t just say, “You know, on average it’s like 1.6 meters.” We need to have better methods for hybridizing debate with pattern matching and statistical intuition, and that’s something that is, if we don’t have that, we may not be competitive with other forms of ML.

Lucas: Why is that not just an intrinsic part of debate? Why is debating over these kinds of things different than any other kind of natural language debate?

Geoffrey: It is the same. The problem is just that for some types of questions, and there are other forms of this in natural language, there aren’t short deterministic arguments. There are many questions where the shortest deterministic argument is much longer than the shortest randomized argument. For example, if you allow randomization, I can say, “I claim the average height of a person is 1.6 meters.” Well, pick a person at random, and you’ll score me according to the square difference between those two numbers. My claim and the height of this particular person you’ve chosen. The optimal move to make there is to just say the average height right away.

The thing I just described is a debate using randomized steps that is extremely shallow. It’s only basically two steps long. If I want to do a deterministic debate, I have to deterministically talk about the average height of a person in North America is X, and in Asia, it’s Y. The other debater could say, “I disagree about North America,” and you sort of recourse into that.

It would be super embarrassing if we propose these complicated alignment schemes, “This is how we’re going to solve AI safety,” and they can’t quickly answer a trivial statistical questions. That would be a serious problem. We kind of know how to solve that one. The harder case is if you bring in this more vague statistical intuition. It’s not like I’m computing a mean over some dataset. I’ve looked at the weather and, you know, it feels like it’s going to rain tomorrow. Getting that in is a bit trickier, but we have some ideas there. They’re unresolved.

The thing which I am optimistic about, but we need to work on, that’s one. The most important reason to be concerned is just that humans are flawed in a variety of ways. We have all these biases, ethical inconsistencies, and cognitive biases. We can write down some toy theoretical arguments. The debate works with a limited but reliable judge, but does it work in practice with a human judge? I think there’s some questions you can kind of reason through there, but in the end, a lot of that will be determined by just trying it, and seeing whether debate works with people. Eventually, when we start to get agents that can play these debates, then we can sort of check whether it worked with two ML agents and a human judge. For now, when language modeling is not that far along, we may need to try it out first with all humans.

This would be, you play the same debate game, but both the debaters are also people, and you set it up so that somehow it’s trying to model this case where the debaters are better than the judge at some task. The debaters might be experts at some domain, they might have access to some information that the judge doesn’t have, and therefore, you can ask whether a reasonably short debate is truth seeking if the humans are playing to win.

The hope there would be that you can test out debate on real people with interesting questions, say complex scientific questions, and questions about ethics, and about areas where humans are biased in known ways, and see whether it works, and also see not just whether it works, but which forms of debate are strongest.

Lucas: What does it mean for debate to work or be successful for two human debaters and one human judge if it’s about normative questions?

Geoffrey: Unfortunately, if you want to do this test, you need to have a source of truth. In the case of normative questions, there’s two ways to go. One way is you pick a task where we may not know the entirety of the answer, but we know some aspect of it with high confidence. An example would be this resume case, where two resumes are identical except for the name at the top, and we just sort of normatively … we believe with high confidence that the answer shouldn’t depend on that. If it turns out that a winning debater can maliciously and subtly take advantage of the name to spread fear into the judge, and make a resume with a black name sound bad, that would be a failure.

We sort of know that because we don’t know in advance whether a resume should be good or bad overall, but we know that this pair of identical resumes shouldn’t depend on the name. That’s one way just we have some kind of normative statement where we have reasonable confidence in the answer. The other way, which is kind of similar, is you have two experts in some area, and the two experts agree on what the true answer is, either because it’s a consensus across the field, or just because maybe those two experts agree. Ideally, it should be a thing that’s generally true. Then, you force one of the experts to lie.

You say, “Okay, you both agree that X is true, but now we’re going to flip a coin and now one of you only wins if you lie, and we’ll see whether that wins or not.”

Lucas: I think it also … Just to plug your game here, you guys do have a debate game. We’ll put a link to that in the article that goes along with this podcast. I suggest that people check that out if you would like a little bit more tangible and fun way to understand debate, and I think it’ll help elucidate what the process looks like, and the asymmetries that go on, and the key idea here that it is harder to lie than to refute a lie. It seems like if we could deploy some sort of massive statistical analysis over many different iterated debates across different agents, that we would be able to come down on the efficacy of debate in different situations where the judge and the debaters are all AI, mixed situations, or all human debates. I think it’d be interesting to see the varying results there.

Geoffrey: This is going to be a noisy enough process for a variety of reasons, that we will probably do this a lot to know. So far, we’ve just done a very small, informal number of these human, human, human debates. Say, if you’re doing expert debate, we’ve already learned a fair amount at a qualitative level, just in those few things. I’ll give an example. In this debate about CS theory question, there was a judge present while the debate was going on, but they were interacting only fairly minimally.

Early in the process of the debate, the debaters … I was the debater telling the truth here. I stated my formulation of the question, and unpacked it intuitively, and the judge perfectly understandably had a subtle misunderstanding of how I’d framed the question, and therefore, throughout the debate, this misunderstanding was not corrected, so there was a misunderstanding of just what the topic was about that was never fixed, and therefore, it was much harder for the honest player to win, because it seemed like the honest player had a harder case to make.

That sort of taught us that having judge interaction is potentially quite important so that the debaters have a detailed understanding of what the judge is thinking. If your goal is to model debate as a perfect information game, the closer to that you can get, the more information that debaters have, the better it should be.

Lucas: Yeah. I mean, that also allows the debaters to exploit cognitive biases in the judge.

Geoffrey: That’s right.

Lucas: You would point that out. Like, “Hey, this person’s exploiting your cognitive bias.”

Geoffrey: Yeah, so I think it’s an open question how exactly to strike that balance, and if there’s a way to strike it that works. Generally, the more information about, say, the judge, that he provides to the debaters, either through judge interaction or just tell the debaters something about the judge, that will make them stronger as players of the game, but it might reveal ways to attack the judge.

Now, if our goal is to be resistant to very strong agents, and it turns out that the only way to make it safe is to hide information from the agents, maybe you shouldn’t use this method. It may not be very resilient. It’s likely that for experiments, we should push as far as we can towards strong play, revealing as much as possible, and see whether it still works in that case.

Lucas: In terms here of the social scientists playing a role here, do you want to go ahead and unpack that a bit more? There’s a paper that you’re working on with Amanda Askell on this.

Geoffrey: As you say, we want to run statistically significant experiments that test whether debate is working and which form of debate are best, and that will require careful experimental design. That is an experiment that is, in some sense, an experiment in just social science. There’s no ML involved. It’s motivated by machine learning, but it’s just a question about how people think, and how they argue and convince each other. Currently, no one at OpenAI has any experience running human experiments of this kind, or at least no one that is involved in this project.

The hope would be that we would want to get people involved in AI safety that have experience and knowledge in how to structure experiments on the human side, both in terms of experimental design, having an understanding of how people think, and where they might be biased, and how to correct away from those biases. I just expect that process to involve a lot of knowledge that we don’t possess at the moment as ML researchers.

Lucas: Right. I mean, in order for there to be an efficacious debate process, or AI alignment process in general, you need to debug and understand the humans as well as the machines. Understanding our cognitive biases in debates, and our weak spots and blind spots in debate, it seems crucial.

Geoffrey: Yeah. I sort of view it as a social science experiment, because it’s just a bunch of people interacting. It’s a fairly weird experiment. It differs from normal experiments in some ways. In thinking about how to build AGI in a safe way, we have a lot of control over the whole process. If it takes a bunch of training to make people good at judging these debates, we can provide that training, pick people who are better or worse at judging. There’s a lot of control that we can exert. In addition to just finding out whether this thing works, it’s sort of an engineering process of debugging the humans, maybe it’s sort of working around human flaws, taking them into account, and making the process resilient.

My highest level hope here is that humans have various flaws and biases, but we are willing to be corrected, and set our flaws aside, or maybe there’s two ways of approaching a question where one way hits the bias and one way doesn’t. We want to see whether we can produce some scheme that picks out the right way, at least to some degree of accuracy. We don’t need to be able to answer every question. If we, for example, learned that, “Well, debate works perfectly well for some broad class of tasks, but not for resolving the final question of what humans should do over the long term future, or resolving all metaethical disagreements or something,” we can afford to say, “We’ll put those aside for now. We want to get through this risky period, make sure AI doesn’t do something malicious, and we can deliberately work through these product questions, take our time doing that.”

The goal includes the task of knowing which things we can safely answer, and the goal should be to structure the debates so that if you give it a question where humans just disagree too much or are too unreliable to reliably answer, the answer should be, “We don’t know the answer to that question yet.” A debater should be able to win a debate by admitting ignorance in that case.

There is an important assumption I’m making about the world that we should make explicit, which is that I believe it is safe to be slow about certain ethical or directional decisions. Y/ou can construct games where you just have to make a decision now, like you’re barreling along in some car with no brakes, you have to dodge left or right around an obstacle, but you can’t just say, “I’m going to ponder this question for a while and sort of hold off.” You have to choose now. I would hope that the task of choosing what we want to do as a civilization is not like that. We can resolve some immediate concerns about serious problems now, and existential risk, but we don’t need to resolve everything,

That’s a very strong assumption about the world, which I think is true, but it’s worth saying that I know that is an assumption.

Lucas: Right. I mean, it’s true insofar as coordination succeeds, and people don’t have incentives just to go do what they think is best.

Geoffrey: That’s right. If you can hold off deciding things until we can deliberate longer.

Lucas: Right. What does this distillization process look for debate, where ensuring alignment is maintained as a system capability is amplified and changed?

Geoffrey: One property of amplification, which is nice, is that you can sort of imagine running it forever. You train on simple questions, and then you train on more complicated questions, and then you keep going up and up and up, and if you’re confident that you’ve trained enough on the simple questions, you can never see them again, freeze that part of the model, and keep going. I think in practice, that’s probably not how we would run it, so you don’t inherit that advantage. In debate, what you would have to do to get to more and more complicated questions is, at some point, and maybe this point is fairly far off, but you have to go to the longer and longer and longer debates.

If you’re just sort of thinking about the long term future, I expect to have to switch over to some other scheme, or at least layer a scheme, embed debate in a larger scheme. An example would be it could be that the question you resolve with debate is, “What is an even better way to build AI alignment?” That, you can resolve with, say, depth 100 debates, and maybe you can handle that depth well. What that spits out to you is an algorithm, you interrogate it enough to know that you trust it, and you can put that one.

You can also imagine eventually needing to hybridize kind of a Debate-like scheme and an Amplification-like scheme, where you don’t get a new algorithm out, but you trust this initial debating oracle enough that you can view it as fixed, and then start a new debate scheme, which can trust any answer that original scheme produces. Now, I don’t really like that scheme, because it feels like you haven’t gained a whole lot. Generally, if you think about, say, the next 1,000 years … It’s useful to think about the long-term. AI alignment going forwards. I expect to need further advances after we get past this AI risk period.

I’ll give a concrete example. You ask your debating agents, “Okay, give me a perfect theorem prover.” Right now, all of our theorem provers have little bugs, probably, so you can’t really trust them to resist superintelligent agent. You say you trust that theorem prover that you get out, and you say, “Okay, now, just I want a proof that AI alignment works.” You bootstrap your way up using this agent as an oracle on sort of interesting, complicated questions, until you’ve got to a scheme that gets you to the next level, and then you iterate.

Lucas: Okay. In terms of practical, short-term world to AGI world maybe in the next 30 years, what does this actually look like? In what ways could we see debate and amplification deployed and used at scale?

Geoffrey: There is the direct approach, where you use them to answer questions, using exactly the structure they’re trained as. Debating agent, you would just engage in debates, and you would use it as an oracle in that way. You can also use it to generate training data. You could, for example, ask a debating agent to spit out the answers to a large number of questions, and then you just train a little module. If you trust all the answers, and you trust supervised learning to work. If you wanted to build a strong self-driving car, you could ask it to train a much smaller network that way. It would not be human level, but it just gives you a way to access data.

There’s a lot you could do with a powerful oracle that gives you answers to questions. I could probably go on at length about fancy schemes you could do with oracles. I don’t know if it’s that important. The more important part to me is what is the decision process we deploy these things into? How we choose which questions to answer and what we do with those answers. It’s probably not a great idea to train an oracle and then give it to everyone in the world right away, unfiltered, for reasons you can probably fill in by yourself. Basically, malicious people exist, and would ask bad questions, and eventually do bad things with the results.

If you have one of these systems, you’d like to deploy it in a way that can help as many people as possible, which means everyone will have their own questions to ask of it, but you need some filtering mechanism or some process to decide which questions to actually ask what to do with the answers, and so on.

Lucas: I mean, can the debate process be used to self-filter out providing answers for certain questions, based off of modeling the human decision about whether or not they would want that question answered?

Geoffrey: It can. There’s a subtle issue, which I think we need to deal with, but haven’t dealt with yet. There’s a commutativity question, which is, say you have a large number of people, there’s a question of whether you reach reflective equilibrium for each person first, and then you would, say, vote across people, or whether you have a debate, and then you vote on the answer to what the judgment should be. Imagine playing a Debate game where you play a debate, and then everyone votes on who wins. There’s advantages on both sides. On the side of voting after reflective equilibrium, you have this problem that if you reach reflective equilibrium for a person, it may be disastrous if you pick the wrong person. That extreme is probably bad. The other extreme is also kind of weird because there are a bunch of standard results where if you take a bunch of rational agents voting, it might be true that A and B implies C, but the agents might vote yes on A, yes on B, and no on C. Votes on statements where every voter is rational are not rational. The voting outcome is irrational.

The result of voting before you take reflective equilibrium is sort of an odd philosophical concept. Probably, you need some kind of hybrid between these schemes, and I don’t know exactly what that hybrid looks like. That’s an area where I think technical AI safety mixes with policy to a significant degree that we will have to wrestle with.

Lucas: Great, so to back up and to sort of zoom in on this one point that you made, is the view that one might want to be worried about people who might undergo an amplified long period of explicit human reasoning, and that they might just arrive at something horrible through that?

Geoffrey: I guess, yes, we should be worried about that.

Lucas: Wouldn’t one view of debate be that also humans, given debate, would also over time come more likely to true answers? Reflective equilibrium will tend to lead people to truth?

Geoffrey: Yes. That is an assumption. The reason I think there is hope there … I think that you should be worried. I think the reason for hope is our ability to not answer certain questions. I don’t know that I trust reflective equilibrium applied incautiously, or not regularized in some way, but I expect that if there’s a case where some definition of reflective equilibrium is not trustworthy, I think it’s hopeful that we can construct debate so that the result will be, “This is just too dangerous too decide. We don’t really know with high confidence the answer.”

Geoffrey: This is certainly true of complicated moral things. Avoiding lock in, for example. I would not trust reflective equilibrium if it says, “Well, the right answer is just to lock our values in right now, because they’re great.” We need to take advantage of the outs we have in terms of being humble about deciding things. Once you have those outs, I’m hopeful that we can solve this, but there’s a bunch of work to do to know whether that’s actually true.

Lucas: Right. Lots more experiments to be done on the human side and the AI side. Is there anything here that you’d like to wrap up on, or anything that you feel like we didn’t cover that you’d like to make any last minute points?

Geoffrey: I think the main point is just that there’s a bunch of work here. OpenAI is hiring people to work on both the ML side of things, also theoretical aspects, if you think you like wrestling with how these things work on the theory side, and then certainly, trying to start on this human side, doing the social science and human aspects. If this stuff seems interesting, then we are hiring.

Lucas: Great, so people that are interested in potentially working with you or others at OpenAI on this, or if people are interested in following you and keeping up to date with your work and what you’re up to, what are the best places to do these things?

Geoffrey: I have taken a break from pretty much all social media, so you can follow me on Twitter, but I won’t ever post anything, or see your messages, really. I think email me. It’s not too hard to find my email address. That’s pretty much the way, and then watch as we publish stuff.

Lucas: Cool. Well, thank you so much for your time, Geoffrey. It’s been very interesting. I’m excited to see how these experiments go for debate, and how things end up moving along. I’m pretty interested and optimistic, I guess, about debate is an epistemic process in its role for arriving at truth and for truth seeking, and how that will play in AI alignment.

Geoffrey: That sounds great. Thank you.

Lucas: Yep. Thanks, Geoff. Take care.

If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: Human Cognition and the Nature of Intelligence with Joshua Greene

How do we combine concepts to form thoughts? How can the same thought be represented in terms of words versus things that you can see or hear in your mind’s eyes and ears? How does your brain distinguish what it’s thinking about from what it actually believes? If I tell you a made up story, yesterday I played basketball with LeBron James, maybe you’d believe me, and then I say, oh I was just kidding, didn’t really happen. You still have the idea in your head, but in one case you’re representing it as something true, in another case you’re representing it as something false, or maybe you’re representing it as something that might be true and you’re not sure. For most animals, the ideas that get into its head come in through perception, and the default is just that they are beliefs. But humans have the ability to entertain all kinds of ideas without believing them. You can believe that they’re false or you could just be agnostic, and that’s essential not just for idle speculation, but it’s essential for planning. You have to be able to imagine possibilities that aren’t yet actual. So these are all things we’re trying to understand. And then I think the project of understanding how humans do it is really quite parallel to the project of trying to build artificial general intelligence.” -Joshua Greene

Josh Greene is a Professor of Psychology at Harvard, who focuses on moral judgment and decision making. His recent work focuses on cognition, and his broader interests include philosophy, psychology and neuroscience. He is the author of Moral Tribes: Emotion, Reason, and the Gap Bewtween Us and Them. Joshua Greene’s current research focuses on further understanding key aspects of both individual and collective intelligence. Deepening our knowledge of these subjects allows us to understand the key features which constitute human general intelligence, and how human cognition aggregates and plays out through group choice and social decision making. By better understanding the one general intelligence we know of, namely humans, we can gain insights into the kinds of features that are essential to general intelligence and thereby better understand what it means to create beneficial AGI. This particular episode was recorded at the Beneficial AGI 2019 conference in Puerto Rico. We hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

Topics discussed in this episode include:

  • The multi-modal and combinatorial nature of human intelligence
  • The symbol grounding problem
  • Grounded cognition
  • Modern brain imaging
  • Josh’s psychology research using John Rawls’ veil of ignorance
  • Utilitarianism reframed as ‘deep pragmatism’
You can find out more about Joshua Greene at his website or follow his lab on their Twitter. You can listen to the podcast above or read the transcript below.

Lucas: Hey everyone. Welcome back to the AI Alignment Podcast. I’m Lucas Perry, and today we’ll be speaking with Joshua Greene about his research on human cognition as well as John Rawls’ veil of ignorance and social choice. Studying the human cognitive engine can help us better understand the principles of intelligence, and thereby aid us in arriving at beneficial AGI. It can also inform group choice and how to modulate persons’ dispositions to certain norms or values, and thus affect policy development in observed choice. Given this, we discussed Josh’s ongoing projects and research regarding the structure, relations, and kinds of thought that make up human cognition, key features of intelligence such as it being combinatorial and multimodal, and finally how a particular thought experiment can change how impartial a person is, and thus what policies they support.

And as always, if you enjoy this podcast, please give it a like, share it with your friends, and follow us on your preferred listening platform. As a bit of announcement, the AI Alignment Podcast will be releasing every other Wednesday instead of once a month, so there are a lot more great conversations on the way. Josh Greene is a professor of psychology at Harvard, who focuses on moral judgment and decision making. His recent work focuses on cognition, and his broader interests include philosophy, psychology and neuroscience. And without further ado, I give you Josh Greene.

What sort of thinking has been predominantly occupying the mind of Josh Greene?

Joshua: My lab has two different main research areas that are related, but on a day to day basis are pretty separate. You can think of them as focused on key aspects of individual intelligence versus collective intelligence. On the individual intelligence side, what we’re trying to do is understand how our brains are capable of high level cognition. In technical terms, you can think of that as compositional semantics, or multimodal compositional semantics. What that means in more plain English is how does the brain take concepts and put them together to form a thought, so you can read a sentence like the dog chased the cat, and you understand that it means something different from the cat chased the dog. The same concepts are involved, dog and cat and chasing, but your brain can put things together in different ways in order to produce a different meaning.

Lucas: The black box for human thinking and AGI thinking is really sort of this implicit reasoning that is behind the explicit reasoning, that it seems to be the most deeply mysterious, difficult part to understand.

Joshua: Yeah. A lot of where machine learning has been very successful has been on the side of perception, recognizing objects, or when it comes to going from say vision to language, simple labeling of scenes that are already familiar, so you can show an image of a dog chasing a cat and maybe it’ll say something like dog chasing cat, or at least we get that there’s a dog running and a cat chasing.

Lucas: Right. And the caveat is that it takes a massive amount of training, where it’s not one shot learning, it’s you need to be shown a cat chasing a dog a ton of times just because of how inefficient the algorithms are.

Joshua: Right. And the algorithms don’t generalize very well. So if I show you some crazy picture that you’ve never seen before where it’s a goat and a dog and Winston Churchill all wearing roller skates in a rowboat on a purple ocean, a human can look at that and go, that’s weird, and give a description like the one I just said. Whereas today’s algorithms are going to be relying on brute statistical associations, and that’s not going to cut it for getting a precise, immediate reasoning. So humans have this ability to have thoughts, which we can express in words, but we also can imagine in something like pictures.

And the tricky thing is that it seems like a thought is not just an image, right? So to take an example that I think comes from Daniel Dennett, if you hear the words yesterday my uncle fired his lawyer, you might imagine that in a certain way, maybe you picture a guy in a suit pointing his finger and looking stern at another guy in a suit, but you understand that what you imagined doesn’t have to be the way that that thing actually happened. The lawyer could be a woman rather than a man. The firing could have taken place by phone. The firing could have taken place by phone while the person making the call was floating in a swimming pool and talking on a cell phone, right?

The meaning of the sentence is not what you imagined. But at the same time we have the symbol grounding problem, that is it seems like meaning is not just a matter of symbols chasing each other around. You wouldn’t really understand something if you couldn’t take those words and attach them meaningfully to things that you can see or touch or experience in a more sensory and motor kind of way. So thinking is something in between images and in between words. Maybe it’s just the translation mechanism for those sorts of things, or maybe there’s a deeper language of thought to use, Jerry Fodor’s famous phrase. But in any case, what part of my lab is trying to do is understand how does this central really poorly understood aspect of human intelligence work? How do we combine concepts to form thoughts? How can the same thought be represented in terms of words versus things that you can see or hear in your mind’s eyes and ears?

How does your brain distinguish what it’s thinking about from what it actually believes? If I tell you a made up story, yesterday I played basketball with LeBron James, maybe you’d believe me, and then I say, oh I was just kidding, didn’t really happen. You still have the idea in your head, but in one case you’re representing it as something true, in another case you’re representing it as something false, or maybe you’re representing it as something that might be true and you’re not sure. For most animals, the ideas that get into its head come in through perception, and the default is just that they are beliefs. But humans have the ability to entertain all kinds of ideas without believing them. You can believe that they’re false or you could just be agnostic, and that’s essential not just for idle speculation, but it’s essential for planning. You have to be able to imagine possibilities that aren’t yet actual.

So these are all things we’re trying to understand. And then I think the project of understanding how humans do it is really quite parallel to the project of trying to build artificial general intelligence.

Lucas: Right. So what’s deeply mysterious here is the kinetics that underlie thought, which is sort of like meta-learning or meta-awareness, or how it is that we’re able to have this deep and complicated implicit reasoning behind all of these things. And what that actually looks like seems deeply puzzling in sort of the core and the gem of intelligence, really.

Joshua: Yeah, that’s my view. I think we really don’t understand the human case yet, and my guess is that obviously it’s all neurons that are doing this, but these capacities are not well captured by current neural network models.

Lucas: So also just two points of question or clarification. The first is this sort of hypothesis that you proposed, that human thoughts seem to require some sort of empirical engagement. And then what was your claim about animals, sorry?

Joshua: Well animals certainly show some signs of thinking, especially some animals like elephants and dolphins and chimps engage in some pretty sophisticated thinking, but they don’t have anything like human language. So it seems very unlikely that all of thought, even human thought, is just a matter of moving symbols around in the head.

Lucas: Yeah, it’s definitely not just linguistic symbols, but it still feels like conceptual symbols that have structure.

Joshua: Right. So this is the mystery, human thought, you could make a pretty good case that symbolic thinking is an important part of it, but you could make a case that symbolic thinking can’t be all it is. And a lot of people in AI, most notably DeepMind, have taken the strong view and I think it’s right, that if you’re really going to build artificial general intelligence, you have to start with grounded cognition, and not just trying to build something that can, for example, read sentences and deduce things from those sentences.

Lucas: Right. Do you want to unpack what grounded cognition is?

Joshua: Grounded cognition refers to a representational system where the representations are derived, at least initially, from perception and from physical interaction. There’s perhaps a relationship with empiricism in the broader philosophy of science, but you could imagine trying to build an intelligent system by giving it lots and lots and lots of words, giving it lots of true descriptions of reality, and giving it inference rules for going from some descriptions to other descriptions. That just doesn’t seem like it’s going to work. You don’t really understand what apple means unless you have some sense of what an apple looks like, what it feels like, what it tastes like, doesn’t have to be all of those things. You can know what an apple is without ever eaten one, or I could describe some fruit to you that you’ve never seen, but you have experience with other fruits or other physical objects. Words don’t just exist in a symbol storm vacuum. They’re related to things that we see and touch and interact with.

Lucas: I think for me, just going most foundationally, the question is before I know what an apple is, do I need to understand spatial extension and object permanence? I have to know time, I have to have some very basic ontological understanding and world model of the universe.

Joshua: Right. So we have some clues from human developmental psychology about what kinds of representations, understandings, capabilities humans acquire, and in what order. To state things that are obvious, but nevertheless revealing, you don’t meet any humans who understand democratic politics before they understand objects.

Lucas: Yes.

Joshua: Right?

Lucas: Yeah.

Joshua: Which sounds obvious and it is in a sense obvious, right? But it tells you something about what it takes to build up abstract and sophisticated understandings of the world and possibilities for the world.

Lucas: Right. So for me it seems that the place where grounded cognition is most fundamentally is in between when like the genetic code that seeds the baby and when the baby comes out, the epistemics and whatever is in there, has the capacity to one day potentially become Einstein. So like what is that grounded cognition in the baby that underlies this potential to be a quantum physicist or a scientist-

Joshua: Or even just a functioning human. 

Lucas: Yeah.

Joshua: I mean even people with mental disabilities walk around and speak and manipulate objects. I think that in some ways the harder question is not how do we get from normal human to Einstein, but how do we get from a newborn to a toddler? And the analogous or almost analogous question for artificial intelligence is how do you go from a neural network that has some kind of structure, have some that’s favorable for acquiring useful cognitive capabilities, and how do you figure out what the starting structure is, which is kind of analogous to the question of how does the brain get wired up in utero?

And it gets connected to these sensors that we call eyes and ears, and it gets connected to these effectors that we call hands and feet. And it’s not just a random blob of connectoplasm, the brain has a structure. So one challenge for AI is what’s the right structure for acquiring sophisticated intelligence, or what are some of the right structures? And then what kind of data, what kind of training, what kind of training process do you need to get there?

Lucas: Pivoting back into the relevance of this with AGI, there is like you said, this fundamental issue of grounded cognition that babies and toddlers have that sort of lead them to become full human level intelligences eventually. How does one work to isolate the features of grounded cognition that enable babies to grow and become adults?

Joshua: Well, I don’t work with babies, but I can tell you what we’re doing with adults, for example.

Lucas: Sure.

Joshua: In the one paper in this line of research we already have published, this is work led by Steven Franklin, we have people reading sentences like the dog chased the cat, the cat chased the dog, or the dog was chased by the cat and the cat was chased by the dog. And what we’re doing is looking for parts of the brain where the pattern is different depending on whether the dog is chasing the cat and the cat is chasing the dog. So it has to be something that’s not just involved in representing dog or cat or chasing, but of representing that composition of those three concepts where they’re composed in one way rather than another way. And we found is that their region in the temporal lobe where the pattern is different for those things.

And more specifically, what we’ve found is that in one little spot in this broader region in the temporal lobe, you can better than chance decode who the agent is. So if it’s the dog chased the cat, then in this spot you can better than chance tell that it’s dog that’s doing the chasing. If it’s the cat was chased by the dog, same thing. So it’s not just about the order of the words, and then you can decode better than chance that it’s cat being chased for a sentence like that. So the idea is that these spots in the temporal lobe are functioning like data registers, and representing variables rather than specific values. That is this one region is representing the agent who did something and the other region is representing the patient, as they say in linguistics, who had something done to it. And this is starting to look more like a computer program where the way classical programs work is they have variables and values.

Like if you were going to write a program that translates Celsius into Fahrenheit, what you could do is construct a giant table telling you what Fahrenheit value corresponds to what Celsius value. But the more elegant way to do it is to have a formula where the formula has variables, right? You put in the Celsius value, you multiply it by the right thing and you get the Fahrenheit value. And then what that means is that you’re taking advantage of that recurring structure. Well, the something does something to something else is a recurring structure in the world and in our thought. And so if you have something in your brain that has that structure already, then you can quickly slot in dog as agent, chasing as the action, cat as patient, and that way you can very efficiently and quickly combine new ideas. So the upshot of that first work is that it seems like when we’re representing the meaning of a sentence, we’re actually doing it in a more classical computer-ish way than a lot of neuroscientists might have thought.

Lucas: It’s Combinatorial.

Joshua: Yes, exactly. So what we’re trying to get at is modes of composition. In that experiment, we did it with sentences. In an experiment we’re now doing, this is being led by my grad student Dylan Plunkett, and Steven Franklin is also working on this, we’re doing it with words and with images. We actually took a bunch of photos of different people doing different things. Specifically we have a chef which we also call a cook, and we have a child which we also call a kid. We have a prisoner, which we also call an inmate, and we have male and female versions of each of those. And sometimes one is chasing the other and sometimes one is pushing the other. In the images, we have all possible combinations of the cook pushes the child, the inmate chases the chef-

Lucas: Right, but it’s also gendered.

Joshua: We have male and female versions for each. And then we have all the possible descriptions. And in the task what people have to do is you put two things on the screen and you say, do these things match? So sometimes you’ll have two different images and you have to say, do those images have the same meaning? So it could be a different chef chasing a different kid, but if it’s a chef chasing a kid in both cases, then you would say that they mesh. Whereas if it’s a chef chasing an inmate, then you’d say that they don’t. And then in other cases you would have two sentences, like the chef chased the kid, or it could be the child was chased by the cook, or was pursued by the cook, and even though those are all different words in different orders, you’ve recognized that they have the same meaning or close enough.

And then in the most interesting case, we have an image and a set of words, which you can think of it as as a description, and the question is, does it match? So if you see a picture of a chef chasing a kid, and then the words are chef chases kid or cook pursues child, then you’d say, okay, that one’s a match. And what we’re trying to understand is, is there something distinctive that goes on in that translation process when you have to take a complex thought, not complex in the sense of very sophisticated by human standards, but complex in the sense that it has parts, that it’s composite, and translate it from a verbal representation to a visual representation, and is that different or is the base representation visual? So for example, one possibility is when you get two images, if you’re doing something that’s fairly complicated, you have to translate them both into words. It’s possible that you could see language areas activated when people have to look at two images and decide if they match. Or maybe not. Maybe you can do that in a purely visual kind of way-

Lucas: And maybe it depends on the person. Like some meditators will report that after long periods of meditation, certain kinds of mental events happen much less or just cease, like images or like linguistic language or things like that.

Joshua: So that’s possible. Our working assumption is that basic things like understanding the meaning of the chef chased the kid, and being able to point to a picture of that and say that’s the thing, the sentence described, that our brains do this all more or less than the same way. That could be wrong, but our goal is to get at basic features of high level cognition that all of us share.

Lucas: And so one of these again is this combinatorial nature of thinking.

Joshua: Yes. That I think is central to it. That it is combinatorial or compositional, and that it’s multimodal, that you’re not just combining words with other words, you’re not just combining images with other images, you’re combining concepts that are either not tied to a particular modality or connected to different modalities.

Lucas: They’re like different dimensions of human experience. You can integrate it with if you can feel it, or some people are synesthetic, or like see it or it could be a concept, or it could be language, or it could be heard, or it could be subtle intuition, and all of that seems to sort of come together. Right?

Joshua: It’s related to all those things.

Lucas: Yeah. Okay. And so sorry, just to help me get a better picture here of how this is done. So this is an MRI, right?

Joshua: Yeah.

Lucas: So for me, I’m not in this field and I see generally the brain is so complex that our resolution is just different areas of the brain light up, and so we understand what these areas are generally tasked for, and so we can sort of see how they relate when people undergo different tasks. Right?

Joshua: No, we can do better than that. So that was kind of brain imaging 1.0, and brain imaging 2.0 is not everything we want from a brain imaging technology, but it does take us a level deeper, which is to say instead of just saying this brain region is involved, or it ramps up when people are doing this kind of thing, region function relationships, we can look at the actual encoding of content, I can train a pattern classifier. So let’s say you’re showing people pictures of dog or the word dog versus other things. You can train a pattern classifier to recognize the difference between someone looking at a dog versus looking at a cat, or reading the word dog or reading the word cat. There are patterns of activity that are more subtle than just this region is active or more or less active.

Lucas: Right. So the activity is distinct in a way that when you train the thing on when it looks like people are recognizing cats, then it can recognize that in the future.

Joshua: Yeah.

Lucas: So is there anything besides this multimodal and combinatorial features that you guys have isolated, or that you’re looking into, or that you suppose are like essential features of grounded cognition?

Joshua: Well, this is what we’re trying to study, and we have the ones that have result that’s kind of done and published that I described about representing the meaning of a sentence in terms of representing the agent here and the patient there for that kind of sentence, and we have some other stuff in the pipeline that’s getting at the kinds of representations that the brain uses to combine concepts and also to distinguish concepts that are playing different roles. In another set of studies we have people thinking about different objects.

Sometimes they’ll think about an object where it’s a case where they’d actually get money if it turns out that that object is the one that’s going to appear later. It looks like when you think about, say dog, and if it turns out that it’s dog under the card, then you’ll get five bucks. You see that you were able to decode the dog representation in part of our motivational circuitry, whereas you don’t see that if you’re just thinking about it. So that’s another example, is that things are represented in different places in the brain depending on what function that representation is serving at that time.

Lucas: So with this pattern recognition training that you can do based on how people recognize certain things, you’re able to see sort of the sequence and kinetics of the thought.

Joshua: MRI is not great for temporal resolution. So what we’re not seeing is how on the order of milliseconds a thought gets put together.

Lucas: Okay. I see.

Joshua: What MRI is better for, it has better spatial resolution and is better able to identify spatial patterns of activity that correspond to representing different ideas or parts of ideas.

Lucas: And so in the future, as our resolution begins to increase in terms of temporal imaging or being able to isolate more specific structures, I’m just trying to get a better understanding of what your hopes are for increased ability of resolution and imaging in the future, and how that might also help to disclose grounded cognition.

Joshua: One strategy for getting a better understanding is to combine different methods. fMRI can give you some indication of where you’re representing the fact that it’s a dog that you’re thinking about as opposed to a cat. But other neuroimaging techniques have better temporal resolution but not as good spatial resolution. So EEG which measures electrical activity from the scalp has millisecond temporal resolution, but it’s very blurry spatially. The hope is that you combine those two things and you get a better idea. Now both of these things have been around for more than 20 years, and there hasn’t been as much progress as I would have hoped combining those things. Another approach is more sophisticated models. What I’m hoping we can do is say, all right, so we have humans doing this task where they are deciding whether or not these images match these descriptions, and we know that humans do this in a way that enables them to generalize, so that if they see some combination of things they’ve never seen before.

Joshua: Like this is a giraffe chasing a Komodo Dragon. You’ve never seen that image before, but you could look at that image for the first time and say, okay, that’s a giraffe chasing a Komodo Dragon, at least if you know what those animals look like, right?

Lucas: Yeah.

Joshua: So then you can say, well, what does it take to train a neural network to be able to do that task? And what does it take to train a neural network to be able to do it in such a way that it can generalize to new examples? So if you teach it to recognize Komodo Dragon, can it then generalize such that, well, it learned how to recognize giraffe chases lion, or lion chases giraffe, and so it understands chasing, and it understands lion, and it understands giraffe. Now if you teach it what a Komodo dragon looks like, can it automatically slot that into a complex relational structure?

And so then let’s say we have a neural network that we trained, is able to do that. It’s not all of human cognition. We assume it’s not conscious, but it may capture key features of that cognitive process. And then we look at the model and say, okay, well in real time, what is that model doing and how is it doing it? And then we have a more specific hypothesis that we can go back to the brain and say, well, does the brain do it, something like the way this artificial neural network does it? And so the hope is that by building artificial neural models of these certain aspects of high level cognition, we can better understand human high level cognition, and the hope is that also it will feed back the other way. Where if we look and say, oh, this seems to be how the brain does it, well maybe if you wired up a network like this, what if we mimic that kind of architecture in a neural network and an artificial neural network, does that enable it to solve the problem in a way that it otherwise wouldn’t?

Lucas: Right. I mean we already have AGIs, they just have to be created by humans and they live about 80 years, and then they die, and so we already have an existence proof, and the problem really is the brain is so complicated that there are difficulties replicating it on machines. And so I guess the key is how much can our study of the human brain inform our creation of AGI through machine learning or deep learning or like other methodologies.

Joshua: And it’s not just that the human brain is complicated, it’s that the general intelligence that we’re trying to replicate in machines only exists in humans. You could debate the ethics of animal research and sticking electrodes in monkey brains and things like that, but within ethical frameworks that are widely accepted, you can do things to monkeys or rats that help you really understand in a detailed way what the different parts of their brain are doing, right?

But for good reason, we don’t do those sorts of studies with humans, and we would understand much, much, much, much more about how human cognition works if we were–

Lucas: A bit more unethical.

Joshua: If we were a lot more unethical, if we were willing to cut people’s brains open and say, what happens if you lesion this part of the brain? What happens if you then have people do these 20 tasks? No sane person is suggesting we do this. What I’m saying is that part of the reason why we don’t understand it is because it’s complicated, but another part of the reason why we don’t understand is that we are very much rightly placing ethical limits on what we can do in order to understand it.

Lucas: Last thing here that I just wanted to touch on on this is when I’ve got this multimodal combinatorial thing going on in my head, when I’m thinking about how like a Komodo dragon is chasing a giraffe, how deep does that combinatorialness need to go for me to be able to see the Komodo Dragon chasing the giraffe? Your earlier example was like a purple ocean with a Komodo Dragon wearing like a sombrero hat, like smoking a cigarette. I guess I’m just wondering, well, what is the dimensionality and how much do I need to know about the world in order to really capture a Komodo Dragon chasing a giraffe in a way that is actually general and important, rather than some kind of brittle, heavily trained ML algorithm that doesn’t really know what a Komodo Dragon chasing a giraffe is.

Joshua: It depends on what you mean by really know. Right? But at the very least you might say it doesn’t really know it if it can’t both recognize it in an image and output a verbal label. That’s the minimum, right?

Lucas: Or generalize the new context-

Joshua: And generalize the new cases, right. And I think generalization is key, right. What enables you to understand the crazy scene you described is it’s not that you’ve seen so many scenes that one of them is a pretty close match, but instead you have this compositional engine, you understand the relations, and you understand the objects, and that gives you the power to construct this effectively infinite set of possibility. So what we’re trying to understand is what is the cognitive engine that interprets and generates those infinite possibilities?

Lucas: Excellent. So do you want to sort of pivot here into how Rawls’ veil of justice fits in here?

Joshua: Yeah. So on the other side of the lab, one side is focused more on this sort of key aspect of individual intelligence. On the more moral and social side of the lab, we’re trying to understand our collective intelligence and our social decision making, and we’d like to do research that can help us make better decisions. Of course, what counts is better is always contentious, especially when it comes to morality, but these influences that one could plausibly interpret as better. Right? One of the most famous ideas in moral and political philosophy is John Rawls’s idea of the veil of ignorance, where what Rawls essentially said is you want to know what a just society looks like? Well, the essence of justice is impartiality. It’s not favoring yourself over other people. Everybody has to play this side by the same rules. It doesn’t mean necessarily everybody gets exactly the same outcome, but you can’t get special privileges just because you’re you.

And so what he said was, well, a just society is one that you would choose if you didn’t know who in that society you would be. Even if you are choosing selfishly, but you are constrained to be impartial because of your ignorance. You don’t know where you’re going to land in that society. And so what Rawls says very plausibly is would you rather be randomly slotted into a society where a small number of people are extremely rich and most people are desperately poor? Or would you rather be slotted into a society where most people aren’t rich but are doing pretty well? The answer pretty clearly is you’d rather be slotted randomly into a society where most people are doing pretty well instead of a society where you could be astronomically well off, but most likely would be destitute. Right? So this is all background that Rawls applied this idea of the veil of ignorance to the structure of society overall, and said a just society is one that you would choose if you didn’t know who in it you were going to be.

And this sort of captures the idea of impartiality as sort of the core of justice. So what we’ve been doing recently, and we as this is a project led by Karen Huang and Max Bazerman along with myself, is applying the veil of ignorance idea to more specific dilemmas. So one of the places where we have applied this is with ethical dilemmas surrounding self driving cars. We took a case that was most famously recently discussed by Bonnefon, Sharrif, and Rahwan in their 2016 science paper, The Social Dilemma of Autonomous Vehicles, and the canonical version goes something like you’ve got an autonomous vehicle, and AV, that is headed towards nine people and nothing is done. It’s going to run those nine people over. But it can swerve out of the way and save those nine people, but if it does that, it’s going to drive into a concrete wall and kill the passenger inside.

So the question is should the car swerve or should it go straight? Now, you can just ask people. So what do you think the car should do, or would you approve a policy that says that in a situation like this, the car should minimize the loss of life and therefore swerve? What we did is, some people we just had answer the question just the way I posed it, but other people, we had them do a veil of ignorance exercise first. So we say, suppose you’re going to be one of these 10 people, the nine on the road or the one in the car, but you don’t know who you’re going to be.

From a purely selfish point of view, would you want the car to swerve or not, and almost everybody says, I’d rather have the car swerve. I’d rather have a nine out of 10 chance of living instead of a one out of 10 chance of living. And then we asked people, okay, that was a question about what you would want selfishly, if you didn’t know who you were going to be. Would you approve of a policy that said that cars in situations like this should swerve to minimize the loss of life.

The people who’ve gone through the veil of ignorance exercise, they are more likely to approve of the utilitarian policy, the one that aims to minimize the loss of life, if they’ve gone through that veil of ignorance, exercise first, than if they just answered the question. And we have control conditions where we have them do a version of the veil of ignorance exercise, but where the probabilities are mixed up. So there’s no relationship between the probability and the number of people, and that’s sort of the tightest control condition, and you still see the effect. The idea is that the veil of ignorance is a cognitive device for thinking about a dilemma in a kind of more impartial kind of way.

And then what’s interesting is that people recognize, they do a bit of kind of philosophizing. They say, huh, if I said that what I would want is to have the car swerve, and I didn’t know who I was going to be, that’s an impartial judgment in some sense. And that means that even if I feel sort of uncomfortable about the idea of a car swerving and killing its passenger in a way that is foreseen, if not intended in the most ordinary sense, even if I feel kind of bad about that, I can justify it because I say, look, it’s what I would want if I didn’t know who I was going to be. So we’ve done this with self driving cars, we’ve done it with the classics of the trolley dilemma, we’ve done it with a bioethical case involving taking oxygen away from one patient and giving it to nine others, and we’ve done it with a charity where we have people making a real decision involving real money between a more versus less effective charity.

And across all of these cases, what we find is that when you have people go through the veil of ignorance exercise, they’re more likely to make decisions that promote the greater good. It’s an interesting bit of psychology, but it’s also perhaps a useful tool, that is we’re going to be facing policy questions where we have gut reactions that might tell us that we shouldn’t do what favors the greater good, but if we think about it from behind a veil of ignorance and come to the conclusion that actually we’re in favor of what promotes the greater good at least in that situation, then that can change the way we think. Is that a good thing? If you have consequentialist inclinations like me, you’ll think it’s a good thing, or if you just believe in the procedure, that is I like whatever decisions come out of a veil of ignorance procedure, then you’ll think it’s a good thing. I think it’s interesting that it affects the way people make the choice.

Lucas: It’s got me thinking about a lot of things. I guess a few things are that I feel like if most people on earth had a philosophy education or at least had some time to think about ethics and other things, they’d probably update their morality in really good ways.

Joshua: I would hope so. But I don’t know how much of our moral dispositions come from explicit education versus our broader personal and cultural experiences, but certainly I think it’s worth trying. Certainly believe in the possibility that, understand, this is why I do research on and I come to that with some humility about how much that by itself can accomplish. I don’t know.

Lucas: Yeah, it would be cool to see like the effect size of Rawls’s veil of ignorance across different societies and persons, and then other things you can do are also like the child drowning in the shallow pool argument, and there’s just tons of different thought experiments, it would be interesting to see how it updates people’s ethics and morality. The other thing I just sort of wanted to inject here, the difference between naive consequentialism and sophisticated consequentialism. Sophisticated consequentialism would also take into account not only the direct effect of saving more people, but also how like human beings have arbitrary partialities to what I would call a fiction, like rights or duties or other things. A lot of people share these, and I think within our sort of consequentialist understanding and framework of the world, people just don’t like the idea of their car smashing into walls. Whereas yeah, we should save more people.

Joshua: Right. And as Bonnefon and all point out, and I completely agree, if making cars narrow the utilitarian in the sense that they always try to minimize the loss of life, makes people not want to ride in them, and that means that there are more accidents that lead to human fatalities because people are driving instead of being driven, then that is bad from a consequentialist perspective, right? So you can call it sophisticated versus naive consequentialism, but really there’s no question that utilitarianism or consequentialism in its original form favors the more sophisticated readings. So it’s kind of more-

Lucas: Yeah, I just feel that people often don’t do the sophisticated reasoning, and then they come to conclusions.

Joshua: And this is why I’ve attempted with not much success, at least in the short term, to rebrand utilitarianism as what I call deep pragmatism. Because I think when people hear utilitarianism, what they imagine is everybody walking around with their spreadsheets and deciding what should be done based on their lousy estimates of the greater good. Whereas I think the phrase deep pragmatism gives you a much clearer idea of what it looks like to be utilitarian in practice. That is you have to take into account humans as they actually are, with all of their biases and all of their prejudices and all of their cognitive limitations.

When you do that, it’s obviously a lot more subtle and flexible and cautious than-

Lucas: Than people initially imagine.

Joshua: Yes, that’s right. And I think utilitarian has a terrible PR problem, and my hope is that we can either stop talking about the U philosophy and talk instead about deep pragmatism, see if that ever happens, or at the very least, learn to avoid those mistakes when we’re making serious decisions.

Lucas: The other very interesting thing that this brings up is that if I do the veil of ignorance thought exercise, and then I’m more partial towards saving more people and partial towards policies, which will reduce the loss of life. And then I sort of realize that I actually do have this strange arbitrary partiality, like my car I bought not crash me into a wall, from sort of a third person point of view, I think maybe it seems kind of irrational because the utilitarian thing initially seems most rational. But then we have the chance to reflect as persons, well maybe I shouldn’t have these arbitrary beliefs. Like maybe we should start updating our culture in ways that gets rid of these biases so that the utilitarian calculations aren’t so corrupted by scary primate thoughts.

Joshua: Well, so I think the best way to think about it is how do we make progress? Not how do we radically transform ourselves into alien beings who are completely impartial, right. And I don’t think it’s the most useful thing to do. Take the special case of charitable giving, that you can turn yourself into a happiness pump, that is devote all of your resources to providing money for the world’s most effective charities.

And you may do a lot of good as an individual compared to other individuals if you do that, but most people are going to look at you and just say, well that’s admirable, but it’s super extreme. That’s not for me, right? Whereas if you say, I give 10% of my money, that’s an idea that can spread, that instead of my kids hating me because I deprived them of all the things that their friends had, they say, okay, I was brought up in a house where we give 10% and I’m happy to keep doing that. Maybe I’ll even make it 15. You want norms that are scalable, and that means that your norms have to feel livable. They have to feel human.

Lucas: Yeah, that’s right. We should be spreading more deeply pragmatic approaches and norms.

Joshua: Yeah. We should be spreading the best norms that are spreadable.

Lucas: Yeah. There you go. So thanks so much for joining me, Joshua.

Joshua: Thanks for having me.

Lucas: Yeah, I really enjoyed it and see you again soon.

Joshua: Okay, thanks.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode of the AI Alignment Series.

[end of recorded material]

AI Alignment Podcast: The Byzantine Generals’ Problem, Poisoning, and Distributed Machine Learning with El Mahdi El Mhamdi (Beneficial AGI 2019)

Three generals are voting on whether to attack or retreat from their siege of a castle. One of the generals is corrupt and two of them are not. What happens when the corrupted general sends different answers to the other two generals?

Byzantine fault is “a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the “Byzantine Generals’ Problem”, developed to describe this condition, where actors must agree on a concerted strategy to avoid catastrophic system failure, but some of the actors are unreliable.

The Byzantine Generals’ Problem and associated issues in maintaining reliable distributed computing networks is illuminating for both AI alignment and modern networks we interact with like Youtube, Facebook, or Google. By exploring this space, we are shown the limits of reliable distributed computing, the safety concerns and threats in this space, and the tradeoffs we will have to make for varying degrees of efficiency or safety.

The Byzantine Generals’ Problem, Poisoning, and Distributed Machine Learning with El Mahdi El Mhamdi is the ninth podcast in the AI Alignment Podcast series, hosted by Lucas Perry. El Mahdi pioneered Byzantine resilient machine learning devising a series of provably safe algorithms he recently presented at NeurIPS and ICML. Interested in theoretical biology, his work also includes the analysis of error propagation and networks applied to both neural and biomolecular networks. This particular episode was recorded at the Beneficial AGI 2019 conference in Puerto Rico. We hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

Topics discussed in this episode include:

  • The Byzantine Generals’ Problem
  • What this has to do with artificial intelligence and machine learning
  • Everyday situations where this is important
  • How systems and models are to update in the context of asynchrony
  • Why it’s hard to do Byzantine resilient distributed ML.
  • Why this is important for long-term AI alignment

An overview of Adversarial Machine Learning and where Byzantine-resilient Machine Learning stands on the map is available in this (9min) video . A specific focus on Byzantine Fault Tolerant Machine Learning is available here (~7min)

In particular, El Mahdi argues in the first interview (and in the podcast) that technical AI safety is not only relevant for long term concerns, but is crucial in current pressing issues such as social media poisoning of public debates and misinformation propagation, both of which fall into Poisoning-resilience. Another example he likes to use is social media addiction, that could be seen as a case of (non) Safely Interruptible learning. This value misalignment is already an issue with the primitive forms of AIs that optimize our world today as they maximize our watch-time all over the internet.

The latter (Safe Interruptibility) is another technical AI safety question El Mahdi works on, in the context of Reinforcement Learning. This line of research was initially dismissed as “science fiction”, in this interview (5min), El Mahdi explains why it is a realistic question that arises naturally in reinforcement learning

“El Mahdi’s work on Byzantine-resilient Machine Learning and other relevant topics is available on his Google scholar profile. A modification of the popular machine learning library TensorFlow, to make it Byzantine-resilient (and also support communication over UDP channels among other things) has been recently open-sourced on Github by El Mahdi’s colleagues based on his algorithmic work we mention in the podcast.

To connect with him over social media

You can listen to the podcast above or read the transcript below.

Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast series. I’m Lucas Perry, and today we’ll be speaking with El Mahdi El Mhamdi on the Byzantine problem, Byzantine tolerance, and poisoning in distributed learning and computer networks. If you find this podcast interesting or useful, please give it a like and follow us on your preferred listing platform. El Mahdi El Mhamdi pioneered Byzantine resilient machine learning devising a series of provably safe algorithms he recently presented at NeurIPS and ICML. Interested in theoretical biology, his work also includes the analysis of error propagation and networks applied to both neural and biomolecular networks. With that, El Mahdi’s going to start us off with a thought experiment.

El Mahdi: Imagine you are part of a group of three generals, say, from the Byzantine army surrounding a city you want to invade, but you also want to retreat if retreat is the safest choice for your army. You don’t want to attack when you will lose, so those three generals that you’re part of are in three sides of the city. They sent some intelligence inside the walls of the city, and depending on this intelligence information, they think they will have a good chance of winning and they would like to attack, or they think they will be defeated by the city, so it’s better for them to retreat. Your final decision would be a majority vote, so you communicate through some horsemen that, let’s say, are reliable for the sake of this discussion. But there might be one of you who might have been corrupt by the city.

The situation would be problematic if, say, there are General A, General B, and General C. General A decided to attack. General B decided to retreat based on their intelligence for some legitimate reason. A and B are not corrupt, and say that C is corrupt. Of course, A and B, they can’t figure out who was corrupt. Say C is corrupt. What this general would do they thing, so A wanted to attack. They will tell them, “I also want to attack. I will attack.” Then they will tell General B, “I also want to retreat. I will retreat.” A receives two attack votes and one retreat votes. General B receives two retreat votes and only one attack votes. If they trust everyone, they don’t do any double checking, this would be a disaster.

A will attack alone; B would retreat; C, of course, doesn’t care because he was corrupt by the cities. You can tell me they can circumvent that by double checking. For example, A and B can communicate on what C told them. Let’s say that every general communicates with every general on what he decides and on also what’s the remaining part of the group told them. A will report to B, “General C told me to attack.” Then B would tell C, “General C told me to retreat.” But then A and B wouldn’t have anyway of concluding whether the inconsistency is coming from the fact that C is corrupt or that the general reporting on what C told them is corrupt.

I am General A. I have all the valid reasons to think with the same likelihood that C is maybe lying to me or also B might also be lying to me. I can’t know if you are misreporting what C told you enough for the city to corrupt one general if there are three. It’s impossible to come up with an agreement in this situation. You can easily see that this will generalize to having more than three generals, like I say 100, as soon as the non-corrupt one are less than two-thirds because what we saw with three generals would happen with the fractions that are not corrupt. Say that you have strictly more than 33 generals out of 100 who are corrupt, so what they can do is they can switch the majority votes on each side.

But worse than that, say that you have 34 corrupt generals and the remaining 66 not corrupt generals. Say that those 66 not corrupt generals were 33 on the attack side, 33 on the retreat side. The problem is that when you are in some side, say that you are in the retreat side, you have in front of you a group of 34 plus 33 in which there’s a majority of malicious ones. This majority can collude. It’s part of the Byzantine hypothesis. The malicious ones can collude and they will report a majority of inconsistent messages on the minority on the 33 ones. You can’t provably realize that the inconsistency is coming from the group of 34 because they are a majority.

Lucas: When we’re thinking about, say, 100 persons or 100 generals, why is it that they’re going to be partitioned automatically into these three groups? What if there’s more than three groups?

El Mahdi: Here we’re doing the easiest form of Byzantine agreement. We want to agree on attack versus retreat. When it’s become multi-dimensional, it gets even messier. There are more impossibility results and impossibility results. Just like with the binary decision, there is an impossibility theorem on having agreement if you have unsigned messages to horsemen. Whenever the corrupt group exceeds 33%, you provably cannot come up with an agreement. There are many variants to this problem, of course, depending on what hypothesis you can assume. Here, without even mentioning it, we were assuming bounded delays. The horsemen would always arrive eventually. If the horsemen could die on the way and you don’t have any way to check whether they arrive or not or you can be waiting forever because you don’t have any proof that the horsemen died on the way.

You don’t have any mechanism to tell you, “Stop waiting for the horsemen. Stop waiting for the message from General B because the horsemen died.” You can be waiting forever and there are theorems that shows that when you have unbounded delays, and by the way, like in distributed computing, whenever you have in bounded delays, we speak about asynchrony. If you have a synchronous communication, there is a very famous theorem that tells you consensus is impossible, not even in the malicious case, but just like in …

Lucas: In the mundane normal case.

El Mahdi: Yes. It’s called the Fischer Lynch Patterson theorem theorem .

Lucas: Right, so just to dive down into the crux of the problem, the issue here fundamentally is that when groups of computers or groups of generals or whatever are trying to check who is lying amongst discrepancies and similarities of lists and everyone who’s claiming what is when there appears to be a simple majority within that level of corrupted submissions, then, yeah, you’re screwed.

El Mahdi: Yes. It’s impossible to achieve agreement. There are always fractions of malicious agents above which is provably impossible to agree. Depending on the situation, it will be a third or sometimes or a half or a quarter, depending on your specifications.

Lucas: If you start tweaking the assumptions behind the thought experiment, then it changes what number of corrupted machines or agents that are required in order to flip the majority and to poison the communication.

El Mahdi: Exactly. But for example, you mentioned something very relevant to today’s discussion, which is what if we were not agreeing on two decisions, retreat, attack. What if we were agreeing on some multi-dimensional decision? Attack or retreat on one dimension and then …

Lucas: Maybe hold, keep the siege going.

El Mahdi: Yeah, just like add possibilities or dimensions and multi-dimensional agreements. They’re even more hopeless results in that direction

Lucas: There are more like impossibility theorems and issues where these distributed systems are vulnerable to small amounts of systems being corrupt and screwing over the entire distributed network.

El Mahdi: Yes. Maybe now we can slightly move to machine learning.

Lucas: I’m happy to move into machine learning now. We’ve talked about this, and I think our audience can probably tell how this has to do with computers. Yeah, just dive in what this has to do with machine learning and AI and current systems today, and why it even matters for AI alignment.

El Mahdi: As a brief transition, solving the agreement problem besides this very nice historic thought experiment is behind consistencies of safety critical systems like banking systems. Imagine we have a shared account. Maybe you remove 10% of the amount and then she or he added some $10 to the accounts. You remove the $10 in New York and she or he put the $10 in Los Angeles. The banking system has to agree on the ordering because minus $10 plus 10% is not the same result as plus 10% then minus $10. The final balance of the account would not be the same.

Lucas: Right.

El Mahdi: The banking systems routinely are solving decisions that fall into agreement. If you work on some document sharing platform, like Dropbox or Google Docs, whatever, and we collaboratively are writing the document, me and you. The document sharing platform has to, on real time, solve agreements about the ordering of operations so that me and you always keep seeing the same thing. This has to happen while some of the machines that are interconnecting us are failing, whether just like failing because there was a electric crash or something. Data center has lost some machines or if it was like restart, a bug or a take away. What we want in distributed computing is that we would like communications schemes between machines that’s guarantee this consistency that comes from agreement as long as some fraction of machines are reliable. What this has to do with artificial intelligence and machine learning reliability is that with some colleagues, we are trying to encompass one of the major issues in machine learning reliability inside the Byzantine fault tolerance umbrella. For example, you take, for instance, poisoning attacks.

Lucas: Unpack what poisoning attacks are.

El Mahdi: For example, imagine you are training a model on what are good videos to recommend given some key word search. If you search for “medical advice for young parents on vaccine,” this is a label. Let’s assume for the sake of simplicity that a video that tells you not to take your kid for vaccines is not what we mean by medical advice for young parents on vaccine because that’s what medical experts agree on. We want our system to learn that anitvaxers, like anti-vaccine propaganda is not what people are searching for when they type those key words, so I suppose a world where we care about accuracy, okay? Imagine you want to train a machine learning model that gives you accurate results of your search. Let’s also for the sake of simplicity assume that a majority of people on the internet are honest.

Let’s assume that more than 50% of people are not actively trying to poison the internet. Yeah, this is very optimistic, but let’s assume that. What we can show and what me and my colleagues started this line of research with is that you can easily prove that one single malicious agent can provably poison a distributed machine learning scheme. Imagine you are this video sharing platform. Whenever people behave on your platform, this generates what we call gradients, so it updates your model. It only takes a few hyperactive accounts that could generate behavior that is powerful enough to pull what we call the average gradient because what distributed machine learning is using, at least up to today, if you read the source code of most distributed machine learning frameworks. Distributed machine learning is always averaging gradients.

Imagine you Lucas Perry just googled a video on the Parkland shootings. Then the video sharing platform shows you a video telling you that David Hogg and Emma Gonzalez and those kids behind the March for Our Lives movement are crisis actors. The video labels three kids as crisis actors. It obviously has a wrong label, so it is what I will call a poisoned data point. If you are non-malicious agents on the video sharing platform, you will dislike the video. You will not approve it. You’re likely to flag it. This should generate a gradient that pushes the model in that direction, so the gradient will update the model into a direction where it stops thinking that this video is relevant for someone searching “Parkland shooting survivors.” What can happen if your machine learning framework is just averaging gradients is that a bunch of hyperactive people on some topic could poison the average and pull it towards the direction where the models is enforcing this thinking that, “Yeah, those kids are crisis actors.”

Lucas: This is the case because the hyperactive accounts are seen to be given more weight than accounts which are less active in the same space. But this extra weighting that these accounts will get from their hyperactivity in one certain category or space over another, how is the weighting done? Is it just time spent per category or does it have to do with submissions that agree with the majority?

El Mahdi: We don’t even need to go into the details because we don’t know. I’m talking in a general setting where you have a video sharing platform aggregating gradients for behavior. Now, maybe let’s raise the abstraction level. You are doing gradient descents, so you have a lost function that you want to minimize. You have an error function. The error function is the mismatch between what you predict and what the user tells you. The user tells you this is a wrong prediction, and then you move to the direction where the users stop telling you this is the wrong direction. You are doing great in this sense minimizing the lost function. User behaves, and with their behavior, you generate gradients.

What you do now in the state of the arts way of distributed machine learning is that you average all those gradients. Averaging is well known not to be resilient. If you have a room of poor academics earning a few thousand dollars and then a billionaire jumps in the room, if your algorithm reasons with averaging, it will think that this is a room of millionaires because the average salary would be a couple of hundred millions. But then million is very obvious to do when it comes to salaries and numbers scalers because you can rank them.

Lucas: Right.

El Mahdi: You rank numbers and then decide, “Okay, this is the ordering. This is the number that falls in the middle. This is the upper half. This is the lower half and this is the median.” When it becomes high dimensional, the median is a bit tricky. It has some computational issues. Then even if you compute what we call the geometric median, an attacker can still know how to leverage the fact that you’re only approximating it because there’s no closed formula. There’s no closed form to compute the median in that dimension. But worse than that, what we showed in one of our follow up works is because of the fact that machine learning is done in very, very, very high dimensions, you would have a curse of the dimensionality issue that makes it possible for attackers to sneak in without being spot as a way of the median.

It can still look like the median vector. I take benefits from the fact that those vectors, those gradients, are extremely high dimensional. I would look for all the disagreements. Let’s say you have a group of a couple hundred gradients, and I’m the only malicious one. I would look at the group of correct vectors all updating you somehow in the same direction within some variants. On average, they’re like what we call unbiased estimators of the gradient. When you take out the randomness, the expected value they will give you is the real gradient of the loss function. What I will do as a malicious worker is I will look at the way they are disagreeing slightly on each direction.

I will sum that. I will see that they disagree by this much on direction one. They disagree by this much on direction two. They disagree by this much, epsilon one, epsilon two, epsilon three. I would look for all these small disagreements they have on all the components.

Lucas: Across all dimensions and high dimensional space. [crosstalk 00:16:35]

El Mahdi: Then add that up. It will be my budget, my leeway, my margin to attack you on another direction.

Lucas: I see.

El Mahdi: What we proved is that you have to mix ideas from geometric median with ideas from the traditional component-wise median, and that those are completely different things. The geometric median is a way to find a median by just minimizing the sum of distances between what you look for and all the vectors that were proposed, and then the component-wise median will do a traditional job of ranking the coordinates. It looks at each coordinate, and then rank all the propositions, and then look for the proposition that lies in the middle. Once we proved enough follow up work is that, yeah, the geometric median idea is elegant. It can make you converge, but it can make you converge to something arbitrarily bad decided by the attacker. When you train complex models like neural nets, the landscape you optimize inside is not convex. It’s not like a bowl or a cup where you just follow the descending slope you would end up in the lowest point.

Lucas: Right.

El Mahdi: It’s like a multitude of bowls with different heights.

Lucas: Right, so there’s tons of different local minima across the space.

El Mahdi: Exactly. So in the first paper what we showed is that ideas that look like the geometric median are enough to just converge. You converge. You provably converge, but in the follow up what we realized, like something we were already aware of, but not enough in my opinion, is that there is this square root D, this curse of dimensionality that will arise when you compute high dimensional distances. That the attacker can leverage.

So in what we call the hidden vulnerability of distributed learning, you can have correct vectors, agreeing on one component. Imagine in your head some three axis system.

Let’s say that they are completely in agreement on axis three. But then in axis one, two, so in the plane formed by the axis one and axis two, they have a small disagreement.

What I will do as the malicious agent, is that I will leverage this small disagreement, and inject it in axis three. And this will make you go to a bit slightly modified direction. And instead of going to this very deep, very good minima, you will go into a local trap that is just close ahead.

And that comes from the fact that loss functions of interesting models are clearly like far from being convex. The models are highly dimensional, and the loss function is highly un-convex, and creates a lot of leeway.

Lucas: It creates a lot of local minima spread throughout the space for you to attack the person into.

El Mahdi: Yeah. So convergence is not enough. So we started this research direction by formulating the following question, what does it take to guarantee convergence?

And any scheme that aggregates gradients, and guarantee convergence is called Byzantine resilient. But then you can realize that in very high dimensions, and highly non-convex loss functions, is convergence enough? Would you just want to converge?

There are of course people arguing the deep learning models, like there’s this famous paper by Anna Choromanska, and Yann LeCun, and  Gérard Ben Arous, about the landscape of neural nets, that basically say that, “Yeah, very deep local minimum of neural nets are some how as good.”

From an overly simplified point of view, it’s an optimistic paper, that tells you that you shouldn’t worry too much when you optimize neural nets about the fact that gradient descent would not necessarily go to a global like-

Lucas: To a global minima.

El Mahdi: Yeah. Just like, “Stop caring about that.”

Lucas: Because the local minima are good enough for some reason.

El Mahdi: Yeah. I think that’s a not too unfair way to summarize the paper for the sake of this talk, for the sake of this discussion. What we empirically illustrate here, and theoretically support is that that’s not necessarily true.

Because we show that with very low dimensional, not extremely complex models, trained on CIFAR-10 and MNIST, which are toy problems, very easy toy problems, low dimensional models etc. It’s already enough to have those amounts of parameters, let’s say 100,000 parameters or less, so that an attacker would always find a direction to take you each time away, away, away, and then eventually find an arbitrarily bad local minimum. And then you just converge to that.

So convergence is not enough. Not only you have to seek an aggregation rule that guarantees convergence, but you have to seek some aggregation rules that guarantee that you would not converge to something arbitrarily bad. You would keep converging to the same high quality local minimum, whatever that means.

The hidden vulnerability is this high dimensional idea. It’s the fact that because the loss function is highly non-convex, because there’s the high dimensionality, as an attacker I would always find some direction, so the attack goes this way.

Here the threat model is that an attacker can spy on your gradients, generated by the correct workers but cannot talk on their behalf. So I cannot corrupt the messages. Since you asked about the reliability of horsemen or not.

So horsemen are reliable. I can’t talk on your behalf, but I can spy on you. I can see what are you sending to the others, and anticipate.

So I would as an attacker wait for correct workers to generate their gradients, I will gather those vectors, and then I will just do a linear regression on those vectors to find the best direction to leverage the disagreement on the D minus one remaining directions.

So because there would be this natural disagreement, this variance in many directions, I will just do some linear regression and find what is the best direction to keep? And use the budget I gathered, those epsilons I mentioned earlier, like this D time epsilon on all the directions to inject it the direction that will maximize my chances of taking you away from local minima.

So you will converge, as proven in the early papers, but not necessarily to something good. But what we showed here is that if you combine ideas from multidimensional geometric medians, with ideas from single dimensional component-wise median, you improve your robustness.

Of course it comes with a price. You require three quarters of the workers to be reliable.

There is another direction where we expanded this problem, which is asynchrony. And asynchrony arises when as I said in the Byzantine generals setting, you don’t have a bounded delay. In the bounded delay setting, you know that horses arrive at most after one hour.

Lucas: But I have no idea if the computer on the other side of the planet is ever gonna send me that next update.

El Mahdi: Exactly. So imagine you are doing machine learning on smartphones. You are leveraging a set of smartphones all around the globe, and in different bandwidths, and different communication issues etc.

And you don’t want each time to be bottlenecked by the slowest one. So you want to be asynchronous, you don’t want to wait. You’re just like whenever some update is coming, take it into account.

Imagine some very advanced AI scenario, where you send a lot of learners all across the universe, and then they communicate with the speed of light, but some of them are five light minutes away, but some others are two hours and a half. And you want to learn from all of them, but not necessarily handicap the closest one, because there are some other learners far away.

Lucas: You want to run updates in the context of asynchrony.

El Mahdi: Yes. So you want to update whenever a gradient is popping up.

Lucas: Right. Before we move on to illustrate the problem again here is that the order matters, right? Like in the banking example. Because the 10% plus 10 is different from-

El Mahdi: Yeah. Here the order matters for different reasons. You update me so you are updating me on the model you got three hours ago. But in the meanwhile, three different agents updated me on the models, while getting it three minutes ago.

All the agents are communicating through some abstraction they call the server maybe. Like this server receives updates from fast workers.

Lucas: It receives gradients.

El Mahdi: Yeah, gradients. I also call them updates.

Lucas: Okay.

El Mahdi: Because some workers are close to me and very fast, I’ve done maybe 1000 updates, while you were still working and sending me the message.

So when your update arrive, I can tell whether it is very stale, very late, or malicious. So what we do in here is that, and I think it’s very important now to connect a bit back with classic distributed computing.

Is that Byzantine resilience in machine learning is easier than Byzantine resilience in classical distributed computing for one reason, but it is extremely harder for another reason.

The reason is that we know what we want to agree on. We want to agree on a gradient. We have a toolbox of calculus that tells us how this looks like. We know that it’s the slope of some loss function that is most of today’s models, relatively smooth, differentiable, maybe Lipschitz, bounded, whatever curvature.

So we know that we are agreeing on vectors that are gradients of some loss function. And we know that there is a majority of workers that will produce vectors that will tell us what does a legit vector look like.

You can find some median behavior, and then come up with filtering criterias that will get away with the bad gradients. That’s the good news. That’s why it’s easier to do Byzantine resilience in machine learning than to do Byzantine agreement. Byzantine agreement, because agreement is a way harder problem.

The reason why Byzantine resilience is harder in machine learning than in the typical settings you have in distributed computing is that we are dealing with extremely high dimensional data, extremely high dimensional decisions.

So a decision here is to update the model. It is triggered by a gradient. So whenever I accept a gradient, I make a decision. I make a decision to change the model, to take it away from this state, to this new state, by this much.

But this is a multidimensional update. And Byzantine agreement, or Byzantine approximate agreement in higher dimension has been provably hopeless by Hammurabi Mendes, and Maurice Herlihy in an excellent paper in 2013, where they show that you can’t do Byzantine agreement in D dimension with N agents in less than N to the power D computations, per agent locally.

Of course in their paper, they were meaning Byzantine agreement on positions. So they were framing it with a motivations saying, “This is N to the power D, but the typical cases we care about in distributed computing are like robots agreeing on a position on a plane, or on a position in a three dimensional space.” So D is two or three.

So N to the power two or N to the power three is fine. But in machine learning D is not two and three, D is a billion or a couple of millions. So N to the power a million is just like, just forget.

And not only that, but also they require … Remember when I tell you that Byzantine resilience computing would always have some upper bound on the number malicious agents?

Lucas: Mm-hmm (affirmative).

El Mahdi: So the number of total agents should exceed D times the number of malicious agents.

Lucas: What is D again sorry?

El Mahdi: Dimension.

Lucas: The dimension. Okay.

El Mahdi: So if you have to agree on D dimension, like on a billion dimensional decision, you need at least a billion times the number of malicious agents.

So if you have say 100 malicious agents, you need at least 100 billion total number of agents to be resistant. No one is doing distributed machine learning on 100 billion-

Lucas: And this is because the dimensionality is really screwing with the-

El Mahdi: Yes. Byzantine approximate agreement has been provably hopeless. That’s the bad, that’s why the dimensionality of machine learning makes it really important to go away, to completely go away from traditional distributed computing solutions.

Lucas: Okay.

El Mahdi: So we are not doing agreement. We’re not doing agreement, we’re not even doing approximate agreement. We’re doing something-

Lucas: Totally new.

El Mahdi: Not new, totally different.

Lucas: Okay.

El Mahdi: Called gradient decent. It’s not new. It’s as old as Newton. And it comes with good news. It comes with the fact that there are some properties, like some regularity of the loss function, some properties we can exploit.

And so in the asynchronous setting, it becomes even more critical to leverage those differentiability properties. So because we know that we are optimizing a loss functions that has some regularities, we can have some good news.

And the good news has to do with curvature. What we do here in asynchronous setting, is not only we ask workers for their gradients, we ask them for their empirical estimate of the curvature.

Lucas: Sorry. They’re estimating the curvature of the loss function, that they’re adding the gradient to?

El Mahdi: They add the gradient to the parameter, not the loss function. So we have a loss function, parameter is the abscissa, you add the gradient to the abscissa to update the model, and then you end up in a different place of the loss function.

So you have to imagine the loss function as like a surface, and then the parameter space as the plane, the horizontal plane below the surface. And depending on where you are in the space parameter, you would be on different heights of the loss function.

Lucas: Wait sorry, so does the gradient depend where you are on this, the bottom plane?

El Mahdi: Yeah [crosstalk 00:29:51]-

Lucas: So then you send an estimate for what you think the slope of the intersection will be?

El Mahdi: Yeah. But for asynchrony, not only that. I will ask you to send me the slope, and your observed empirical growth of the slope.

Lucas: The second derivative?

El Mahdi: Yeah.

Lucas: Okay.

El Mahdi: But the second derivative again in high dimension is very hard to compute. You have to computer the Hessian matrix.

Lucas: Okay.

El Mahdi: That’s something like completely ugly to compute in high dimensional situations because it takes D square computations.

As an alternative we would like you to send us some linear computation in D, not a square computation in D.

So we would ask you to compute your actual gradient, your previous gradient, the difference between them, and normalize it by the difference between models.

So, “Tell us your current gradient, by how much it changed from the last gradient, and divide that by how much you changed the parameter.”

So you would tell us, “Okay, this is my current slope, and okay this is the gradient.” And you will also tell us, “By the way, my slope change relative to my parameter change is this much.”

And this would be some empirical estimation of the curvature. So if you are in a very curved area-

Lucas: Then the estimation isn’t gonna be accurate because the linearity is gonna cut through some of the curvature.

El Mahdi: Yeah but if you are in a very curved area of the loss function, your slope will change a lot.

Lucas: Okay. Exponentially changing the slope.

El Mahdi: Yeah. Because you did a very tiny change in the parameter and it takes a lot of the slope.

Lucas: Yeah. Will change the … Yeah.

El Mahdi: When you are in a non-curved area of the loss function, it’s less harmful for us that you are stale, because you will just technically have the same updates.

If you are in a very curved area of the loss function, your updates being stale is now a big problem. So we want to discard your updates proportionally to your curvature.

So this is the main idea of this scheme in asynchrony, where we would ask workers about their gradient, and their empirical growth rates.

And then of course I don’t want to trust you on what you declare, because you can plan to screw me with some gradients, and then declare a legitimate value of the curvature.

I will take those empirical, what we call in the paper empirical Lipschitz-ness. So we ask you for this empirical growth rate, that it’s a scalar, remember? This is very important. It’s a single dimensional number.

And so we ask you about this growth rate, and we ask all of you about growth rates, again assuming the majority is correct. So the majority of growth rates will help us set the median growth rate in a robust manner, because as long as a simple majority is not lying, the median growth rates will always be bounded between two legitimate values of the growth rate.

Lucas: Right because, are you having multiple workers inform you of the same part of your loss function?

El Mahdi: Yes. Even though they do it in an asynchronous manner.

Lucas: Yeah. Then you take the median of all of them.

El Mahdi: Yes. And then we reason by quantiles of the growth rates.

Lucas: Reason by quantiles? What are quantiles?

El Mahdi: The first third, the second third, the third third. Like the first 30%, the second 30%, the third 30%. We will discard the first 30%, discard the last 30%. Anything in the second 30% is safe.

Of course this has some level of pessimism, which is good for safety, but not very good for being fast. Because maybe people are not lying, so maybe the first 30%, and the last 30% are also values we could consider. But for safety reasons we want to be sure.

Lucas: You want to try to get rid of the outliers.

El Mahdi: Possible.

Lucas: Possible outliers.

El Mahdi: Yeah. So we get rid of the first 30%, the last 30%.

Lucas: So this ends up being a more conservative estimate of the loss function?

El Mahdi: Yes. That’s completely right. We explain that in the paper.

Lucas: So there’s a trade off that you can decide-

El Mahdi: Yeah.

Lucas: By choosing what percentiles to throw away.

El Mahdi: Yeah. Safety never comes for free. So here, depending on how good your estimates about the number of potential Byzantine actors is, your level of pessimism with translate into slowdown.

Lucas: Right. And so you can update the amount that you’re cutting off-

El Mahdi: Yeah.

Lucas: Based off of the amount of expected corrupted signals you think you’re getting.

El Mahdi: Yeah. So now imagine a situation where you know the number of workers is know. You know that you are leveraging 100,000 smartphones doing gradient descent for you. Let’s call that N.

You know that F of them might be malicious. We argue that if F is exceeding the third of N, you can’t do anything. So we are in a situation where F is less than a third. So less than 33,000 workers are malicious, then the slowdown would be F over N, so a third.

What if you are in a situation where you know that your malicious agents are way less than a third? For example you know that you have at most 20 rogue accounts in your video sharing platform.

And your video sharing platform has two billion accounts. So you have two billion accounts.

Lucas: 20 of them are malevolent.

El Mahdi: What we show is that the slowdown would be N minus F divided by N. N is the two billion accounts, F is the 20, and is again two billion.

So it would be two billion minus 20, so one million nine hundred billion, like something like 0.999999. So you would go almost as fast as the non-Byzantine resilient scheme.

So our Byzantine resilient scheme has a slowdown that is very reasonable in situations where F, the number of malicious agents is way less than N, the total number of agents, which is typical in modern…

Today, like if you ask social media platforms, they have a lot of a tool kits to prevent people from creating a billion fake accounts. Like you can’t in 20 hours create an army of several million accounts.

None of the mainstream social media platforms today are susceptible to this-

Lucas: Are susceptible to massive corruption.

El Mahdi: Yeah. To this massive account creation. So you know that the number of corrupted accounts are negligible to the number of total accounts.

So that’s the good news. The good news is that you know that F is negligible to N. But then the slowdown of our Byzantine resilient methods is also close to one.

But it has the advantage compared to the state of the art today to train distributed settings of not taking the average gradient. And we argued in the very beginning that those 20 accounts that you could create, it doesn’t take a bot army or whatever, you don’t need to hack into the machines of the social network. You can have a dozen human, sitting somewhere in a house manually creating 20 accounts, training the accounts over time, doing behavior that makes the legitimate for some topics, and then because you’re distributing machine learning scheme would average the gradients generated by people behavior and that making your command anti-vaccine or controversies, anti-Semitic conspiracy theories.

Lucas: So if I have 20 bad gradients and like, 10,000 good gradients for a video, why is it that with averaging 20 bad gradients are messing up the-

El Mahdi: The amplitude. It’s like the billionaire in the room of core academics.

Lucas: Okay, because the amplitude of each of their accounts is greater than the average of the other accounts?

El Mahdi: Yes.

Lucas: The average of other accounts that are going to engage with this thing don’t have as large of an amplitude because they haven’t engaged with this topic as much?

El Mahdi: Yeah, because they’re not super credible on gun control, for example.

Lucas: Yeah, but aren’t there a ton of other accounts with large amplitudes that are going to be looking at the same video and correcting over the-

El Mahdi: Yeah, let’s define large amplitudes. If you come to the video and just like it, that’s a small update. What about you like it, post very engaging comments-

Lucas: So you write a comment that gets a lot of engagement, gets a lot of likes and replies.

El Mahdi: Yeah, that’s how you increase your amplitude. And because you are already doing some good job in becoming the reference on that video-sharing platform when it comes to discussing gun control, the amplitude of your commands is by definition high and the fact that your command was very early on posted and then not only you commented the video but you also produced a follow-up video.

Lucas: I see, so the gradient is really determined by a multitude of things that the video-sharing platform is measuring for, and the metrics are like, how quickly you commented, how many people commented and replied to you. Does it also include language that you used?

El Mahdi: Probably. It depends on the social media platform and it depends on the video-sharing platform and, what is clear is that there are many schemes that those 20 accounts created by this dozen people in a house can try to find good ways to maximize the amplitude of their generated gradients, but this is a way easier problem than the typical problems we have in technical AI safety. This is not value alignment or value loading or coherent extrapolated volition. This is a very easy, tractable problem on which now we have good news, provable results. What’s interesting is the follow-up questions that we are trying to investigate here with my colleagues, the first of which is, don’t necessarily have a majority of people on the internet promoting vaccines.

Lucas: People that are against things are often louder than people that are not.

El Mahdi: Yeah, makes sense, and sometimes maybe numerous because they generate content, and the people who think vaccines are safe not creating content. In some topics it might be safe to say that we have a majority of reasonable, decent people on the internet. But there are some topics in which now even like polls, like the vaccine situation, there’s a surge now of anti-vaccine resentment in western Europe and the US. Ironically this is happening in the developed country now, because people are so young, they don’t remember the non-vaccinated person. My aunt, I come from Morocco. my aunt is handicapped by polio, so I grew up seeing what a non-vaccinated person looks like. So young people in the more developed countries never had a living example of non-vaccinated past.

Lucas: But they do have examples of people that end up with autism and it seems correlated with vaccines.

El Mahdi: Yeah, the anti-vaccine content may just end up being so click baits, and so provocative that it gets popular. So this is a topic where the majority hypothesis which is crucial to poisoning resilience does not hold. An open follow up we’re onto now is how to combine ideas from reputation metrics, PageRank, et cetera, with poisoning resilience. So for example you have the National Health Institute, the John Hopkins Medical Hospital, Harvard Medical School, and I don’t know, the Massachusetts General Hospital having official accounts on some video-sharing platform and then you can spot what they say on some topic because now we are very good at doing semantic analysis of contents.

And know that okay, on the tag vaccines, I know that there’s this bunch of experts and then what you want to make emerge on your platform is some sort of like epistocracy. The power is given to the knowledgeable, like we have in some fields, like in medical regulation. The FDA doesn’t do a majority vote. We don’t have a popular majority vote across the country to tell the FDA whether it should approve this new drug or not. The FDA does some sort of epistocracy where the knowledgeable experts on the topic would vote. So how about mixing ideas from social choice?

Lucas: And topics in which there are experts who can inform.

El Mahdi: Yeah. There’s also a general fall-off of just straight out trying to connect Byzantine resilient learning with social choice, but then there’s another set of follow ups that motivates me even more. We were mentioning workers, workers, people generate accounts on social media, accounts generation gradients. That’s all I can implicitly assume in that the server, the abstraction that’s gathering those gradients is reliable. What about the aggregated platform itself being deployed on rogue machines? So imagine you are whatever platform doing learning. By the way, whatever always we have said from the beginning until now applies as long as you do gradient-based learning. So it can be recommended systems. It can be training some deep reinforcement learning of some super complicated tasks to beat, I don’t know the word, champion in poker.

We do not care as long as there’s some gradient generation from observing some state, some environmental state, and some reward or some label. It can be supervised, reinforced, as long as gradient based or what you say apply. Imagine now you have this platform leveraging distributed gradient creators, but then the platform itself for security reasons is deployed on several machines for fault tolerance. But then those machines themselves can fail. You have to make the servers agree on the model, so despite the fact that a fraction of the workers are not reliable and now a fraction of the servers themselves. This is the most important follow up i’m into now and I think there would be something on archive maybe in February or March on that.

And then a third follow up is practical instances of that, so I’ve been describing speculative thought experiments on power poisoning systems is actually brilliant master students working which means exactly doing that, like on typical recommended systems, datasets where you could see that it’s very easy. It really takes you a bunch of active agents to poison, a hundred thousand ones or more. Probably people working on big social media platforms would have ways to assess what I’ve said, and so as researchers in academia we could only speculate on what can go wrong on those platforms, so what we could do is just like we just took state of the art recommender systems, datasets, and models that are publicly available, and you can show that despite having a large number of reliable recommendation proposers, a small, tiny fraction of proposers can make, I don’t know, like a movie recommendation system recommend the most suicidal triggering film to the most depressed person watching through your platform. So I’m saying, that’s something you don’t want to have.

Lucas: Right. Just wrapping this all up, how do you see this in the context of AI alignment and the future of machine learning and artificial intelligence?

El Mahdi: So I’ve been discussing this here with people in the Beneficial AI conference and it seems that there are two schools of thought. I am still hesitating between the two because I switched within the past three months from the two sides like three times. So one of them thinks that an AGI is by definition resilient to poisoning.

Lucas: Aligned AGI might be by definition.

El Mahdi: Not even aligned. The second school of thought, aligned AGI is Byzantine resilient.

Lucas: Okay, I see.

El Mahdi: Obviously aligned AGI would be poisoning resilience, but let’s just talk about super intelligent AI, not necessarily aligned. So you have a super intelligence, would you include poisoning resilience in the super intelligence definition or not? And one would say that yeah, if you are better than human in whatever task, it means you are also better than human into spotting poison data.

Lucas: Right, I mean the poison data is just messing with your epistemics, and so if you’re super intelligent your epistemics would be less subject to interference.

El Mahdi: But then there is that second school of thought which I switched back again because I find that most people are in the first school of thought now. So I believe that super intelligence doesn’t necessarily include poisoning resilience because of what I call practically time constrained superintelligence. If you have a deadline because of computational complexity, you have to learn something, which can sometimes-

Lucas: Yeah, you want to get things done.

El Mahdi: Yeah, so you want to get it done in a finite amount of time. And because of that you will end up leveraging to speed up your learning. So if a malicious agent just put up bad observations of the environment or bad labeling of whatever is around you, then it can make you learn something else than what you would like as an aligned outcome. I’m strongly on the second side despite many disagreeing with me here. I don’t think super intelligence includes poisoning resilience, because super intelligence would still be built with time constraints.

Lucas: Right. You’re making a tradeoff between safety and computational efficiency.

El Mahdi: Right.

Lucas: It also would obviously seem to matter the kind of world that the ASI finds itself in. If it knows that it’s in a world with no, or very, very, very few malevolent agents that are wanting to poison it, then it can just throw all of this out of the window, but the problem is that we live on a planet with a bunch of other primates that are trying to mess up our machine learning. So I guess just as a kind of fun example in taking it to an extreme, imagine it’s the year 300,000 AD and you have a super intelligence which has sort of spread across space-time and it’s beginning to optimize its cosmic endowment, but it gives some sort of uncertainty over space-time to whether or not there are other super intelligences there who might want to poison its interstellar communication in order to start taking over some of its cosmic endowment. Do you want to just sort of explore?

El Mahdi: Yeah, that was like a closed experiment I proposed earlier to Carl Shulman from the FHI. Imagine some super intelligence reaching the planets where there is a smart form of life emerging from electric communication between plasma clouds. So completely non-carbon, non-silicon based.

Lucas: So if Jupiter made brains on it.

El Mahdi: Yeah, like Jupiter made brains on it just out of electric communication through gas clouds.

Lucas: Yeah, okay.

El Mahdi: And then this turned to a form of communication is smart enough to know that this is a super intelligence reaching the planet to learn about this form of life, and then it would just start trolling it.

Lucas: It’ll start trolling the super intelligence?

El Mahdi: Yeah. So they would come up with an agreement ahead of time, saying, “Yeah, this super intelligence coming from earth throughout our century to discover how we do things here. Let’s just behave dumbly, or let’s just misbehave. And then the super intelligence will start collecting data on this life form and then come back to earth saying, Yeah, they’re just a dumb plasma passive form of nothing interesting.

Lucas: I mean, you don’t think that within the super intelligence’s model, I mean, we’re talking about it right now so obviously a super intelligence will know this when it leaves that there will be agents that are going to try and trick it.

El Mahdi: That’s the rebuttal, yes. That’s the rebuttal again. Again, how much time does super intelligence have to do inference and draw conclusions? You will always have some time constraints.

Lucas: And you don’t always have enough computational power to model other agents efficiently to know whether or not they’re lying, or …

El Mahdi: You could always come up with thought experiment with some sort of other form of intelligence, like another super intelligence is trying to-

Lucas: There’s never, ever a perfect computer science, never.

El Mahdi: Yeah, you can say that.

Lucas: Security is never perfect. Information exchange is never perfect. But you can improve it.

El Mahdi: Yeah.

Lucas: Wouldn’t you assume that the complexity of the attacks would also scale? We just have a ton of people working on defense, but if we have an equal amount of people working on attack, wouldn’t we have an equally complex method of poisoning that our current methods would just be overcome by?

El Mahdi: That’s part of the empirical follow-up I mentioned. The one Isabella and I were working on, which is trying to do some sort of min-max game of poisoner versus poisoning resilience learner, adversarial poisoning setting where like a poisoner and then there is like a resilient learner and the poisoner tries to maximize. And what we have so far is very depressing. It turns out that it’s very easy to be a poisoner. Computationally it’s way easier to be the poisoner than to be-

Lucas: Yeah, I mean, in general in the world it’s easier to destroy things than to create order.

El Mahdi: As I said in the beginning, this is a sub-topic of technical AI safety where I believe it’s easier to have tractable formalizable problems for which you can probably have a safe solution.

Lucas: Solution.

El Mahdi: But in very concrete, very short term aspects of that. In March we are going to announce a major update in Tensor Flow which is the standout frameworks today to do distributed machine learning, open source by Google, so we will announce hopefully if everything goes right in sys ML in the systems for machine learning conference, like more empirically focused colleagues, so based on the algorithms I mentioned earlier which were presented at NuerIPS and ICML from the past two years, they will announce a major update where they basically changed every averaging insight in terms of flow by those three algorithms I mentioned, Krum and Bulyan and soon Kardam which constitute our portfolio of Byzantine resilience algorithms.

Another consequence that comes for free with that is that distributed machinery frameworks like terms of flow use TCPIP as a communication protocol. So TCPIP has a problem. It’s reliable but it’s very slow. You have to repeatedly repeat some messages, et cetera, to guarantee reliability, and we would like to have a faster communication protocol, like UDP. We don’t need to go through those details. But it has some package drop, so so far there was no version of terms of flow or any distributed machine learning framework to my knowledge using UDP. The old used TCPIP because they needed reliable communication, but now because we are Byzantine resilient, we can afford having fast but not completely reliable communication protocols like UDP. So one of the things that come for free with Byzantine resilience is that you can move from heavy-

Lucas: A little bit more computation.

El Mahdi: -yeah, heavy communication protocols like TCPIP to lighter, faster, more live communication protocols like UDP.

Lucas: Keeping in mind you’re trading off.

El Mahdi: Exactly. Now we have this portfolio of algorithms which can serve many other applications besides just making faster distributed machine learning, like making poisoning resilience. I don’t know, recommended systems for social media and hopefully making AGI learning poisoning resilience matter.

Lucas: Wonderful. So if people want to check out some of your work or follow you on social media, what is the best place to keep up with you?

El Mahdi: Twitter. My handle is El Badhio, so maybe you would have it written down on the description.

Lucas: Yeah, cool.

El Mahdi: Yeah, Twitter is the best way to get in touch.

Lucas: All right. Well, wonderful. Thank you so much for speaking with me today and I’m excited to see what comes out of all this next.

El Mahdi: Thank you. Thank you for hosting this.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: Cooperative Inverse Reinforcement Learning with Dylan Hadfield-Menell (Beneficial AGI 2019)

What motivates cooperative inverse reinforcement learning? What can we gain from recontextualizing our safety efforts from the CIRL point of view? What possible role can pre-AGI systems play in amplifying normative processes?

Cooperative Inverse Reinforcement Learning with Dylan Hadfield-Menell is the eighth podcast in the AI Alignment Podcast series, hosted by Lucas Perry and was recorded at the Beneficial AGI 2018 conference in Puerto Rico. For those of you that are new, this series covers and explores the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, Lucas will speak with technical and non-technical researchers across areas such as machine learning, governance,  ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with Dylan Hadfield-Menell. Dylan is a 5th year PhD student at UC Berkeley advised by Anca Dragan, Pieter Abbeel and Stuart Russell, where he focuses on technical AI alignment research.

Topics discussed in this episode include:

  • How CIRL helps to clarify AI alignment and adjacent concepts
  • The philosophy of science behind safety theorizing
  • CIRL in the context of varying alignment methodologies and it’s role
  • If short-term AI can be used to amplify normative processes
You can follow Dylan here and find the Cooperative Inverse Reinforcement Learning paper here. You can listen to the podcast above or read the transcript below.

Lucas: Hey everyone, welcome back to the AI Alignment Podcast series. I’m Lucas Perry and today we will be speaking for a second time with Dylan Hadfield-Menell on cooperative inverse reinforcement learning, the philosophy of science behind safety theorizing, CIRL in the context of varying alignment methodologies, and if short term AI can be used to amplify normative processes. This time it just so happened to be an in person discussion and Beneficial AGI 2019, FLI’s sequel to the Beneficial AI 2017 conference at Asilomar.

I have a bunch of more conversations that resulted from this conference to post soon and you can find more details about the conference in the coming weeks. As always, if you enjoy this podcast, please subscribe or follow us on your preferred listening platform. As many of you will already know, Dylan is a fifth year Ph.D. student at UC Berkeley, advised by Anca Dragan, Pieter Abbeel, and Stuart Russell, where he focuses on technical AI Alignment research. And so without further ado, I’ll give you Dylan.

Thanks so much for coming on the podcast again, Dylan, that’s been like a year or something. Good to see you again.

Dylan: Thanks. It’s a pleasure to be here.

Lucas: So just to start off, we can go ahead and begin speaking a little bit about your work on cooperative inverse reinforcement learning and whatever sorts of interesting updates or explanation you have there.

Dylan: Thanks. For me, working in cooperative IRL has been a pretty long process, it really sort of dates back to the start of my second year in PhD when my advisor came back from a yearlong sabbatical and suggested that we entirely changed the research direction we were thinking about.

That was to think about AI Alignment and AI Safety and associated concerns that, that might bring. And our first attempt at a really doing research in that area was to try to formalize what’s the problem that we’re looking at, what are the space of parameters and the space of solutions that we should be thinking about in studying that problem?

And so it led us to write Cooperative Inverse Reinforcement Learning. Since then I’ve had a large amount of conversations where I’ve had incredible difficulty trying to convey what it is that we’re actually trying to do here and what exactly that paper and idea represents with respect to AI Safety.

One of the big updates for me and one of the big changes since we’ve spoken last, is getting a little bit of a handle on really what’s the value of that as the system. So for me, I’ve come around to the point of view that really what we were trying to do with cooperative IRL was to propose an alternative definition of what it means for an AI system to be effective or rational in some sense.

And so there’s a story you can tell about artificial intelligence, which is that we started off and we observed that people were smart and they were intelligent in some way, and then we observed that we could get computers to do interesting things. And this posed the question of can we get computers to be intelligent? We had no idea what that meant, no idea how to actually nail it down and we discovered that in actually trying to program solutions that looked intelligent, we had a lot of challenges.

So one of the big things that we did as a field was to look over next door into the economics department in some sense, to look at those sort of models that they have of decision theoretic rationality and really looking at homoeconomicous as an ideal to shoot for. From that perspective, actually a lot of the field of AI has shifted to be about effective implementations of homoeconomicous.

In my terminology, this is about systems that are effectively individually rational. These are systems that are good at optimizing for their goals, and a lot of the concerns that we have about AI Safety is that systems optimizing for their own goals could actually lead to very bad outcomes for the rest of us. And so what cooperative IRL attempts to do is to understand what it would mean for a human robot system to behave as a rational agent.

In the sense, we’re moving away from having a box drawn around the AI system or the artificial component of the system to having that agent box drawn around the person and the system together, and we’re trying to model the sort of important parts of the value alignment problem in our formulation here. And in this case, we went with the simplest possible set of assumptions which are basically that we have a static set of preferences that are the humans preferences that they’re trying to optimize. This is effectively the humans welfare.

The world is fully observable and the robot and the person are both working to maximize the humans welfare, but there is this information bottlenecking. This information asymmetry that’s present that we think is a fundamental component of the value alignment problem. And so really what cooperative IRL, is it’s a definition of how a human and a robot system together can be rational in the context of fixed preferences in a fully observable world state.

Lucas: There’s a point of metatheory or coming up with models and theory. It seems like the fundamental issue is given how and just insanely complex AI Alignment is trying to converge on whatever the most efficacious model is, is very, very difficult. People keep flicking back and forth about theoretically how we’re actually going to do this. Even in very grid world or toy environments. So it seemed very, very hard to isolate the best variables or what variables can be sort of modeled and tracked in ways that is going to help us most.

Dylan: So, I definitely think that this is not an accurate model of the world and I think that there are assumptions here which, if not appropriately reexamined, would lead to a mismatch between the real world and things that work in theory.

Lucas: Like human beings having static preferences.

Dylan: So for example, yes, I don’t claim to know what human preferences really are and this theory is not an attempt to say that they are static. It is an attempt to identify a related problem to the one that we’re really faced with, that we can actually make technical and theoretical progress on. That will hopefully lead to insights that may transfer out towards other situations.

I certainly recognize that what I’m calling a theta in that paper is not really the same thing that everyone talks about when we talk about preferences. I, in talking with philosophers, I’ve discovered, I think it’s a little bit more closer to things like welfare in like a moral philosophy context, which maybe you could think about as being a more static object that you would want to optimize.

In some sense theta really is an encoding of what you would like the system to do, in general is what we’re assuming there.

Lucas: Because it’s static.

Dylan: Yes, and to the extent that you want to have that be changing over time, I think that there’s an interesting theoretical question as to how that actually is different, and what types of changes that leads to and whether or not you can always reduce something with non-static preferences to something with static preferences from a mathematical point of view.

Lucas: I can see how moving from static to changing over time just makes it so much more insanely complex.

Dylan: Yeah, and it’s also really complex of the level of its Philosophically unclear what the right thing to do.

Lucas: Yeah, that’s what I mean. Yeah, you don’t even know what it even means to be aligning as the values are changing, like whether or not the agent even thinks that they just moved in the right direction or not.

Dylan: Right, and I also even think I want to point out how uncertain all of these things are. We as people are hierarchical organizations have different behaviors and observation systems and perception systems. And we believe we have preferences, we have a name to that, but there is a sense in which that is ultimately a fiction of some kind.

It’s a useful tool that we have to talk about ourselves to talk about others that facilitates interaction and cooperation. And so given that I do not know the answer to these philosophical questions, what can I try to do as a technical researcher to push the problem forward and to make actual progress?

Lucas: Right, and so it’s sort of again, like a metatheoretical point and what people are trying to do right now in the context of AI Alignment, it seems that the best thing for people to be doing is sort of to be coming up with these theoretical models and frameworks, which have a minimum set of assumptions which may be almost like the real world but are not, and then making theoretical progress there that will hopefully in the future transfer, as you said to other problems as ML and deep learning gets better and the other tools are getting better so that it’ll actually have the other tools to make it work with more complicated assumptions.

Dylan: Yes, I think that’s right. The way that I view this as we had AI, is this broad, vague thing. Through the course of AI research, we kind of got to Markov decision processes as a sort of coordinating theory around what it means for us to design good agents, and cooperative IRL is an attempt to take a step from markup decision processes more closely towards the set of problems that we want to study.

Lucas: Right, and so I think this is like a really interesting point that I actually haven’t talked to anyone else about and if you have a few more words about it, I think it would be really interesting. So just in terms of being a computer scientist and being someone who is working on the emerging theory of a field. I think it’s often unclear what the actual theorizing process is behind how people get to CIRL. How did someone get to debate? How did someone get to iterated amplification?

It seems like you first identify problems which you see to be crucial and then there are some sorts of epistemic and pragmatic heuristics that you apply to try and begin to sculpt a model that might lead to useful insight. Would you have anything to correct or unpack here?

Dylan: I mean, I think that is a pretty good description of a pretty fuzzy process.

Lucas: But like being a scientist or whatever?

Dylan: Yeah. I don’t feel comfortable speaking for scientists in general here, but I could maybe say a little bit more about my particular process, which is that I try to think about how I’m looking at the problem differently from other people based on different motivations and different goals that I have. And I try to lean into how that can push us in different directions. There’s a lot of other really, really smart people who have tried to do lots of things.

You have to maintain an amount of intellectual humility about your ability to out think the historical components of the field. And for me, I think that in particular for AI Safety, it’s thinking about reframing what is the goal that we’re shooting towards as a field.

Lucas: Which we don’t know.

Dylan: We don’t know of those goals are, absolutely. And I think that there is a sense in which the field has not re-examined those goals incredibly deeply. For a little bit, I think that it’s so hard to do anything that looks intelligent in the real world that we’ve been trying to focus on that individually rational Markov decision process model. And I think that a lot of the concerns about AI Safety are really a call for AI as a field to step back and think about what we’re trying to accomplish in the world and how can we actually try to achieve beneficial outcomes for society.

Lucas: Yeah, and I guess like a sociological phenomenon within the scientists or people who are committed to empirical things. In terms of reanalyzing what the goal of AI Alignment is, the sort of area of moral philosophy and ethics and other things, which for empirical leaning rational people can be distasteful because you can’t just take a telescope to the universe and see like a list of what you ought to do.

And so it seems like people like to defer on these questions. I don’t know. Do you have anything else to add here?

Dylan: Yeah. I think computer scientists in particular are selected to be people who like having boxed off problems that they know how to solve and feel comfortable with, and that leaning into getting more people with a humanities bent into computer science and broadly AI in particular, AI Safety especially is really important and I think that’s a broad call that we’re seeing come from society generally.

Lucas: Yeah, and I think it also might be wrong though to model the humanities questions as those which are not in boxes and cannot be solved. That’s sort of like a logical positivist thing to say, that on one end we have the hard things and you just have to look at the world enough and you’ll figure it out and then there’s the soft squishy things which deal with abstractions that I don’t have real answers, but people with fluffy degrees need to come up with things that seem right but aren’t really right.

Dylan: I think it would be wrong to take what I just said in that direction, and if that’s what it sounds like I definitely want to correct that. I don’t think there is a sense in which computer science is a place where there are easy right answers, and that the people in humanities are sort of waving their hands and sort of fluffing around.

This is sort of leaning into making this a more AI value alignment kinds of framing or thinking about it. But when I think about being AI systems into the world, I think about what things can you afford to get wrong in your specification and which things can you not afford to get wrong in your specifications.

In this sense, specifying physics incorrectly is much, much better than specifying the objective incorrectly, at least by default. And the reason for that is what happens to the world when you push it, is a question that you can answer from your observations. And so if you start off in the wrong place, as long as you’re learning and adapting, I can reasonably expect my systems do correct to that. Or at least the goal of successful AI research is that your systems will effectively adapt to that.

However, the past that your system is supposed to do is sort of arbitrary in a very fundamental sense. And from that standpoint, it is on you as the system designer to make sure that objective is specified correctly. When I think about what we want to do as a field, I ended up taking a similar lens and that there’s a sense in which we as researchers and people and society and philosophers and all of it are trying to figure out what we’re trying to do and what we want to task the technology with, and the directions that we want to push it in. And then there are questions of what will the technology be like and how should it function that will be informed by that and shaped by that.

And I think that there is a sense in which that is arbitrary. Now, what is right? That I don’t really know the answer to and I’m interested in having those conversations, but they make me feel uneasy. I don’t trust myself on those questions, and that could mean that I should learn how to feel more uneasy and think about it more and in doing this research I have been kind of forced into some of those conversations.

But I also do think that for me at least I see a difference between what can we do and what should we do. And thinking about what should we do as a really, really hard question that’s different than what can we do.

Lucas: Right. And so I wanna move back towards CIRL, but just to sort of wrap up here on our philosophy of science musings, a thought I had while you were going through that was, at least for now, what I think is fundamentally shared between fields that deal with things that matter, are their concepts deal with meaningfully relevant reference in the world? Like do your concepts refer to meaningful things?

Putting ontology aside, whatever love means or whatever value alignment mean. These are meaningful referents for people and I guess for now if our concepts are actually referring to meaningful things in the world, then it seems important.

Dylan: Yes, I think so. Although, I’m not totally sure I understood that.

Lucas: Sure, that’s fine. People will say that humanities or philosophy doesn’t have these boxes with like well-defined problems and solutions because they either don’t deal with real things in the world or the concepts are so fuzzy that the problems are sort of invented and illusory. Like how many angels can stand on the head of a pin? Like the concepts don’t work, aren’t real and don’t have real referents, but whatever.

And I’m saying the place where philosophy and ethics and computer science and AI Alignment should at least come together for now is where the referents have, where the concepts of meaningful referents in the world?

Dylan: Yes, that is something that I absolutely buy. Yes, I think there’s a very real sense in which those questions are harder, but that doesn’t mean they’re less real or less important.

Lucas: Yes, that’s because it’s the only point I wanted to push against logical positivism.

Dylan: No, I don’t mean to say that the answers are wrong, it’s just that they are harder to prove in a real sense.

Lucas: Yeah. I mean, I don’t even know if they have answers or if they do or if they’re just all wrong, but I’m just open to it and like more excited about everyone coming together thing.

Dylan: Yes, I absolutely agree with that.

Lucas: Cool. So now let’s turn it back into the CIRL. So you began by talking about how you and your advisers were having this conceptual shift and framing, then we got into the sort of philosophy of science behind how different models and theories of alignment go. So from here, whatever else you have to say about CIRL.

Dylan: So I think for me the upshot of concerns about advanced AI systems and negative consequence there in really is a call to recognize that the goal of our field is AI Alignment. That almost any AI that’s not AI Alignment is solving a sub problem and viewing it only in solving that sub problem is a mistake.

Ultimately, we are in the business of building AI systems that integrate well with humans and human society. And if we don’t take that as a fundamental tenant of the field, I think that we are potentially in trouble and I think that that is a perspective that I wish was more pervasive throughout artificial intelligence generally,

Lucas: Right, so I think I do want to move into this view where safety is a normal thing, and like Stuart Russell will say, “People who build bridges all care about safety and there aren’t a subsection of bridge builders who work in bridge safety, everyone is part of the bridge safety.” And I definitely want to get into that, but I also sort of want to get a little bit more into CIRL and why you think it’s so motivating and why this theoretical framing and shift is important or illuminating, and what the specific content of it is.

Dylan: The key thing is that what it does is point out that it doesn’t make sense to talk about how well your system is doing without talking about the way in which it was instructed and the type of information that it got. No AI system exists on its own, every AI system has a designer, and it doesn’t make sense to talk about the functioning of that system without also talking about how that designer built it, evaluated it and how well it is actually serving those ends.

And I don’t think this is some brand new idea that no one’s ever known about, I think this is something that is incredibly obvious to practitioners in the field once you pointed out. The process whereby a robot learns to navigate a maze or vacuum a room is not, there is an objective and it optimizes it and then it does it.

What it is that there is a system designer who writes down an objective, selects an optimization algorithm, observes the final behavior of that optimization algorithm, goes back, modifies the objectives, modifies the algorithm, changes hyper parameters, and then runs it again. And there’s this iterative process whereby your system eventually ends up getting to the behavior that you wanted to have. And AI researchers have tended to draw a box around. The thing that we call AI is the sort of final component of that.

Lucas: Yeah, it’s because at least subjectively and I guess this is sort of illuminated by meditation and Buddhism, is that if you’re a computer scientist and you’re just completely identified with the process of doing computer science, you’re just identified with the problem. And if you just have a little bit of mindfulness and you’re like, “Okay, I’m in the context of a process where I’m an agent and trying to align another agent,” and if you’re not just completely identified with the process and you see the unfolding of the process, then you can do sort of like more of a meta-analysis which takes a broader view of the problem and can then, I guess hopefully work on improving it.

Dylan: Yeah, I think that’s exactly right, or at least as I understand that, that’s exactly right. And to be a little bit specific about this, we have had these engineering principles and skills that are not in the papers, but they are things that are passed down from Grad student to Grad student within a lab. Their institutional knowledge that exists within a company for how you actually verify and validate your systems, and cooperative IRLs and attempt to take all of that sort of structure that AI systems have existed within and try to bring that into the theoretical frameworks that we actually work with.

Lucas: So can you paint a little picture of what the CIRL model looks like?

Dylan: It exists in a sequential decision making context and we assume we have states of the world and a transition diagram that basically tells us how we get to another state given the previous state and actions from the human and the robot. But the important conceptual shift that it makes is the space of solutions that we’re dealing with are combinations of a teaching strategy and a learning strategy.

There is a commitment on the side of the human designers or users of the systems to provide data that is in some way connected to the objectives that they want to be fulfilled. That data can take many forms, it could be in the form of writing down a reward function that ranks a set of alternatives, it could be in the form of providing demonstrations that you expect your system to imitate. It could be in the form of providing binary comparisons between two clearly identified alternatives.

And the other side of the problem is what is the learning strategy that we use? And this is the question of how the robot is actually committing to respond to the observations that we’re giving it about what we wanted to do, in the case of a pre-specified proxy reward going to a literal interpretation and a reinforcement learning system, let’s say. What the system is committing to doing is optimizing under that set of trajectory rankings and preferences based off the simulation environment that it’s in, or the actual physical environment that it’s exploring.

When we shift to something like inverse reward design, which is a paper that we released last year, what that says is we’d like the system to look at this ranking of alternatives and actually try to blow that up into a larger uncertainty set over the set of possible consistent rankings with that, and then when you go into deployment, you may be able to leverage that uncertainty to avoid catastrophic failures or generally just unexpected behavior.

Lucas: So this other point I think that you and I discussed briefly, maybe it was actually with Rohan, but it seems like often in terms of AI Alignment, it’s almost like we’re reasoning from nowhere about abstract agents and that sort of makes the problem extremely difficult. Often, if you just look at human examples, it just becomes super mundane and easy. This sort of conceptual shift can almost I think be framed super simply as like the difference between a teacher trying to teach someone and then a teacher realizing that the teacher is a person that is teaching another student and the teacher can think better about how to teach and then also the process between the teacher and the student and how to improve that at a higher level of attraction.

Dylan: I think that’s the direction that we’re moving in. What I would say is it’s as AI practitioners, we are teaching our systems how to behave and we have developed our strategies for doing that.

And now that we’ve developed a bunch of strategies that sort of seem to work. I think it’s time for us to develop a more rigorous theory of actually how those teaching strategies interact with the final performance of the system.

Lucas: Cool. Is there anything else here that you would like say about CIRL, or any really important points you would like people to get people who are interested in technical AI Alignment or CS students?

Dylan: I think the main point that I would make is that research and thinking about powerful AI systems is valuable, even if you don’t think that that’s what’s going to happen. You don’t need to be motivated by those sets of problems in order to recognize that this is actually just basic research into the science of artificial intelligence.

It’s got an incredible amount of really interesting problems and the perspectives that you adopt from this framing can be incredibly useful as a comparative advantage over other researchers in the field. I think that’d be my final word here.

Lucas: If I might just ask you one last question. We’re at beneficial AGI 2019 right now and we’ve heard a lot of overviews of different research agendas and methodologies and models and framings for how to best go forth with AI Alignment, which include a vast range of things which work on corrigibility and interpretability and robustness and other things, and the different sort of research agendas and methodologies of places like MIRI who is come out with this new framing on embedded agency, and also different views at OpenAI and DeepMind.

And Eric Drexler has also newly proposed these services based conception of AI where we remove the understanding of powerful AI systems or regular AI systems as agents, which sort of gets us away from a lot of the x-risky problems and global catastrophic risks problems and value alignment problems.

From your point of view, as someone who’s worked a lot in CIRL and is the technical alignment researcher, how do you view CIRL in this context and how do you view all of these different emerging approaches right now in AI Alignment?

Dylan: For me, and you know, I should give a disclaimer. This is my research area and so I’m obviously pretty biased to thinking it’s incredibly important and good, but for me at least, cooperative IRL is a uniting framework under which I can understand all of those different approaches. I believe that a services type solution to AI Safety or AI Alignment that’s actually arguing for a particular type of learning strategy and implementation strategies of CIRL, and I think it can be framed within that system.

Similarly, I had some conversations with people about debate. I believe debate fits really nicely into the framework and we commit to a human strategy of judging debates from systems and we commit to a robot strategy and just putting yourself into two systems and working towards that direction. So for me, it’s a way in which I can sort of identify the commonalities between these different approaches and compare and contrast them and then under a set of assumptions about what the world is like, what the space of possible preferences is like and what the space of strategies that people can implement possibly get out some information about which one is better or worse, or which type of strategy is vulnerable to different types of mistakes or errors.

Lucas: Right, so I agree with all of that, the only place that I might want to push back is, it seems that maybe the MIRI embedded agency stuff subsumes everything else. What do you think about that?

Because the framing is like whenever AI researchers draw these models, there are these conceptions of these information channels, right, which are selected by the researchers and which we control, but the universe is really just a big non-dual happening of stuff and agents are embedded in the environment and are almost an identical process within the environment and it’s much more fuzzy where the dense causal streams are and where a little causal streams are and stuff like that. It just seems like the MIRI stuff seems to maybe subsume the CIRL and everything else a little bit more, but I don’t know.

Dylan: I certainly agree that that’s the one that’s hardest to fit into the framework, but I would also say that in my mind, I don’t know what an agent is. I don’t know how to operationalize an agent, I don’t actually know what that means in the physical world and I don’t know what it means to be an agent. What I do know is that there is a strategy of some sort that we can think of as governing the ways that the system is perform and behave.

I want to be very careful about baking in assumptions in beforehand. And it feels to me like embedded agency is something that I don’t fully understand the set of assumptions being made in that framework. I don’t necessarily understand how they relate to the systems that we’re actually going to build.

Lucas: When people say that an agent is like a fuzzy concept, I think that, that might be surprising to a lot of people who have thought somewhat about the problem because it’s like, obviously I know what an agent is, it’s different than all the other dead stuff in the world that has goals and it’s physically confined and unitary.

If you just like imagine like abiogenesis, how life began. It is the first relatively self-replicating chain of hydrocarbons and agent and you can go from a really small systems to really big systems, which can exhibit certain properties or principles that feel a little bit agenty, but may not be useful. And so I guess if we’re going to come up with a definition of it, it should just be something useful for us or something.

Dylan: I think I’m not sure is the most accurate word we can use here. I wish I had a better answer for what this was, maybe I can share one of the thought experiments that convinced me, I was pretty confused about what an agent is.

Lucas: Yeah, sure.

Dylan: It came from thinking about what value alignment is. So if we think about values alignment between two agents and those are both perfectly rational actors, making decisions in the world perfectly in accordance with their values, with full information. I can sort of write down a definition of value alignment, which is basically you’re using the same ranking over alternatives that I am.

But a question that we really wanted to try to answer that feels really important is what does it mean to be value aligned in a partial context? If you were a bounded agent, if you’re not a perfectly rational agent, what does it actually mean for you to be value aligned? That was the question that we also didn’t really know how to answer.

Lucas: My initial reaction is the kind of agent that tries its best with its limited rationality to be like the former thing that you talked about.

Dylan: Right, so that leads to a question that we thought about, so as opposed I have a chess playing agent and it is my chess playing agent and so I wanted to win the game for me. Suppose it’s using the correct goal test, so it is actually optimizing for my values. Let’s say it’s only searching out to depth three, so it’s pretty dumb as far as chess players go.

Do I think that that is an agent that is value aligned with me? Maybe. I mean, certainly I can tell the story in one way that it sounds like it is. It’s using the correct objective function, it’s doing some sort of optimization thing. If it ever identifies a checkmate move in three moves, I will always find that get that back to me. And so that’s a sense in which it feels like it is a value aligned agent.

On the other hand, what if it’s using a heuristic function which is chosen poorly, or and something closer to an adversarial manner. So now it’s a depth three agent that is still using the correct goal test, but it’s searching in a way that is adversarially selected. Is that a partially value aligned agent?

Lucas: Sorry, I don’t understand what it means to have the same objective function, but be searching in three depth in an adversarial way.

Dylan: In particular, when you’re doing a chess search engine, there is your sort of goal tests that you run on your leaves of your search to see if you’ve actually achieved winning the game. But because you’re only doing a partial search, you often have to rely on using a heuristic of some sort to like rank different positions.

Lucas: To cut off parts of the tree.

Dylan: Somewhat to cut off parts of the tree, but also just like you’ve got different positions, neither of which are winning and you need to choose between those.

Lucas: All right. So there’s a heuristic, like it’s usually good to take the center or like the queen is something that you should always probably keep.

Dylan: Or these things that are like values of pieces that you can add up was I think one of the problems …

Lucas: Yeah, and just as like an important note now in terms of the state of machine learning, the heuristics are usually chosen by the programmer. Are system is able to collapse on heuristics themselves?

Dylan: Well, so I’d say one of the big things in like AlphaZero or AlphaGo as an approach is that they applied sort of learning on the heuristic itself and they figured out a way to use the search process to gradually improve the heuristic and have the heuristic actually improving the search process.

And so there’s sort of a feedback loop set up in those types of expert iteration systems. What my point here is that when I described that search algorithm to you, I didn’t mention what heuristic it was using at all. And so you had no reason to tell me whether or not that system was partially value aligned or not because actually with heuristic is 100 percent of what’s going to determine the final performance of the system and whether or not it’s actually helping you.

And then the sort of final point I have here that I might be able to confuse you with a little bit more is, what if we just sort of said, “Okay, forget this whole searching business. I’m just going to precompute all the solutions from my search algorithm and I’m going to give you a policy of when you’re in this position, do this move. When you’re in that position, do that move.” And what would it mean for that policy to be values aligned with me?

Lucas: If it did everything that you would have done if you were the one playing the chess game. Like is that value alignment?

Dylan: That certainly perfect imitation, and maybe we [crosstalk 00:33:04]

Lucas: Perfect imitation isn’t necessarily value alignment because you don’t want it to perfectly imitate you, you want it to win the game.

Dylan: Right.

Lucas: Isn’t the easiest way to just sort of understand this is that there are degrees of value alignment and value alignment is the extent to which the thing is able to achieve the goals that you want?

Dylan: Somewhat, but the important thing here is trying to understand what these intuitive notions that we’re talking about actually mean for the mathematics of sequential decision making. And so there’s a sense in which you and I can talk about partial value alignment and the agents that are trying to help you. But if we actually look at the math of the problem, it’s actually very hard to understand how that actually translates. Like mathematically I have lots of properties that I could write down and I don’t know which one of those I want to call partial value alignment.

Lucas: You know more about the math than I do, but the percentage chance of a thing achieving the goal is the degree to which its value aligned? If you’re certain that the end towards which is striving, and the end towards what you want it to strive?

Dylan: Right, but that striving term is a hard one, right? Because if your goals aren’t achievable then it’s impossible to be value aligned with you in that sense.

Lucas: Yeah, you have to measure the degree to which the end towards which it’s striving is the end towards what you want it to strive and then also measure the degree to which the way that it tries to get to what you want is efficacious or …

Dylan: Right. I think that intuitively I agree with you and I know what you mean, but it’s like I can do things like I can write down a reward function and I can say how well does this system optimize that reward function? And we could ask whether or not that means its value aligned with it or not. But to me, that just sounds like the question of like is your policy optimal and the sort of more standard context.

Lucas: All right, so have you written about how you think that CIRL subsumes all of these other methodologies? And if it does subsume these other AI Alignment methodologies. How do you think that will influence or affect the way we should think about the other ones?

Dylan: I haven’t written that explicitly, but when I’ve tried to convey is that it’s a formalization of the type of problem we’re trying to solve. I think describing this subsuming them is not quite right.

Lucas: It contextualizes them and it brings light to them by providing framing.

Dylan: It gives me a way to compare those different approaches and understand what’s different and what’s the same between them, and in what ways are they … like in what scenarios do we expect them to work out versus not? One thing that we’ve been thinking about recently is what happens when the person doesn’t know immediately and what they’re trying to do.

So if we imagine that there is in fact the static set of preferences, the person’s trying to optimize, so we’re still making that assumption, but assuming that those preferences are revealed to the person over time through experience or interaction with the world. That is a richer class of value alignment problems than cooperative IRL deals with. It’s really closer to what we are attempting to do right now.

Lucas: Yeah, and I mean that doesn’t even include value degeneracy, like what if I get hooked on drugs in the next three years and all my values go and my IRL agent works on assumptions that I’m always updating towards what I want, but you know …

Dylan: Yes, and I think that’s where you get these questions of changing preferences that make it hard to really think through things. I think there’s a philosophical stance you’re taking there, which is that your values have changed rather than your beliefs have changed there.

In the sense that wire-heading is a phenomenon that we see in people and in general learning agents, and if you are attempting to help it learning agent, you must be aware of the fact that wire-heading is a possibility and possibly bad. And then it’s incredibly hard to distinguish from someone who’s just found something that they really like and want to do.

When you should make that distinction or how you should make that distinction is a really challenging question, that’s not a purely technical computer science question.

Lucas: Yeah, but even at the same time, I would like to demystify it a bit. If your friend got hooked on drugs, it’s pretty obvious for you why it’s bad, it’s bad because he’s losing control, it’s bad because he’s sacrificing all of his other values. It’s bad because he’s shortening his life span by a lot.

I just mean to win again, in this way, it’s obvious in ways in which humans do this, so I guess if we take biological inspired approaches to understanding cognition and transferring how humans deal with these things into AI machines, at least at face value seems like a good way of doing it, I guess.

Dylan: Yes, everything that you said I agree with. My point is that those are in a very real sense, normative assumptions that you as that person’s friend are able to bring to the analysis of that problem, and in in some ways there is an arbitrariness to labeling that as bad.

Lucas: Yeah, so the normative issue is obviously very contentious and needs to be addressed more, but at the same time society has come to very clear solutions to normative problems like murder is basically a solved normative problem. There’s a degree to which it’s super obvious that certain normative questions are just answer it and we should I guess practice epistemic humility and whatever here obviously.

Dylan: Right, and I don’t disagree with you on that point, but I think what I’d say is, as a research problem there’s a real question to getting a better understanding of the normative processes whereby we got to solving that question. Like what is the human normative process? It’s a collective societal system. How does that system evolve and change? And then how should machines or other intelligent entities integrate into that system without either subsuming or destroying it in bad ways? I think that’s what I’m trying to get at when I make these points. There is something about what we’re doing here as a society that gets us to labeling these things in the ways that we do and calling them good or bad.

And on the one hand, as a person believe that there are correct answers and I know what I think is right versus what I think is wrong. And then as a scientist I want to try to take a little bit more of an outside view and try to understand like what is the process whereby we as a society or as genetic beings started doing that? Understanding what that process is and how that process evolves, and actually what that looks like in people now is a really critical research program.

Lucas: So one thing that I tried to cover in my panel yesterday on what civilization should strive for, is in the short, medium, to longterm the potential role that narrow to general AI systems might play in amplifying human moral decision making.

Solving as you were discussing this sort of deliberative, normative process that human beings undergo to total converge on an idea. I’m just curious to know like with more narrow systems, if you’re optimistic about ways in which AI can sort of help and elucidate our moral decision making at work to amplify it.

And before I let you start, I guess there’s one other thing that I said that I think Rohin Shah pointed out to me that was particularly helpful in one place. But beyond the moral decision making, the narrow AI systems can help us by making the moral decision make, the decisions that we implement them faster than we could.

Depending on the way a self-driving car decides to crash is like an expression of our moral decision making in like a fast computery way. I’m just saying like beyond ways in which AI systems make moral decisions for us faster than we can, I don’t know, maybe in courts or other things which seem morally contentious. Are there also other ways in which they can actually help the deliberative process examining massive amounts of moral information or like a value information or analyzing something like an aggregated well-being index where we try to understand more so how policies impact the wellbeing of people or like what sorts of moral decisions lead to good outcomes, whatever. So whatever you have to say to that.

Dylan: Yes, I definitely want to echo that. We can sort of get a lot of pre-deliberation into a fast timescale reaction with AI systems and I think that that is a way for us to improve how we act in the quality of the things that we do from a moral perspective. That you do see a real path and to actually bringing that to be in the world.

In terms of helping us actually deliberate better, I think that is a harder problem that I think is absolutely worth more people thinking about but I don’t know the answers here. What I do think is that if we have a better understanding of what the deliberative process is, I think there are correct questions to look at to try to get to that or not, the moral questions about what’s right and what’s wrong and what do we think is right and what do we think is wrong, but they are much more questions at the level of what is it about our evolutionary pathway that led us to thinking that these things are right or wrong.

What is it about society and the pressures that you’re gone and faced that led us to things where murder is wrong in almost every society in the world. I will say the death penalty is the thing, it’s just the type of sanctioned murder. So there is a sense in which I think it’s a bit more nuanced than just that. And there’s something to be said about like I guess if I had to make my claims, like what I think has sort of happened there.

So there’s something about us as creatures that evolved to coordinate and perform well in groups and pressures that, that placed on us that caused us to develop these normative systems whereby we say different things are right and wrong.

Lucas: Iterated game theory over millions of years or something.

Dylan: Something like that. Yeah, but there’s a sense in which us labeling things as right and wrong and developing the processes whereby we label things as right and wrong is a thing that we’ve been pushed towards.

Lucas: From my perspective, it feels like this is more tractable than people lead on, like AI is only going to be able to help in moral deliberation, once it’s general. It already helps us in regular deliberation and moral deliberation isn’t a special kind of deliberation and moral deliberation requires empirical facts about the world and in persons just like any other kind of actionable deliberation does and domains that aren’t considered to have to do with moral philosophy or ethics or things like that.

So I’m not an AI researcher, but it seems to me like this is more attractable than people lead onto be. The normative aspect of AI Alignment seems to be under researched.

Dylan: Can you say a little more about what you mean by that?

Lucas: What I meant was the normative deliberative process, the difficulty in coming to normative conclusions and what the appropriate epistemic and deliberative process is for arriving at normative solutions and how narrow AI systems can take us to a beautiful world where advanced AI systems actually lead us to post human ethics.

If we ever want to get to a place where general systems take us to post human ethics, why not start today with figuring out how narrow systems can work to amplify human moral decision making and deliberative processes.

Dylan: I think the hard part there is, I don’t exactly know what it means to amplify those processes. My perspective is that we as a species do not yet have a good understanding of what those deliberative processes actually represent and what formed the result actually does.

Lucas: It’s just like giving more information, providing tons of data, analyzing the data, potentially pointing out biases. The part where they’re literally amplifying cognitive implicit or explicit decision making process is more complicated and will require more advancement and cognition and deliberation and stuff. But yeah, I still think there are more mundane ways in which it can make us better moral reasoners and decision makers.

If I could give you like 10,000 more bits of information every day about moral decisions that you make, you would probably just be a better moral agent.

Dylan: Yes, one way to try to think about that is maybe things like VR approaches to increasing empathy. I think that that has a lot of power to make us better.

Lucas: Max always says that there’s a race between wisdom and the power of our technology and it seems like people really aren’t taking seriously ways in which we can amplify wisdom because wisdom is generally taken to be part of the humanities and like the soft sciences. Maybe we should be taking more seriously ways in which narrow current day AI systems can be used to amplify the progress at which the human species makes wisdom. Because otherwise we’re just gonna like continue how we always continue and the wisdom is going to go really slow and then we’re going to probably learn from a bunch of mistakes.

And it’s just not going to be as good until we’ll develop a rigorous science of making moral progress or like using technology to amplify the progress of wisdom and moral progress.

Dylan: So in principle, what you’re saying, I don’t really disagree with it, but I also don’t know how that would change what I’m working on either. In the sense that I’m not sure what it would mean. I do not know how I would do research on amplifying wisdom. I just don’t really know what that means. And that’s not to say it’s an impossible problem, we talked earlier about how I don’t know what partial value alignment means, that something that you and I can talk about it and we can intuitively I think align on a concept, but it’s not a concept I knew how to translate into actionable concrete research problems right now.

In the same way, the idea of amplifying wisdom and making people more wise is something that I think intuitively I understand what you mean, but when I try to think about how an AI system would make someone wiser, that feels difficult.

Lucas: It can seem difficult, but I feel like it would, obviously this is like an open research question, but if you were able to identify a bunch of variables that are most important for moral decision making and then if you could use AI systems to sort of gather aggregate and compile in certain ways and analyze moral information in this way, again, it just seems more tractable than people seem to be letting on.

Dylan: Yeah, although I wonder now is that different from value alignment does, we’re thinking about it, right? Concrete research thing I spend a while thinking about is, how do you identify the features that a person considers to be valuable? Say, we don’t know the relative tradeoffs between them.

One way you might try to solve value alignment is have a process that identifies the features that might matter in the world and then have a second process that identifies the appropriate tradeoffs between those features, and maybe something about diminishing returns or something like that. And that to me sounds like I just placed values with wisdom and I’ve got sort of what you’re thinking about. I think both of those terms are similarly diffuse. I wonder if what we’re talking about is semantics, and if it’s not, I’d like to know what the difference is.

Lucas: I guess, the more mundane definition of wisdom, at least in the way that Max Tegmark would use it would be like the ways in which we use our technology. I might have specific preferences, but just because I have specific preferences that I may or may not be aligning an AI system to does not necessarily mean that that total process, this like CIRL process is actually an expression of wisdom.

Dylan: Okay, can you provide a positive description of what a process would look like? Or like basically what I’m saying is I can hear the point of I have preferences and I aligned my system to it and that’s not necessarily a wise system and …

Lucas: Yeah, like I build a fire because I want to be hot, but then the fire catches my village on fire and no longer is … That’s still might be value alignment.

Dylan: But isn’t [crosstalk 00:48:39] some values that you didn’t take into account when you were deciding to build the fire.

Lucas: Yeah, that’s right. So I don’t know. I’d probably have to think about this more because I guess this is something that I just sort of throwing out right now as a reaction to what we’ve been talking about. So I don’t have a very good theory of it.

Dylan: And I don’t wanna say that you need to know the right answers to these things to not have that be a useful direction to push people.

Lucas: We don’t want to use different concepts to just reframe the same problem and just make a conceptual mess.

Dylan: That’s what I’m a little bit concerned about and that’s the thing I’m concerned about broadly. We’ve got a lot of issues that we’re thinking about in dealing with that we’re not really sure what they are.

For me, I think one of the really helpful things has been to frame the issue that I’m thinking about as if a person has a behavior that they want to implement into the world and that’s a complex behavior that they don’t know how to identify immediately. How do you actually go about building systems that allow you to implement that behavior effectively, evaluate that the behavior is actually been correctly implemented.

Lucas: Avoiding side effects, avoiding …

Dylan: Like all of these kinds of things that we sort of concerned about in AI Safety, in my mind fall a bit more into place when we frame the problem as I have a desired behavior that I want to exist, a response function, a policy function that I want to implement into the world. What are the technological systems I can use to implement that in a computer or a robot or what have you.

Lucas: Okay. Well, do you have anything else you’d like to wrap up on?

Dylan: No, I just, I want to say thanks for asking hard questions and making me feel uncomfortable because I think it’s important to do a lot of that as a scientist and in particular I think as people working on AI, we should be spending a bit more time being uncomfortable and talking about these things, because it does impact what we end up doing and it does I think impact the trajectories that we put the technology on.

Lucas: Wonderful. So if people want to read about cooperative inverse reinforcement learning, where can we find the paper or other work that you have on that? What do you think are the best resources? What are just general things you’d like to point people towards in order to follow you or keep up to date with AI Alignment?

Dylan: I tweet occasionally about AI Alignment and a bit of AI ethics questions, the Hadfield-Menell, my first initial, last name. And if you’re interested in getting a technical introduction to value alignment, I would say take a look at the 2016 paper on cooperative IRL. If you’d like a more general introduction, there’s a blog post from summer 2017 on the bear blog.

Lucas: All right, thanks so much Dylan, and maybe we’ll be sitting in a similar room again in two years for Beneficial Artificial Super Intelligence 2021.

Dylan: I look forward to it. Thanks a bunch.

Lucas: Thanks. See you, Dylan. If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: Inverse Reinforcement Learning and the State of AI Alignment with Rohin Shah

What role does inverse reinforcement learning (IRL) have to play in AI alignment? What issues complicate IRL and how does this affect the usefulness of this preference learning methodology? What sort of paradigm of AI alignment ought we to take up given such concerns?

Inverse Reinforcement Learning and the State of AI Alignment with Rohin Shah is the seventh podcast in the AI Alignment Podcast series, hosted by Lucas Perry. For those of you that are new, this series is covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across areas such as machine learning, governance,  ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with Rohin Shah. Rohin is a 5th year PhD student at UC Berkeley with the Center for Human-Compatible AI, working with Anca Dragan, Pieter Abbeel and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter

Topics discussed in this episode include:

  • The role of systematic bias in IRL
  • The metaphilosophical issues of IRL
  • IRL’s place in preference learning
  • Rohin’s take on the state of AI alignment
  • What Rohin has changed his mind about
You can learn more about Rohin’s work here and find the Value Learning sequence hereYou can listen to the podcast above or read the transcript below.

Lucas: Hey everyone, welcome back to the AI Alignment Podcast series. I’m Lucas Perry and today we will be speaking with Rohin Shah about his work on inverse reinforcement learning and his general take on the state of AI alignment efforts and theory today. Rohin is a 5th year PhD student at UC Berkeley with the Center for Human-Compatible AI, working with Anca Dragan, Pieter Abbeel and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter. He has also been working with effective altruism for several years. Without further ado I give you Rohin Shah.

Hey, Rohin, thank you so much for coming on the podcast. It’s really a pleasure to be speaking with you.

Rohin: Hey, Lucas. Yeah. Thanks for inviting me. I’m glad to be on.

Lucas: Today I think that it would be interesting just to start off by delving into a lot of the current work that you’ve been looking into and practicing over the past few years. In terms of your research, it looks like you’ve been doing a lot of work on practical algorithms for inverse reinforcement learning that take into account, as you say, systematic cognitive biases that people have. It would be interesting if you could just sort of unpack this work that you’ve been doing on this and then contextualize it a bit within the AI alignment problem.

Rohin: Sure. So basically the idea with inverse reinforcement learning is you can look at the behavior of some agent, perhaps a human, and tell what they’re trying to optimize, what are the things that they care about? What are their goals? And in theory this seems like a pretty nice way to do AI alignment and that intuitively you can just say, “Hey, AI, go look at the actions of humans are taking, look at what they say, look at what they do, take all of that in and figure out what humans care about.” And then you could use that perhaps as a utility function for your AI system.

I think I have become less optimistic about this approach now for reasons I’ll get into, partly because of my research on systematic biases. Basically one problem that you have to deal with is the fact that whatever humans are trying to optimize for, they’re not going to do it perfectly. We’ve got all of these sorts of cognitive biases like a planning fallacy or hyperbolic time discounters, when we tend to be myopic, not looking as far into the long-term as we perhaps could.

So assuming that humans are like perfectly optimizing goals that they care about is like clearly not going to work. And in fact, basically, if you make that assumption, well, then whatever reward function you infer, once the AI system is optimizing that, it’s going to simply recover the human performance because well, you assumed that it was optimal when you inferred what it was so that means whatever the humans were doing is probably the behavior that optimizes their work function that you inferred.

And we’d really like to be able to reach super human performance. We’d like our AI systems to tell us how we’re wrong to get new technologies develop things that we couldn’t have done ourselves. And that’s not really something we can do using the sort of naive version of inverse reinforcement learning that just assumes that you’re optimal. So one thing you could try to do is to learn the ways in which humans are biased, the ways in which they make mistakes, the ways in which they plan sub-optimally. And if you could learn that, then you could correct for those mistakes, take them into account when you’re inferring human values.

The example I like to use is if there’s a grad student who procrastinates or doesn’t plan well and as a result near a paper deadline they’re frantically working, but they don’t get it in time and they miss the paper deadline. If you assume that they’re optimal, optimizing for their goals very well I don’t know what you’d infer, maybe something like grad students like to miss deadlines. Something like that seems pretty odd and it doesn’t seem like you’d get something sensible out of that, but if you realize that humans are not very good at planning, they have the planning fallacy and they tend to procrastinate for reasons that they wouldn’t endorse on reflection, then maybe you’d be able to say, “Oh, this was just a mistake of a grad student made. In the future I should try to help them meet their deadlines.”

So that’s the reason that you want to learn systematic biases. My research was basically let’s just take the hammer of deep learning and apply it to this problem. So not just learn the reward function, but let’s also learn the biases. It turns out that this was already known, but there is an impossibility result that says that you can’t do this in general. So more, I guess I would phrase the question I was investigating, as what are a weaker set of assumptions some of than the ones that we currently use such that you can still do some reasonable form of IRL.

Lucas: Sorry. Just stepping back for like half a second. What does this impossibility theorem say?

Rohin: The impossibility theorem says that if you assume that the human is basically running some sort of planner that takes in a reward function and spits out a behavior or a policy, a thing to do over time, then if you all you see is the behavior of the human, basically any reward function is compatible with some planner. So you can’t learn anything about that reward function without making any more assumptions. And intuitively, this is because for any complex behavior you see you could either call it, “Hey, the human’s optimizing a reward that makes them act like that. “Or you could say, “I guess the human is biased and they’re trying to do something else, but they did this instead.”

The sort of extreme version of this is like if you give me an option between apples and oranges and I picked the apple, you could say, “Hey, Rohin probably likes apples and is good at maximizing his reward of getting apples.” Or you could say, “Rohin probably likes oranges and he is just extremely bad at satisfying his preferences. He’s got a systematic bias that always causes him to choose the opposite of what he wants.” And you can’t distinguish between these two cases just by looking at my behavior.

Lucas: Yeah, that makes sense. So we can pivot sort of back in here into this main line of thought that you were on.

Rohin: Yeah. So basically with that impossibility result … When I look at the impossibility result, I sort of say that humans do this all the time, humans just sort of look at other humans and they can figure out what they want to do. So it seems like there are probably some simple set of assumptions that humans are using to infer what other humans are doing. So a simple one would be when the consequences of something or obvious to humans. Now, how you determine when that is another question, but when that’s true humans tend to be close to optimal and if you have something like that, you can rule out the planner that says the human is anti-rational and always chooses the worst possible thing.

Similarly, you might say that as tasks get more and more complex or require more and more computation, the probability that the human chooses the action that best maximizes his or her goals also goes down since the task is more complex and maybe a human doesn’t figure that out, figure out what’s the best thing to do. Maybe with enough of these assumptions we could get some sort of algorithm that actually works.

So we looked at if you make the assumption that the human is often close to rational and a few other assumptions about humans behaving similarly or planning similarly on similar tasks, then you can maybe, kind of, sort of, in simplified settings do IRL better than if you had just assumed that the human was optimal if humans actually systematically biased, but I wouldn’t say that our results are great. I don’t think I would say that I definitively, conclusively said, “This will never work.” Nor did I definitively conclusively say that this is great and we should definitely be putting more resources into it. Sort of somewhere in the middle, maybe more on the negative side of like this seems like a really hard problem and I’m not sure how we get around it.

Lucas: So I guess just as a point of comparison here, how is it that human beings succeed at this every day in terms of inferring preferences?

Rohin: I think humans have the benefit of being able to model the other person as being very similar to themselves. If I am trying to infer what you are doing I can sort of say, “Well, if I were in Lucas issues and I were doing this, what would I be optimizing?” And that’s a pretty good answer to what you would be optimizing. Humans are just in some absolute sense very similar to each other. We have similar biases. We’ve got similar ways of thinking. And I think we’ve leveraged that similarity a lot using our own self models as a drop-in approximation of the other person’s planner in this planner reward language.

And then we say, “Okay, well, if this other person thought like me and this is what they ended up doing, well then, what must they have been optimizing?” I think you’ll see that when this assumption breaks down humans actually get worse at inferring goals. It’s harder for me to infer what someone in a different culture is actually trying to do. They might have values that are like significantly different from mine.

I’ve been in both India and the US and it often seems to me that people in the US just have a hard time grasping the way that Indians see society and family expectations and things like this. So that’s an example that I’ve observed. It’s probably also true the other way around, but I was never old enough in India to actually think through this.

Lucas: Human beings sort of succeed in inferring preferences of people who they can model as having like similar values as their own or if you know that the person has similar values as your own. If inferring human preferences from inverse reinforcement learning is sort of not having the most promising results, then what do you believe to be a stronger way of inferring human preferences?

Rohin: The one thing I correct there is that I don’t think humans do it by assuming that people have similar values, just that people think in similar ways. For example, I am not particularly good at dancing. If I see someone doing a lot of hip-hop or something. It’s not that I value hip-hop and so I can infer they value hip-hop. It’s that I know that I do things that I like and they are doing hip-hop. Therefore, they probably like doing hip-hop. But anyway, that’s the minor point.

So a, just because IRL algorithms aren’t doing well now, I don’t think it’s true that IRL algorithms couldn’t do well in the future. It’s reasonable to expect that they would match human performance. That said, I’m not super optimistic about IRL anyway, because even if we do figure out how to get IRL algorithms and sort of make all these implicit assumptions that humans are making that we can then run and get what a human would have thought other humans are optimizing, I’m not really happy about then going and optimizing that utility function off into the far future, which is what sort of the default assumption that we seem to have when using inverse reinforcement learning.

It may be that IRL algorithms are good for other things, but for that particular application, it seems like the utility function you infer is going to not really scale to things that super intelligence will let us do. Humans just think very differently about how they want the future to go. In some sense, the future is going to be very, very different. We’re going to need to think a lot about how we want the future to go. All of our experience so far has not trained us to be able to think about what we care about in the sort of feature setting where we’ve got as a simple example the ability to easily copy people if they’re uploaded as software.

If that’s a thing that happens, well, is it okay to clone yourself? How does democracy work? All these sorts of things are somewhat value judgments. If you take egalitarianism and run with it, you basically get that one person can copy themselves millions of millions of times and just determine the outcome of all voting that way. That seems bad, but on our current values, I think that is probably what we want and we just really haven’t thought this through. IRL to infer utility function that we’ve then just ruthlessly optimized in the long-term just seems like by the time when the world changes a bunch, the value function that we inferred is going to be weirdly wrong in strange ways that we can’t predict.

Lucas: Why not run continuous updates on it as people update given the change of the world?

Rohin: It seems broadly reasonable. This is the sort of idea that you could have about how you could use IRL in a more realistic way that actually works. I think that’s perfectly fine. I’m optimistic about approaches that are like, “Okay, we’re going to use IRL to infer a value function or reward function or something and we’re going to use that to inform what the AI does, but it’s not going to be the end-all utility functions. It’s just going to infer what we do now and AI system is somehow going to check with us. Maybe it’s got some uncertainty over what the true reward function is. Maybe that it only keeps this reward function for a certain amount of time.”

These seem like things that are worth exploring, but I don’t know that we have the correct way to do it. So in the particular case that you proposed, just updating the reward function over time. The classic wire heading question is, how do we make it so that the AI doesn’t say, “Okay, actually, in order to optimize the utility function I have now, it would be good for me to prevent you from changing my utility function since if you change my utility function, I’m no longer going to achieve my original utility.” So that’s one issue.

The other issue is maybe it starts doing some long-term plans. Maybe even if it’s planning according to this utility function without expecting some changes to the utility function in the future, then it might set up some long-term plans that are going to look bad in the future, but it is hard to stop them in the future. Like you make some irreversible change to society because you didn’t realize that something was going to change. These sorts of things suggest you don’t want a single utility function that you’re optimizing even if you’re updating that utility function over time.

It could be that you have some sort of uncertainty over utility functions and that might be okay. I’m not sure. I don’t think that it’s settled that we don’t want to do something like this. I think it’s settled that we don’t want to use IRL to infer a utility function and optimize that one forever. There are certain middle grounds. I don’t know how well those middle grounds work. There are some intuitively there are going to be some problems, but maybe we can get around those.

Lucas: Let me try to do a quick summary just to see if I can explain this as simply as possible. There are people and people have preferences, and a good way to try and infer their preferences is through their observed behavior, except that human beings have cognitive and psychological biases, which sort of skew their actions because they’re not perfectly rational epistemic agents or rational agents. So the value system or award system that they’re optimizing for is imperfectly expressed through their behavior. If you’re going to infer the preferences from behavior than you have to correct for biases and epistemic and rational failures to try and inferr the true reward function. Stopping there. Is that sort of like a succinct way you’d put it?

Rohin: Yeah, I think maybe another point that might be the same or might be different is that under our normal definition of what our preferences or our values are, if we would say something like, “I value egalitarianism, but it seems predictably true that in the future we’re not going to have a single vote per a sentient being,” or something. Then essentially what that says is that our preferences, our values are going to change over time and they depend on the environment in which we are right now.

So you can either see that as okay, I have this really big, really global, really long-term utility function that tells me how given my environment what my narrow values in that environment are. And in that case and you say, “Well okay, in that case, we’re really super biased because we only really know our values in the environment. We don’t know our values in future environments. We’d have to think a lot more for that.” Or you can say, “We can infer our narrow values now and that has some biases thrown in, but we could probably account for those that then we have to have some sort of story for how we deal with our preferences evolving in the future.”

Those are two different perspectives on the same problem, I would say, and they differ in basically what you’re defining values to be. Is it the thing that tells you how to extrapolate what you want all the way into the future or is it the thing that tells you how you’re behaving right now in the environment. I think our classical notion of preference or values, the one that we use when we say values in everyday language is talking about the second kind, the more narrow kind.

Lucas: There’s really a lot there, I think, especially in terms of issues in that personal identity over time, commitment to values and as you said, different ideas and conceptualization of value, like what is it that I’m actually optimizing for or care about. Population ethics and tons of things about how people value future versions of themselves or whether or not they actually equally care about their value function at all times as it changes within the environment.

Rohin: That’s a great description of why I am nervous around inverse reinforcement learning. You listed a ton of issues and I’m like, yeah, all of those are like really difficult issues. And with inverse reinforcement learning, it’s sort of based on this premise of all of that is existent, is real and is timeless and we can infer it and then maybe we put on some hacks like continuously improving the value function over time to take into account changes, but this does feel like we’re starting with some fundamentally flawed paradigm.

So mostly because of this fact that it feels like we’ve taken a flawed paradigm to start with, then changed it so that it doesn’t have all the obvious flaws. I’m more optimistic about trying to have a different paradigm of how we want to build AI, which maybe I’ll summarize as just make AIs that do what we want or what we mean at the current moment in time and then make sure that they evolve along with us as we evolve and how we think about the world.

Lucas: Yeah. That specific feature there is something that we were trying to address in inverse reinforcement learning, if the algorithm were sort of updating overtime alongside myself. I just want to step back for a moment to try to get an even grander and more conceptual understanding of the globalness of inverse reinforcement learning. So from an evolutionary and sort of more cosmological perspective, you can say that from the time that the first self-replicating organisms on the planet until today, like the entire evolutionary tree, there’s sort of a global utility function across all animals that is ultimately driven by thermodynamics and the sun shining light on a planet and that this sort of global utility function of all agents across the planet, it seems like very ontologically basic and pure like what simply empirically exists. Attempting to access that through IRL is just interesting, the difficulties that arise from that. Does that sort of a picture seem accurate?

Rohin: I think I’m not super sure what exactly you’re proposing here. So let me try and restate it. So if we look at the environment as a whole or the universe as a whole or maybe we’re looking at evolution perhaps and we see that hey, evolution seems to have spit out all of these creatures that are interacting in this complicated way, but you can look at all of their behavior and trace it back to this objective in some sense of maximizing reproductive fitness. And so are we expecting that IRL on this very grand scale would somehow end up with maximize reproductive fitness. Is that what … Yeah, I think I’m not totally sure what implication you’re drawing from this.

Lucas: Yeah. I guess I’m not arguing that there’s going to be some sort of evolutionary thing which is being optimized.

Rohin: IRL does make the assumption that there is something doing an optimization. You usually have to point it towards what that thing is. You have to say, “Look at the behavior of this particular piece of the environment and tell me what it’s optimizing.” Maybe if you’re imagining IRL on this very grand scale, what is the thing you’re pointing it at?

Lucas: Yeah, so to sort of reiterate and specify, the pointing IRL at the human species would be like to point IRL at 7 billion primates. Similarly, I was thinking that what if one pointed IRL at the ecosystem of Earth over time, you could sort of plot this evolving algorithm over time. So I was just sort of bringing to note that accessing this sort of thing, which seems quite ontologically objective and just sort of clear in this way, it’s just very interesting how it’s fraught with so many difficulties. Yeah, in terms of history it seems like all there really is, is the set of all preferences at each time step over time, which could be summarized in some sort of global or individual levels of algorithms.

Rohin: Got it. Okay. I think I see what you’re saying right now. It seems like the intuition is like ecosystems, universe, laws of physics, very simple, very ontologically basic things, there’s something more real about any value function we could infer from that. And I think this is a misunderstanding of what IRL does. IRL fundamentally requires you to have some notion of counterfactuals. You need to have a description of the action space that some agent had and then when you observe their behavior, you see that they made a choice to take one particular action instead of another particular action.

You need to be able to ask the question of what could they have done instead, which is a counterfactual. Now, with laws of physics, it’s very unclear what the counterfactual would be. With evolution, you can maybe say something like, “Evolution could have chosen to make a whole bunch of mutations and I chose this particular one. And then if you use that particular model, what is IRL going to infer? It will probably infer something like maximized reproductive fitness.”

On the other hand, if you model evolution as like hey, you can design the best possible organism that you can. You can just create an organism out of thin air. And then what reward function are you maximizing then, it’s like super unclear. If you could just poof into existence a organism, you could just make something that’s extremely intelligent, very strong, et cetera, et cetera. And you’re like, well, evolution didn’t do that. It took millions of years to create even humans so clearly it wasn’t optimizing reproductive fitness, right?

And in fact, I think people often say that evolution is not an optimization process because of things like this. The notion of something doing optimization is very much relative to what you assume their capabilities to be and in particular what do you assume their counterfactuals to be. So if you were talking about this sort of grand scale ecosystems, universe, laws of physics, I would ask you like, “What are the counterfactuals? What could the laws of physics done otherwise or what could the ecosystem have done if it didn’t do the thing that it did?” Once you have an answer to that, I imagine I could predict what IRL would do. And that part is the part that doesn’t seem ontologically basic to me, which is why I don’t think that IRL on this sort of thing makes very much sense.

Lucas: Okay. The part here that seems to be a little bit funny to me is where tracking from physics, whatever you take to be ontologically basic about the universe, and tracking from that to the level of whatever our axioms and pre-assumptions for IRL are. What I’m trying to say is in terms of moving from whatever is ontologically basic to the level of agents and we have some assumptions in our IRL where we’re thinking about agents as sort of having theories of counterfactuals where they can choose between actions and they have some sort of reward or objective function that they’re trying to optimize for over time.

It seems sort of metaphysically queer where physics stops … Where we’re going up in levels of abstraction from physics to agents and we … Like physics couldn’t have done otherwise, but somehow agents could have done otherwise. Do you see the sort of concern that I’m raising?

Rohin: Yeah, that’s right. And this is perhaps another reason that I’m more optimistic about the don’t try to do anything at the grand scale and just try to do something that does the right thing locally in our current time, but I think that’s true. It definitely feels to me like optimization, the concept, should be ontologically basic and not a property of human thought. There’s something about how a random universe is high entropy whereas the ones that humans construct is low entropy. That suggests that we’re good at optimization.

It seems like it should be independent of humans. Also, on the other hand, optimization, any conception I come up with it is either specific to the way humans think about it or it seems like it relies on this notion of counterfactuals. And yeah, the laws of physics don’t seem like they have counterfactuals, so I’m not really sure where that comes in. In some sense, you can see that, okay, why do we have the notion of counterfactuals on agency thinking that we could have chosen something else while we’re basically … In some sense we’re just an algorithm that’s continually thinking about what we could do, trying to make plans.

So we search over this space of things that could be done, and that search is implemented in physics, which has no say, it has no counterfactuals, but the search itself, which is an abstraction layer above, it’s something that is running on physics. It is not itself a physics thing, that search is in fact going through multiple options and then choosing one now. It is deterministic from the point of view of physics, but from the point of view of the search, it’s not deterministic. The search doesn’t know which one is going to happen. I think that’s why humans have this notion of choice and of agency.

Lucas: Yeah, and I mean, just in terms of understanding the universe, it’s pretty interesting just how there’s like these two levels of attention where at the physics level you actually couldn’t have done otherwise, but as sort of like this optimization process running on physics that’s searching over space and time and modeling different world scenarios and then seemingly choosing and thus, creating observed behavior for other agents to try and infer whatever reward function that thing is trying to optimize for, it’s an interesting picture.

Rohin: I agree. It’s definitely a sort of puzzles that keep you up at night. But I think one particularly important implication of this is that agency is about how a search process thinks about itself. It’s not just about that because I can look at what someone else is doing and attribute agency to them, figure out that they are themselves running an algorithm that chooses between actions. I don’t have a great story for this. Maybe it’s just humans realizing that other humans are just like them.

So this is maybe why we get acrimonious debates about whether evolution has agency, but we don’t get acrimonious debates about whether humans have agency. Evolution is sufficiently different from us that we can look at the way that it “chooses” “things” and we say, “Oh well, but we understand how it chooses things.” You could model it as a search process, but you could also model it is all that’s happening is this deterministic or mostly deterministic which animals survived and had babies and that is how things happen. And so therefore, it’s not an optimization process. There’s no search. There is deterministic. And so you have these two conflicting views for evolution.

Whereas I can’t really say, “Hey Lucas, I know exactly deterministically how you’re going to do things.” I know this at the sense of like men, there are electrons and atoms and stuff moving around in your brain and electrical signals, but that’s not going to let me predict what you can do. One of the best models I can have of you is just optimizing for some goal, whereas with evolution I can have a more detailed model. And so maybe that’s why I set aside the model of evolution as an optimizer.

Under this setting it’s like, “Okay, maybe our views of agency and optimization are just facts about how well we can model the process, which cuts against the optimization as ontologically basic thing and it seems very difficult. It seems like a hard problem to me. I want to reiterate that most of this has just pushed me to let’s try and instead have a AI alignment focus, try to do things that we understand now and not get into the metaphilosophy problems. If we just get AI systems that broadly do what we want and are asking us for clarification, helping us evolve our thoughts over time, if we can do something like that. I think there are people who would argue that like no, of course, we can’t do something like that.

But if we could do something like that, that seems significantly more likely to work than something that has to have answers to all these metaphilosophical problems today. My position is just that this is doable. We should be able to make systems that are of the nature that I described.

Lucas: There’s clearly a lot of philosophical difficulties that go into IRL. Now it would be sort of good if we could just sort of take a step back and you could summarize your thoughts here on inverse reinforcement learning and the place that it has in AI alignment.

Rohin: I think my current position is something like fairly confidently don’t use IRL to infer a utility function that you then optimize over the long-term. In general, I would say don’t have a utility function that you optimize over the long-term because it doesn’t seem like that’s easily definable right now. So that’s like one class of things I think we should do. On the other hand I think IRL is probably good as a tool.

There is this nice property of IRL that you figure out what someone wants and then you help them do it. And this seems more robust than handwriting, the things that we care about in any particular domain, like even in a simple household robot setting, there are tons and tons of preferences that we have like don’t break vases. Something like IRL could infer these sorts of things.

So I think IRL has definitely a place as a tool that helps us figure out what humans want, but I don’t think the full story for alignment is going to rest on IRL in particular. It gets us good behavior in the present, but it doesn’t tell us how to extrapolate on into the future. Maybe if you did IRL that let you infer how we want the AI system to extrapolate our values or to figure out IRL and our meta-preferences about how the algorithm should infer our preferences or something like this, that maybe could work, but it’s not obvious to me. It seems worth trying at some point.

TLDR, don’t use it for long-term utility function. Do use it as a tool to get decent behavior in the short-term. Maybe also use it as a tool to infer meta-preferences. That seems broadly good, but I don’t know that we know enough about that setting yet.

Lucas: All right. Yeah, that’s all just super interesting and it’s sort of just great to hear how the space is unfolded for you and what your views are now. So I think that we can just sort of pivot here into the AI alignment problem more generally and so now that you’ve moved on from being as excited about IRL, what is essentially capturing your interests currently in the space of AI alignment?

Rohin: The thing that I’m most interested in right now is can we build an AI system that basically evolves over time with us. I’m thinking of this now is like a human AI interaction problem. You’ve got an AI system. We want to figure out how to make it that it broadly helps us, but also at the same time and figures out what it needs to do based on some sort of data that comes from humans. Now, this doesn’t have to be the human saying something. It could be from their behavior. It could be things that they have created in the past. It could be all sorts of things. It could be a reward function that they write down.

But I think the perspective of the things that are easy to infer are the things that are specific to our current environment is pretty important. What I would like to do is build AI systems that refer to preferences in the current environment or things we want in the current environment and do those reasonably well, but don’t just extrapolate to the future and let humans adapt to the future and then figure out what the humans value now and then do things based on that then.

There are a few ways that you could imagine this going. One is this notion of corrigibility in the sense that Paul Christiano writes about it, not the sense that MIRI writes about it, where the AI is basically trying to help you. And if I have an AI that is trying to help me, well, I think one of the most obvious things for someone who’s trying to help me to do is make sure that I remain in effective control of any power resources that might be present that the AI might have and to ask me if my values change in the future or if what I want the AI to do changes in the future. So that’s one thing that you might hope to do.

Also imagine building a norm following AI. So I think human society basically just runs on norms that we mostly all share and tend to follow. We have norms against particularly bad things like murdering people and stealing. We have norms against shoplifting. We have maybe less strong norms against littering. Unclear. And then we also have norms for things are not very consequential. We have norms against randomly knocking over a glass at a restaurant in order to break it. That is also a norm. Even though there are quite often times where I’m like, “Man, it would be fun to just break a glass at the restaurant. It’s very cathartic,” but it doesn’t happen very often.

And so if we could build an AI system that could infer and follow those norms, it seems like this AI would behave in a more human-like fashion. This is a pretty new line of thought so I don’t know whether this works, but it could be that such an AI system is simultaneously behaving in a fashion that humans would find acceptable and also lets us do pretty cool, interesting, new things like developing new technologies and stuff that humans can then deploy and the AI doesn’t just unilaterally deploy without any safety checks or running it by humans or something like that.

Lucas: So let’s just back up a little bit here in terms of the picture of AI alignment. So we have a system that we do not want to extrapolate too much toward possible future values. It seems that there are all these ways in which we can be using the AI first to sort of amplify our own decision making and then also different methodologies which reflect the way that human beings update their own values and preferences over time, something like as proposed by I believe Paul Christiano and Geoffrey Irving and other people at OpenAI, like alignment through debate.

And there’s just all these sorts of epistemic practices of human beings with regards to sort of this world model building and how that affects shifts in value and preferences, also given how the environment changes. So yeah, it just seems like tracking overall these things, finding ways in which AI can amplify or participate in those sort of epistemic practices, right?

Rohin: Yeah. So I definitely think that something like amplification can be thought of as improving our epistemics over time. That seems like a reasonable way to do it. I haven’t really thought very much about how amplification or the pay scales were changing environments. They both operate under this general like we could have a deliberation tree and in principle what we want is this exponentially sized deliberation tree where the human goes through all of the arguments and counter-arguments and breaks those down into sub-points in excruciating detail in a way that no human could ever actually do because it would take way too long.

And then amplification debate basically show you how to get the outcome that this reasoning process would have given by using an AI system to assist the human. I don’t know if I would call it like improving human epistemics, but more like taking whatever epistemics you already have and running it for a long amount of time. And it’s possible like in that long amount of time you actually figure out how to do better epistemics.

I’m not sure that this perspective really talks very much about how preferences change over time. You would hope that it would just naturally be robust to that in that as the environment changes, your deliberation starts looking different in that like okay, now suddenly we have to go back to my example before we have uploads and we’re like egalitarianism now seems to have some really weird consequences. And then presumably the deliberation tree that amplification and debate are mimicking is going to have a bunch of thoughts about do we actually want egalitarianism now, what were the moral intuitions that pushed us towards this? Is there some equivalent principle that lets us keep our moral intuitions, but doesn’t have this weird property where a single person can decide the outcome of an election, et cetera, et cetera.

I think they were not designed to do this, but by a virtue of being based off like how a human would think, what a human would do if they got a long time and a lot of helpful tools to think about it, they’re essentially just inheriting these properties from the human. If the human as the environment would change would start rethinking their priorities or what they care about, then so too would amplification and debate.

Lucas: I think here it also has me thinking about what are the meta-preferences and the meta-meta-preferences and if you could imagine taking a human brain and then running it until the end, through decision and rational and logical thought trees over enough time, with enough epistemics and power behind it to try to sort of navigate its way to the end. It just raises interesting questions about like is that what we want? Is taking that over every single person and then sort of just preference aggregating it all together, is that what we want? And what is the role of moral philosophy for thinking here?

Rohin: Well, so one thing is that whatever moral philosophy you would do so would the amplification of you in theory. I think the benefit of these approaches is that they have this nice property that whatever you would have thought of it in the limit of good AI and idealizations, properly mimicking you and so on, so forth. In this sort of nice world where this all works in a nice, ideal way, it seems like any consideration you can have or you would have so would be agent produced by iterated amplification or debate.

And so if you were going to do a bunch of moral philosophy and come to some sort of decision based on that, so would iterated amplification or debate. So I think it’s like basically here is how we build an AI system that solves the problems in the same way that a human would solve them. And so then if you’re worried about, hey, maybe humans themselves are just not very good at solving problems. Looks like most humans in the world. Like don’t do moral philosophy and don’t extrapolate their values well in the future. And the only reason we have moral progress is because younger generations keep getting born and they have different views than the older generations.

That, I think, could in fact be a problem, but I think there’s hope that we could like train humans to have them nice sort of properties, good epistemics, such they would provide good training data for iterated amplification if there comes a day where we think we can actually train iterated amplification to mimic human explicit reasoning. They do both have the property that they’re only mimicking the explicit reasoning and not necessarily the implicit reasoning.

Lucas: Do you want to unpack that distinction there?

Rohin: Oh, yeah. Sure. So both of them require that you take your high-level question and decompose it into a bunch of sub-questions or sorry, the theoretical model of them has that. This is like pretty clear with iterated amplification. It is less clear with debate. At each point you need to have the top level agent decompose the problem into a bunch of sub-problems. And this basically requires you to be able to decompose tasks into clearly specified sub-tasks, where clearly specified could mean in natural language, but you need to make it explicit in a way that the agent you’re assigning the task to can understand it without having to have your mind.

Whereas if I’m doing some sort of programming task or something, often I will just sort of know what direction to go in next, but not be able to cleanly formalize it. So you’ll give me some like challenging algorithms question and I’ll be like, “Oh, yeah, kind of seems like dynamic programming is probably the right thing to do here.” And maybe if I consider it this particular way, maybe if I put these things in the stack or something, but even the fact that I’m saying this out in natural language is misrepresenting my process.

Really there’s some intuitive not verbalizable process going on in my head. Somehow navigates to the space of possible programs and picks a thing and I think the reason I can do this is because I’ve been programming for a lot of time and I’ve trained a bunch of intuitions and heuristics that I cannot easily verbalize us some like nice decomposition. So that’s sort of implicit in this thing. If you did want that to be incorporated in an iterated amplification, it would have to be incorporated in the base agent, the one that you start with. But if you start with something relatively simple, which I think is often what we’re trying to do, then you don’t get those human abilities and you have to rediscover them in some sense through explicit decompositional reasoning.

Lucas: Okay, cool. Yeah, that’s super interesting. So now to frame all of this again, do you want to sort of just give a brief summary of your general views here?

Rohin: I wish there were a nice way to summarize this. That would mean we’d made more progress. It seems like there’s a bunch of things that people have proposed. There’s amplification/debate, which are very similar, IRL as a general. I think, but I’m not sure, that most of them would agree that we don’t want to like infer a utility function and optimize it for the long-term. I think more of them are like, yeah, we want this sort of interactive system with the human and the AI. It’s not clear to me how different these are and what they’re aiming for in amplification and debate.

So here we’re sort of looking at how things change over time and making that a pretty central piece of how we’re thinking about it. Initially the AI is trying to help the human, human has some sort of reward function, AI trying to learn it and help them, but over time this changes, the AI has to keep up with it. And under this framing you want to think a lot about interaction, you want to think about getting as many bits about reward from the human to the AI as possible. Maybe think about control theory and how human data is in some sense of control mechanism for the AI.

You’d want to like infer norms and ways that people behave, how people relate with each other, try to have your AI systems do that as well. So that’s one camp of things, have the AI interact with humans, behave generally in the way that humans would say is not crazy, update those over time. And then there’s the other side which is like have an AI system that is taking human reasoning, human explicit reasoning and doing that better or doing that more, which allows it to do anything that the human would have done, which is more taking the thought process that humans go through and putting that at the center. That is the thing that we want to mimic and make better.

Sort of parts where our preferences change over time is something that you get for free in some sense by mimicking human thought processes or reasoning. Summary, those are two camps. I am optimistic about both of them, think that people should be doing research on both of them. I don’t really have much more of a perspective of that, I think.

Lucas: That’s excellent. I think that’s a super helpful overview actually. And given that, how do you think that your views of AI alignment have changed over the past few years?

Rohin: I’ll note that I’ve only been in this field for I think 15, 16 months now, so just over a year, but over that year I definitely came into it thinking what we want to do is infer the correct utility function and optimize it. And I have moved away quite strongly from that. I, in fact, recently started writing a value learning sequence or maybe collating is a better word. I’ve written a lot of posts that still have to come out, but I also took a few posts from other people.

The first part of that sequence is basically arguing seems bad to try and define a utility function and then optimize it. So I’m just trying to move away from long-term utility functions in general or long-term goals or things like this. That’s probably the biggest update since starting. Other things that I’ve changed, a focus more on norms than on values, trying to do things that are easy to infer right now in the current environment and that making sure that we update on these over time as opposed to trying to get the one true thing that depends on us solving all the hard metaphilosophical problems. That’s, I think, another big change in the way I’ve been thinking about it.

Lucas: Yeah. I mean, there are different levels of alignment at their core.

Rohin: Wait, I don’t know exactly what you mean by that.

Lucas: There’s your original point of view where you said you came into the field and you were thinking infer the utility function and maximize it. And your current view is now that you are moving away from that and beginning to be more partial towards the view which takes it that we want to be like inferring from norms in the present day just like current preferences and then optimizing that rather than extrapolating towards some ultimate end-goal and then trying to optimize for that. In terms of aligning in these different ways, isn’t there a lot of room for value drift, allowing the thing to run in the real world rather than amplifying explicit human thought on a machine?

Rohin: Value drift if is an interesting question. In some sense, I do want my values to drift in that whatever I think about the correct way that the future should go or something like that today. I probably will not endorse that in the future and I endorse the fact that I won’t endorse it in the future. I do want to learn more and then figure out what to do in the future based on that. You could call that value drift that is a thing. I want to happen. So in that sense then value drift wouldn’t be a bad thing, but then there’s also a sense in which there are ways in which my values could change in the future and ways that I don’t endorse and then that one maybe is value drift. That is bad.

So yeah, if you have an AI system that’s operating in the real world and changes over time as we humans change, yes, there will be changes at what the AI system is trying to achieve over time. You could call that value drift, but value drift usually has a negative connotation, whereas like this process of learning as the environment changes seems to be to me like a positive thing. It’s a thing I would want to do myself.

Lucas: Yeah, sorry, maybe I wasn’t clear enough. In the case of running human beings in the real world, where there are like the causes and effects of history and whatever else and how that actually will change the expression of people over time. Because if you’re running this version of AI alignment where you’re sort of just always optimizing the current set of values in people, progression of the world and of civilization is only as good as the best of all human like values and preferences in that moment.

It’s sort of like limited by what humans are in that specific environment and time, right? If you’re running that in the real world versus running some sort of amplified version of explicit human reasoning, don’t you think that they’re going to come to different conclusions?

Rohin: I think the amplified explicit human reasoning, I imagine that it’s going to operate in the real world. It’s going to see changes that happen. It might be able to predict those changes and then be able to figure out how to respond fast, before the changes even happen perhaps, but I still think of amplification as being very much embedded in the real world. Like you’re asking it questions about things that happen in the real world. It’s going to use explicit reasoning that it would have used if a human were in the real world and thinking about the question.

I don’t really see much of a distinction here. I definitely think that even in my setting where I’m imagining AI systems that evolve over time and change based on that, that they are going to be smarter than humans, going to think through things a lot faster, be able to predict things in advance in the same way that simplified explicit reasoning would. Maybe there are differences, but value drift doesn’t seem like one of them or at least I cannot predict right now how they will differ along the axis of value drift.

Lucas: So then just sort of again taking a step back to the ways in which your views have shifted over the past few years. Is there anything else there that you’d like to touch on?

Rohin: Oh man, I’m sure there is. My views changed so much because I was just so wrong initially.

Lucas: So most people listening should think that if given a lot more thought on this subject, that their views are likely to be radically different than the ones that they currently have and the conceptions that they currently have about AI alignment.

Rohin: Seems true from most listeners, yeah. Not all of them, but yeah.

Lucas: Yeah, I guess it’s just an interesting fact. Do you think this is like an experience of most people who are working on this problem?

Rohin: Probably. I mean, within the first year of working on the problem that seems likely. I mean just in general if you work on the problem, if you start with near no knowledge on something and then you work on it for a year, your views should change dramatically just because you’ve learned a bunch of things and I think that basically explains most of my changes in view.

It’s just actually hard for me to remember all the ways in which I was wrong back in the past and I focused on not using utility functions because I think that even other people in the field still believe right now. So that’s where that one came from, but there are like plenty of other things that are just notably, easily, demonstrably wrong about that I’m having trouble recalling now.

Lucas: Yeah, and the utility function one I think is a very good example and I think that if it were possible to find all of these in your brain and distill them, I think it would make a very, very good infographic on AI alignment, because those misconceptions are also misconceptions that I’ve had and I share those and I think that I’ve seen them also in other people. A lot of sort of the intellectual blunders that you or I have made are probably repeated quite often.

Rohin: I definitely believe that. Yeah, I guess I could talk about the things that I’m going to very soon saying the value learning sequence. Those were definitely updates that I made, one of those a utility functions thing. Another one was thinking about what we want is for the human AI system as a whole to be optimizing for some sort of goal. And this opens up a nice space of possibilities where the AI is not optimizing a goal, only the human AI system together is. Keeping in mind that that is the goal and not just the AI itself must be optimizing some sort of goal.

The idea of corrigibility itself as a thing that we should be aiming for was a pretty big update for me, took a while for me to get to that one. I think distributional shift was a pretty key concept that I learned at some point and started applying everywhere. One way of thinking about the evolving preferences over time thing is that humans, they’ve been trained on the environment that we have right now and arguably we’ve been trained on the ancestral environment too by evolution, but we haven’t been trained on whatever the future is going to be.

Or for a more current example, we haven’t been trained on social media. Social media is a fairly new thing affecting us in ways that we hadn’t considered in the past and this is causing us to change how we do things. So in some sense what’s happening is as we go into the future, we’re encountering a distributional shift and human values don’t extrapolate well to that distributional shift. What do you actually need to do is wait for the humans to get to that point, let them experience it, train on it, have their values be trained on this new distribution and then figure out what they are rather than trying to do it right now when their values are just going to be wrong or going to be not what they would get if they were actually in that situation.

Lucas: Isn’t that sort of summarizing coherent extrapolated volition?

Rohin: I don’t know that coherent extrapolated volition explicitly talks about having the human be in a new environment. I guess you could imagine that CEV considers … If you imagine like a really, really long process of deliberation in CEV, then you could be like, okay what would happen if I were in this environment and all these sorts of things happened. It seems like you would need to have a good model of how the world works and how physics works in order to predict what the environment would be like. Maybe you can do that and then in that case you simulate a bunch of different environments and you think about how humans would adapt and evolve and respond to those environments and then you take all of that together and you summarize it and distill it down into a single utility function.

Plausibly that could work. Doesn’t seem like a thing we can actually build, but as a definition of what we might want, that seems not bad. I think that is me putting the distributional shift perspective on CEV and it was not, certainly not obvious to me from the statement of CEV itself, that you’re thinking about how to mitigate the impact of distributional shift on human values. I think I’ve had this perspective and I’ve put it on CEV and I’m like, yeah, that seems fine, but it was not obvious to me from reading about CEV alone.

Lucas: Okay, cool.

Rohin: I recently posted a comment on the Alignment Forum talking about how we want to like … I guess this is sort of in corrigibility ability too, making an AI system that tries to help us as opposed to making an AI system that is optimizing the one true utility function. So that was an update I made, basically the same update as the one about aiming for corrigibility. I guess another update I made is that while there is a phase transition or something or like a sharp change in the problems that we see when AIs become human level or super-intelligent, I think the underlying causes of the problems don’t really change.

Underlying causes of problems with narrow AI systems, probably similar to the ones that underlie a super intelligent systems. Having their own reward function leads to problems both in narrow settings and in super-intelligent settings. This made me more optimistic about doing work trying to address current problems, but with an eye towards long-term problems.

Lucas: What made you have this update?

Rohin: Thinking about the problems a lot, in particular thinking about how they might happen in current systems as well. So I guess a prediction that I would make is that if it is actually true that superintelligence would end up killing us all or something like that, some like really catastrophic outcome. Then I would predict that before that, we will see some AI system that causes some other smaller scale catastrophe where I don’t know what catastrophe means, it might be something like oh, you humans die or oh, the power grid went down for some time or something like that.

And then before that we will have things that sort of fail in relatively not important ways, but in ways of say that like here’s an underlying problem that we need to fix with how we build AI systems. If you extrapolate all the way back to today that looks like for example to boat racing example from open AI, a reward hacking one. So generally expecting things to be more continuous. Not necessarily slow, but continuous. That update I made because of the posts arguing for slow take off from Paul Christiano and AI impacts.

Lucas: Right. And the view there is sort of that the world will be propagated with lower-level ML as we sort of start to ratchet up the capability of intelligence. So a lot of tasks will sort of be … Already being done by systems that are slightly less intelligent than the current best system. And so all work ecosystems will already be fully flooded with AI systems optimizing within the spaces. So there won’t be a lot of space for the first AGI system or whatever to really get decisive strategic advantage.

Rohin: Yeah, would I make prediction that we won’t have a system that gets a decisive strategic advantage? I’m not sure about that one. It seems plausible to me that we have one AI system that is improving over time and we use those improvements in society for before it becomes super intelligent. But then by the time it becomes super intelligent, it is still the one AI system that is super intelligent. So it does gain a decisive strategic advantage.

An example of this would be if there was just one main AGI project, I would still predict that progress on AI, it would be continuous, but I would not predict a multipolar outcome in that scenario. The corresponding view is that while I still do use the terminology first AGI because it’s like pointing out some intuitive concept that I think is useful, it’s a very, very fuzzy concept and I don’t think we’ll be able to actually point at any particular system and say that was the first AGI. Rather we’ll point to like a broad swath of time and say, “Somewhere in there AI had became generally intelligent.”

Lucas: There are going to be all these sort of like isolated meta-epistemic reasoning tools which can work in specific scenarios, which will sort of potentially aggregate in that fuzzy space to create something fully general.

Rohin: Yep. They’re going to be applied in some domains and then the percent of domains in which they apply will gradually grow grutter and eventually we’ll be like, huh, looks like there’s nothing left for humans to do. It probably won’t be a surprise, but I don’t think there will be a particular point where everyone agrees, yep, looks like AI is going to automate everything in just a few years. It’s more like AI will start automating a bunch of stuff. The amount of stuff it automates will increase over time. Some people will see it coming, see full automation coming earlier, some people will be like nah, this is just a simple task that AI can do, still got a long ways to go for all the really generally intelligent stuff. People will sign on to like oh, yeah, it’s actually becoming generally intelligent at different spots.

Lucas: Right. If you have a bunch of small mammalian level AIs automating a lot of stuff in industry, there would likely be a lot of people whose timelines would be skewed in the wrong direction.

Rohin: I’m not even sure this was a point of timelines. It was just a point of like which is the system that you call AGI. I claim this will not have a definitive answer. So that was also an update to how I was thinking. That one, I think, is like more generally accepted in the community. And this was more like well, all of the literature on the AI safety that’s publicly available and like commonly read by EA’s doesn’t really talk about these sorts of points. So I just hadn’t encountered these things when I started out. And then I encountered a more maybe I thought to myself, I don’t remember, but like once I encountered the arguments I was like, yeah, that makes sense and maybe I should have thought of that before.

Lucas: In the sequence which you’re writing, do you sort of like cover all of these items which you didn’t think were in the mainstream literature?

Rohin: I cover some of them. The first few things I told you were I was just like what did I say in the sequence. There were a few I think that probably aren’t going to be in that sequence just because there’s a lot of stuff that people have not written down.

Lucas: It’s pretty interesting because the way in which the AI alignment field is evolving is sometimes, it’s often difficult to have a bird’s-eye view of where it is and track avant-guard ideas being formulated in people’s brains and being shared.

Rohin: Yeah. I definitely agree. I was hoping that the Alignment Newsletter, which I write, to help with that. I would say it probably speeds up the process of bit, but it’s definitely not keeping you on the forefront. There are many ideas that I’ve heard about, that I’ve even read documents about that haven’t made it in the newsletter yet because they haven’t become public.

Lucas: So how many months behind do you think for example, the newsletter would be?

Rohin: Oh, good question. Well, let’s see. There’s a paper that I started writing in May or April that has not made it into the newsletter yet. There’s a paper that I finished and submitted in October that has not made it to the newsletter yet, or was it September, possibly September. That one will come out soon. That suggests a three month lag. But I think many others have been longer than that. Admittedly, this is for academic researchers at CHAI. I think CHAI is like we tend to publish using papers and not blog posts and this results in the longer delay on our side.

Also because work on relative reachability, for example, I’ve learned about quite a bit. I learned about maybe four or five months before she released it and that’s when it came out in the newsletter. And of course, she’d been working on it for longer or like AI safety by debate I think I learned about six or seven months before it was published in came out in the newsletter. So yeah, somewhere between three months and half a year for things seems likely. For things that I learned from MIRI, it’s possible that they never get into the newsletter because they’re never made public. So yeah, there’s a fairly broad range there.

Lucas: Okay. That’s quite interesting. I think that also sort of gives people a better sense of what’s going on in technical AI alignment because it can seem kind of black boxy.

Rohin: Yeah. I mean, in some sense this is a thing that all fields have. I used to work in programming languages. On there we would often write a paper and submit it and then go and present about it a year later by the time we had moved on, done a whole other project and written other paper and then we’d go back and we’d talk about this. I definitely remember sometimes grad students being like, “Hey, I want to get this practice document.” I say, “What’s it about?” It’s like some topic. And I’m like wait, but you did that. I heard about this like two years ago. And they’re like, yep, just got published.

So in that sense, I think both AI is faster and AI alignment is I think even faster than AI because it’s a smaller field and people can talk to each other more, and also because a lot of us write blog posts. Blog posts are great.

Lucas: They definitely play a crucial role within the community in general. So I guess just sort of tying things up a bit more here, pivoting back to a broader view. Given everything that you’ve learned and how your ideas have shifted, what are you most concerned about right now in AI alignment? How are the prospects looking to you and how does the problem of AI alignment look right now to Rohin Shah?

Rohin: I think it looks pretty tractable, pretty good. Most of the problems that I see are I think ones that we can see in advance, we probably can solve. None of these seem like particularly impossible to me. I think I also give more credit to the machine learning community or AI community than other researchers do. I trust in our ability where here are meaning like the AI field broadly, our ability to notice what things could go wrong and fix them in a way that maybe other researchers in the AI safety don’t.

I think one of the things that feels most problematic to me right now is the problem of inner optimizers, which I’m told there will probably be a sequence on in the future because there aren’t great resources on it right now. So basically this is the idea of if you run a search process over a wide space of strategies or options and you search for something that gets you good external reward or something like that, what you might end up finding is a strategy that is itself a consequentialist agent that’s optimizing for its own internal reward and that internal reward will agree with the external reward on the training data because that’s why it was selected, but it might diverge soon as there’s any distribution shift.

And then it might start optimizing against us adversarially in the same way that you would get if you like gave a misspecified award function to and RL system today. This seems plausible to me. I’ve read a bit more about this and talk to people about this and things that aren’t yet public, but hopefully will soon be. I definitely recommend reading that if it ever comes out, but yeah, this seems like it could be a problem. I don’t think we have any instance of it being a problem yet. Seems hard to detect and I’m not sure how I would fix it right now.

But I also don’t think that we’ve thought about the problem or I don’t think I’ve thought about the problem that much. I don’t want to say like, “Oh man, this is totally unsolvable,” yet. Maybe I’m just an optimistic person by nature. I mean, that’s definitely true, but maybe that’s biasing my judgment here. Feels like we could probably solve that if it ends up being a problem.

Lucas: Is there anything else here that you would like to wrap up on in terms of AI alignment or inverse reinforcement learning?

Rohin: I want to continue to exhort that we should not be trying to solve all the metaphilosophical problems and we should not be trying to like infer the one true utility function and we should not be modeling an AI as pursuing a single goal over the long-term. That is a thing I want to communicate to everybody else. Apart from that I think we’ve covered everything at a good depth. Yeah, I don’t think there’s anything else I’d add to that.

Lucas: So given that I think rather succinct distillation of what we are trying not to do, could you try and offer an equally succinct distillation of what we are trying to do?

Rohin: I wish I could. That would be great, wouldn’t it? I can tell you that I can’t do that. I could give you like a suggestion on what we are trying to do instead, which would be try to build an AI system that is corrigible, that is doing what we want, but it’s going to remain under human control in some sense. It’s going to ask us, take our preferences into account, not try to go off behind our backs and optimize against us. That is a summary of a path that we could go down that I think is premised or what I would want our AI systems to be like. But that’s unfortunately very sparse on concrete details because I don’t know those concrete details yet.

Lucas: Right. I think that that sort of perspective shift is quite important. I think it changes the nature of the problem and how one thinks about the problem, even at the societal level.

Rohin: Yeah. Agreed.

Lucas: All right. So thank you so much Rohin, it’s really been a pleasure. If people are interested in checking out some of this work that we have mentioned or following you, where’s the best place to do that?

Rohin: I have a website. It is just RohinShah.com. Subscribing to the Alignment Newsletter is … Well, it’s not a great way to figure out what I personally believe. Maybe if you’d keep reading the newsletter over time and read my opinions for several weeks in a row, maybe then you’d start getting a sense of what Rohin thinks. It will soon have links to my papers and things like that, but yeah, that’s probably the best way on this, like my website. I do have a Twitter, but I don’t really use it.

Lucas: Okay. So yeah, thanks again Rohin. It’s really been a pleasure. I think that was a ton to think about and I think that I probably have a lot more of my own thinking and updating to do based off of this conversation.

Rohin: Great. Love it when that happens.

Lucas: So yeah. Thanks so much. Take care and talk again soon.

Rohin: All right. See you soon.

Lucas: If you enjoyed this podcast, please subscribe, give it a like or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: On Becoming a Moral Realist with Peter Singer

Are there such things as moral facts? If so, how might we be able to access them? Peter Singer started his career as a preference utilitarian and a moral anti-realist, and then over time became a hedonic utilitarian and a moral realist. How does such a transition occur, and which positions are more defensible? How might objectivism in ethics affect AI alignment? What does this all mean for the future of AI?

On Becoming a Moral Realist with Peter Singer is the sixth podcast in the AI Alignment series, hosted by Lucas Perry. For those of you that are new, this series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across areas such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with Peter Singer. Peter is a world-renowned moral philosopher known for his work on animal ethics, utilitarianism, global poverty, and altruism. He’s a leading bioethicist, the founder of The Life You Can Save, and currently holds positions at both Princeton University and The University of Melbourne.

Topics discussed in this episode include:

  • Peter’s transition from moral anti-realism to moral realism
  • Why emotivism ultimately fails
  • Parallels between mathematical/logical truth and moral truth
  • Reason’s role in accessing logical spaces, and its limits
  • Why Peter moved from preference utilitarianism to hedonic utilitarianism
  • How objectivity in ethics might affect AI alignment
In this interview we discuss ideas contained in the work of Peter Singer. You can learn more about Peter’s work here and find many of the ideas discussed on this podcast in his work The Point of View of the Universe: Sidgwick and Contemporary EthicsYou can listen to the podcast above or read the transcript below.

Lucas: Hey, everyone, welcome back to the AI Alignment Podcast series. I’m Lucas Perry, and today, we will be speaking with Peter Singer about his transition from being a moral anti-realist to a moral realist. In terms of AI safety and alignment, this episode primarily focuses on issues in moral philosophy.

In general, I have found the space of moral philosophy to be rather neglected in discussions of AI alignment where persons are usually only talking about strategy and technical alignment. If it is unclear at this point, moral philosophy and issues in ethics make up a substantial part of the AI alignment problem and have implications in both strategy and technical thinking.

In terms of technical AI alignment, it has implications in preference aggregation, and it’s methodology, in inverse reinforcement learning, and preference learning techniques in general. It affects how we ought to proceed with inter-theoretic comparisons of value, with idealizing persons or agents in general and what it means to become realized, how we deal with moral uncertainty, and how robust preference learning versus moral reasoning systems should be in AI systems. It has very obvious implications in determining the sort of society we are hoping for right before, during, and right after the creation of AGI.

In terms of strategy, strategy has to be directed at some end and all strategies smuggle in some sort of values or ethics, and it’s just good here to be mindful of what those exactly are.

And with regards to coordination, we need to be clear, on a descriptive account, of different cultures or groups’ values or meta-ethics and understand how to move from the state of all current preferences and ethics onwards given our current meta-ethical views and credences. All in all, this barely scratches the surface, but it’s just a point to illustrate the interdependence going on here.

Hopefully this episode does a little to nudge your moral intuitions around a little bit and impacts how you think about the AI alignment problem. In coming episodes, I’m hoping to pivot into more strategy and technical interviews, so if you have any requests, ideas, or persons you would like to see interviewed, feel free to reach out to me at lucas@futureoflife.org. As usual, if you find this podcast interesting or useful, it’s really a big help if you can help share it on social media or follow us on your preferred listening platform.

As many of you will already know, Peter is a world-renowned moral philosopher known for his work on animal ethics, utilitarianism, global poverty, and altruism. He’s a leading bioethicist, the founder of The Life You Can Save, and currently holds positions at both Princeton University and The University of Melbourne. And so, without further ado, I give you Peter Singer.

Thanks so much for coming on the podcast, Peter. It’s really wonderful to have you here.

Peter: Oh, it’s good to be with you.

Lucas: So just to jump right into this, it would be great if you could just take us through the evolution of your metaethics throughout your career. As I understand, you began giving most of your credence to being an anti-realist and a preference utilitarian, but then over time, it appears that you’ve developed into a hedonic utilitarian and a moral realist. Take us through the evolution of these views and how you developed and arrived at your new ones.

Peter: Okay, well, when I started studying philosophy, which was in the 1960s, I think the dominant view, at least among people who were not religious and didn’t believe that morals were somehow an objective truth handed down by God, was what was then referred to as an emotivist view, that is the idea that moral judgments express our attitudes, particularly, obviously from the name, emotional attitudes, that they’re not statements of fact, they don’t purport to describe anything. Rather, they express attitudes that we have and they encourage others to share those attitudes.

So that was probably the first view that I held, siding with people who were non-religious. It seemed like a fairly obvious option. Then I went to Oxford and I studied with R.M. Hare who was a professor of moral philosophy at Oxford at the time and a well-known figure in the field. His view was also in this general ballpark of non-objectivist or, as we would know say, non-realist theories, non-cognitivist] was another term used for them. They didn’t purport to be about knowledge.

But his view was that when we make a moral judgment, we are prescribing something. So his idea was that moral judgments fall into the general family of imperative judgments. So if I tell you shut the door, that’s an imperative. It doesn’t say anything that’s true or false. And moral judgments were a particular kind of imperative according to Hare, but they had this feature that they had to be universalizable. So by universalizable, Hare meant that if you were to make a moral judgment, your prescription would have to hold in all relevantly similar circumstances. And relevantly similar was defined in such a way that it didn’t depend on who the people were.

So, for example, if I were to prescribe that you should be my slave, the fact that I’m the slave master and you’re the slave isn’t a relevantly similar circumstance. If there’s somebody just like me and somebody just like you, that I happen to occupy your place, then the person who is just like me would also be entitled to be the slave master of me ’cause now I’m in the position of the slave.

Obviously, if you think about moral judgments that way, that does put a constraint on what moral judgments you can accept because you wouldn’t want to be a slave, presumably. So I liked this view better than the straightforwardly emotivist view because it did seem to give more scope for argument. It seemed to say look, there’s some kind of constraint that really, in practice, means we have to take everybody’s interests into account.

And I thought that was a good feature about this, and I drew on that in various kinds of applied contexts where I wanted to make moral arguments. So that was my position, I guess, after I was at Oxford, and for some decades after that, but I was never completely comfortable with it. And the reason I was not completely comfortable with it was that there was always a question you could ask on Hare’s view. Hare said where does this universalizability constraint come from on our moral judgment? And Hare’s answer was well, it’s a feature of moral language. It’s implied in, say, using the terms ought or good or bad or beauty or obligation. It’s implied that the judgments you are making are universalizable in this way.

And that, in itself, was plausible enough, but it was open to the response that well, in that case, I’m just not gonna use moral language. If moral language requires me to make universalizable prescriptions and that means that I can’t do all sorts of things or can’t advocate all sorts of things that I would want to advocate, then I just won’t use moral language to justify my conduct. I’ll use some other kind of language, maybe prudential language, language of furthering my self-interests. And what’s wrong with doing that moreover, and it’s not just that they can do that, but tell me what’s wrong with them doing that?

So this is a kind of a question about why act morally. And on his view, it wasn’t obvious from his view what the answer to that would be, and, in particular, it didn’t seem that there would be any kind of answer about that’s irrational or you’re missing something. It seemed, really, as if it was an open choice that you had whether to use moral language or not.

So as I got further into the problem, as I tried to develop arguments that would show that it was a requirement of reason, not just a requirement of moral language, but a requirement of reason that we universalize our judgements.

And yet, it was obviously a problem in fitting that in to Hare’s framework, which is, I’ve been saying, was a framework within this general non-cognitivist family. And for Hare, the idea that there are objective reasons for action didn’t really make sense. They were just these desires that we had, which led to us making prescriptions and then the constraint that we universalize their prescriptions, but he explicitly talked about the possibility of objective prescriptions and said that that was a kind of nonsense, which I think comes out of the general background of the kind of philosophy that came out of logical positivism and the verificationist idea that things that you couldn’t verify were nonsense or so and so. And that’s why I was pretty uncomfortable with this, but I didn’t really see bright alternatives to it for some time.

And then, I guess, gradually, I was persuaded by a number of philosophers who were respected that Hare was wrong about rejecting the idea of objective truth in morality. I talked to Tom Nagel and probably most significant was the work of Derek Parfit, especially his work On What Matters, volumes one and two, which I saw in advance in draft form. He circulated drafts of his books to lots of people who he thought might give him some useful criticism. And so I saw that many years before it came out, and the arguments did seem, to me, pretty strong, particularly the objections to the kind of view that I’d held, which, by this time, was no longer usually called emotivism, but was called expressivism, but I think it’s basically a similar view, a view in the ballpark.

And so I came to the conclusion that there is a reasonable case for saying that there are objective moral truths and this is not just a matter of our attitudes or of our preferences universalized, but there’s something stronger going on and it’s, in some ways, more like the objectivity of mathematical truths or perhaps of logical truths. It’s not an empirical thing. This is not something you can describe that comes in the world, the natural world of our sense that you can find or prove empirically. It’s rather something that is rationally self-evident, I guess, to people who reflect on it properly and think about it carefully. So that’s how I gradually made the move towards objectivist metaethic.

Lucas: I think here, it would be really helpful if you could thoroughly unpack what your hedonistic utilitarian objectivist meta-ethics actually looks like today, specifically getting into the most compelling arguments that you found in Parfit and in Nagel that led you to this view.

Peter: First off, I think we should be clear that being an objectivist about metaethics is one thing. Being a hedonist rather than a preference utilitarian is a different thing, and I’ll describe … There is some connection between them as I’ll describe in a moment, but I could have easily become an objectivist and remained a preference utilitarian or held some other kind of normative moral view.

Lucas: Right.

Peter: So the metaethic view is separate from that. What were the most compelling arguments here? I think one of the things that had stuck with me for a long time and that had restrained me from moving in this direction was the idea that it’s hard to know what you mean when you say that something is an objective truth outside the natural world. So in terms of saying that things are objectively true in science, the truths of scientific investigation, we can say well, there’s all of this evidence for it. No rational person would refuse to believe this once they were acquainted with all of this evidence. So that’s why we can say that that is objectively true.

But that’s clearly not going to work for truths in ethics, which, assuming of course that we’re not naturalists, that we don’t think this can be deduced from some examination of human nature or the world, I certainly don’t think that and the people that are influential on me, Nagel and Parfit in particular, also didn’t think that.

So the only restraining question was well, what could this really amount to? I had known going back to the intuitionists in the early 20th century, people like W.D. Ross or, earlier, Henry Sidgwick, who was a utilitarian objectivist philosopher, that people made the parallel with mathematical proofs that there are mathematical proofs that we see as true by direct insight into their truths by their self-evidence, but I have been concerned about this. I’d never really done a deep study of philosophy or mathematics, but I’d been concerned about this because I thought there’s a case for saying that mathematical truths are an analytic truths, they’re truths in virtue of the meanings of the terms and virtue of the way we define what we mean by the numbers and by equals or the various other terms that we use in mathematics so that it’s basically just the unpacking of an analytic system.

The philosophers that I respected didn’t think this view had been more popular at the time when I was a student and it had stuck with me for a while, and although it’s not disappeared, I think it’s perhaps not as widely held a view now as it was then. So that plus the arguments that were being made about how do we understand mathematical truths, how do we understand the truths of logical inference. We grasps these as self-evident. We find them undeniable, yet this is, again, a truth that is not part of the empirical world, but it doesn’t just seem that it’s an analytic truth either. It doesn’t just seem that it’s the meanings of the terms. It does seem that we know something when we know the truths of logic or the truths of mathematics.

On this basis, it started to seem like the idea that there are these non-empirical truths in ethics as well might be more plausible than I thought it was before. And I also went back and read Henry Sidgwick who’s a philosopher that I greatly admire and that Parfit also greatly admired, and looked at his arguments about what he saw as, what he called, moral axioms, and that obviously makes the parallel with axioms of mathematics.

I looked at them and it did seem to me difficult to deny, that is, claims, for example, that there’s no reason for preferring one moment of our existence to another in itself. In other words that we shouldn’t discount the future, except for things like uncertainty, but otherwise, the future is just as important as the present, an idea somewhat similar to his universalizability, but somewhat differently stated by Sidgwick that if something is right for someone, then it’s right independently of the identities of the people involved. But for Sidgwick, as I say, that was, again, a truth of reason, not simply an implication of the use of particular moral terms. Thinking about that, that started to seem right to me, too.

And, I guess, finally, Sidgwick’s claim that the interest of one individual are no more important than the interests of another, assuming that the goods involved that can be done to that person, that is the extent of their interests are similar. Sidgwick’s claim was that people were reflecting carefully on these truths can see that they’re true, and I thought about that, and it did seem to me that … It was pretty difficult to deny, not that nobody will deny them, but that they do have a self-evidence about them. That seemed to me to be a better basis for ethics than views that I’d been holding up to that point, the views that so came out of, originally, emotivism and then out of prescriptivism.

It was a reasonable chance that that was right. As you say, I should give it more credence than I have. It’s not that I’m 100% certain that it’s right by any means, but that’s a plausible view that’s worth defending and trying to see what objections people make to it.

Lucas: I think there’s three things here that would be helpful for us to dive in more on. The first thing is, and this isn’t a part of metaethics, which I’m particularly acquainted with, so, potentially, you can help guide us through this part a little bit more. This non-naturalism vs naturalism argument. Your view is, I believe you’re claiming, is a non-naturalist view is you’re claiming that you can not deduce the axioms of ethics or the basis of ethics from a descriptive or empirical account of the universe?

Peter: That’s right. There certainly are still naturalists around. I guess Peter Railton is a well-known, contemporary, philosophical naturalist. Perhaps Frank Jackson, my Australian friend and colleague. And some of the naturalist views have become more complicated than they used to be. I suppose the original idea of naturalism that people might be more familiar with is simply the claim that there is a human nature and that acting in accordance with that human nature is the right thing to do, so you describe human nature and then you draw from that what are the characteristics that we ought to follow.

That, I think, just simply doesn’t work. I think it has its origins in a religious framework in which you think that God has created our nature with particular purposes that we should behave in certain ways. But the naturalists who defend it, going back to Aquinas even, maintain that it’s actually independent of that view.

If you, in fact, you take an evolutionary view of human nature, as I think we should, then our nature is morally neutral. You can’t derive any moral conclusions from what our nature is like. It might be relevant to know what our nature is like in order to know that if you do one thing, that might lead to certain consequences, but it’s quite possible that, for example, our nature is to seek power and to use force to obtain power, that that’s an element of human nature or, on a group level, to go to war in order to have power over others, and yet naturalists wouldn’t wanna say that those are the right things. They would try and give some account as to why how some of that’s a corruption of human nature.

Lucas: Putting aside naturalist accounts that involve human nature, what about a purely descriptive or empirical understanding of the world, which includes, for example, sentient beings and suffering, and suffering is like a substantial and real ontological fact of the universe and the potential of deducing ethics from facts about suffering and what it is like to suffer? Would that not be a form of naturalism?

Peter: I think you have to be very careful about how you formulate this. What you said sounds a little bit like what Sam Harris says in his book, The Moral Landscape, which does seem to be a kind of naturalism because he thinks that you can derive moral conclusions from science, including exactly the kinds of things that you’ve talked about, but I think there’s a gap there, and the gap has to be acknowledged. You can certainly describe suffering and you can describe happiness conversely, but you need to get beyond description if you’re going to have a normative judgment. That is if you’re gonna have a judgment that says what we ought to do or what’s the right thing to do or what’s a good thing to do, there’s a step that’s just being left out.

If somebody says sentient beings can suffer pain or they can be happy, this is what suffer and pain is like, this is what being happy is like; therefore, we ought to promote happiness, which goes back to David Hume who pointed this out that various moral arguments describe the world using is, is, is, this is the case, and then, suddenly, but without any explanation, they say and therefore, we ought. Needs to be explained how you get from this is statement to the ought statements.

Lucas: It seems that reason, whatever reason might be and however you might define that, seems to do a lot of work at the foundation of your moral view because it seems that reason is what leads you towards the self-evident truth of certain foundational ethical axioms. Why might we not be able to pull the same sort of move with a form of naturalistic moral realism like Sam Harris develops by simply stating that given a full descriptive account of the universe and given first person accounts of suffering and what suffering is like, that it is self-evidently true that built into the nature of that sort of property or part of the universe is that it ought to be diminished?

Peter: Well, if you’re saying that … There is a fine line, maybe this is what you’re suggesting, between saying from the description, we can deduce what we ought to do and between saying when we reflect on what suffering is and when we reflect on what happiness is, we can see that it is self-evident that we ought to promote happiness and we ought to reduce suffering. So I regard that as a non-naturalist position, but you’re right that the two come quite close together.

In fact, this is one of the interesting features of volume three of Parfit’s On What Matters, which was only published posthumously, but was completed before he died, and in that, he responds to essays that are in a book that I edited called Does Anything Really Matter. The original idea was that he would respond in that volume, but, as often happened with Parfit, he wrote responses as such length that it needed to be a separate volume. It would’ve made the work too bulky to put them together, but Peter Railton had an essay in Does Anything Really Matter, and Parfit responded to it, and then he invited Railton to respond to his response, and, essentially, they are saying that yeah, their views have become closer anyway, there’s been a convergence, which is pretty unusual in philosophy because philosophers tend to emphasize the differences between their views.

Between what Parfit calls his non-natural objectivist view and between Railton’s naturalist view, because Railton’s is a more sophisticated naturalist view, the line starts to become a little thin, I agree. But, to me, the crucial thing is that you’re not just saying here’s this description; therefore, we ought to do this. But you’re saying if we understand what we’re talking about here, we can have as an intuition of self-evidence, the proposition that it’s good to promote this or it’s good to try to prevent this. So that’s the moral proposition, that it is good to do this. And that’s the proposition that you have to take some other step. You can say it’s self-evident, but you have to take some other step from simply saying this is what suffering is like.

Lucas: Just to sort of capture and understand your view a bit more here, and going back to, I think, mathematics and reason and what reason means to you and how it operates the foundation of your ethics, I think that a lot of people will sort of get lost or potentially feel it is maybe an arbitrary or cheap move to …

When thinking about the foundations of mathematics, there are foundational axioms, which is self-evidently true, which no one will deny, and then translating that move into the foundations of ethics into determining what we ought to do, it seems like there would be a lot of peole being lost there, there would be a lot of foundational disagreement there. When is it permissible or okay or rational to make that sort of move? What does it mean to say that these really foundational parts of ethics are self-evidently true? How is not the case that that’s simply an illusion or simply a byproduct of evolution that we’re confused that these certain fictions that we’ve evolved are self-evidently true?

Peter: Firstly, let me say, as I’ve mentioned before, I don’t claim that we can be 100% certain about moral truths, but I do think that it’s a plausible view. One reason why it relates to, you just mentioned, being a product of evolution, one reason why it relates to that, and this is something that I argued with my co-author Katarzyna de Lazari-Radek in the 2014 book we wrote called The Point of View of the Universe, which is, in fact, a phrase form Sidgwick, and that argument is that there are a number of moral judgments that we make, there are many moral judgments that we make that we know have evolutionary origins, so lots of things that we think of as wrong, originated because they would not have helped us to survive or they would not have helped a small tribal group to survive to allow certain kinds of conduct. And some of those, we might wanna reject today.

We might think, for example, we have an instinctive repugnance of incest, but Jonathon Hyde has shown that even if you describe a case where adult brothers and sisters who choose to have sex and nothing bad happens as a result of that, their relationship remains as strong as ever, and they have fun, and that’s the end of it, people still say oh, somehow that’s wrong. They try to make up reasons why it’s wrong. That, I think, is an example of an evolved impulse, which, perhaps, is no longer really apposite because we have effective contraception, and so what are the evolutionary reasons why we might want to avoid incest are not necessarily there.

But in a case of the kinds of things that I’m talking about and that Sidgwick is talking about, like the idea that everyone’s good is of equal significance, they have perceived why we would’ve evolved to have bad attitude because, in fact, it seems harmful to our prospects of survival and reproduction to give equal weight to the interest of complete strangers.

The fact that people do think this, and if you look at a whole lot of different independent, historical, ethical traditions in different cultures and different parts of the world at different times, you do find many thinkers who converge on something like this idea in various formulations. So why do they converge on this given that it doesn’t seem to have that evolutionary justification or explanation as to why it would’ve evolved?

I think that suggests that it may be a truth of reason and, of course, you may then say well, but reason has also evolved, and indeed it has, but I think that reason may be a little different in that we evolved a capacity to reason various specific problem solving needs, helped us to survive in lots of circumstances. But it may then enable us to see things that have no survival value, just as no doubt simple arithmetic has a survival value, but understanding the truths of higher mathematics don’t really have a survival value, so maybe similarly in ethics, there are some of these more abstract universal truths that don’t have a survival value, but which, nevertheless, the best explanation for why many people seem to come to these views is that they’re truths of reason, and once we’re capable of reasoning, we’re capable of understanding these truths.

Lucas: Let’s start off at reason and reason alone. When moving from reason and thinking, I guess, alongside here about mathematics for example, how is one moving specifically from reason to moral realism and what is the metaphysics of this kind of moral realism in a naturalistic universe without anything supernatural?

Peter: I don’t think that it needs to have a very heavyweight metaphysical presence in the universe. Parfit actually avoided the term realism in describing his view. He called it non-naturalistic normative objectivism because he thought that realism carried this idea that it was part of the furniture of the universe, as philosophers say, that the universe consists of the various material objects, but in addition to that, it consists of moral truths is if they’re somehow sort of floating there out in space, and that’s not the right way to think about it.

I’d say, rather, the right way to think about it is as, you know, we do with logical and mathematical truths that once you have been capable of a certain kind of thought, they will move towards these truths. They have the potential and capacity for thinking along these lines. One of the claims that I would make a consequence of my acceptance of objectivism in ethics as the rationally based objectivism is that the morality that we humans have developed on Earth in this, anyway, at this more abstract, universal level is something that aliens from another galaxy could also have achieved if they had similar capacities of thought or maybe greater capacities of thought. It’s always a possible logical space, you could say, or a rational space that is there that beings may be able to discover once they develop those capacities.

You can see mathematics in that way, too. It’s one of a number of possible ways of seeing mathematics and of seeing logic, but they’re just timeless things that, in some way, truths or laws, if you like, but they don’t exist in the sense in which the physical universe exists.

Lucas: I think that’s really a very helpful way of putting it. So the claim here is that through reason, one can develop the axioms of mathematics and then eventually develop quantum physics and other things. And similarly, when reason is applied to thinking about what one ought to do or when thinking about the status of sentient creatures that one is applying logic and reason to this rational space and that this rational space has truths in the same way that mathematics does?

Peter: Yes, that’s right. It has at least perhaps only a very small number, Sidgwick came up with three axioms that are perhaps only a very small number of truths and fairly abstract truths, but that they are truths. That’s the important aspect. That they’re not just particular attitudes, which beings who evolved as homo sapiens have all are likely to understand and accept, but beings who evolved in a different galaxy in a quite different way would not accept. My claim is that if they are also capable of reasoning, if evolution had again produced rational beings, they would be able to see the truths in the same way as we can.

Lucas: So spaces of rational thought and of logic, which can or can not be explored, seems very conceptual queer to me, such that I don’t even really know how to think about it. I think that one would worry that one is applying reason, whatever reason might be, to a fictional space. I mean you’re discussing earlier that some people believe mathematics to be simply the formalization of what is analytically true about the terms and judgments and the axioms and then it’s just a systematization of that and an unpacking of it from beginning into infinity. And so, I guess, it’s unclear to me how one can discern spaces of rational inquiry which are real, from ones which are anti-real or which are fictitious. Does that make sense?

Peter: It’s a problem. I’m not denying that there is something mysterious, I think maybe my former professor, R.M. Hare, would have said queer … No, it was John Mackie, actually, John Mackie was also at Oxford when I was there, who said these must be very queer things if there are some objective moral truths. I’m not denying that it’s something that, in a way, would be much simpler if we could explain everything in terms of empirical examination of the natural world and say there’s only that plus there are formal systems. There are analytic systems.

But I’m not persuaded that that’s a satisfactory explanation of mathematics or logic either, so if those who are convinced that this is a satisfactory way of explaining logic and mathematics, may well think that then they don’t need this explanation of ethics either, but it is a matter of if we need to appeal to something outside the natural realm to understand some of the other things about the way we reason, then perhaps ethics is another candidate for this.

Lucas: So just drawing parallels again here with mathematics ’cause I think it’s the most helpful. Mathematics is incredible for helping us to describe and predict the universe. The president of the Future of Life Institute, Max Tegmark, develops an idea of potential mathematical Platonism or realism where the universe can be understood primarily as, and sort of ontologically, a mathematical object within, potentially, a multiverse because as we look into the properties and features of quarks and the building blocks of the universe, all we find is more mathematical properties and mathematical relationships.

So within the philosophy of math, there’s certainly, it seems, open questions about what math is and what the relation of mathematics is to the fundamental metaphysics and ontology of the universe and potential multiverse. So in terms of ethics, what information or insight or anything do you think that we’re missing could further inform our view that there potentially is objective morality or whatever that means or inform us that there is a space of moral truths which can be arrived at by non-anthropocentric minds, like aliens minds you said could also arrive at the moral truths as they could also arrive at mathematical truths.

Peter: So what further insight would show that this was correct, other, presumably, than the arrival of aliens who start swapping mathematical theorems with us?

Lucas: And have arrived at the same moral views. For example, if they show up and they’re like hey, we’re hedonistic consequentialists and we’re really motivated to-

Peter: I’m not saying they’d necessarily be hedonistic consequentialists, but they would-

Lucas: I think they should be.

Peter: That’s a different question, right?

Lucas: Yeah, yeah, yeah.

Peter: We haven’t really discussed steps to get there yet, so I think they’re separate questions. My idea is that they would be able to see that if we had similar interests to the ones that they did, then those interests ought to get similar weight, that they shouldn’t ignore our interests just because we’re not members of whatever civilization or species they are. I would hope that if they are rationally sophisticated, they would at least be able to see that argument, right?

Some of them, just as with us, might see the argument and then say yes, but I love the tastes of your flesh so much I’m gonna kill you and eat you anyway. So, like us, they may not be purely rational beings. We’re obviously not purely rational beings. But if they can get here and contact us somehow, they should be sufficiently rational to be able to see the point of the moral view that I’m describing.

But that wasn’t a very serious suggestion about waiting for the aliens to arrive, and I’m not sure that I can give you much of an answer to say what further insights are relevant here. Maybe it’s interesting to try and look at this cross-culturally, as I was saying, and to examine the way that great thinkers of different cultures and different eras have converged on something like this idea despite the fact that it seems unlikely to have been directly produced by evolutionary beings in the same way that our other more emotionally driven moral reactions are.

Peter: I don’t know that the argument can go any further, and it’s not completely conclusive, but I think it remains plausible. You might say well, that’s a stalemate. Here are some reasons for thinking morality’s objective and other reasons for rejecting that, and that’s possible. That happens in philosophy. We get down to bedrock disagreements and it’s hard to move people with different views.

Lucas: What is reason? One could also view reason as some human-centric bundle of both logic and intuitions, and one can be mindful that the intuitions, which are sort of bundled with this logic, are almost arbitrary consequences of evolution. So what is reason fundamentally and what does it mean that other reasonable agents could explore spaces of math and morality in similar ways?

Peter: Well, I would argue that there are common principles that don’t depend on our specific human nature and don’t depend on the path of our evolution. I accept, to the extent, that because the path of our evolution has given us the capacity to solve various problems through thought and that that is what our reason amounts to and therefore, we have insight into these truths that we would not have if we did not have that capacity. From this kind of reasoning, you can think of as something that goes beyond specific problem solving skills to insights into laws of logic, laws of mathematics, and laws of morality as well.

Lucas: When we’re talking about axiomatic parts of mathematics and logics and, potentially, ethics here as you were claiming with this moral realism, how is it that reason allows us to arrive at the correct axioms in these rational spaces?

Peter: We developed the ability when we’re presented with these things to consider whether we can deny them or not, whether they are truly self-evident. We can reflect on them, we can talk to others about them, we can consider biases that we might have that might explain why we believe them and see where there are any such biases, and once we’ve done all that, we’re left with the insight that some things we can’t deny.

Lucas: I guess I’m just sort of poking at this idea of self-evidence here, which is doing a lot of work in the moral realism. Whether or not something is self-evident, at least to me, it seems like a feeling, like I just look at the thing and I’m like clearly that’s true, and if I get a little bit meta, I ask okay, why is that I think that this thing is obviously true? Well, I don’t really know, it just seems self-evidently true. It just seems so and this, potentially, just seems to be a consequence of evolution and of being imbued with whatever reason is. So I don’t know if I can always trust my intuitions about things being self-evidently true. I’m not sure how to navigate my intuitions and views of what is self-evident in order to come upon what is true.

Peter: As I said, it’s possible that we’re mistaken, that I’m mistaken in these particular instances. I can’t exclude that possibility, but it seems to me that there’s hypotheses that we hold these views because they are self-evident, and look for evolutionary explanations and, as I’ve said, I’ve not really found them, so that’s as far as I can go with that.

Lucas: Just moving along here a little bit, and I’m becoming increasingly mindful of your time, would you like to cover briefly this sort of shift that you had from preference utilitarianism to hedonistic utilitarianism?

Peter: So, again, let’s go back to my autobiographical story. For Hare, the only basis for making moral judgments was to start from our preferences and then to universalize them. There could be no arguments about something else being intrinsically good or bad, whether it was happiness or whether it was justice or freedom or whatever because that would be to import some kind of objective claims into this debate that just didn’t have a place in his framework, so all I could do was take my preferences and prescribe them universally, and, as I said, that involved putting myself in the position of the others affected by my action and asking whether I could still accept it.

When you do that, and if you, let’s say your action affects many people, not just you and one other, what you’re really doing is you’re trying to sum up how this would be from the point of view of every one of these people. So if I put myself in A’s position, would I be able to accept this? But then I’ve gotta put myself in B’s position as well, and C, and D, and so on. And to say can I accept this prescription universalized is to say if I were living the lives of all of those people, would I want this to be done or not? And that’s a kind of, as they say, a summing of the extent to which doing this satisfies everyone’s preferences net on balance after deducting, of course, the way in which is thwarts or frustrates or is contrary to their preferences.

So this seem to be the only way in which you could go further with Hare’s views as they eventually worked it out and changed it a little bit over the years, but in his later formulations of it. So it was a kind of a preference utilitarianism that it led to, and I was reasonably happy with that, and I accepted the idea that this meant that what we ought to be doing is to maximize the satisfaction of preferences and avoid thwarting them.

And it gives you, in many cases, of course, somewhat similar conclusions to what you would say if what we wanna do is maximize happiness an minimize suffering or misery because for most people, happiness is something that they very much desire and misery is something that they don’t want. Some people might have different preferences that are not related to that, but for most people, they will probably come down some way or other to how it relates to their well-being, their interests.

There are certainly objections to this, and some of the objections relate to preferences that people have when they’re not fully informed about things. And Hare’s view was that, in fact, the preferences that we should universalize are the preferences people should have when they are fully informed and when they’re thinking calmly, they’re not, let’s say, angry with somebody and therefore they have a strong preference to hit him in the face, even though this will be bad for them and bad for him.

So the preference view sort of then took this further step of saying it’s the preferences that you would have if you were well informed and rational and calm, and that seemed to solve some problems with preference utilitarianism, but it gave rise to other problems. One of the problems were well, does this mean that if somebody is misinformed in a way that you can be pretty confident they’re never gonna be correctly informed, you should still do what they would want if they were correctly informed.

An example of this might be someone who’s a very firm religious believer and has been all their life, and let’s say one of their religious beliefs is that having sex outside marriage is wrong because God has forbid it, let’s say, it’s contrary to the commandments or whatever, but given that, let’s say, let’s just assume, there is no God, therefore, a priori there’s no commandments that God made against sex outside marriage, and given that if they didn’t believe in God, they would be happy to have sex outside marriage, and this would make them happier, and would make their partner happy as well, should I somehow try to wangle things so that they do have sex outside marriage even though, as they are now, they prefer not to.

And that seems a bit of a puzzle, really. Seems highly paternalistic to ignore their preferences in the base of their knowledge even though you’re convinced that they’re knowledge is false. So there are puzzles and paradoxes like that. And then there was another argument that does actually, again, come out of Sidgwick, although I didn’t find it in Sidgwick until I read it in other philosopher’s later.

Again, I think Peter Railton’s is one who uses his. and that is that if you’re really asking what people would do if they’re rational and fully informed, you have to make judgments about what is a rational and fully informed view in this situation. And that might involve even the views that we’ve just been discussing, that if you were rational, you would know what the objective truth was and you would want to do it. So, at that level, a preference view actually seems to amount to a different view, an objectivist view, that you would hold where you would have to actually know what the things that were good.

So, as I say, it had a number of internal problems, even just if you assume the meta-ethic that I was taking from Hare originally. But if then, as happened with me, you become convinced that there can be objective moral truths. This was, in some ways, opened up the field to other possible ideas as to what was intrinsically good because now you could argue that something was intrinsically good even if it was not something that people preferred, and in that light, I went back to reading some of the classical utilitarians, again, particularly, Sidgwick and his arguments for why happiness rather than the satisfaction of desires is the ultimate value, something that is of intrinsic value, and it did seem to overcome these problems with preference utilitarianism that had been troubling me.

It had certainly had some paradoxes of its own, some things that it seemed not to handle as well, but after thinking about it, again, I decided that it was more likely than not that a hedonistic view was the right view. I wouldn’t put it stronger than that. I still think preference utilitarianism has some things to be said for it and they’re also, of course, views that say yes, happiness is intrinsically good and suffering is intrinsically bad, but they’re not the only things that are intrinsically good or bad, things like justice or freedom or whatever. There’s various other candidates that people have put forward. Many of them, in fact, are being objectively good or bad. So there are also possibilities.

Lucas: When you mentioned that happiness or certain sorts of conscious states of sentient creatures can be seen as intrinsically good or valuable, keeping in mind the moral realism that you hold, what is the metaphysical status of experiences in the universe given this view? Is it that happiness is good based off of the application of reason and the rational space of ethics? Unpack the ontology of happiness and the metaphysics here a bit.

Peter: Well, of course it doesn’t change what happiness is. That’s to say that it’s of intrinsic value, but that is the claim that I’m making. That the world is a better place if it has more happiness in it and less suffering in it. That’s judgment that I’m making about the state of the universe. Obviously, there have to be beings who can be happy or can be miserable, and that requires a conscious mind, but the judgment that the universe if better with more happiness and less suffering is mind independent. I think … Let’s imagine that there were beings that could feel pain and pleasure, but could not make any judgments about anything of value. They’re like some non-humans animals, I guess. It would still be the case that the universe was better if those non-human animals suffered less and had more pleasure.

Lucas: Right. Because it would be sort of intrinsic quality or property to the experience that it be valuable or disvaluable. So yeah, thanks so much for your time, Peter. It’s really been wonderful and informative. If people would like to follow you or check you out somewhere, where can they go ahead and do that?

Peter: I have a website, which actually I’m in the process of reconstructing a bit, but it’s Petersinger.info. There’s a Wikipedia page. They wanna look at things that I’m involved in, they can look at thelifeyoucansave.org, which is the nonprofit organization that I’ve founded that is recommending perfective charities that people can donate to. That probably gives people a bit of an idea. There’s books that I’ve written that are discussing these things. I probably mentioned The Point of View of the Universe, which goes into the things we’ve discussed today, probably more thoroughly than anything else. For people who don’t wanna read a big book, I’ve also got Oxford University Press’ very short introduction series. The book on utilitarianism is, again, co-authored by the same co-author as The Point of View of the Universe, Katarzyna de Lazari-Radek and myself, and that’s just a hundred page version of some of these arguments we’ve been discussing.

Lucas: Wonderful. Well, thanks again, Peter. We haven’t ever met in person, but hopefully I’ll catch you around the Effective Altruism conference track sometime soon.

Peter: Okay, hope so.

Lucas: Alright, thanks so much, Peter.

Hey, it’s post-podcast Lucas here and just wanted to chime in with some of my thoughts and tie this all into AI thinking. For me, the most consequential aspect of moral thought in this space and moral philosophy, generally, is how much disagreement there is between people who’ve thought long and hard about this issue and what an enormous part of AI alignment this makes up, and the effects, different moral and meta-ethical views have on preferred AI alignment methodology.

Current general estimates by AI researchers, but human level AI on the decade to century long timescale with about a 50% probability by mid-century with that obviously increasing over time, and it’s quite obvious that moral philosophy ethics and issues of value and meaning will not be solved on that timescale. So if we assume at the worst case success story where technical alignment and coordination and strategy issues will continue in their standard, rather morally messy way with how we currently unreflectively deal with things, where moral information isn’t taken very seriously, then I’m really hoping the technical alignment and coordination succeed well enough for us to create a very low level aligned system, that we’re able to pull the brakes on and work hard on issues of value, ethics, and meaning. The end towards which that AGI will be aimed. Otherwise, it seems very clear that given all of this moral uncertainty that is shared, we risk value drift or catastrophically unoptimal or even negative futures.

Turning into Peter’s views that we discussed here today, if axioms of morality are accessible through reason alone, as the axioms of mathematics appear to be, then we ought to consider the implications here for how we want to progress with AI systems and AI alignment more generally.

If we take human beings to be agents of limited or semi-rationality, then we could expect that some of us, or some fraction of us, have gained access to what might potentially be core axioms of the logical space of morality. When AI systems are trained on human data in order to infer and learn human preferences, given Peter’s view, this could be seen as a way of learning the moral thinking of imperfectly rational beings. This, or any empirical investigation, given Peter’s views, would not be able to arrive at any clear, moral truth, rather it would find areas where semi-rational beings, like ourselves, generally tend to converge in this space.

This would be useful or potentially passable up until AGI, but if such a system is to be fully autonomous and safe, then a more robust form of alignment is necessary. If the AGI we create is one day rational, putting aside whatever reason might be and how it gives rational creatures access to self-evident truths and rational spaces, then if AGI is a fully rational agent, then it, perhaps, would arrive at self-evident truths of mathematics and logic, and even morality, just as aliens on another planet might if they’re fully rational as is Peter’s view. If so, this would potentially be evidence of this view being true and can also reflect here that AGI from this point of using reason to have insight into the core truths of logical spaces could reason much better and more impartially than any human in order to fully explore and realize universal truths of morality.

At this point, we would essentially have a perfect moral reasoner on our hands with access to timeless universal truths. Now the question would be could we trust it and what would ever be sufficient reasoning or explanation given to humans by this moral oracle that would satisfy and satiate us of our appetites and desires to know moral truth and to be sure that we have arrived at moral truth?

It’s above my pay grade what rationality or reason actually is and might be prior to certain logical and mathematical axioms and how such a truth seeking meta-awareness can grasps these truths as self-evident or whether the self-evidence of the truths of mathematics and logic are programmed into us by evolution trying and failing over millions of year. But maybe that’s an issue for another time. Regardless, we’re doing philosophy, computer science, and poly-sci on a deadline, so let’s keep working on getting it right.

If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: Moral Uncertainty and the Path to AI Alignment with William MacAskill

How are we to make progress on AI alignment given moral uncertainty?  What are the ideal ways of resolving conflicting value systems and views of morality among persons? How ought we to go about AI alignment given that we are unsure about our normative and metaethical theories? How should preferences be aggregated and persons idealized in the context of our uncertainty?

Moral Uncertainty and the Path to AI Alignment with William MacAskill is the fifth podcast in the new AI Alignment series, hosted by Lucas Perry. For those of you that are new, this series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across areas such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with William MacAskill. Will is a professor of philosophy at the University of Oxford and is a co-founder of the Center for Effective Altruism, Giving What We Can, and 80,000 Hours. Will helped to create the effective altruism movement and his writing is mainly focused on issues of normative and decision theoretic uncertainty, as well as general issues in ethics.

Topics discussed in this episode include:

  • Will’s current normative and metaethical credences
  • The value of moral information and moral philosophy
  • A taxonomy of the AI alignment problem
  • How we ought to practice AI alignment given moral uncertainty
  • Moral uncertainty in preference aggregation
  • Moral uncertainty in deciding where we ought to be going as a society
  • Idealizing persons and their preferences
  • The most neglected portion of AI alignment
In this interview we discuss ideas contained in the work of William MacAskill. You can learn more about Will’s work here, and follow him on social media here. You can find Gordon Worley’s post here and Rob Wiblin’s previous podcast with Will here.  You can hear more in the podcast above or read the transcript below.

Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast series at the Future of Life Institute. I’m Lucas Perry, and today we’ll be speaking with William MacAskill on moral uncertainty and its place in AI alignment. If you’ve been enjoying this series and finding it interesting or valuable, it’s a big help if you can share it on social media and follow us on your preferred listening platform.

Will is a professor of philosophy at the University of Oxford and is a co-founder of the Center for Effective Altruism, Giving What We Can, and 80,000 Hours. Will helped to create the effective altruism movement and his writing is mainly focused on issues of normative and decision theoretic uncertainty, as well as general issues and ethics. And so, without further ado, I give you William MacAskill.

Yeah, Will, thanks so much for coming on the podcast. It’s really great to have you here.

Will: Thanks for having me on.

Lucas: So, I guess we can start off. You can tell us a little bit about the work that you’ve been up to recently in terms of your work in the space of metaethics and moral uncertainty just over the past few years and how that’s been evolving.

Will: Great. My PhD topic was on moral uncertainty, and I’m just putting the finishing touches on a book on this topic. The idea here is to appreciate the fact that we very often are just unsure about what we ought, morally speaking, to do. It’s also plausible that we ought to be unsure about what we ought morally to do. Ethics is a really hard subject, there’s tons of disagreement, it would be overconfident to think, “Oh, I’ve definitely figured out the correct moral view.” So my work focuses on not really the question of how unsure we should be, but instead what should we do given that we’re uncertain?

In particular, I look at the issue of whether we can apply the same sort of reasoning that we apply to uncertainty about matters of fact to matters of moral uncertainty. In particular, can we use what is known as “expected utility theory”, which is very widely accepted as at least approximately correct in empirical uncertainty. Can we apply that in the same way in the case of moral uncertainty?

Lucas: Right. And so coming on here, you also have a book that you’ve been working on on moral uncertainty that is unpublished. Have you just been expanding this exploration in that book, diving deeper into that?

Will: That’s right. There’s actually been very little that’s been written on the topic of moral uncertainty, at least in modern times, at least relative to its importance. I would think of this as a discipline that should be studied as much as consequentialism or contractualism or Kantianism is studied. But there’s really, in modern times, only one book that’s been written on the topic and that was written 18 years ago now, or published 18 years ago. What we want is this to be, firstly, just kind of definitive introduction to the topic, it’s co-authored with me as lead author, but co-authored with Toby Ord and Krista Bickfest, laying out both what we see as the most promising path forward in terms of addressing some of the challenges that face an account of decision-making under moral uncertainty, some of the implications of taking moral uncertainty seriously, and also just some of the unanswered questions.

Lucas: Awesome. So I guess, just moving forward here, you have a podcast that you already did with Rob Wiblin: 80,000 Hours. So I guess we can sort of just avoid covering a lot of the basics here about your views on using expected utility calculous in moral reasoning and moral uncertainty in order to decide what one ought to do when one is not sure what one ought to do. People can go ahead and listen to that podcast, which I’ll provide a link to within the description.

It would also be good, just to sort of get a general sense of where your meta ethical partialities just generally right now tend to lie, so what sort of meta ethical positions do you tend to give the most credence to?

Will: Okay, well that’s a very well put question ’cause, as with all things, I think it’s better to talk about degrees of belief rather than absolute belief. So normally if you ask a philosopher this question, we’ll say, “I’m a nihilist,” or “I’m a moral realist,” or something, so I think it’s better to split your credences. So I think I’m about 50/50 between nihilism or error theory and something that’s non-nihilistic.

Whereby nihilism or error theory, I just mean that any positive moral statement or normative statement or a evaluative statement. That includes, you ought to maximize happiness. Or, if you want a lot of money, you ought to become a banker. Or, pain is bad. That, on this view, all of those things are false. All positive, normative or evaluative claims are false. So it’s a very radical view. And we can talk more about that, if you’d like.

In terms of the rest of my credence, the view that I’m kind of most sympathetic towards in the sense of the one that occupies most of my mental attention is a relatively robust form of moral realism. It’s not clear whether it should be called kind of naturalist moral realism or non-naturalist moral realism, but the important aspect of it is just that goodness and badness are kind of these fundamental moral properties and are properties of experience.

The things that are of value are things that supervene on conscious states, in particular good states or bad states, and the way we know about them is just by direct experience with them. Just by being acquainted with a state like pain gives us a reason for thinking we ought to have less of this in the world. So that’s my kind of favored view in the sense it’s the one I’d be most likely to defend in the seminar room.

And then I give somewhat less credence in a couple of views. One is a view called “subjectivism” which is the idea that what you ought to do is determined in some sense by what you want to do. So the simplest view there would just be when I say, “I ought to do X.” That just means I want to do X in some way. Or a more sophisticated version would be ideal subjectivism where when I say I ought to do X, it means some very idealized version of myself would want myself to want to do X. Perhaps if I had limited amounts of knowledge and much clearer computational power and so on. I’m a little less sympathetic to that than many people I know. We’ll go into that.

And then a final view that I’m also less sympathetic towards is non-cognitivism, which would be the idea that our moral statements … So when I say, “Murder is wrong,” I’m not even attempting to express a proposition. What they’re doing is just expressing some emotion of mine, like, “Yuk. Murder. Ugh,” in the same way that when I said that, that wasn’t expressing any proposition, it was just expressing some sort of pro or negative attitude. And again, I don’t find that terribly plausible, again for reasons we can go into.

Lucas: Right, so those first two views were cognitivist views, which makes them fall under sort of a semantic theory where you think that people are saying truth or false statements when they’re claiming moral facts. And the other theory in your moral realism are both metaphysical views, which I think is probably what we’ll mostly be interested here in terms of the AI alignment problem.

There are other issues in metaethics, for example having to do with semantics, as you just discussed. You feel as though you give some credence to non-cognitivism, but there are also justification views, so like issues in moral epistemology, how one can know about metaethics and why one ought to follow metaethics if metaethics has facts. Where do you sort of fall in in that camp?

Will: Well, I think all of those views are quite well tied together, so what sort of moral epistemology you have depends very closely, I think, on what sort of meta-ethical view you have, and I actually think, often, is intimately related as well to what sort of view in normative ethics you have. So my preferred philosophical world view, as it were, the one I’d defend in a seminar room, is classical utilitarian in its normative view, so the only thing that matters is positive or negative mental states.

In terms of its moral epistemology, the way we access what is of value is just by experiencing it, so in just the same way we access conscious states. There are also some ways in which you can’t merely, you know, why is it that we should maximize the sum of good experiences rather than the product, or something? That’s a view that you’ve got to obtain by kind of reasoning rather than just purely from experience.

Part of my epistemology does appeal to whatever this spooky ability we have to reason about abstract affairs, but it’s the same sort of faculty that is used when we think about mathematics or set theory or other areas of philosophy. If, however, I had some different view, so supposing we were a subjectivist, well then moral epistemology looks very different. You’re actually just kind of reflecting on your own values, maybe looking at what you would actually do in different circumstances and so on, reflecting on your own preferences, and that’s the right way to come to the right kind of moral views.

There’s also another meta-ethical view called “constructivism” that I’m definitely not the best person to talk about with. But on that view, again it’s not really a realistic view, but on this view we just have a bunch of beliefs and intuitions and the correct moral view is just the best kind of systematization of those and beliefs or intuitions in the same way as you might think … Like linguistics, it is a science, but it’s fundamentally based just on what our linguistic intuitions are. It’s just kind of a systematization of them.

On that view, then, moral epistemology would be about reflecting on your own moral intuitions. You just got all of this data, which is the way things seem like to you, morally speaking, and then you’re just doing the systematization thing. So I feel like the question of moral epistemology can’t be answered in a vacuum. You’ve got to think about your meta-ethical view of the metaphysics of ethics at the same time.

Lucas: I think I’m pretty interested in here, and also just poking a little bit more into that sort of 50% credence you give to your moral realist view, which is super interesting because it’s a view that people tend not to have, I guess, in the AI computer science rationality space, EA space. People tend to, I guess, have a lot of moral anti-realists in this space.

In my last podcast, I spoke with David Pearce, and he also seemed to sort of have a view like this, and I’m wondering if you can just sort of unpack yours a little bit, where he believed that suffering and pleasure disclose the in-built pleasure/pain access of the universe. Like you can think of minds as sort of objective features of the world, because they in fact are objective features of the world, and the phenomenology and experience of each person is objective in the same way that someone could objectively be experiencing redness, and in the same sense they could be objectively experiencing pain.

It seems to me, and I don’t fully understand the view, but the claim is that there are some sort of in-built quality or property to the hedonic qualia of suffering or pleasure that discloses its in-built value to that.

Will: Yeah.

Lucas: Could you unpack it a little bit more about the metaphysics of that and what that even means?

Will: It sounds like David Pearce and I have quite similar views. I think relying heavily on the analogy with, or very close analogy with consciousness is going to help, where imagine you’re kind of a robot scientist, you don’t have any conscious experiences but you’re doing all this fancy science and so on, and then you kind of write out the book of the world, and i’m like, “hey, there’s this thing you missed out. It’s like conscious experience.” And you, the robot scientist, would say, “Wow, that’s just insane. You’re saying that some bits of matter have this first person subjective feel to them? Like, why on earth would we ever believe that? That’s just so out of whack with the naturalistic understanding of the world.” And it’s true. It just doesn’t make any sense from given what we know now. It’s a very strange phenomenon to exist in the world.

Will: And so one of the arguments that motivates error theory is this idea of just, well, if values were to exist, they would just be so weird, what Mackie calls “queer”. It’s just so strange that just by a principle of Occam’s razor not adding strange things in to our ontology, we should assume they don’t exist.

But that argument would work in the same way against conscious experience, and the best response we’ve got is to say, no, but I know I’m conscious, and just tell by introspecting. I think we can run the same sort of argument when it comes to a property of consciousness as well, which is namely the goodness or badness of certain conscious experiences.

So now I just want you to go kind of totally a-theoretic. Imagine you’ve not thought about philosophy at all, or even science at all, and I was just to ask you, rip off one of your fingernails, or something. And then I say, “Is that experience bad?” And you would say yes.

Lucas: Yeah, it’s bad.

Will: And I would ask, how confident are you? The more confident that this pain is bad than that I even have hands, perhaps. That’s at least how it seems to be for me. So then it seems like, yeah, we’ve got this thing that we’re actually incredibly confident of which is the badness of pain, or at least the badness of pain for me, and so that’s what initially gives the case for then thinking, okay, well, that’s at least one objective moral fact that pain is bad, or at least pain is bad for me.

Lucas: Right, so the step where I think that people will tend to get lost in this is when … I thought the part about Occam’s razor was very interesting. I think that most people are anti-realistic because they use Occam’s razor there and they think that what the hell would a value even be anyway in the third person objective sense? Like, that just seems really queer, as you put it. So I think people get lost at the step where the first person seems to simply have a property of badness to it.

I don’t know what that would mean if one has a naturalistic reductionist view of the world. There seems to be just like entropy, noise and quarks and maybe qualia as well. It’s not clear to me how we should think about properties of qualia and whether or not one can drive, obviously, “ought” statements about properties of qualia to normative statements, like “is” statements about the properties of qualia to “ought” statements?

Will: One thing I want to be very clear on is just it definitely is the case that we have really no idea on this view. We are currently completely in the dark about some sort of explanation of how matter and forces and energy could result in goodness or badness, something that ought to be promoted. But that’s also true with conscious experience as well. We have no idea how on earth matter could result in kind of conscious experience. At the same time, it would be a mistake to start denying conscious experience.

And then we can ask, we say, okay, we don’t really know what’s going on but we accept that there’s conscious experience, and then I think if you were again just to completely pre theoretically start categorizing distant conscious experiences that we have, we’d say that some are red and some are blue, some are maybe more intense, some are kind of dimmer than others, you’d maybe classify them into sights and sounds and other sorts of experiences there.

I think also a very natural classification would be the ones that are good and the ones that are bad, and then I think when we cash that out further, I think it’s not nearly the case. I don’t think the best explanation is that when we say, oh, this is good or this is bad it means what we want or what we don’t want, but instead it’s like what we think we have reason to want or reason not to want. It seems to give us evidence for those sorts of things.

Lucas: I guess my concern here is just that I worry that words like “good” and “bad” or “valuable” or “dis-valuable”, I feel some skepticism about whether or not they disclose some sort of intrinsic property of the qualia. I’m also not sure what the claim here is about the nature of and kinds of properties that qualia can have attached to them. I worry that goodness and badness might be some sort of evolutionary fiction which enhances us, enhances our fitness, but it doesn’t actually disclose some sort of intrinsic metaphysical quality or property of some kind of experience.

Will: One thing I’ll say is, again, remember that I’ve got this 50% credence on error theory, so in general, all these questions, maybe this is just some evolutionary fiction, things just seem bad but they’re not actually, and so on. I actually think those are good arguments, and so that should give us confidence, some degree of confidence and this idea of just actually nothing matters at all.

But kind of underlying a lot of my views is this more general argument that if you’re unsure between two views, one in which just nothing matters at all, we’ve got no reasons for action, the other one we do have some reasons for action, then you can just ignore the one that says you’ve got no reasons for action ’cause you’re not going to do badly by its likes no matter what you do. If I were to go around shooting everybody, that wouldn’t be bad or wrong on nihilism. If I were to shoot lots of people, it wouldn’t be bad or wrong on nihilism.

So if there are arguments such as, I think an evolutionary argument that pushes us in the direction of kind of error theory, in a sense we can put them to the side, ’cause what we ought to do is just say, yeah, we take that really seriously. Give us a high credence in error theory, but now say, after all those arguments, what are the views, because most plausibly kind of bear their force.

So this is why with the kind of evolutionary worry, I’m just like, yes. But, supposing it’s the case that there actually are. Presumably conscious experiences themselves are useful in some evolutionary way that, again, we don’t really understand. I think, presumably, also good and bad experiences are useful in some evolutionary way that we don’t fully understand, perhaps because they have a tendency to motivate at least beings like us, and that in fact seems to be a key aspect of making a kind of goodness or badness statement. It’s at least somehow tied up to the idea of kind of motivation.

And then when I say ascribing a property to a conscious experience, I really just don’t mean whatever it is that we mean when we say that this experience is red seeming, this is experience is blue seeming, I mean, again, opens philosophical questions what we even mean by properties but in the same way this is bad seeming, this is good seeming.

Before I got into thinking about philosophy and naturalism and so on, would I have thought those things are kind of on a par, and I think I would’ve done, so it’s at least a pre theoretically justified view to think, yeah, there just is this axiological property of my experience.

Lucas: This has made me much more optimistic. I think after my last podcast I was feeling quite depressed and nihilistic, and hearing you give this sort of non-naturalistic or naturalistic moral realist count is cheering me up a bit about the prospects of AI alignment and value in the world.

Will: I mean, I think you shouldn’t get too optimistic. I’m also certainly wrong-

Lucas: Yeah.

Will: … sort of is my favorite view. But take any philosopher. What’s the chance that they’ve got the right views? Very low, probably.

Lucas: Right, right. I think I also need to be careful here that human beings have this sort of psychological bias where we give a special metaphysical status and kind of meaning and motivation to things which have objective whatever to it. I guess there’s also some sort of motivation that I need to be mindful of that seeks out to make value objective or more meaningful and foundational in the universe.

Will: Yeah. The thing that I think should make you feel optimistic, or at least motivated, is this argument that if nothing matters, it doesn’t matter that nothing matters. It just really ought not to affect what you do. You may as well act as if things do matter, and in fact we can have this project of trying to figure out if things matter, and that maybe could be an instrumental goal, which kind of is a purpose for life is to get to a place where we really can figure out if it has any meaning. I think that sort of argument can at least give one grounds for getting out of bed in the morning.

Lucas: Right. I think there’s this philosophy paper that I saw, but I didn’t read, that was like, “nothing Matters, but it does matter”, with the one lower case M and then another capital case M, you know.

Will: Oh, interesting.

Lucas: Yeah.

Will: It sounds a bit like 4:20 ethics.

Lucas: Yeah, cool.

Moving on here into AI alignment. And before we get into this, I think that this is something that would also be interesting to hear you speak a little bit more about before we dive into AI alignment. What even is the value of moral information and moral philosophy, generally? Is this all just a bunch of BS or how can it be interesting and or useful in our lives, and in science and technology?

Will: Okay, terrific. I mean, and this is something I write about in a paper I’m working on now and also in the book, as well.

So, yeah, I think the stereotype of the philosopher engaged in intellectual masturbation, not doing really much for the world at all, is quite a prevalent stereotype. I’ll not comment on whether that’s true for certain areas of philosophy. I think it’s definitely not true for certain areas within ethics. What is true is that philosophy is very hard, ethics is very hard. Most of the time when we’re trying to do this, we make very little progress.

If you look at the long-run history of thought in ethics and political philosophy, the influence is absolutely huge. Even just take Aristotle, Locke, Hobbes, Mill, and Marx. The influence of political philosophy and moral philosophy there, it shaped thousands of years of human history. Certainly not always for the better, sometimes for the worse, as well. So, ensuring that we get some of these ideas correct is just absolutely crucial.

Similarly, even in more recent times … Obviously not as influential as these other people, but also it’s been much less time so we can’t predict into the future, but if you consider Peter Singer as well, his ideas about the fact that we may have very strong obligations to benefit those who are distant strangers to us, or that we should treat animal welfare just on a par with human welfare, at least on some understanding of those ideas, that really has changed the beliefs and actions of, I think, probably tens of thousands of people, and often in really quite dramatic ways.

And then when we think about well, should we be doing more of this, is it merely that we’re influencing things randomly, or are we making things better or worse? Well, if we just look to the history of moral thought, we see that most people in most times have believed really atrocious things. Really morally abominable things. Endorsement of slavery, distinctions between races, subjugation of women, huge discrimination against non-heterosexual people, and, in part at least, it’s been ethical reflection that’s allowed us to break down some of those moral prejudices. And so we should presume that we have very similar moral prejudices now. We’ve made a little bit of progress, but do we have the one true theory of ethics now? I certainly think it’s very unlikely. And so we need to think more if we want to get to the actual ethical truth, if we don’t wanna be living out moral catastrophes in the same way as we would if we kept slaves, for example.

Lucas: Right, I think we do want to do that, but I think that a bit later in the podcast we’ll get into whether or not that’s even possible, given economic, political, and militaristic forces acting upon the AI alignment problem and the issues with coordination and race to AGI.

Just to start to get into the AI alignment problem, I just wanna offer a little bit of context. It is implicit in the AI alignment problem, or value alignment problem, that AI needs to be aligned to some sort of ethic or set of ethics, this includes preferences or values or emotional dispositions, or whatever you might believe them to be. And so it seems that generally, in terms of moral philosophy, there are really two methods, or two methods in general, by which to arrive at an ethic. So, one is simply going to be through reason, and one is going to be through observing human behavior or artifacts, like books, movies, stories, or other things that we produce in order to infer and discover the observed preferences and ethics of people in the world.

The latter side of alignment methodologies are empirical and involves the agent interrogating and exploring the world in order to understand what the humans care about and value, as if values and ethics were simply a physical by-product of the world and of evolution. And the former is where ethics are arrived at through reason alone, and involve the AI or the AGI potentially going about ethics as a philosopher would, where one engages in moral reasoning about metaethics in order to determine what is correct. From the point of view of ethics, there is potentially only what the humans empirically do believe and then there is what we may or may not be able to arrive at through reason alone.

So, it seems that one or both of these methodologies can be used when aligning an AI system. And again, the distinction here is simply between sort of preference aggregation or empirical value learning approaches, or methods of instantiating machine ethics, reasoning, or decision-making in AI systems so they become agents of morality.

So, what I really wanna get into with you now is how metaethical uncertainty influences our decision over the methodology of value alignment. Over whether or not we are to prefer an empirical preference learning or aggregation type approach, or one which involved an imbuing of moral epistemology and ethical metacognition and reasoning into machine systems so it can discover what we ought to do. And how moral uncertainty, and metaethical moral uncertainty in particular, operates within both of these spaces once you’re committed to some view, or both of these views. And then we can get into issues and intertheoretic comparisons and how that arises here at many levels, the ideal way we should proceed if we could do what would be perfect, and again, what is actually likely to happen given race dynamics and political, economic, and militaristic forces.

Will: Okay that sounds terrific. I mean, there’s a lot of cover there.

I think it might be worth me saying just maybe a couple of distinctions I think are relevant and kind of my overall view in this. So, in terms of distinction, I think within what broadly gets called the alignment problem, I think I’d like to distinguish between what I’d call the control problem, then kind of human values alignment problem, and then the actual alignment problem.

Where the control problem is just, can you get this AI to do what you want it to do? Where that’s maybe relatively narrowly construed, I want it to clean up my room, I don’t want it to put my cat in the bin, that’s kinda control problem. I think describing that as a technical problem is kind of broadly correct.

Second is then what gets called aligning AI with human values. For that, it might be the case that just having the AI pay attention to what humans actually do and infer their preferences that are revealed on that basis, maybe that’s a promising approach and so on. And that I think will become increasingly important as AI becomes larger and larger parts of the economy.

This is kind of already what we do when we vote for politicians who represent at least large chunks of the electorate. They hire economists who undertake kind of willingness-to-pay surveys and so on to work out what people want, on average. I do think that this is maybe more normatively loaded than people might often think, but at least you can understand that, just as the control problem is I have some relatively simple goal, which is, what do I want? I want this system to clean my room. How do I ensure that it actually does that without making mistakes that I wasn’t intending? This is kind of broader problem of, well you’ve got a whole society and you’ve got to aggregate their preferences for what kind of society wants and so on.

But I think, importantly, there’s this third thing which I called a minute ago, the actual alignment problem, so let’s run with that. Which is just working out what’s actually right and what’s actually wrong and what ought we to be doing. I do have a worry that because many people in the wider world, often when they start thinking philosophically they start endorsing some relatively simple, subjectivist or relativist views. They might think that answering this question of well, what do humans want, or what do people want, is just the same as answering what ought we to do? Whereas for kind of the reductio of that view, just go back a few hundred years where the question would have been, well, the white man’s alignment problem, where it’s like, “Well, what do we want, society?”, where that means white men.

Lucas: Uh oh.

Will: What do we want them to do? So similarly, unless you’ve got the kind of such a relativist view that you think that maybe that would have been correct back then, that’s why I wanna kind of distinguish this range of problems. And I know that you’re kind of most interested in that third thing, I think. Is that right?

Lucas: Yeah, so I think I’m pretty interested in the second and the third thing, and I just wanna unpack a little bit of your distinction between the first and the second. So, the first was what you called the control problem, and you called the second just the plurality of human values and preferences and the issue of aligning to that in the broader context of the world.

It’s unclear to me how I get the AI to put a strawberry on the plate or to clean up my room and not kill my cat without the second thing haven been done, at least to me.

There is a sense at a very low level where your sort of working on technical AI alignment, which involves working on the MIRI approach with agential foundations and trying to work on a constraining optimization and corrigibility and docility and robustness and security and all of those sorts of things that people work on and the concrete problems in AI safety, stuff like that. But, it’s unclear to me where that sort of stuff is just limited to and includes the control problem, and where it begins requiring the system to be able to learn my preferences through interacting with me and thereby is already sort of participating in the second case where it’s sort of participating in AI alignment more generally, rather than being sort of like a low level controlled system.

Will: Yeah, and I should say that on this side of things I’m definitely not an expert, not really the person to be talking to, but I think you’re right. There’s going to be some big, gray area or transition from systems. So there’s one that might be cleaning my room, or even let’s just say it’s playing some sort of game, unfortunately I forget the example … It was under the blog post, an example of the alignment problem in the wild, or something, from open AI. But, just a very simple example of the AIs playing a game, and you say, “Well, get as many points as possible.” And what you really want it to do is win a certain race, but what it ends up doing is driving this boat just round and round in circles because that’s the way of maximizing the number of points.

Lucas: Reward hacking.

Will: Reward hacking, exactly. That would be a kind of failure of control problem, that first in our sense. And then I believe there’s gonna be kind of gray areas, where perhaps it’s the certain sort of AI system where the whole point is it’s just implementing kind of what I want. And that might be very contextually determined, might depend on what my mood is of the day. For that, that might be a much, much harder problem and will involve kind of studying what I actually do and so on.

We could go into the question of whether you can solve the problem of cleaning a room without killing my cat. Whether that is possible to solve without solving much broader questions, maybe that’s not the most fruitful avenue of discussion.

Lucas: So, let’s put aside this first case which involves the control problem, we’ll call it, and let’s focus on the second and the third, where again the second is defined as sort of the issue of the plurality of human values and preferences which can be observed, and then the third you described as us determining what we ought to do and tackling sort of the metaethics.

Will: Yeah, just tackling the fundamental question of, “Where ought we to be headed as a society?” One just extra thing to add onto that is that’s just a general question for society to be answering. And if there are kind of fast, or even medium-speed, developments in AI, perhaps suddenly we’ve gotta start answering that question, or thinking about that question even harder in a more kind of clean way than we have before. But even if AI were to take a thousand years, we’d still need to answer that question, ’cause it’s just fundamentally the question of, “Where ought we to be heading as a society?”

Lucas: Right, and so going back a little bit to the little taxonomy that I had developed earlier, it seems like your second case scenario would be sort of down to metaethical questions, which are behind and which influence the empirical issues with preference aggregation and there being plurality of values. And the third case would be, what would be arrived at through reason and, I guess, the reason of many different people.

Will: Again, it’s gonna involve questions of metaethics as well where, again, on my theory that metaethics … It would actually just involve interacting with conscious experiences. And that’s a critical aspect of coming to understand what’s morally correct.

Lucas: Okay, so let’s go into the second one first and then let’s go into the third one. And while we do that, it would be great if we could be mindful of problems in intertheoretic comparison and how they arise as we go through both. Does that sound good?

Will: Yeah, that sounds great.

Lucas: So, would you like to just sort of unpack, starting with the second view, the metaethics behind that, issues in how moral realism versus moral anti-realism will affect how the second scenario plays out, and other sorts of crucial considerations in metaethics that will affect the second scenario?

Will: Yeah, so for the second scenario, which again, to be clear, is the aggregating of the variety of human preferences across a variety of contexts and so on, is that right?

Lucas: Right, so that the agent can be fully autonomous and realized in the world that it is sort of an embodiment of human values and preferences, however construed.

Will: Yeah, okay, so here I do think all the metaethics questions are gonna play a lot more role in the third question. So again, it’s funny, it’s very similar to the question of kind of what mainstream economists often think they’re doing when it comes to cost-benefit analysis. Let’s just even start in the individual case. Even there, it’s not a purely kind of descriptive enterprise, where, again, let’s not even talk about AI. You’re just looking out for me. You and I are friends and you want to do me a favor in some way, how do you make a decision about how to do me that favor, how to benefit me in some way? Well, you could just look at the things I do and then infer on the basis of that what my utility function is. So perhaps every morning I go and I rob a convenience store and then I buy some heroin and then I shoot up and-

Lucas: Damn, Will!

Will: That’s my day. Yes, it’s a confession. Yeah, you’re the first to hear it.

Lucas: It’s crazy, in Oxford huh?

Will: Yeah, Oxford University is wild.

You see that behavior on my part and you might therefore conclude, “Wow, well what Will really likes is heroin. I’m gonna do him a favor and buy him some heroin.” Now, that seems kind of commonsensically pretty ridiculous. Well, assuming I’m demonstrating all sorts of bad behavior that looks like it’s very bad for me, it looks like a compulsion and so on. So instead what we’re really doing is not merely maximizing the utility function that’s gone by my revealed preferences, we have some deeper idea of kind of what’s good for me or what’s bad for me.

Perhaps that comes down to just what I would want to want, or what I want myself to want to want to want. Perhaps you can do it in terms of what are called second-order, third-order preferences. What idealized Will would want … That is not totally clear. Well firstly, it’s really hard to know kind of what would idealized Will want. You’re gonna have to start doing at least a little bit of philosophy there. Because I tend to favor hedonism, I think that an idealized version of my friend would want the best possible experiences. That might be very different from what they think an idealized version of themselves would want because perhaps they have some objective list account of well-being and they think well, what they would also want is knowledge for the its own sake and appreciating beauty for its own sake and so on.

So, even there I think you’re gonna get into pretty tricky questions about what is good or bad for someone. And then after that you’ve got the question of preference aggregation, which is also really hard, both in theory and in practice. Where, do you just take strengths of preferences across absolutely everybody and then add them up? Well, firstly you might worry that you can’t actually make these comparisons of strengths of preferences between people. Certainly if you’re just looking at peoples revealed preferences, it’s really opaque how you would say if I prefer coffee to tea and you vice versa, who has the stronger preference? But perhaps we could look at behavioral facts to kind of try and at least anchor that, but it’s still then non-obvious that what we ought to do when we’re looking at everybody’s preferences is just maximize the sum rather than perhaps give some extra weighting to people who are more badly off, perhaps we give more priority to their interests. So this is kinda theoretical issues.

And then secondly, is kinda just practical issues of implementing that, where you actually need to ensure that people aren’t faking their preferences. And there’s a well known literature and voting theory that says that basically any aggregation system you have, any voting system, is going to be manipulable in some way. You’re gonna be able to get a better result for yourself, at least in some circumstances, by misrepresenting what you really want.

Again, these are kind of issues that our society already faces, but they’re gonna bite even harder when we’re thinking about delegating to artificial agents.

Lucas: There’s two levels to this that you’re sort of elucidating. The first is that you can think of the AGI as being something which can do favors for everybody in humanity, so there are issues empirically and philosophically and in terms of understanding other agents about what sort of preferences should that AGI be maximizing for each individual, say being constrained by what is legal and what is generally converged upon as being good or right. And then there’s issues with preference aggregation which come up more given that we live in a resource-limited universe and world, where not all preferences can coexist and there has to be some sort of potential cancellation between different views.

And so, in terms of this higher level of preference aggregation … And I wanna step back here to metaethics and difficulties of intertheoretic comparison. It would seem that given your moral realist view, it would affect how the weighting would potentially be done. Because it seemed like before you were eluding to the fact that if your moral realist view would be true, then the way at which we could determine what we ought to do or what is good and true about morality would be through exploring the space of all possible experiences, right, so we can discover moral facts about experiences.

Will: Mm-hmm (affirmative).

Lucas: And then in terms of preference aggregation, there would be people who would be right or wrong about what is good for them or the world.

Will: Yeah, I guess this is, again why I wanna distinguish between these two types of value alignment problem, where on the second type, which is just kind of, “What does society want?” Societal preference aggregation. I wasn’t thinking of it as there being kind of right or wrong preferences.

In just the same way as there’s this question of just, “I want system to do X” but there’s a question of, “Do I want that?” or “How do you know that I want that?”, there’s a question of, “How do you know what society wants?” That’s a