
Elliott Thornley
Why do you care about AI existential safety?
I think that reducing existential risk is extremely important, and I think that working on AI existential safety is the most effective way to reduce existential risk.
Please give at least one example of your research interests related to AI existential safety.
I use ideas from decision theory to design and train artificial agents: a project that I call ‘constructive decision theory.’
My main focus so far is on solving the shutdown problem: the problem of ensuring that powerful artificial agents never resist shutdown. My proposed solution is training agents to satisfy a condition I call ‘Preferences Only Between Same-Length Trajectories’ (or ‘POST’ for short). POST-agents have preferences between same-length trajectories (and so can be useful) but lack a preference between every pair of different-length trajectories (and so are neutral about when they get shut down). I’ve been working on both the theoretical and practical aspects of this proposed solution. On the theoretical side, I’ve proved that POST – together with other plausible conditions – implies Neutrality+: the agent maximizes expected utility, ignoring the probability of each trajectory-length. The agent behaves similarly to how you would if you were absolutely certain that you couldn’t affect the probability of your dying at each moment. I’ve argued that agents satisfying Neutrality+ would be shutdownable and useful. On the practical side, my coauthors and I have trained simple reinforcement learning agents to satisfy POST using my proposed reward function. We’re currently scaling up these experiments.
I’ve also been considering the promise of keeping powerful agents under control by training them to be risk-averse. Here’s the basic idea. For misaligned artificial agents, trying to take over the world is risky. If these agents are risk-averse, trying to take over the world will seem less appealing to them. In the background here is a famous calibration theorem from the economist Matthew Rabin which says in effect: if an agent is even slightly risk-averse when the stakes are low, it is extremely risk-averse when the stakes are high. This theorem suggests it won’t be too hard to find a degree of risk-aversion satisfying the following two conditions: (i) any aligned agents will be bold enough to be useful, and (ii) any misaligned agents will be timid enough to be safe.
I’m also considering how helpful it could be to train artificial agents to be indifferent between pairs of options. Current training techniques make it easy to train agents to prefer some options to others, but they don’t make it easy to train agents to be indifferent between pairs of options. My proposed technique might make it easy. My coauthors and I are trying to figure out if that’s true. If we can train agents to be indifferent between pairs of options, that could be a big boost in our ability to avoid goal misgeneralization. After all, a preference just imposes an inequality constraint on the agent’s utility function, whereas indifference imposes an equality constraint. We are trying to figure out just how big a boost this could be.