Three generals are voting on whether to attack or retreat from their siege of a castle. One of the generals is corrupt and two of them are not. What happens when the corrupted general sends different answers to the other two generals?
A Byzantine fault is “a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the “Byzantine Generals’ Problem”, developed to describe this condition, where actors must agree on a concerted strategy to avoid catastrophic system failure, but some of the actors are unreliable.“
The Byzantine Generals’ Problem and associated issues in maintaining reliable distributed computing networks is illuminating for both AI alignment and modern networks we interact with like Youtube, Facebook, or Google. By exploring this space, we are shown the limits of reliable distributed computing, the safety concerns and threats in this space, and the tradeoffs we will have to make for varying degrees of efficiency or safety.
The Byzantine Generals’ Problem, Poisoning, and Distributed Machine Learning with El Mahdi El Mhamdi is the ninth podcast in the AI Alignment Podcast series, hosted by Lucas Perry. El Mahdi pioneered Byzantine resilient machine learning devising a series of provably safe algorithms he recently presented at NeurIPS and ICML. Interested in theoretical biology, his work also includes the analysis of error propagation and networks applied to both neural and biomolecular networks. This particular episode was recorded at the Beneficial AGI 2019 conference in Puerto Rico. We hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.
If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.
Topics discussed in this episode include:
- The Byzantine Generals’ Problem
- What this has to do with artificial intelligence and machine learning
- Everyday situations where this is important
- How systems and models are to update in the context of asynchrony
- Why it’s hard to do Byzantine resilient distributed ML.
- Why this is important for long-term AI alignment
An overview of Adversarial Machine Learning and where Byzantine-resilient Machine Learning stands on the map is available in this (9min) video . A specific focus on Byzantine Fault Tolerant Machine Learning is available here (~7min)
In particular, El Mahdi argues in the first interview (and in the podcast) that technical AI safety is not only relevant for long term concerns, but is crucial in current pressing issues such as social media poisoning of public debates and misinformation propagation, both of which fall into Poisoning-resilience. Another example he likes to use is social media addiction, that could be seen as a case of (non) Safely Interruptible learning. This value misalignment is already an issue with the primitive forms of AIs that optimize our world today as they maximize our watch-time all over the internet.
The latter (Safe Interruptibility) is another technical AI safety question El Mahdi works on, in the context of Reinforcement Learning. This line of research was initially dismissed as “science fiction”, in this interview (5min), El Mahdi explains why it is a realistic question that arises naturally in reinforcement learning
“El Mahdi’s work on Byzantine-resilient Machine Learning and other relevant topics is available on his Google scholar profile. A modification of the popular machine learning library TensorFlow, to make it Byzantine-resilient (and also support communication over UDP channels among other things) has been recently open-sourced on Github by El Mahdi’s colleagues based on his algorithmic work we mention in the podcast.
Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast series. I’m Lucas Perry, and today we’ll be speaking with El Mahdi El Mhamdi on the Byzantine problem, Byzantine tolerance, and poisoning in distributed learning and computer networks. If you find this podcast interesting or useful, please give it a like and follow us on your preferred listing platform. El Mahdi El Mhamdi pioneered Byzantine resilient machine learning devising a series of provably safe algorithms he recently presented at NeurIPS and ICML. Interested in theoretical biology, his work also includes the analysis of error propagation and networks applied to both neural and biomolecular networks. With that, El Mahdi’s going to start us off with a thought experiment.
El Mahdi: Imagine you are part of a group of three generals, say, from the Byzantine army surrounding a city you want to invade, but you also want to retreat if retreat is the safest choice for your army. You don’t want to attack when you will lose, so those three generals that you’re part of are in three sides of the city. They sent some intelligence inside the walls of the city, and depending on this intelligence information, they think they will have a good chance of winning and they would like to attack, or they think they will be defeated by the city, so it’s better for them to retreat. Your final decision would be a majority vote, so you communicate through some horsemen that, let’s say, are reliable for the sake of this discussion. But there might be one of you who might have been corrupt by the city.
The situation would be problematic if, say, there are General A, General B, and General C. General A decided to attack. General B decided to retreat based on their intelligence for some legitimate reason. A and B are not corrupt, and say that C is corrupt. Of course, A and B, they can’t figure out who was corrupt. Say C is corrupt. What this general would do they thing, so A wanted to attack. They will tell them, “I also want to attack. I will attack.” Then they will tell General B, “I also want to retreat. I will retreat.” A receives two attack votes and one retreat votes. General B receives two retreat votes and only one attack votes. If they trust everyone, they don’t do any double checking, this would be a disaster.
A will attack alone; B would retreat; C, of course, doesn’t care because he was corrupt by the cities. You can tell me they can circumvent that by double checking. For example, A and B can communicate on what C told them. Let’s say that every general communicates with every general on what he decides and on also what’s the remaining part of the group told them. A will report to B, “General C told me to attack.” Then B would tell C, “General C told me to retreat.” But then A and B wouldn’t have anyway of concluding whether the inconsistency is coming from the fact that C is corrupt or that the general reporting on what C told them is corrupt.
I am General A. I have all the valid reasons to think with the same likelihood that C is maybe lying to me or also B might also be lying to me. I can’t know if you are misreporting what C told you enough for the city to corrupt one general if there are three. It’s impossible to come up with an agreement in this situation. You can easily see that this will generalize to having more than three generals, like I say 100, as soon as the non-corrupt one are less than two-thirds because what we saw with three generals would happen with the fractions that are not corrupt. Say that you have strictly more than 33 generals out of 100 who are corrupt, so what they can do is they can switch the majority votes on each side.
But worse than that, say that you have 34 corrupt generals and the remaining 66 not corrupt generals. Say that those 66 not corrupt generals were 33 on the attack side, 33 on the retreat side. The problem is that when you are in some side, say that you are in the retreat side, you have in front of you a group of 34 plus 33 in which there’s a majority of malicious ones. This majority can collude. It’s part of the Byzantine hypothesis. The malicious ones can collude and they will report a majority of inconsistent messages on the minority on the 33 ones. You can’t provably realize that the inconsistency is coming from the group of 34 because they are a majority.
Lucas: When we’re thinking about, say, 100 persons or 100 generals, why is it that they’re going to be partitioned automatically into these three groups? What if there’s more than three groups?
El Mahdi: Here we’re doing the easiest form of Byzantine agreement. We want to agree on attack versus retreat. When it’s become multi-dimensional, it gets even messier. There are more impossibility results and impossibility results. Just like with the binary decision, there is an impossibility theorem on having agreement if you have unsigned messages to horsemen. Whenever the corrupt group exceeds 33%, you provably cannot come up with an agreement. There are many variants to this problem, of course, depending on what hypothesis you can assume. Here, without even mentioning it, we were assuming bounded delays. The horsemen would always arrive eventually. If the horsemen could die on the way and you don’t have any way to check whether they arrive or not or you can be waiting forever because you don’t have any proof that the horsemen died on the way.
You don’t have any mechanism to tell you, “Stop waiting for the horsemen. Stop waiting for the message from General B because the horsemen died.” You can be waiting forever and there are theorems that shows that when you have unbounded delays, and by the way, like in distributed computing, whenever you have in bounded delays, we speak about asynchrony. If you have a synchronous communication, there is a very famous theorem that tells you consensus is impossible, not even in the malicious case, but just like in …
Lucas: In the mundane normal case.
El Mahdi: Yes. It’s called the Fischer Lynch Patterson theorem theorem .
Lucas: Right, so just to dive down into the crux of the problem, the issue here fundamentally is that when groups of computers or groups of generals or whatever are trying to check who is lying amongst discrepancies and similarities of lists and everyone who’s claiming what is when there appears to be a simple majority within that level of corrupted submissions, then, yeah, you’re screwed.
El Mahdi: Yes. It’s impossible to achieve agreement. There are always fractions of malicious agents above which is provably impossible to agree. Depending on the situation, it will be a third or sometimes or a half or a quarter, depending on your specifications.
Lucas: If you start tweaking the assumptions behind the thought experiment, then it changes what number of corrupted machines or agents that are required in order to flip the majority and to poison the communication.
El Mahdi: Exactly. But for example, you mentioned something very relevant to today’s discussion, which is what if we were not agreeing on two decisions, retreat, attack. What if we were agreeing on some multi-dimensional decision? Attack or retreat on one dimension and then …
Lucas: Maybe hold, keep the siege going.
El Mahdi: Yeah, just like add possibilities or dimensions and multi-dimensional agreements. They’re even more hopeless results in that direction
Lucas: There are more like impossibility theorems and issues where these distributed systems are vulnerable to small amounts of systems being corrupt and screwing over the entire distributed network.
El Mahdi: Yes. Maybe now we can slightly move to machine learning.
Lucas: I’m happy to move into machine learning now. We’ve talked about this, and I think our audience can probably tell how this has to do with computers. Yeah, just dive in what this has to do with machine learning and AI and current systems today, and why it even matters for AI alignment.
El Mahdi: As a brief transition, solving the agreement problem besides this very nice historic thought experiment is behind consistencies of safety critical systems like banking systems. Imagine we have a shared account. Maybe you remove 10% of the amount and then she or he added some $10 to the accounts. You remove the $10 in New York and she or he put the $10 in Los Angeles. The banking system has to agree on the ordering because minus $10 plus 10% is not the same result as plus 10% then minus $10. The final balance of the account would not be the same.
El Mahdi: The banking systems routinely are solving decisions that fall into agreement. If you work on some document sharing platform, like Dropbox or Google Docs, whatever, and we collaboratively are writing the document, me and you. The document sharing platform has to, on real time, solve agreements about the ordering of operations so that me and you always keep seeing the same thing. This has to happen while some of the machines that are interconnecting us are failing, whether just like failing because there was a electric crash or something. Data center has lost some machines or if it was like restart, a bug or a take away. What we want in distributed computing is that we would like communications schemes between machines that’s guarantee this consistency that comes from agreement as long as some fraction of machines are reliable. What this has to do with artificial intelligence and machine learning reliability is that with some colleagues, we are trying to encompass one of the major issues in machine learning reliability inside the Byzantine fault tolerance umbrella. For example, you take, for instance, poisoning attacks.
Lucas: Unpack what poisoning attacks are.
El Mahdi: For example, imagine you are training a model on what are good videos to recommend given some key word search. If you search for “medical advice for young parents on vaccine,” this is a label. Let’s assume for the sake of simplicity that a video that tells you not to take your kid for vaccines is not what we mean by medical advice for young parents on vaccine because that’s what medical experts agree on. We want our system to learn that anitvaxers, like anti-vaccine propaganda is not what people are searching for when they type those key words, so I suppose a world where we care about accuracy, okay? Imagine you want to train a machine learning model that gives you accurate results of your search. Let’s also for the sake of simplicity assume that a majority of people on the internet are honest.
Let’s assume that more than 50% of people are not actively trying to poison the internet. Yeah, this is very optimistic, but let’s assume that. What we can show and what me and my colleagues started this line of research with is that you can easily prove that one single malicious agent can provably poison a distributed machine learning scheme. Imagine you are this video sharing platform. Whenever people behave on your platform, this generates what we call gradients, so it updates your model. It only takes a few hyperactive accounts that could generate behavior that is powerful enough to pull what we call the average gradient because what distributed machine learning is using, at least up to today, if you read the source code of most distributed machine learning frameworks. Distributed machine learning is always averaging gradients.
Imagine you Lucas Perry just googled a video on the Parkland shootings. Then the video sharing platform shows you a video telling you that David Hogg and Emma Gonzalez and those kids behind the March for Our Lives movement are crisis actors. The video labels three kids as crisis actors. It obviously has a wrong label, so it is what I will call a poisoned data point. If you are non-malicious agents on the video sharing platform, you will dislike the video. You will not approve it. You’re likely to flag it. This should generate a gradient that pushes the model in that direction, so the gradient will update the model into a direction where it stops thinking that this video is relevant for someone searching “Parkland shooting survivors.” What can happen if your machine learning framework is just averaging gradients is that a bunch of hyperactive people on some topic could poison the average and pull it towards the direction where the models is enforcing this thinking that, “Yeah, those kids are crisis actors.”
Lucas: This is the case because the hyperactive accounts are seen to be given more weight than accounts which are less active in the same space. But this extra weighting that these accounts will get from their hyperactivity in one certain category or space over another, how is the weighting done? Is it just time spent per category or does it have to do with submissions that agree with the majority?
El Mahdi: We don’t even need to go into the details because we don’t know. I’m talking in a general setting where you have a video sharing platform aggregating gradients for behavior. Now, maybe let’s raise the abstraction level. You are doing gradient descents, so you have a lost function that you want to minimize. You have an error function. The error function is the mismatch between what you predict and what the user tells you. The user tells you this is a wrong prediction, and then you move to the direction where the users stop telling you this is the wrong direction. You are doing great in this sense minimizing the lost function. User behaves, and with their behavior, you generate gradients.
What you do now in the state of the arts way of distributed machine learning is that you average all those gradients. Averaging is well known not to be resilient. If you have a room of poor academics earning a few thousand dollars and then a billionaire jumps in the room, if your algorithm reasons with averaging, it will think that this is a room of millionaires because the average salary would be a couple of hundred millions. But then million is very obvious to do when it comes to salaries and numbers scalers because you can rank them.
El Mahdi: You rank numbers and then decide, “Okay, this is the ordering. This is the number that falls in the middle. This is the upper half. This is the lower half and this is the median.” When it becomes high dimensional, the median is a bit tricky. It has some computational issues. Then even if you compute what we call the geometric median, an attacker can still know how to leverage the fact that you’re only approximating it because there’s no closed formula. There’s no closed form to compute the median in that dimension. But worse than that, what we showed in one of our follow up works is because of the fact that machine learning is done in very, very, very high dimensions, you would have a curse of the dimensionality issue that makes it possible for attackers to sneak in without being spot as a way of the median.
It can still look like the median vector. I take benefits from the fact that those vectors, those gradients, are extremely high dimensional. I would look for all the disagreements. Let’s say you have a group of a couple hundred gradients, and I’m the only malicious one. I would look at the group of correct vectors all updating you somehow in the same direction within some variants. On average, they’re like what we call unbiased estimators of the gradient. When you take out the randomness, the expected value they will give you is the real gradient of the loss function. What I will do as a malicious worker is I will look at the way they are disagreeing slightly on each direction.
I will sum that. I will see that they disagree by this much on direction one. They disagree by this much on direction two. They disagree by this much, epsilon one, epsilon two, epsilon three. I would look for all these small disagreements they have on all the components.
Lucas: Across all dimensions and high dimensional space.
El Mahdi: Then add that up. It will be my budget, my leeway, my margin to attack you on another direction.
Lucas: I see.
El Mahdi: What we proved is that you have to mix ideas from geometric median with ideas from the traditional component-wise median, and that those are completely different things. The geometric median is a way to find a median by just minimizing the sum of distances between what you look for and all the vectors that were proposed, and then the component-wise median will do a traditional job of ranking the coordinates. It looks at each coordinate, and then rank all the propositions, and then look for the proposition that lies in the middle. Once we proved enough follow up work is that, yeah, the geometric median idea is elegant. It can make you converge, but it can make you converge to something arbitrarily bad decided by the attacker. When you train complex models like neural nets, the landscape you optimize inside is not convex. It’s not like a bowl or a cup where you just follow the descending slope you would end up in the lowest point.
El Mahdi: It’s like a multitude of bowls with different heights.
Lucas: Right, so there’s tons of different local minima across the space.
El Mahdi: Exactly. So in the first paper what we showed is that ideas that look like the geometric median are enough to just converge. You converge. You provably converge, but in the follow up what we realized, like something we were already aware of, but not enough in my opinion, is that there is this square root D, this curse of dimensionality that will arise when you compute high dimensional distances. That the attacker can leverage.
So in what we call the hidden vulnerability of distributed learning, you can have correct vectors, agreeing on one component. Imagine in your head some three axis system.
Let’s say that they are completely in agreement on axis three. But then in axis one, two, so in the plane formed by the axis one and axis two, they have a small disagreement.
What I will do as the malicious agent, is that I will leverage this small disagreement, and inject it in axis three. And this will make you go to a bit slightly modified direction. And instead of going to this very deep, very good minima, you will go into a local trap that is just close ahead.
And that comes from the fact that loss functions of interesting models are clearly like far from being convex. The models are highly dimensional, and the loss function is highly un-convex, and creates a lot of leeway.
Lucas: It creates a lot of local minima spread throughout the space for you to attack the person into.
El Mahdi: Yeah. So convergence is not enough. So we started this research direction by formulating the following question, what does it take to guarantee convergence?
And any scheme that aggregates gradients, and guarantee convergence is called Byzantine resilient. But then you can realize that in very high dimensions, and highly non-convex loss functions, is convergence enough? Would you just want to converge?
There are of course people arguing the deep learning models, like there’s this famous paper by Anna Choromanska, and Yann LeCun, and Gérard Ben Arous, about the landscape of neural nets, that basically say that, “Yeah, very deep local minimum of neural nets are some how as good.”
From an overly simplified point of view, it’s an optimistic paper, that tells you that you shouldn’t worry too much when you optimize neural nets about the fact that gradient descent would not necessarily go to a global like-
Lucas: To a global minima.
El Mahdi: Yeah. Just like, “Stop caring about that.”
Lucas: Because the local minima are good enough for some reason.
El Mahdi: Yeah. I think that’s a not too unfair way to summarize the paper for the sake of this talk, for the sake of this discussion. What we empirically illustrate here, and theoretically support is that that’s not necessarily true.
Because we show that with very low dimensional, not extremely complex models, trained on CIFAR-10 and MNIST, which are toy problems, very easy toy problems, low dimensional models etc. It’s already enough to have those amounts of parameters, let’s say 100,000 parameters or less, so that an attacker would always find a direction to take you each time away, away, away, and then eventually find an arbitrarily bad local minimum. And then you just converge to that.
So convergence is not enough. Not only you have to seek an aggregation rule that guarantees convergence, but you have to seek some aggregation rules that guarantee that you would not converge to something arbitrarily bad. You would keep converging to the same high quality local minimum, whatever that means.
The hidden vulnerability is this high dimensional idea. It’s the fact that because the loss function is highly non-convex, because there’s the high dimensionality, as an attacker I would always find some direction, so the attack goes this way.
Here the threat model is that an attacker can spy on your gradients, generated by the correct workers but cannot talk on their behalf. So I cannot corrupt the messages. Since you asked about the reliability of horsemen or not.
So horsemen are reliable. I can’t talk on your behalf, but I can spy on you. I can see what are you sending to the others, and anticipate.
So I would as an attacker wait for correct workers to generate their gradients, I will gather those vectors, and then I will just do a linear regression on those vectors to find the best direction to leverage the disagreement on the D minus one remaining directions.
So because there would be this natural disagreement, this variance in many directions, I will just do some linear regression and find what is the best direction to keep? And use the budget I gathered, those epsilons I mentioned earlier, like this D time epsilon on all the directions to inject it the direction that will maximize my chances of taking you away from local minima.
So you will converge, as proven in the early papers, but not necessarily to something good. But what we showed here is that if you combine ideas from multidimensional geometric medians, with ideas from single dimensional component-wise median, you improve your robustness.
Of course it comes with a price. You require three quarters of the workers to be reliable.
There is another direction where we expanded this problem, which is asynchrony. And asynchrony arises when as I said in the Byzantine generals setting, you don’t have a bounded delay. In the bounded delay setting, you know that horses arrive at most after one hour.
Lucas: But I have no idea if the computer on the other side of the planet is ever gonna send me that next update.
El Mahdi: Exactly. So imagine you are doing machine learning on smartphones. You are leveraging a set of smartphones all around the globe, and in different bandwidths, and different communication issues etc.
And you don’t want each time to be bottlenecked by the slowest one. So you want to be asynchronous, you don’t want to wait. You’re just like whenever some update is coming, take it into account.
Imagine some very advanced AI scenario, where you send a lot of learners all across the universe, and then they communicate with the speed of light, but some of them are five light minutes away, but some others are two hours and a half. And you want to learn from all of them, but not necessarily handicap the closest one, because there are some other learners far away.
Lucas: You want to run updates in the context of asynchrony.
El Mahdi: Yes. So you want to update whenever a gradient is popping up.
Lucas: Right. Before we move on to illustrate the problem again here is that the order matters, right? Like in the banking example. Because the 10% plus 10 is different from-
El Mahdi: Yeah. Here the order matters for different reasons. You update me so you are updating me on the model you got three hours ago. But in the meanwhile, three different agents updated me on the models, while getting it three minutes ago.
All the agents are communicating through some abstraction they call the server maybe. Like this server receives updates from fast workers.
Lucas: It receives gradients.
El Mahdi: Yeah, gradients. I also call them updates.
El Mahdi: Because some workers are close to me and very fast, I’ve done maybe 1000 updates, while you were still working and sending me the message.
So when your update arrive, I can tell whether it is very stale, very late, or malicious. So what we do in here is that, and I think it’s very important now to connect a bit back with classic distributed computing.
Is that Byzantine resilience in machine learning is easier than Byzantine resilience in classical distributed computing for one reason, but it is extremely harder for another reason.
The reason is that we know what we want to agree on. We want to agree on a gradient. We have a toolbox of calculus that tells us how this looks like. We know that it’s the slope of some loss function that is most of today’s models, relatively smooth, differentiable, maybe Lipschitz, bounded, whatever curvature.
So we know that we are agreeing on vectors that are gradients of some loss function. And we know that there is a majority of workers that will produce vectors that will tell us what does a legit vector look like.
You can find some median behavior, and then come up with filtering criterias that will get away with the bad gradients. That’s the good news. That’s why it’s easier to do Byzantine resilience in machine learning than to do Byzantine agreement. Byzantine agreement, because agreement is a way harder problem.
The reason why Byzantine resilience is harder in machine learning than in the typical settings you have in distributed computing is that we are dealing with extremely high dimensional data, extremely high dimensional decisions.
So a decision here is to update the model. It is triggered by a gradient. So whenever I accept a gradient, I make a decision. I make a decision to change the model, to take it away from this state, to this new state, by this much.
But this is a multidimensional update. And Byzantine agreement, or Byzantine approximate agreement in higher dimension has been provably hopeless by Hammurabi Mendes, and Maurice Herlihy in an excellent paper in 2013, where they show that you can’t do Byzantine agreement in D dimension with N agents in less than N to the power D computations, per agent locally.
Of course in their paper, they were meaning Byzantine agreement on positions. So they were framing it with a motivations saying, “This is N to the power D, but the typical cases we care about in distributed computing are like robots agreeing on a position on a plane, or on a position in a three dimensional space.” So D is two or three.
So N to the power two or N to the power three is fine. But in machine learning D is not two and three, D is a billion or a couple of millions. So N to the power a million is just like, just forget.
And not only that, but also they require … Remember when I tell you that Byzantine resilience computing would always have some upper bound on the number malicious agents?
Lucas: Mm-hmm (affirmative).
El Mahdi: So the number of total agents should exceed D times the number of malicious agents.
Lucas: What is D again sorry?
El Mahdi: Dimension.
Lucas: The dimension. Okay.
El Mahdi: So if you have to agree on D dimension, like on a billion dimensional decision, you need at least a billion times the number of malicious agents.
So if you have say 100 malicious agents, you need at least 100 billion total number of agents to be resistant. No one is doing distributed machine learning on 100 billion-
Lucas: And this is because the dimensionality is really screwing with the-
El Mahdi: Yes. Byzantine approximate agreement has been provably hopeless. That’s the bad, that’s why the dimensionality of machine learning makes it really important to go away, to completely go away from traditional distributed computing solutions.
El Mahdi: So we are not doing agreement. We’re not doing agreement, we’re not even doing approximate agreement. We’re doing something-
Lucas: Totally new.
El Mahdi: Not new, totally different.
El Mahdi: Called gradient decent. It’s not new. It’s as old as Newton. And it comes with good news. It comes with the fact that there are some properties, like some regularity of the loss function, some properties we can exploit.
And so in the asynchronous setting, it becomes even more critical to leverage those differentiability properties. So because we know that we are optimizing a loss functions that has some regularities, we can have some good news.
And the good news has to do with curvature. What we do here in asynchronous setting, is not only we ask workers for their gradients, we ask them for their empirical estimate of the curvature.
Lucas: Sorry. They’re estimating the curvature of the loss function, that they’re adding the gradient to?
El Mahdi: They add the gradient to the parameter, not the loss function. So we have a loss function, parameter is the abscissa, you add the gradient to the abscissa to update the model, and then you end up in a different place of the loss function.
So you have to imagine the loss function as like a surface, and then the parameter space as the plane, the horizontal plane below the surface. And depending on where you are in the space parameter, you would be on different heights of the loss function.
Lucas: Wait sorry, so does the gradient depend where you are on this, the bottom plane?
El Mahdi: Yeah –
Lucas: So then you send an estimate for what you think the slope of the intersection will be?
El Mahdi: Yeah. But for asynchrony, not only that. I will ask you to send me the slope, and your observed empirical growth of the slope.
Lucas: The second derivative?
El Mahdi: Yeah.
El Mahdi: But the second derivative again in high dimension is very hard to compute. You have to computer the Hessian matrix.
El Mahdi: That’s something like completely ugly to compute in high dimensional situations because it takes D square computations.
As an alternative we would like you to send us some linear computation in D, not a square computation in D.
So we would ask you to compute your actual gradient, your previous gradient, the difference between them, and normalize it by the difference between models.
So, “Tell us your current gradient, by how much it changed from the last gradient, and divide that by how much you changed the parameter.”
So you would tell us, “Okay, this is my current slope, and okay this is the gradient.” And you will also tell us, “By the way, my slope change relative to my parameter change is this much.”
And this would be some empirical estimation of the curvature. So if you are in a very curved area-
Lucas: Then the estimation isn’t gonna be accurate because the linearity is gonna cut through some of the curvature.
El Mahdi: Yeah but if you are in a very curved area of the loss function, your slope will change a lot.
Lucas: Okay. Exponentially changing the slope.
El Mahdi: Yeah. Because you did a very tiny change in the parameter and it takes a lot of the slope.
Lucas: Yeah. Will change the … Yeah.
El Mahdi: When you are in a non-curved area of the loss function, it’s less harmful for us that you are stale, because you will just technically have the same updates.
If you are in a very curved area of the loss function, your updates being stale is now a big problem. So we want to discard your updates proportionally to your curvature.
So this is the main idea of this scheme in asynchrony, where we would ask workers about their gradient, and their empirical growth rates.
And then of course I don’t want to trust you on what you declare, because you can plan to screw me with some gradients, and then declare a legitimate value of the curvature.
I will take those empirical, what we call in the paper empirical Lipschitz-ness. So we ask you for this empirical growth rate, that it’s a scalar, remember? This is very important. It’s a single dimensional number.
And so we ask you about this growth rate, and we ask all of you about growth rates, again assuming the majority is correct. So the majority of growth rates will help us set the median growth rate in a robust manner, because as long as a simple majority is not lying, the median growth rates will always be bounded between two legitimate values of the growth rate.
Lucas: Right because, are you having multiple workers inform you of the same part of your loss function?
El Mahdi: Yes. Even though they do it in an asynchronous manner.
Lucas: Yeah. Then you take the median of all of them.
El Mahdi: Yes. And then we reason by quantiles of the growth rates.
Lucas: Reason by quantiles? What are quantiles?
El Mahdi: The first third, the second third, the third third. Like the first 30%, the second 30%, the third 30%. We will discard the first 30%, discard the last 30%. Anything in the second 30% is safe.
Of course this has some level of pessimism, which is good for safety, but not very good for being fast. Because maybe people are not lying, so maybe the first 30%, and the last 30% are also values we could consider. But for safety reasons we want to be sure.
Lucas: You want to try to get rid of the outliers.
El Mahdi: Possible.
Lucas: Possible outliers.
El Mahdi: Yeah. So we get rid of the first 30%, the last 30%.
Lucas: So this ends up being a more conservative estimate of the loss function?
El Mahdi: Yes. That’s completely right. We explain that in the paper.
Lucas: So there’s a trade off that you can decide-
El Mahdi: Yeah.
Lucas: By choosing what percentiles to throw away.
El Mahdi: Yeah. Safety never comes for free. So here, depending on how good your estimates about the number of potential Byzantine actors is, your level of pessimism with translate into slowdown.
Lucas: Right. And so you can update the amount that you’re cutting off-
El Mahdi: Yeah.
Lucas: Based off of the amount of expected corrupted signals you think you’re getting.
El Mahdi: Yeah. So now imagine a situation where you know the number of workers is know. You know that you are leveraging 100,000 smartphones doing gradient descent for you. Let’s call that N.
You know that F of them might be malicious. We argue that if F is exceeding the third of N, you can’t do anything. So we are in a situation where F is less than a third. So less than 33,000 workers are malicious, then the slowdown would be F over N, so a third.
What if you are in a situation where you know that your malicious agents are way less than a third? For example you know that you have at most 20 rogue accounts in your video sharing platform.
And your video sharing platform has two billion accounts. So you have two billion accounts.
Lucas: 20 of them are malevolent.
El Mahdi: What we show is that the slowdown would be N minus F divided by N. N is the two billion accounts, F is the 20, and is again two billion.
So it would be two billion minus 20, so one million nine hundred billion, like something like 0.999999. So you would go almost as fast as the non-Byzantine resilient scheme.
So our Byzantine resilient scheme has a slowdown that is very reasonable in situations where F, the number of malicious agents is way less than N, the total number of agents, which is typical in modern…
Today, like if you ask social media platforms, they have a lot of a tool kits to prevent people from creating a billion fake accounts. Like you can’t in 20 hours create an army of several million accounts.
None of the mainstream social media platforms today are susceptible to this-
Lucas: Are susceptible to massive corruption.
El Mahdi: Yeah. To this massive account creation. So you know that the number of corrupted accounts are negligible to the number of total accounts.
So that’s the good news. The good news is that you know that F is negligible to N. But then the slowdown of our Byzantine resilient methods is also close to one.
But it has the advantage compared to the state of the art today to train distributed settings of not taking the average gradient. And we argued in the very beginning that those 20 accounts that you could create, it doesn’t take a bot army or whatever, you don’t need to hack into the machines of the social network. You can have a dozen human, sitting somewhere in a house manually creating 20 accounts, training the accounts over time, doing behavior that makes the legitimate for some topics, and then because you’re distributing machine learning scheme would average the gradients generated by people behavior and that making your command anti-vaccine or controversies, anti-Semitic conspiracy theories.
Lucas: So if I have 20 bad gradients and like, 10,000 good gradients for a video, why is it that with averaging 20 bad gradients are messing up the-
El Mahdi: The amplitude. It’s like the billionaire in the room of core academics.
Lucas: Okay, because the amplitude of each of their accounts is greater than the average of the other accounts?
El Mahdi: Yes.
Lucas: The average of other accounts that are going to engage with this thing don’t have as large of an amplitude because they haven’t engaged with this topic as much?
El Mahdi: Yeah, because they’re not super credible on gun control, for example.
Lucas: Yeah, but aren’t there a ton of other accounts with large amplitudes that are going to be looking at the same video and correcting over the-
El Mahdi: Yeah, let’s define large amplitudes. If you come to the video and just like it, that’s a small update. What about you like it, post very engaging comments-
Lucas: So you write a comment that gets a lot of engagement, gets a lot of likes and replies.
El Mahdi: Yeah, that’s how you increase your amplitude. And because you are already doing some good job in becoming the reference on that video-sharing platform when it comes to discussing gun control, the amplitude of your commands is by definition high and the fact that your command was very early on posted and then not only you commented the video but you also produced a follow-up video.
Lucas: I see, so the gradient is really determined by a multitude of things that the video-sharing platform is measuring for, and the metrics are like, how quickly you commented, how many people commented and replied to you. Does it also include language that you used?
El Mahdi: Probably. It depends on the social media platform and it depends on the video-sharing platform and, what is clear is that there are many schemes that those 20 accounts created by this dozen people in a house can try to find good ways to maximize the amplitude of their generated gradients, but this is a way easier problem than the typical problems we have in technical AI safety. This is not value alignment or value loading or coherent extrapolated volition. This is a very easy, tractable problem on which now we have good news, provable results. What’s interesting is the follow-up questions that we are trying to investigate here with my colleagues, the first of which is, don’t necessarily have a majority of people on the internet promoting vaccines.
Lucas: People that are against things are often louder than people that are not.
El Mahdi: Yeah, makes sense, and sometimes maybe numerous because they generate content, and the people who think vaccines are safe not creating content. In some topics it might be safe to say that we have a majority of reasonable, decent people on the internet. But there are some topics in which now even like polls, like the vaccine situation, there’s a surge now of anti-vaccine resentment in western Europe and the US. Ironically this is happening in the developed country now, because people are so young, they don’t remember the non-vaccinated person. My aunt, I come from Morocco. my aunt is handicapped by polio, so I grew up seeing what a non-vaccinated person looks like. So young people in the more developed countries never had a living example of non-vaccinated past.
Lucas: But they do have examples of people that end up with autism and it seems correlated with vaccines.
El Mahdi: Yeah, the anti-vaccine content may just end up being so click baits, and so provocative that it gets popular. So this is a topic where the majority hypothesis which is crucial to poisoning resilience does not hold. An open follow up we’re onto now is how to combine ideas from reputation metrics, PageRank, et cetera, with poisoning resilience. So for example you have the National Health Institute, the John Hopkins Medical Hospital, Harvard Medical School, and I don’t know, the Massachusetts General Hospital having official accounts on some video-sharing platform and then you can spot what they say on some topic because now we are very good at doing semantic analysis of contents.
And know that okay, on the tag vaccines, I know that there’s this bunch of experts and then what you want to make emerge on your platform is some sort of like epistocracy. The power is given to the knowledgeable, like we have in some fields, like in medical regulation. The FDA doesn’t do a majority vote. We don’t have a popular majority vote across the country to tell the FDA whether it should approve this new drug or not. The FDA does some sort of epistocracy where the knowledgeable experts on the topic would vote. So how about mixing ideas from social choice?
Lucas: And topics in which there are experts who can inform.
El Mahdi: Yeah. There’s also a general fall-off of just straight out trying to connect Byzantine resilient learning with social choice, but then there’s another set of follow ups that motivates me even more. We were mentioning workers, workers, people generate accounts on social media, accounts generation gradients. That’s all I can implicitly assume in that the server, the abstraction that’s gathering those gradients is reliable. What about the aggregated platform itself being deployed on rogue machines? So imagine you are whatever platform doing learning. By the way, whatever always we have said from the beginning until now applies as long as you do gradient-based learning. So it can be recommended systems. It can be training some deep reinforcement learning of some super complicated tasks to beat, I don’t know the word, champion in poker.
We do not care as long as there’s some gradient generation from observing some state, some environmental state, and some reward or some label. It can be supervised, reinforced, as long as gradient based or what you say apply. Imagine now you have this platform leveraging distributed gradient creators, but then the platform itself for security reasons is deployed on several machines for fault tolerance. But then those machines themselves can fail. You have to make the servers agree on the model, so despite the fact that a fraction of the workers are not reliable and now a fraction of the servers themselves. This is the most important follow up i’m into now and I think there would be something on archive maybe in February or March on that.
And then a third follow up is practical instances of that, so I’ve been describing speculative thought experiments on power poisoning systems is actually brilliant master students working which means exactly doing that, like on typical recommended systems, datasets where you could see that it’s very easy. It really takes you a bunch of active agents to poison, a hundred thousand ones or more. Probably people working on big social media platforms would have ways to assess what I’ve said, and so as researchers in academia we could only speculate on what can go wrong on those platforms, so what we could do is just like we just took state of the art recommender systems, datasets, and models that are publicly available, and you can show that despite having a large number of reliable recommendation proposers, a small, tiny fraction of proposers can make, I don’t know, like a movie recommendation system recommend the most suicidal triggering film to the most depressed person watching through your platform. So I’m saying, that’s something you don’t want to have.
Lucas: Right. Just wrapping this all up, how do you see this in the context of AI alignment and the future of machine learning and artificial intelligence?
El Mahdi: So I’ve been discussing this here with people in the Beneficial AI conference and it seems that there are two schools of thought. I am still hesitating between the two because I switched within the past three months from the two sides like three times. So one of them thinks that an AGI is by definition resilient to poisoning.
Lucas: Aligned AGI might be by definition.
El Mahdi: Not even aligned. The second school of thought, aligned AGI is Byzantine resilient.
Lucas: Okay, I see.
El Mahdi: Obviously aligned AGI would be poisoning resilience, but let’s just talk about super intelligent AI, not necessarily aligned. So you have a super intelligence, would you include poisoning resilience in the super intelligence definition or not? And one would say that yeah, if you are better than human in whatever task, it means you are also better than human into spotting poison data.
Lucas: Right, I mean the poison data is just messing with your epistemics, and so if you’re super intelligent your epistemics would be less subject to interference.
El Mahdi: But then there is that second school of thought which I switched back again because I find that most people are in the first school of thought now. So I believe that super intelligence doesn’t necessarily include poisoning resilience because of what I call practically time constrained superintelligence. If you have a deadline because of computational complexity, you have to learn something, which can sometimes-
Lucas: Yeah, you want to get things done.
El Mahdi: Yeah, so you want to get it done in a finite amount of time. And because of that you will end up leveraging to speed up your learning. So if a malicious agent just put up bad observations of the environment or bad labeling of whatever is around you, then it can make you learn something else than what you would like as an aligned outcome. I’m strongly on the second side despite many disagreeing with me here. I don’t think super intelligence includes poisoning resilience, because super intelligence would still be built with time constraints.
Lucas: Right. You’re making a tradeoff between safety and computational efficiency.
El Mahdi: Right.
Lucas: It also would obviously seem to matter the kind of world that the ASI finds itself in. If it knows that it’s in a world with no, or very, very, very few malevolent agents that are wanting to poison it, then it can just throw all of this out of the window, but the problem is that we live on a planet with a bunch of other primates that are trying to mess up our machine learning. So I guess just as a kind of fun example in taking it to an extreme, imagine it’s the year 300,000 AD and you have a super intelligence which has sort of spread across space-time and it’s beginning to optimize its cosmic endowment, but it gives some sort of uncertainty over space-time to whether or not there are other super intelligences there who might want to poison its interstellar communication in order to start taking over some of its cosmic endowment. Do you want to just sort of explore?
El Mahdi: Yeah, that was like a closed experiment I proposed earlier to Carl Shulman from the FHI. Imagine some super intelligence reaching the planets where there is a smart form of life emerging from electric communication between plasma clouds. So completely non-carbon, non-silicon based.
Lucas: So if Jupiter made brains on it.
El Mahdi: Yeah, like Jupiter made brains on it just out of electric communication through gas clouds.
Lucas: Yeah, okay.
El Mahdi: And then this turned to a form of communication is smart enough to know that this is a super intelligence reaching the planet to learn about this form of life, and then it would just start trolling it.
Lucas: It’ll start trolling the super intelligence?
El Mahdi: Yeah. So they would come up with an agreement ahead of time, saying, “Yeah, this super intelligence coming from earth throughout our century to discover how we do things here. Let’s just behave dumbly, or let’s just misbehave. And then the super intelligence will start collecting data on this life form and then come back to earth saying, Yeah, they’re just a dumb plasma passive form of nothing interesting.
Lucas: I mean, you don’t think that within the super intelligence’s model, I mean, we’re talking about it right now so obviously a super intelligence will know this when it leaves that there will be agents that are going to try and trick it.
El Mahdi: That’s the rebuttal, yes. That’s the rebuttal again. Again, how much time does super intelligence have to do inference and draw conclusions? You will always have some time constraints.
Lucas: And you don’t always have enough computational power to model other agents efficiently to know whether or not they’re lying, or …
El Mahdi: You could always come up with thought experiment with some sort of other form of intelligence, like another super intelligence is trying to-
Lucas: There’s never, ever a perfect computer science, never.
El Mahdi: Yeah, you can say that.
Lucas: Security is never perfect. Information exchange is never perfect. But you can improve it.
El Mahdi: Yeah.
Lucas: Wouldn’t you assume that the complexity of the attacks would also scale? We just have a ton of people working on defense, but if we have an equal amount of people working on attack, wouldn’t we have an equally complex method of poisoning that our current methods would just be overcome by?
El Mahdi: That’s part of the empirical follow-up I mentioned. The one Isabella and I were working on, which is trying to do some sort of min-max game of poisoner versus poisoning resilience learner, adversarial poisoning setting where like a poisoner and then there is like a resilient learner and the poisoner tries to maximize. And what we have so far is very depressing. It turns out that it’s very easy to be a poisoner. Computationally it’s way easier to be the poisoner than to be-
Lucas: Yeah, I mean, in general in the world it’s easier to destroy things than to create order.
El Mahdi: As I said in the beginning, this is a sub-topic of technical AI safety where I believe it’s easier to have tractable formalizable problems for which you can probably have a safe solution.
El Mahdi: But in very concrete, very short term aspects of that. In March we are going to announce a major update in Tensor Flow which is the standout frameworks today to do distributed machine learning, open source by Google, so we will announce hopefully if everything goes right in sys ML in the systems for machine learning conference, like more empirically focused colleagues, so based on the algorithms I mentioned earlier which were presented at NuerIPS and ICML from the past two years, they will announce a major update where they basically changed every averaging insight in terms of flow by those three algorithms I mentioned, Krum and Bulyan and soon Kardam which constitute our portfolio of Byzantine resilience algorithms.
Another consequence that comes for free with that is that distributed machinery frameworks like terms of flow use TCPIP as a communication protocol. So TCPIP has a problem. It’s reliable but it’s very slow. You have to repeatedly repeat some messages, et cetera, to guarantee reliability, and we would like to have a faster communication protocol, like UDP. We don’t need to go through those details. But it has some package drop, so so far there was no version of terms of flow or any distributed machine learning framework to my knowledge using UDP. The old used TCPIP because they needed reliable communication, but now because we are Byzantine resilient, we can afford having fast but not completely reliable communication protocols like UDP. So one of the things that come for free with Byzantine resilience is that you can move from heavy-
Lucas: A little bit more computation.
El Mahdi: -yeah, heavy communication protocols like TCPIP to lighter, faster, more live communication protocols like UDP.
Lucas: Keeping in mind you’re trading off.
El Mahdi: Exactly. Now we have this portfolio of algorithms which can serve many other applications besides just making faster distributed machine learning, like making poisoning resilience. I don’t know, recommended systems for social media and hopefully making AGI learning poisoning resilience matter.
Lucas: Wonderful. So if people want to check out some of your work or follow you on social media, what is the best place to keep up with you?
El Mahdi: Twitter. My handle is El Badhio, so maybe you would have it written down on the description.
Lucas: Yeah, cool.
El Mahdi: Yeah, Twitter is the best way to get in touch.
Lucas: All right. Well, wonderful. Thank you so much for speaking with me today and I’m excited to see what comes out of all this next.
El Mahdi: Thank you. Thank you for hosting this.
Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.