Adversarial Examples, meet Computational Complexity
This blog post summarizes my notes on Adversarial Examples from Computational Constraints by Sébastien Bubeck, Eric Price, and Ilya Razenshteyn (2018). The paper asks a fundamental question in machine learning: Why are adversarial examples a thing, anyway? What’s stopping us from building the robust classifiers of our dreams?
There are several schools of thought on this, and they’re not mutually exclusive.
This paper is the first one that I’m aware of to propose option (3), and the authors present a setting in which it is actually the case! They construct a family of distributions such that that, even with enough data so that it is information-theoretically possible to learn a robust classifier, it is impossible to find a robust classifier in polynomial time under the SQ (“Statistical Query”) model of computation. This suggests a new trade-off between computational complexity and robustness in classification algorithms.
But before we get there, let’s review what the goal is: robust learning.
Now we can roughly understand the main theorems from the paper. Here I’m writing them in qualitative terms, but we’ll get to the rigorous ones in a bit.
Now, let’s talk a bit about that SQ (“Statistical Query”) model of computation. It’s one I hadn’t heard of before this, so maybe you haven’t either. The gist isn’t too complicated, though: There’s a secret distribution that the SQ oracle knows, but you don’t. You get to ask it a question in the form of a [0,1]-valued function, and get back an expected value over the distribution (plus a little noise). Very reasonable!
Of course, it does matter quite a bit how much noise we get in our answers. To make this a fair fight, since the paper is going to show a hardness result (that the SQ model isn’t able to find our robust classifier efficiently), we’ll assume its quite a precise oracle, with only a little noise.
That’s a lot of precision! You’d need more than a polynomial number of samples to achieve it (polynomial in d, the problem dimension).
So it’s actually pretty surprising that the authors find that even with this high precision, there’s still a task out there that requires an exponential number of queries for robust learning… and yet is easy to learn without robustness… and yet robust learning is information-theoretically possible!
Now, the SQ model is not the be-all end-all of computational hardness, of course. In fact, there are some examples we know of that have exponential SQ hardness, but polynomial time algorithms using other methods. They’re rare (so far), but we don’t really know when to expect it to happen. We’re really just grazing the tip of the iceberg on SQ here, but suffice it to say an efficient algorithm for this problem going outside of the SQ model would be quite the breakthrough of its own.
Okay, now is a great time to skip to the takeaways at the end if you’ve only got a few minutes to spare on this blog post, because we’re about to get into the nitty-gritty. It’s time for everyone’s favorite: a ~definition dump~. (Don’t worry, I annotated it!)
So far, so good: We’ve got a loss function that takes into account an epsilon-ball around each data point, and a definition for a robust classifier based on that loss function. Now, let’s set up our classification task.
Whoo, okay! So we’ve got two types of robustness: robustly learnable and robustly feasible. Feasible’s the easier one: it just says that there’s a classifier out there somewhere. Learnable says we can actually find it with a classification algorithm and the right amount of data. But don’t get too attached to that distinction — we’re about to find out that they’re one and the same!
Neat! Let’s prove it. This commences the “Information Theory” part of our program. As far as I could tell from the paper, the proof is really just a bunch of Chernoff bounds daisy-chained together.
Great! Well, almost. There’s a pesky little note up there on the right side of the claim: Assuming a finite set of classifiers. That’s quite the assumption, because usually when we’re talking robust classifiers we’re looking at neural networks. And neural networks can generate an (approximately) continuous space of distributions.
Now it turns out we can handle this, although I haven’t included the full details in my notes for the sake of space. The basic idea is that you use this thing called a “covering number”, which more-or-less tells you about how many epsilon-balls you need to cover the continuous space. If the covering number is small enough, everything works out — you get a result of the same form as before, but now n has a log dependence on the covering number instead of the total number of classifiers. I’m hiding the hand-waving behind an html tag, but you’re welcome to take a peek.
Okay, we’ve been good, eaten our vegetables, and proved learnable == feasible. Let’s get to the fun stuff! What’s the distribution that’s learnable, but not efficiently?
It’s this guy! More or less. We’ve got two discretized Gaussian looking things, which are beautifully interleaved with each other. I bet that won’t make it super hard to learn using our expected value SQ oracle or anything… *shifty eyes*.
We’ve got some technical properties (proofs of which I’m not including), but the main thing to understand is that the two distributions are well-separated from each other, and each one has its own non-overlapping set of intervals where almost all of its sample points will fall. Oh, and a whole bunch of their moments match. (Remember that telling-things-apart-by-expected-value thing? Yeah, it’s not subtle.)
DA and DB aren’t actually our distributions, though — there’s one more step to extend this to higher dimensions. First we set up a family of k-dimensional subspaces that are pairwise close-to-orthogonal.
And that’s all there is to it!
Now, we can prove a few lemmas to show that the properties we wanted all hold:
And all together, this means our fancy distribution does, in fact, admit a robust classifier.
Let’s recap. What was the point of all this again?
This last bit is new! This part makes some intuitive sense to me: our distribution looks a lot like a (discretized) Gaussian, so it stands to reason it should take lots of queries to tell it apart from one, if all we get to look at are expected values. The specifics of this are actually completely nontrivial to prove, and it goes much deeper into the SQ model — but for that, you’ll need to read the paper. 😉 (Check out Appendix B.)
As promised, it’s also easy to learn without robustness. A hand-wavy sketch, but there’s not much else to see here. We’ve accomplished what we came for!
I’ll leave you with some closing thoughts and a few vague pointers to other work in this space. And always feel free to let me know if you have any papers you’d like to see in this blog next, and maybe it’ll make it into a post!