 Hello everyone, welcome to Robustly Beneficial. Today we will discuss recent advances in robust high-dimensional statistics. This is a book chapter written by Elias Diakonikolaus and Daniel Kain, which is based on a series of recent papers, day, and some other quarters, and some other people in other groups published in the past three or so years. The objective is to provide solutions for a robust statistical estimation, despite the presence of outliers or maliciously crafted data. As the paper we discussed yesterday. We should say this is more or less your research area. It's strongly connected to your research area and it's important. How important is it to robustly beneficial algorithms? For instance, for example, if you do statistical learning, you are trying to infer something from data and you will do this job in a safe way if the data is reliable in the sense that it translates some ground truth or something that is really observed. But if someone is hiding part of the data or corrupting part of the data, you might learn something else than what you were aiming to. So for instance, imagine you're trying to estimate the mean salary of Lausanne or of New York or any place. So you get data about salaries, but if a fraction of this data is corrupted or is not representative of the mean salary of New York, then you would have a wrong estimation of the mean salary of New York. A simple example would be you want to learn what is the mean salary of New York. If you just collect salaries and take the average and if you happen to have a few billionaires in the data set, then you would think that the average salary of New York is 300 million. A simple solution, the historically first solution was to take the median if the majority of samples were fine. But then in high-dimensional framing of this problem is more challenging. It's interesting to see how much the high-dimensional version of this problem had to wait for the recent years to really take off. The very first paper that was cited or one of the first papers was a paper by Turkey. So there was this idea of Turkey's mean median that is a robust estimator, meaning that if you have a small fraction or a fraction of poison data, then it still recovers the means of the inniers, the true data. But Turkey's approach, Turkey's median is NP-hard to compute. So it's definitely not intractable. It can be hard in the dimension of the space. So just maybe you can insist on the dimension thing because it's something new. And I think like five years ago you still had paper, even in machine learning sometimes that discuss high dimension with G equals to 20 or 100. Now these days like D, the number of parameters is extremely huge. Yeah, if you take the literature in statistics and maybe I don't know, you search for the term, like for example when statisticians used big data as a motivation for a paper up to say 2012, 13, they would mean a lot of data points. So let's say there is N, N is the number of people you're collecting data on. But then for every person, you collect D features. So for example, for Louis, I connect salary, age, school, etc. That's like four features. But if I keep collecting features, it can go to 20. If I look at, for example, national statistics, you can have X like imagine tables where you have 1 million of citizen and then 100 column for every citizen. So D would be 100 and N would be 1 million. And for up to the 2010s, people would care about big N and very rarely care about big D, like very, very, very, very large dimension. And today in machine learning, you'd have situations where you aggregate models, for example, parameters of a neural network, where D could be in the same order or even much larger than N. Yeah, I may be discussing this later, but this is the number of parameters and these days, like neural networks, is usually at least a million. And sometimes it can go as far as 10 to the 12 or 10 to the 11, I think for some papers. So now you could say, why is it a problem? Well, the reason is why it is a problem. And it's actually something that surprised me when I read it. Like, I didn't know this, but it's that if you do, so there are these generalizations of the median. So the median, for those who don't know what we're talking about, the median of a set of numbers is like you rank the numbers in the right order, and you pick the one in the middle. That's the median. And if you want to do for different lists of numbers like this, or in high dimensions, one thing you can do is for each dimension to do this. So this is called coordinate-wise median. So you do the median for each coordinate. Well, this, like I was not that surprised that it was not that, well, in high dimensions, not that great, because you're making a small error for each dimension. And these are right up. And in the end, you get a total error, which is epsilon times the square root of D. So the square root of D is very classical in statistics. It has to do with the variance, like the square root of the variance. But anyways, so you have this problem. And the thing is that you have other generalizations of the median, including the geometric median, which for a long time, I thought it was a great candidate for a robust estimator. But it turns out that the geometric median as well is not performing well. It still has this epsilon square root of D. And yeah, if you have D equals to 10 to the 12, then square root of D is a median. So it's very, very far from the actual mean. Yeah, so we haven't explained what epsilon also is. Yeah, I think we didn't frame the problem correctly yet. So the problem we are working with is we collect data and we try to estimate the mean of these data points. And some bad actor has poisoned the data, meaning that it possibly removed a proportioned epsilon of the data and replaced it with other data points. And so the goal of poisoning is to make you learn a different median, a different mean. And this way, it can manipulate what you learn for your model or what you think. And usually, the poisoning would not be just computing random points, because if you add up randomly around the mean in average, you would be able to learn the right mean, because it would all average to zero. But the poisoner would try to attack in one direction. So pushing all the points that are fake, the fake data points, push pushing them in one direction. So that's why the algorithms for robot statistics, they then try to learn from the data that is collected, so both in layers and out layers. Is there one direction in which the data set I'm looking at is surprisingly a surprising variance? And then we would take action to reduce the variance in that direction. Yeah, the idea of the algorithm is, if I just rephrase it, the idea is that you try to see among the data points if there's one direction along which it looks very suspicious. And if you do, you see, and there's actually good reasons to do the right assumption, you can identify this bad behavior in one dimension. And you know that something is wrong with high probability. And then you can just remove the extreme points, essentially. And then you have a better data set. And you just repeat this procedure. Maybe it's better, maybe it's still not good enough. You have maybe another direction where things go wrong. And so you have to repeat and repeat. And this way of data cleaning, what they showed is efficient. But maybe before we go into this, we can discuss more of the threat model, the poisoning model. Because I think we are like, so in the paper, they discuss a strong version of contamination of poisoning, meaning that the attacker gets to see the true data set, which is arguable in practice. And then he gets to remove some of these points. But not all of them. Somehow it's a bit weird. He can only remove a fraction. But anyone can choose which fraction he gets to remove. And then he gets to replace it by some other data. Yeah, it's true. In practice, this was a very surprising case. So if someone has access to the whole data set and can replace up to 50% of the data, it's surprising to know why can't he replace 75% of the data. But these rules are necessary for the problem of robot statistics. Because if in the case of mean estimation, if more than 50% of the data can be poisoned, then there is nothing we can do about estimating the mean. Because if more than half, the poisoner could build a totally different distribution. And because we observe these two distributions that are separated, we don't know which one comes from the real data set. The different problems that are described is more practically in today's for today's algorithms that are deployed. There can be different types of poisoning attacks. Some, I think the most common would be that the poisoner can only add new data points. And if you think of recommender system, for example, then someone trying to poison recommender system would create some fake accounts to like some specific types of post with the hope that these types of post gets recommended more. And it's quite rare that other types of attacks where someone can remove inputs. I think it's just semantics. I'm sure the authors don't just like mean that the attacker would go and remove a fraction. It's just a way to model that someone would participate with mislabeled data, with wrong data. So if someone participates with wrong data, the total data has a fraction that is corrupted. So regardless of how the wrong data was injected. And the second very interesting point was about whether the attacker knows about the whole input data that their algorithm is using or whether it doesn't know. So to come back to what we were saying before that we try to estimate the mean of a Gaussian, if the attacker is giving me points that are very far from the rest of the distribution, then I would very easily identify them as outliers. But if he's giving points on the margin of the true mean that I'm supposed to measure, then I would have a more difficult time to find out what are these points. That's where the high-dimensional part becomes tricky, because she or he can give you small fractions close to the true mean. But then because of the square root D, because of the high-dimensional aspects of the problem, it would already be a hard enough attack. So for example, we demonstrated that in the context of distributed machine learning, where we showed that an attacker that puts data points that are very close to the true mean, but with a small deviation that can add up on several coordinates or the opposite. One deviation on one coordinates can already do a lot of harm. So depending on why you're trying to estimate, small deviations could already be a very big problem. Yeah, exactly. So that's the point that we would detect large deviations, and that's why poisoning needs to be able to manipulate you correctly in a way that you will have difficulty detecting, needs to be small deviations. And that's why it's interesting to study the two cases where the attacker knows actually what's your input data, and the attacker doesn't know. If he doesn't know, then it's much harder to know how to manipulate. Maybe to reframe my remark was that in high-dimension, small is already big. So a small deviation in high-dimension is already something... When you say small, do you mean square root of D or not? Yeah, because of the square root of D. You have high margins of attack. Yeah, but in the context of the paper, they consider the square root of D is big. So to me, what's interesting also to have this very strong model is not that you're necessarily going to have a strong adversaries, it's just that you're preparing for things that can go very bad. Maybe one of the databases was more important than the other and was biased in a sense, and it crashed or something. So you lost this data. Maybe you have some genes on Excel that mutated. So there's this story about 20% of the papers in biology have a flawed... Of genetics paper, they have a flawed data because Excel modified the value automatically of some genes. And you want your algorithms to still perform correctly, even though you know that there was likely to be some distortions in your data that you did not anticipate. So I think it's interesting also in this regard. So I've been trying to think about, for instance, what's going on really in recommender systems, you have this debate that doesn't count as poisoning or not, it's hard to say. Maybe a more general form of poisoning or data distortion is that you have this sort of true data that you want to collect. But because of many possible reasons, maybe there was biases in the sampling because maybe it was a form on the internet and you only receive data from people who actually answer to forms. Maybe there was a distortion because you had some people who reported this. Maybe there was some journalist who reported what they saw. And so there was this data distortion. And so more generally I said, the problem is that you have this true data you want to learn from, you want to apply your state of the atmosphere learning on the true data. But what you actually observe is some distorted version of this data and you don't know what the distortion is. And you want to have then some robust learning, but the robust learning, you have to apply it to the distorted data because you don't have access to the true data. And in the end, you want your robust machine learning learning from the distorted data to be close to the machine learning algorithm that would have learned from the true data. I think this is like more general setting of the problem. Okay. So from what I understand, your setting, it contains the setting of strong poisoning that we discussed in the paper where an attacker has access to changing absolute proportion of the data. Yeah. So the idea is all in the distortion. And if you allow full power on the distortion, of course, you cannot do nothing. But then what's done in the paper is you consider different types of distortions, more precisely different sets of possible distortions by the attacker. Okay. So like maybe a set of distortion is like a function that removes data points and add all of the data points. But you can have more sophisticated different kinds of, and you want your robust machine learning algorithm to be efficient for all of these, like no matter what the distortion are within a data set of distortion is, you want to be learning something close to what you want to be learning. Okay. So one robust algorithm would depend on what we know is the set of possible distortions. Yeah, definitely. From the attacker. Yeah, definitely. Oh, I think this is already particularly with, well, actually, at least the guarantees you can provide. How much do you think is important, the work to build systems that are not subjects to distortions compared to knowing that the system will receive distorted data and working on still learning something correct, but out of distorted data? Do you see this as two different problems that can be explored? Well, if you find this very generally, I guess they all like they can be matched up. I guess it's interesting than to add more details. But so one thing that's interesting maybe by taking this kind of a step back is like you can like in practice, for instance, if you go on social media, you have a lot of metadata about the data. Like for instance, while this video came out of obviously beneficial probably, it was good and it was not poison, hopefully. But you do have this other data, like other things, and you can assume safely or more or less safely. You can also have probability distribution over the possible threat models, but you can assume that the attacker cannot modify this kind of data coming from this account, for instance. So you can have a more detailed attack model by doing so. Maybe like I'm just trying to put this as an interesting research direction, which I'm not aware of being explored so far. But yeah, because so far, like it's really just adding and moving and you're assuming that the data or we don't know their origin, for instance. And if you're doing journalism or research, you should not trust data just because it's here. And the source of the data is a very important metadata, like you want the machine learning algorithm to be using this. Yes, definitely. I guess Twitter is fighting a lot of fake accounts that try to manipulate the Twitter feed and how much some tweets can go viral or not. And definitely, I'm nearly sure they use a lot of better data about the accounts that they're looking at to detect whether they are fake accounts. Like when were they created? Are they all created yesterday and all liking the same tweets? An example of this, using some metadata or using some aspects of the data to increase their credence or their liability, the trustability, how much you trust them. There's a counter example, not exactly on using metadata, but there was this conspiracy video that went on the front page of YouTube. And when YouTube apologized, they said it's because the video contained sequences from CNN. So the algorithm detected that the sequences come from CNN. So the algorithm updated the reliability of the video. So it contains frames from CNN. But then what then that's exactly not what he means. Like what he means is like the video comes really from the real accounts of CNN. But then they're like, there's a blurry zone because... Would you say it was wrong to update because it's in CNN images? It doesn't feel that bad to have a few... I don't know. It makes sense, but it can be exploited. If I know that you use this metadata to increase the liability of the data, I can then, of course, I can inject metadata and produce an anti-vax video and insert frames from Pasteur and CDC institutes and the research institutes inside it. So the algorithm say, oh, this is a video about vaccines. It contains frames from Pasteur institutes and from the CDC in the U.S., Center for Disease Control. So likely it's reliable. And then I promote it. I feel like researchers are doing this sometimes. They first look at the bibliography of the paper and they say, oh, he said it is... The authors, there are like many studies showing that if you have MIT or, I don't know, Stanford or EFF, in your affiliation, you're more likely if the review is not double-clined. But that's harder to hack from an outsider. If you're not at MIT... Yeah, but for now, it's easier to do with algorithms because you can just insert like YouTube receives, what, 50,000 hours of content every hour. So you can't scale, like you can't check every video manually. So if people start injecting frames and your algorithm gives reliability labels based on where do the content come from, this might be hacked also in the same way. So I just inject CNN or I inject, I don't know, reliable Center for Disease Control videos inside my anti-vax video. Now, we discussed also that there are some kind of data that is much harder to manipulate. If the data is encrypted, for example, our Google account, no one can create the same Google account as mine. And if you create a new Google account, you won't be, you are not able to create an old Google account. True, true. That's why I say it's not exactly what, so someone, like someone posting frames from the center of disease and control should not be up-weighted. But a video coming from the accounts of the center of disease control should be up-weighted. And I think probably that's what YouTube or any other platform are already doing in some sense. But I'd say there needs to be research about how to do this in a very robust manner, like with a well-defined threat model. It's not, it's not, it's not, it's not evident. You can't do, it doesn't, it's not easy to implement. It's not easy. Yeah, with these things, I feel like there's a high potential of false good ideas or ideas that sound good, but are counterproductive. So, yeah, theorems are more reliable. But in the end, theorems have to look at the assumptions. Yeah, as well, as well. So, yeah, maybe we can talk about the result of the paper. Well, the results discussed in the paper. So essentially, what they say is that they did it. Like, they found an algorithm that, like for the mean estimator, so I guess it's very, it seems like a very specific problem in statistics, but actually the mean is very, is very much everywhere, especially when you're doing machine learning, like, very undecent, even if it's a mean. And they successfully provide a polynomial time algorithm to compute a robust estimation of the mean that's within epsilon, and that does not depend on D, that's the big innovation. But they do have, yeah, like, I'm going to criticize a little bit. Overall, the paper is really impressive. Like, what they've done is really amazing. But like, all along the paper, I was like, I had this strong feeling that, well, okay, but you need to know the covariance matrix. Which is square. Well, yeah, well, that's not that much of a problem because, well, actually, the algorithm is running in N and D, and they need N larger than D, because otherwise, we don't get any improvement. So, yeah, so Nd is already more than D squared, it's already more than we can take with current computers. I'm just saying, in context where D is very large compared to N, what you are afraid of is, yeah, but that's not the setting of the paper. In the paper, N is larger than D. I was already going to the setting where we care about D larger than N. Yeah, go back to that. Yeah, we'll just discuss the results first and then we can move to this. Yeah, so you need to know the covariance matrix and it's actually a big flaw because in practice, if you're doing a gradient descent check, you don't know the covariance matrix. And it's really critical to know the covariance matrix because you then do this outlier removal in directions and you need to know the covariance very precisely, well, precisely enough to do this. But they also cite another paper, which is a very recent paper from last June, which I was very, very impressed by. It's a paper by De Persien and Le Couet, not sure of the names, I'll put it in the description. And it's a paper that achieves this best possible guarantee in a more detailed way even. And so it's in the threat model where the attacker only can add poisoning data, so it cannot remove poisoning data. And they essentially achieve the best possible bound using a linear time algorithm, linear in the total inputs of the prime, so n times d. So that was really, really impressive. And I was almost sad when I read the paper because I felt like it closed the question, or like a poem solved, nothing less to research. But it's a great news that this important poem has been sold. Yeah, so you want to move on to the problem of d louder than n? So one of the main motivations to date, why people started researching robust statistics in high dimension in the past two years is because remember my first comment on a million, a million citizen and a hundred columns? So that's what you would do in like classic statistics, I would say. Like you have statistics on people, you have data sets, and then you build a model of which the dimensionality is not much larger than the than the initial data you had on people. But now in machine learning, we have models that are themselves very high dimensional. If you look at a neural network or a matrix factorization problem, that's something used in recommender systems a lot, like just neural networks, which most people know in machine learning. If you take neural networks today, it has a lot of parameters. And when you do when you do train it with gradient descent, you are having stochastic estimations of the of the gradient. And it's a problem where you care about mean estimation. So if you want to use something, some building blocks and black box that does mean estimation, you would care a lot about its complexity in the because if you do gradient estimation, typically you have much more dimensions than estimators of the gradient. Let's say you're doing creating distributed machine learning on a hundred machines or a thousand machines, your model is a million parameters or sometimes not even billion parameters. So D would be one billion and N would be a few hundred or a thousand in the best case. So here, here are situations where D exceeds N a lot. And this makes any solution where you would require N to be larger than D in practical. In particular, like if D is 10 to the power nine is one billion, you have to keep in mind that D square is 10 to the power 18. And that's you don't want to run a computation in 10 to the power 18. So like it would be one billion like 33 years on a gigahertz CPU. So what we discussed yesterday is that in that case, simply by the statistical complexity of the data, we can't have a good estimation of the of the average anyway, even without data. No, no, here you can like you have to go back to the solutions that are linear in D, but would have the square root D. So you would go back to things like geometric medium or mediums in general like the mediums family, like there are like many ways of doing the they run in linear time in the dimension, but they have an error that is square roots of the dimension. No, but I guess these are a simplistic behaviors. The question you can ask is just like in practice, like how do how does like geometric medium or stuff like that compared to like for instance, the new paper that came out for n equals to 100 and d equals to 10 to the 9. Is there a gain by doing something more sophisticated? I think the jury is still out because the paper is is just new. But maybe there can be improvement. How you shouldn't expect big improvements though? I don't know. In your model, like you're assuming that what can be compromised is a machine, right? But if you assume that the data of the machines can be compromised as well, like the n is more like the batch size. And I guess what do you know the batch sizes you usually use? I guess it's in the thousands? No, it depends. Sometimes in the 30s, in the 10s. So n is very small. Yeah, so it's not clear that there's a gain. Like one thing that the new paper does is actually it's not a square root of d that they have, it's a square root of the trace of the covariance matrix, which is sort of like the effective dimension of the data. And maybe this can be much smaller than the actual dimension of the data. Again, I guess the jury is still out. But I think there's a lot of exciting new research directions around this. It's a lot of both like implementation wise. I think these ideas need to be implemented at some point if you want to test them in practice and also if you want to use them in practice. But there's also like lots of interesting theoretical questions about all of this. Yeah, I'm a bit particularly curious about using different threat models where like you completely change the data. But not in a way like the f can be not just adding and removing, but it's like there's a more sophisticated structure in the data. And maybe in this case, like even if most of the data are corrupt data, maybe you can still learn a lot. If you have the right hypothesis. Yeah, so this concludes this discussion about robust statistics. Yeah, I think the scale of it is going to be more and more important as we, well, it's already extremely important like these algorithms are deployed. And as algorithms are becoming more and more powerful, more and more influential, then more and more actors will want to use them for their own purposes. So robust machine learning is becoming very critical. And there's a growing community doing research in this, but I think it needs to be bigger. So if you're interested in this, we recommend the papers. You can take also the wiki where we have brief descriptions of the results. And I guess I'll see you, we'll see you next time. Next time we'll be discussing emotion contagion in social media, which is a paper by Facebook, which was very interesting and led to a lot of controversies.