 My name is Kate Sayonka, I'm a professor at Boston University and at the MIT IB and Watson AI lab and I'm very happy to be here today to talk about my research on data set bias. So I'm going to start with talking about the success of AI and computer vision. So computer vision is AI technology that can analyze visual scenes and you can see here an example of it applied to detecting cars and buses and pedestrians and images and it's quite good and getting better. So here's an example of computer vision for object detection in a different scene. We can also train computer vision models to classify other objects that are maybe even cartoon characters and we have quite accurate models for face recognition, emotion recognition and a lot of this is becoming a product, right? So we're seeing computer vision being used as a product, maybe in your phone, you might have a face ID that verifies your face against the database to unlock your phone. So that's very exciting. However, we also have some problems with this technology with computer vision but it also applies to machine learning in general, which is the problem of data set bias. That's what I want to talk about today. So what do I mean by data set bias? Well, suppose that you're training a model to recognize pedestrians and you collected a data set that looks something like this and you train your neural network and it seems to work really well on your held out test data from the same kind of data that you collected. And now you deploy your model on your product on a car but now this car is in New England whereas your training data was in California. So immediately you see a very different visual domain with different weather conditions like snow for example that you didn't have in training data because there's not much snow in California. But also pedestrians will look different because they're wearing heavy coats and so on. So all of a sudden the model that worked really well on your source data that you trained it on does not work so well anymore. And so we call this problem data set bias. We also call it domain shift. So the problem of data set bias is essentially this issue that the training data looks different from your test data that you're actually faced with. It's different in terms of the distribution of data. That's a more general way of putting it but you might qualify it for example as a difference between one city that you trained on and the new city that you're testing on or it could be a data set bias to images collected on the web whereas at test time you are getting images from a robot which also looks different. They have different backgrounds, different lighting and different pose. Another common data set shift that we see in machine learning is from simulation to real images. So here for example if you're simulating something for robotics and training your machine learning algorithm on the simulated data it's not going to generalize very well to real data. This could also happen with demographics for example if your training data is biased in a way that where light skinned faces are over represented in the training data but then at test time you are applying the model to darker skinned faces again you will have a data set bias issue and the model will not work as well on the test data. This could also happen with different cultures let's say you're classifying weddings and you trained on western weddings from western cultures and then at test time if you get an image of a wedding from a different culture your classifier will not generalize very well and will not be able to recognize it. So there are lots of different ways that data set bias could happen. That's my point. Now let's look at what actually this means in terms of the accuracy of the machine learning model. Here is a very simple example, very famous data set called MNIST. Everyone knows what MNIST is. It's just 10 digits that are handwritten. So if we train on this data set we know with modern deep learning we can get very high accuracy of more than 99% accuracy. However if we train for the same 10 classes of digits but our training data looks like this, this is a street view house numbers data set. Now this model tested on the MNIST data set achieves much lower performance, 67% accuracy that's really really bad for this problem. And by the way, even when the data set bias is not as extreme for example we train on the USPS digits which look to the human eye look quite similar to MNIST and yet the bias in the data leads to a similar drop in performance. And if you're curious if we swap and we train on MNIST and test on USPS we have similarly poor performance. So that's just an example of how this data set bias could affect even in a simple case of digit classification could affect accuracy. Okay, now what about real world implications of data set bias? Have we seen this in the real world? Well yes, I believe we have. This is one example that's quite famous now which is the fact that in face recognition or gender classification some researchers have actually evaluated how well existing commercial systems from Amazon, from IBM, from other companies, how well they work, what is the accuracy they achieve on different demographics. And you can see here according to one study they work a lot worse on African-American and female faces than on Caucasian and male faces. So again that's an enlarged part due to data set bias. Another very sad example of potential data bias is this accident that the self driving vehicle was involved in a while back the Uber self driving car which according to some reports did not recognize the pedestrian because it was not designed to detect pedestrians outside of a crosswalk. So if that's in case if that's your data set bias that in your data set all the pedestrians are on a crosswalk then yes your machine learning algorithm will not be able to recognize them as well if they're not in that context of a crosswalk but maybe in this case just jaywalking without a crosswalk. So you might be wondering well wait a minute can't we just fix this by collecting more data if we don't have pedestrians not in crosswalks let's just collect more data like that right. Well there are a few problems with that. The first is that some types of events just might be rare like jaywalking pedestrians they just might be very rare events and we may not necessarily want to force people to jaywalk so we can collect more data. So that's one problem but another really big problem is the cost of data collection so imagine that we wanted to label images from cars that an example you see here this is from the Berkeley BDD data set. Well labeling 1,000 pedestrians with per pixel segmentation labels that you see here where the label has to identify each pixel that belongs to that pedestrian it's quite expensive so it costs maybe about a thousand dollars per 1,000 pedestrians and now if you imagine the huge sheer variety of visual data that we would want to cover in our data set we want multiple poses we want multiple genders age race clothing style and so on and so on like and somewhere in there we want people who riding bicycles maybe not riding bicycles or maybe riding tricycles right so if you think about you know how many different factors of variation we would have to cover this very quickly becomes untenable and and just too expensive to collect label data that's balanced across all of these variation factors so what actually causes for performance right you might be wondering that as well well you know can't my deep learning algorithm just get better maybe I just need a better algorithm that will generalize and do better on test data so there are a couple of problems that is caused by data so bias that current models cannot handle the first problem is that the training and test data distributions are different so here you have an example of two-digit domains the blue points and the red points are from these two-digit domains and you can see that when we visualize this data we do this by extracting features from these images using the deep learning model that we trained and then plotting it in a t-snee visualization so this is what we get you can see that clearly the distribution of the training blue points is very different from the distribution of the test points and so this is a theoretical problem actually when these distributions are different we can show that theoretically they're actually bounds on how well our model will generalize another problem is that a model trained on the blue points is not as discriminative so the features it learned are not as discriminative for the target red domain and you can see that because the blue points are much better clustered into different categories than the red points right so you just may not be learning good features for these test points that the target domain so fortunately there are quite a few techniques that we can use to alleviate this I've listed a bunch here what I want to talk about here today is the technique of domain adaptation but you know there's always data augmentation there's always using sort of batch normalization some of these techniques can help in the case of data set bias but let's talk about domain adaptation so in domain adaptation we design a new machine learning approach that tries to adapt the knowledge from the labeled source data to the unlabeled target domain okay so our goal here is to learn a classifier that achieves a low expected loss under the target distribution and importantly here we assume that we have a lot of labeled data in the source domain but we also get to see unlabeled data from our target domain we just don't get to see the labels right because labels are expensive to collect so we assume that we do get to see some unlabeled data at least from the target domain so what can we do well the first technique it's very I would say fairly common now and fairly standard in the literature which is adversarial domain alignment so here we want to take a neural network which I'm showing here is this encoder convolutional neural network because we're dealing with images so we always use convolutional networks and we have some training data with labels and now so if we train this using regular classifier loss we can generate features from our encoder CNN and here I'm just showing it for two classes for clarity and then the last layer will be our classifier layer so we can visualize the decision boundary that it learns between one class and the other class now if we also get to see some unlabeled data from our target domain so let's say we put a camera on the robot and it can explore its environment and snap some photos so now it has some data is just not labeled and so if we apply the encoder CNN train on the source directly to this data we already know that we'll see a dataset shift like this so the distribution of the target points will be shifted with respect to the distribution of the source blue points and so in adversarial domain alignment our goal now is to align these two distributions the blue source distribution and the orange target distribution so how can we do this well a very standard approach is to add another piece to the neural network which we call the domain discriminator this is just a classifier that tries to assign a domain label to these input examples and if we train it with a GAN loss with a an under serial loss essentially then we iterate between the domain discriminator trying to separate the distributions and then in the next step we update the encoder in such a way that it can fool the discriminator so the discriminator's accuracy goes down and in the process the encoder learns to align the two distributions so that if everything goes well that the discriminator can no longer cannot tell the difference between the domains and these features have become domain invariant essentially so that's adversarial alignment and here's an example of it working for those two-digit domains that I showed you earlier you can see that in fact after adaptation with adversarial alignment the two distributions of the red and the blue points have now been aligned almost perfectly and in fact classification also goes up considerably and so it's not just that the distributions are aligned it actually does improve classification accuracy another technique that I want to mention is alignment in pixel space so what I mean by that is suppose again we have source data with labels and some unlabeled target data and now instead of just doing adaptation with alignment like I just showed you what if we first translate our source data in image space so we're generating new images from the originals but now these new images look like they come from the target domain so this is a similar idea of aligning the two data sets but now we're aligning them in pixel space because we're actually generating the images themselves and not just generating features so the advantage is that once we've done that if we're able to train this generative adversarial network that can translate from the source to the target domain now we have data that looks like it came from the target domain but it has labels because the original data is from the source so it's labeled with the categories that we need for training and by the way we can still add feature alignment on the feature space through this overall architecture and in fact we have experimented with that in our paper which is at the bottom so if you're interested you can take a look but if we do do both feature and pixel space alignment that can further improve our performance on the target domain okay well that's great this pixel space alignment seems pretty neat but so far we've been assuming that we have unlabeled target data in fact what I didn't tell you is that in order for that method to work it needed to see quite a lot of unlabeled data from the target domain but what if we only get one image or a couple images from our target domain well unfortunately the existing method like Cyclegan or cicada that I showed you doesn't quite work so instead what we need to do is take a source domain image which is our content essentially and we want to translate it to a new visual domain but we only have one example let's say of that domain so in this example our content is a dog and we want to preserve the pose of the dog but we want to change the style or the domain of the dog into this other breed I don't unfortunately don't know what breed of dog maybe you know what breed of dog this is but anyway just have one example of this new breed and so we actually propose a method that can do this and this is the result so you can see in the generated image we took the original source image and we added the style of the target image but preserving the pose of the dog right so the content is preserved so this method we call it Cocoa Funit and was published recently in eccv 2020 I'm not going to go through the details because they don't have time but essentially the model takes a content image and a style image encodes it using a content encoder and a style encoder and then combines these two encodings using an image decoder to generate the output image here are more examples so we have the style image on the top and then the content image below it and then the resulting generated image with our Cocoa Funit approach at the bottom we can take a look and see that we're able to even just using a few sometimes just one we've tried one or a couple images of the target domain here the domain is a breed of the animal or a breed of an animal so we can change the pose is the same from the dog but the sorry the pose is the same from the content image but the breed is you is taken from the style image so you can see how this is working quite well and if you're curious compared to the previous approach which is which is called Funit that we're building on actually we're improving on that quite a bit because as you can see Funit is not able to translate images using just a single style image it's kind of generating fairly poor results in this case and on average when we evaluate on a large data set we also see significant gain using our Cocoa Funit approach so that's another example of pixel domain translation one other example that I want to show you really quickly is using this idea for adaptation and robotics so here we have a robot that's trying to insert an object into another object let's say a peg into a hole or is trying to more generally we can apply this to other manipulation tasks and our input data is coming from the depth sensor so it looks like this so there's an RGB image but what we're using is actually the depth image you can see it in the middle here but to train so we want to train a computer vision model in your network that will control the robot arm to perform the task but to train we want to use simulated images so we simulate this kind of problem and generate fake depth images and train the neural network but the problem that we run into is of course we have a gap between the training domain of simulated data and the target domain of real depth images and so what we tried is using pixel level domain translation to solve the status of bias problem without collecting any label data in the target real domain you can see here an example real depth view image and then a similar simulated image and then the last one is we take the real and we translate it into the simulated domain and you can see that it's now looking a lot more like the simulated data so we're closing this domain gap okay great so I'm going to wrap up here just to recap what I talked about so data set bias is a pretty major problem for machine learning in general but for computer vision specifically that's mostly what I work on so that's what I focused on and I showed you a couple of ways that we can mitigate this problem using either feature space domain alignment or pixel space domain alignment I also think you know we could discuss if we have time after this some even more general ethical issues related to data sets for example recently there was a paper that's generating quite a lot of interest that looks at the dangers of large language models and points out that language models are being trained on progressively larger and larger data sets so it's almost like the opposite of the problem that I talked about where we have a huge data set that we're training on and now the problem that they're pointing out is that this data set might contain all kinds of bad data like offensive data or just you know even private data and by training the model on it we don't know what kind of biases or bad undesirable things that it's learning right so that's that's kind of a related but different ethical issue and this paper by the way is one of the co-authors is Timnit Gebru which you might have heard that actually she was forced to leave Google over this specific paper so yeah so there are quite a few ethical issues and I'm happy to discuss that or anything related to what I talked about and thank you very much for your attention