 Good morning everyone. I am Akanksha your moderator for the day. Next up we have Sanjay Aurore and Uli Draper as data scientist from Red Hat over to Sanjay and Uli. So yeah, I mean I'm Sanjay this is a joint talk with Uli. We are both within the AI center of excellence at Red Hat. Can you guys hear me at the back? This is mostly Sanjay's work so I just delegated to the Indian in the room. We have given a couple of these talks at various conferences on the feedback in general was good but it very quickly got over people's head if they're not into this topic. So Sanjay tried to really really hard to make it tangible so that people can take away from this so just don't assume that after that you can put on your resume data scientist. Okay so let's get started. So this is a talk about unsupervised deep learning as you all know. So what is unsupervised learning? So very simply unsupervised learning is finding patterns in unlabeled data. So the next question is what is unlabeled data? It's data that's not labeled. Label data is data where you have some annotations. So as an example if I give you a bunch of images and say this image is a cat, this image is a dog, the dogs and cats are labels or I can give you a bunch of log files and tell you which ones are anomalous and which ones are not. Those are labels but in real life most of your data is not labeled. Imagine Red Hat, imagine any company, forget companies, imagine the data on your phones. That's not labeled. You have a ton of pictures and you just collect them. So one of the holy grails in machine learning is figuring out how to do unsupervised learning which is learning from data that doesn't have labels in an efficient way because if you can do that then you can actually use the data you collect otherwise you have to manually annotate it then feed it to what's called a supervised learning algorithm and use it. So you can imagine so other companies which you've seen this new economy on some they wouldn't exist except for unsupervised learning because they're all collecting gigabytes, terabytes, exabytes of data. No one could annotate that. So it's all about unsupervised learning. Supervised learning is something, well it's a luxury which hardly anyone has. It's all about unsupervised learning. And so even if you don't do machine learning there's some examples that you might have seen or at least heard of. So the simplest possible one is I give you a bunch of numbers. So I give you heights of people I randomly sample on the street and you plot them, you make a histogram, how many people for each height and what you'll see is something like a bell shaped curve right so you see a peak and then it falls off. Now technically that's a normal distribution or a Gaussian distribution and you can fit it. You can say what's the standard deviation? What's the mean? Can I compute that from the data? And that's the simplest possible unsupervised learning algorithm. You just measure something in the data set you have. Of course you can make it more complex. You can say you know real data doesn't have to be something that's normally distributed. It can be something more messy, something more complex. And so you can get a mixture of Gaussians where you are taking weighted means. You can make it more complicated. But the fundamental idea being there are no labels there. It's just the raw data. Another example is clustering. Clustering is give me some data and put it in groups, right? Make these disjoint groups that don't overlap and there are no labels there. No one tells me what cluster you should belong to. I just get the data and I tell you these data points are similar. That other group is similar and they get an ID each. Yet another example is that of finding latent or hidden dimensions in your data. So what does that mean? The simplest example I can give is imagine images of dogs and cats again. I'll use that example a lot. And you have 100 by 100 pixel images. So you have 10,000 pieces of data for every image. But as a human being if you come to me and say what's that? I won't say 10,000 values and read off the values. I'll say that's a dog. That's a cat. So the dogginess or the cattiness of the image is a hidden variable, right? One can make it slightly more complicated and say it's a dog where the image was taken during the day or a cat where the image was taken at night. So what animal it was and when the picture was taken are two latent variables. But generally humans are very good at doing that. We are very good at summarizing information and transmitting that to each other. Computers are not. So if you have algorithms that can find these hidden variables and maybe you can interpret them in physical ways, maybe there's some complex variables that you just write down. But that's an example of unsupervised learning. And I'll throw some buzzwords out there. The simplest possible one is principle component analysis where you're looking for these linear subspaces. Don't worry about that. Just know that it exists. And then there are generalizations called manifold learning, very abstractly where you're learning nonlinear generalizations. But again, the techniques don't matter. It's the idea of extracting hidden variables. Yet another thing you could do if you have unlabeled data is you can say, can I use supervised learning that Uli was talking about techniques for dealing with data sets that are labeled. So the simplest example is linear regression, predict the price of a house from variables for the house, location, age, and so on. Can I use those supervised techniques to solve an unsupervised problem? And here things get a bit more hacky, right? You could say something like, okay, give me all the data and I'll predict one of the features of my data set from the other ones. It's not really an annotation. It's not really a target variable. I'm just making it up. But the idea is if I predict maybe the age of the house from the rest of the variables, I can then use the learn model and inspect it to learn something about the data set. And so that's the technique that's often used. Another one that's used is the anomaly detection example. You have log files. We have a lot of log files for some system, open shift for something. And I want to find which ones are anomalous and which ones are not. Now it's unsupervised. No one gave me labels on these log files. I can label all of them as zeros, right? I picked some arbitrary ideas. Indeed, this is class zero. I can pick some log files from a completely different system and label all those log files as one. And then I can say let me train a model to distinguish between log files of my system versus the other system. And once that my model is very confident that zero are okay, the less confident it gets, the less okay. So this is sometimes called semi-supervised learning. But again, the crucial part is you can use supervised learning techniques to do unsupervised learning. So what is unsupervised deep learning? That's the title of the talk. So what's deep learning? Deep learning is, I guess the only thing you could say is it's the art and science of training neural networks or deep neural networks. And the art is important. It's not always well defined in practice. There are many tricks and heuristics. But you can think of a neural network as a machine or a box that can learn any function or any reasonable function very well to arbitrary accuracy. So if you give me images and labels, the function that imagined writing code that says, here are the pixels of an image that has a cat. Let me predict that it's a cat. Imagine writing that code by hand for the pixel values. That's messy. But a neural network says, give me enough examples and I'll tell you what the function is. And it gives you the mathematical function. So if I have a device like that that can learn arbitrary functions very well, how can I attack unsupervised learning? What can I possibly do? So hopefully I'll get a world record entry from this. This should be the shortest neural networks course. But maybe it's not. And I just want to give you an overview because the core ideas are always very simple, right? The details are not but the core ideas are simple. So what's a neural network? Well, what's a model, right? So in this case, we are looking at supervised learning just for these two minutes. That's that blue box in the center can be any statistical model. It has one job in life takes an input, which is x, x can be pixels from an image. It can be tokens from a language, words or characters from a piece of text can be any features. It can be the housing data that we were talking about. And there's an output, which is the Y, which is what I want to predict, right? That's my target. And that depends on the problem you're solving. If you're doing fraud detection, it's is my transaction fraudulent or not. If it's house prices, it's the price of the house fits an image data set. Maybe it's what the image contains. And the neural network takes a specific choice when it comes to building that box. And one way to look at it is a neural network is a very simple iterative procedure. It takes a vector by vector, I mean a list or an array of numbers, which is your input. So for an image, a vector would be all the pixels listed in an array. That guy comes in and a neural network says, I'll just map it to another vector v1, then v2, then v3, I'll iteratively keep mapping this vector. And at the very end, I'll predict a number. And so the question is, what are these sequentially mapped vectors? You make a very simple choice, right? You make one of the simplest possible choices. You say, I don't want to make this messy. When I go from one vector to the next vector, I'll take the previous vector multiplied by a matrix, a algebra object, which has one job in life to map vectors to vectors. That's what it's constructed for. So you map a vector to another vector. If you look at that equation, that's the m, m stands for matrix, v1 gets multiplied by the matrix. You add a constant called B to that matrix. And then you need to do one more thing. This is not enough. You need to feed that to a nonlinear function, right? And there's a technical reason for that. But basically, it's this simple procedure that I repeat. And again, for the rest of the talk, just forget about that equation. What's important is that there's a matrix in that equation. And there's that B vector in that equation, which is not dependent on your data comes from outside your data set. And those things, the matrix and the B's are called weights. So if you ever see a talk on deep learning or AI or anything like that, they'll often say weights and biases. That's what they mean. Your neural network takes the data, but it also has some parameters inside it that decide what this mapping from vectors to vectors would be. And the whole point of training neural networks, right? If you if you wanted to basically summarize all this buzz about, you know, distributed training and GPUs and this and that eventually all we are doing is finding those weights. That's that's the main goal. What are the weights that I should put in there? That's the whole point of training. So of course, I skipped a lot of things. And I'll go through these quickly just to mention them, right, just to plant a seed. So we didn't talk about the exact layers. So these blue bars are called layers. And there are specialized layers, depending on the kind of data you have, you have ones for images called convolutional networks, you have ones for sequential data called, so there are recurrent neural nets, vanilla ones, LSTMs, GRUs, lots of buzzwords. But you can basically have specialized layers for special data. The other thing is how do you actually find those weights, right? We know we need to find those weights. But how do you actually do it? The dominant method is something called back propagation and gradient descent. Again, we won't get into that we don't need to get into that. The third piece, which is important is something called the loss or cost function. And the idea is very simple. You give me a neural net, and you take an image and you predict what this image is. And maybe your prediction is it's a dog, but the actual label is it's a cat. And I need to be able to measure the difference between your prediction and the actual label. That is measured by something called the loss function. So it's very simply if I give you two numbers, right, if I give you four and five and say what's the difference, you can subtract and take the absolute value, can subtract and take the square. That's a loss function. And the goal of deep learning training is to minimize that cost function, to minimize the difference between your prediction and the actual values. And then there's a lot of theory and heuristics that of course we won't get into. One thing is so these kind of things are examples of what could be done. They are just there because people found they are useful. And most 99% of all the people are using this just copy the useful things. It's not the only way. They have merits and millions of different ways how this can be done. They are just a couple of things like back probation, for instance, as the requirement that it's, you have differential functions. Well, this puts a certain requirement on this, but this does not mean it has to be exactly like this. It's just that if you want to use existing software, you follow this. But if you are clever, you can come up with something yourself, it might work just as well. And so at that stage, you can come back to the original question. If I gave you this thing that maps vectors to vectors sequentially and finds the weights so that it can make accurate predictions, how do I solve my real problem unsupervised learning? And of course, you know, when I say solve unsupervised learning, that's I don't know, I probably learned that from Silicon Valley or something, you don't solve a whole field. It's mostly trying to come up with techniques to deal with unlabeled data. How can I possibly do something useful with it? And so when I started writing this talk, I wanted to be ambitious and I had two different techniques. One that we'll talk about is called autoencoders. The other one, they're called generative adversarial networks. That would be a two hour talk. So we are going to eat lunch. So we'll just talk about autoencoders. And I think you capture a lot of the essential bits of using deep neural networks for unsupervised learning. So autoencoders have the same problem that we have in unsupervised learning in general, which is you are giving me some data, log files, images, something structured. I don't have a target to predict. So I can't do supervised learning. So what should I do? And you make the easiest choice. You say if you don't have anything to predict, predict the input itself. And you say one, like you're giving me four, you want me to predict four? Like, well, why is that useful? Right? So in this picture, we are saying, take that Y, cut it off, predict X back. Or in slightly more mathematical language, learn the identity function. Identity function does exactly what I said. You give me something, I'll give it give it back to you. So let's see how this would work, right? Okay, you say, I'll trust you for a second. Let's learn this identity function. This is useless to me. You're going to predict who wants to build a model that predicts the data I put in. But let's see. So let's say, you know, for, as an example, let's say X, the input is an image. Let's say it's a 100 by 100 image. So you have 10,000 pixels. And what you do is you're doing the sequential mapping to vectors, but I get to choose how big each vector is. So the first one I'll choose to be a thousand smaller than 10,000. Then 500, then 100. So it's getting less, lesser and lesser. That's where the bars are getting narrower. And then we'll reverse the trend. We went from 10,000 to 1000 to 500 to 100, back to 500, back to 1000. And then X is 10,000. And so what you're doing is you're taking this vector, the pixels, and you're squeezing it. And then from that squeeze version, you're coming back to the original one. And the central part, the squeezing part acts as an information bottleneck. Why? Because I had 10,000 pieces of information. If you give me 10,000 memory locations, I can of course give it back to you. That's easy. But if you force me to only store 100 numbers or 10 numbers, then I'm forced to pick out features of those images that are actually useful for reconstructing that image. Because I have to give you back the image eventually. But I have to reconstruct that image from the point where I have only 100 numbers. So those 100 numbers act as a compression or as latent variables. And that makes it hard to learn this identity mapping, right? You're compressing something and re-uncompressing it or decompressing it. There are many other ways of imposing this information bottleneck. I might mention them at the end if you have time. But this is one of the most common simplest ways of doing it. So how do I actually use this thing? Right? You say, okay, fine. You squeeze this, you reconstruct the image. What do I do with this thing? One thing you can do is you can go to that red bar in the center. That's your compressed representation. That has only 100 numbers for that image. So you give me an image. I'll pass it through this thing. I'll go to that red bar and see the 100 numbers. And I say, I don't need the image. I'll keep these 100 numbers. That's my proxy for that image. And hopefully those 100 numbers are something meaningful, right? Maybe if I give it lots of images, it will say, oh, this is a cat. This is a dog. This is a plate and so on. Just a technical point. When you do something like this, technically you have x going in, which is your input. What comes out is x underscore recon, which is reconstructed x. It won't be the same as x in general. But an auto encoder tries and minimizes the difference between the input x and the output x recon. Or in the language of deep learning, you have a loss function which measures the difference between your input and your output. And it tries to minimize that. So that's what those equations say. They say there's a loss that depends on capital R, which is a reconstruction loss. And capital R literally takes the two images, subtracts them, squares each pixel value, and takes a sum of the mean. And you minimize that thing. So in all this is fine. How do we actually use this practically? Right? So an auto encoder in principle is something very simple. So again to summarize it's a neural network that squeezes your data in, then reconstructs it with the idea that at the squeeze compressed part you get something, you capture some salient features of your input. So let's look at a few applications. Let's look at four. The first one is anomaly detection. This is a big thing for any company, especially for us. Log files, telemetry data, I'm sure there's like all kinds of low level signals from the CPU from all kinds of systems. Can I find anomalies? What's an anomaly? Well, what's an outlier? There are many definitions. One could say an outlier is if you give me data, what is statistically unlikely in my data set? I declare that to be an outlier. It's unlikely. A better definition, and I forget who came up with this, is when you have data that's generated by a process. Log files are generated by your system. When data is generated by a different process, that data from the new process is called an outlier. Something changed, and I want to capture that. What's an anomaly? An anomaly is a judgmental data scientist. You say I have an outlier. An anomaly says it's negative, but I can't really tell you if an anomaly is negative or not negative. My job is to give you outliers, you as the domain expert, tell me if it's anomalous or not. How do you actually use an autoencoder for anomaly detection? What you do is you say, give me data that you know is normal. You just make sure you carefully look at it. There are no anomalies there. Everything's okay, and I'll train my autoencoder on that. So again as an example, give me images that were of images of cats and dogs, and I'll train my autoencoder on this, and I get very good at reconstructing these images. If you give me this image, I'll feed it through my autoencoder, and my autoencoder is really, really good because it's trained, and it will give you that image back with minor distortions because of the compression. But if you give me images that are not dogs or cats, something else, then my autoencoder says I never saw this thing before. It's what's that, a plane? I've never seen this, so it can't reconstruct it very well. And so if I look at the difference between the prediction, the output, and the input, for normal data it will be low. We'll say I know how to reconstruct this. For abnormal data it will be high. So I can use that loss, my error in reconstruction, as a metric for anomaly detection. I can say the higher it is, the more likely you are to be anomalous. So I'm just making sure that the definition of anomalies, of course, context sensitive. So, for instance, one other use of anomaly detection is in spam crates. So if I train my system on spam email, no legitimate email I will ever receive will be written in Chinese. So, but if you're actually speaking Chinese, it will have these kind of things. So my reconstructed emails will always only be Latin characters. But for someone else in a different context it will be completely different. And this is, these are two papers, so once from 2007, once from 2018, and the one on the, I mean they're both doing the same thing. They're both taking time series from some systems. And they're taking the values in the last 10 minutes, right? So five minutes, 10 minutes. And they're reconstructing that time series. So you give me the last five minutes worth of measurements of your time series. I feed that to my autoencoder. It compresses it. Then it uncompresses that and tries to predict the input again. And when it cannot reconstruct that input, it says something's wrong. When it can reconstruct the input, it says you're okay. And so if you look at the two plots to the right, the smaller ones, one says normal, one says anomaly, the x-axis is that reconstruction error. This is for a high performance computing system, an HPC system, and they have 166 metrics. So they have fan speed, co-loads, temperatures, all kinds of things. And they're trying to reconstruct that data. And what you see in the central plot where it says normal is that the x-axis which measures the reconstruction error is pretty low. All the values are less than 0.2. The y-axis is the number of counts, number of examples. Then they artificially go and inject anomalies in their system. And if you look at the plot on the right, the histogram on the right, suddenly your reconstruction error increases. Right? So of course there's some overlap between the two histograms, but you get a lot more points higher than 0.2. And that's the fundamental idea. In a real life system, you keep running your autoencoder, and then as soon as I see that my reconstruction error is increasing, that's the signal that something's going wrong. Yeah, again, also something else. If everyone who is in this kind of specific around machines, then know that things are changing over time. So you have drift usually in the data itself, which also means that the models which you're building up, so in this case the neural network which is provided as conserved by the autoencoder, will have to be redone regularly. Without that, sooner or later, everything will look like more like the right figure there. And same thing for the left plot. The left plot is the x-axis is now real time. It's 2001. The y-axis is reconstruction error, so the higher it is, the worse, the more anomalous you are. And I know nothing about routers and networking, but they took time series of BGP router updates and announcements. Apparently these were all internet worms, and routes would die and they would signal to each other that there's a new route open, something like that. And I think the spike corresponds to one of the famous internet worms. I forget its name, so you can look at the paper. But they were trying to see if they can identify these large-scale internet worms which were disrupting network traffic purely from router data. Same technique. Let's reconstruct the data and see if the reconstruction loss is higher. Another example, and this is not something that most of us maybe care about except for people who are doing deep learning or machine learning. But this is what really triggered a change in how deep neural nets are trained after 2006. So if you went back to 2005, no one was training deep neural nets, right? Because everyone who tried it said this is actually worse. So I should first define what I mean by deep neural nets. Deep just means add more layers, right? So the more layers, the more vertical bars you add, the deeper a network is. And we know from first principles from simple toy experiments that the deeper a network is, the more expressive it gets. It gets easier to learn more complex functions or datasets. If you make a deeper network. But in practice when you tried this before 2006, it didn't work, right? It would be much worse than a network with just two or three layers. And so everyone was confused. They said why is this not working? Then people figured out that well when I train these I have all these weights, right? The matrices, the M's and the B's, and they have to be initialized to some values before I start training. If I initialize them differently, sometimes the training works beautifully, right? It's perfect. Sometimes you just initialize it a bit differently and your neural network doesn't learn anything. So of course you have these two hints, right? You know that your deeper networks should do well. You know that the weight initialization matters. The next question is so how do I initialize these weights? Like okay I know it's good, like I should do it somehow. I'm not randomly going to try everything. So let's find better initializations. And so Jeffrey Hinton and his students first did this I think in 2006. They did not use autoencoders. They use something called restricted Boltzmann machines. Doesn't matter. You can use autoencoders for the same thing. And the idea is can I find these matrices, these weights using autoencoders? And one way to do it is to say give me a data set. So you have inputs and outputs. This is supervised learning. Forget the output. I don't care, right? I'll only look at the input X. And let me create an autoencoder, which is what that picture is. It's X getting mapped back to X. There's one more twist where you say the weights on the two every time you go from one layer to another one, you have a matrix. And you basically make sure the matrices are the same, right? So that T stands for transpose. Let's not worry about that. What you would do is I would give you images. You would train this autoencoder and then you would take this second layer and drop it. We don't need need that stuff, right? We just need the first part. And you say, OK, all my images have to some vector V one. Let me build one more auto encoder. And now V one gets mapped back to V one. I have some other matrix you that's OK. Let's train the second autoencoder. Take the last bit, drop it. So you see the story that every time I play this game, I get one more layer with some weights. The weights at least are slightly meaningful. They are part of an autoencoder at the very end, right? Maybe after 100 such layers, you say, let me now finally use the targets that I'm trying to predict. Let me add that in. Right? So let me add one last layer, which is I'm trying to predict whether it's a cat or a dog or something else. And that last V that weight is trained using your target. So this is called unsupervised pre training, which is greedy. You're doing one layer at a time. There are other versions of this. But this made it possible to train very deep networks because all your weights are now initialized to reasonably good values. They're not randomly selected. And for people who want to do these experiments in today's world, no one does this as far as I know, no one does this. We have other techniques again, buzzword time, batch norm, relu dropout, which make it easier to train deep networks. But I don't know if anyone has ever tried those with this kind of greedy pre training to see if it works even better. But that's it's not an application that most people outside deep learning would care about. But this really triggered this revolution in training deeper networks. This one we would, I think a lot of us would care about. There's something called semantic caching, which is you give me images log files or the stuff. I can create a hash, right? Or I can store them in memory. What I really want to do is I want to store things that are similar to each other, closer to each other in memory. That's where the semantic comes in. I want to take the meaning into account the meaning of the images, the meaning of the log files, or anything else. So can I find a mapping from my documents to locations in memory, where similar documents get mapped to similar locations. And then searching is super easy, right? If that happens, you say, find me an image that's similar to this image. I say, oh, let me just look in the neighborhood around it. And here are more images. And this was also done by Hinton and his grad student, where they I won't get into many of the details. They just took a lot of documents, right? Articles from Reuters and from news group. They took a bunch of these. And there were no labels, right? They just trained an auto encoder on this. And they took the central part of the auto encoder, the one that's compressed and said, let me throw the documents out, let me take this compressed piece. And you can do tricks so that most of those entries are zeros or ones. So it looks like a bit vector. They say, okay, this is my bit vector for every document. After they had trained, they said, let me plot these, right? Let me see where each document lies. That's what these plots are. So each point is a document. And they're projecting these bit vectors into two dimensions. After the fact, they color each dot. They say each dot comes from this topic. So the topics were never used during training. Training was just auto encoder, take the bit vectors, map them in 2D. Then they said, let's color this. And maybe surprisingly or not surprisingly, they found that documents are the similar from the same topic actually grouped together. The blues are together, the reds are together. It doesn't have to be that way. If I randomly hash it, it will be all mixed. And the fact that it could do this is enormously important for better search, right? Information retrieval. So auto encoders are great for this kind of work too. And as an example, they didn't do it with log files. With documents, all you can do, all you have to do is get every unique word in your document, create a long vector which is called bag of words, a count of every unique word. So you said this document has five instances of the word open shift, three instances of the word linux, create this vector and train an auto encoder to predict that vector back and take the compressed piece and that is your hash function. So what you also see here obviously is that there are what you would classify now as kind of application stay in your mind. Right? So anomaly detection, especially when things get complex, right? Images or log files, that's a useful application. Training deep networks, unless you're doing it, if you are interested in machine learning, you should definitely try it out. But if you're not, semantic hashing is a very useful technique, right? It's very useful to be able to do this on any kind of search space. Another one that is very useful is graph problems. So you have graphs everywhere, right? We have a actually we have a talk tomorrow on graph neural networks to graphs as in nodes and edges craft, not charts or plots. So I have this collection of nodes and edges. And I want to answer all kinds of interesting questions. This can be a social network. It can be a physical network doesn't matter. One of the interesting problems is if you give me two nodes, are they connected? So for some nodes, I know they're connected for sure. For some nodes, I know they're not connected for sure. Have no idea. And I would like to get some indication if they're connected, maybe a probability of connectedness. Can I answer that? And so one way auto encoders for that is you create something called an adjacency matrix from a graph, which is basically if you have n nodes, you have an n by n matrix, not sure you put a null. And the job of an auto encoder is to try and recreate that matrix, but fill in the blanks, right? Say, okay, this is probability 3.3.7. And here's a paper that does exactly that with a lot of experiments. And it seems to work very well. They actually do two tasks. They say, not only will I tell you which links are likely and which are not, I'm also going to some of these nodes have data associated with them, right? The nodes of their physical routers have something related to them, social networks, every one of us has data associated with us. You can actually predict data about the nodes based on the data of the nodes, of the neighboring nodes, right? So they actually train a neural network to both do link prediction and they tell you what kind of node it is. Again, don't worry about that equation to the left, but that's the basic idea. That's very useful too. So these are just four applications, right? And I have a ceiling Tommy to I have like 200 papers and in a spreadsheet with interesting applications. If you're interested, let me know, I'll send you that spreadsheet. But the idea is train a neural network that learns the identity mapping, but make it hard by squeezing the central part. And there are other ways to make it hard. And I'll quickly mention all three. So the first one is what we did, squeeze that red bit and make it hard. The other one is you not only squeeze the red bit, but you try and force most of those entries to be zeros. So that's called sparsity. You can impose that. That helps. There are many other ones. There's one where you put noise in your input and you try to recreate the out the noise. That's called a denoising auto encoder. So there's a whole slew of techniques there. You can try and set derivatives to zero. You can do all kinds of things. But at its core, just learning the identity mapping, but make it hard for it to learn the identity mapping. So it has to throw away irrelevant information. So squeeze it, throw away relevant information. And then what you see in the center is telling you, hopefully, telling you something meaningful about the data you have. And for applications, there are many more. Anomaly detection, semantic hashing, better hashing, graph problems, link prediction, people have done clustering on graphs with autoencoders and, of course, training deep networks. And lastly, so it depends on if you find this interesting or not interesting. If you're interested in unsupervised learning, feel free to contact us. There are many resources. We don't know everything, but we'll work together. But I think what's really important is in your day jobs, you all see some kind of data. There are log files, there's telemetry data, and just keep this at the back of your mind if you think that some of this might be useful. Or even if you suspect it might be useful, get in touch with us or, you know, start working on it yourself. That's even better. But some of these techniques can give very surprising results in a good way when you try them out. So that's it. The last thing is when these files get uploaded, you'll find a whole appendix on another technique. So again, if you find it interesting, skim through it. If you have questions, just email us. So I just want to add one thing to the last slide, basically. So for those who are interested in this, so implementing these kind of, actually, implementation for this is nowadays very simple. So you can just use some of the existing tools, which are Python based on still implementing neural networks that arbitrarily depends on. So that's actually not that hard. So what might help you actually is to get some illustrative example. And one of the most famous ones out there, which has been used for generations of data scientists now, is the MNIST dataset, which is just tiny little pictures of hand-drawn digits. So that's actually also not that large of a dataset. So get that, which you can find everywhere. Just type MNIST into your favorite search engines and train a network on that. The dataset by itself is labeled. So it has, for every picture, a digit to it. So, but ignore that for a time being. Feed in these tiny little pictures into the neural network, make it some architecture, but make the compressed side in the middle, make it 10. This will force basically the autoencoder to actually give you 10 possibilities, which correspond to 10 possible digits, and see what comes out of it. This will give you an impression what you can do with this and how it should work. And at any point in time, you can try to translate back the intuitive back to Wednesday in the reconstruction. You can translate it back into a graphical form and look at it. So this example specifically, I found extremely useful to understand what is actually going on in these kind of things. It's very easy to do. That's it. Any questions? Yeah, really nice talk. I had a few questions regarding the anomaly detection part. So you said that it gives you, it predicts anomalies, but what do you think about the interpretability of those, of the neural net? Basically, why do you think those anomalies are anomalies? If you wanted the neural net to tell you what parts it is looking at, to tell you that it's an anomaly, can it do it? Compared to something like classical machine learning, which kind of tells you which parameters are affecting its score and things like that? Yeah, so I think, and I mean, I would argue even classical machine learning, depending on the technique, you have to do more work to get more information out. So at the surface, the neural net doesn't tell you why it's, why the reconstruction error is lower on this versus something else. But then you can always do more careful experiments where you say, okay, I have these inputs. What happens when I just vary one of the inputs, right? One of the elements? Or what are the activations in the intermediate one? But that is almost always more manual. The interpretability is never straightforward. So it's not going to tell you this is why I think it's anomalous. All you can do is basically explore the input space close to the input you gave it and try and see what exactly is triggering this loss. But that's, I think, a general problem with deep learning. Yeah, but specifically for, I can't express it this way, so specifically for other encoders, you have the advantage that you're controlling the compressed layer. So if you managed to create an autoencoder with a very, a very strict in a layer and so on, you can look at a limited set of your examples and try to see whether you can recognize a pattern in the values of the most constricted area of the autoencoder. Oftentimes, and again, I'm going back to this amnesty example. So hopefully you will see something that, especially if you're forcing the many of the entries of the inner vector to zero, you will force that there's actually a binary relation to something like, and the neural network will not tell you what it means, but humans are extremely good at recognizing patterns. So you're looking at a couple of them, looking at the representation, the compressed representation and assign yourself the appropriate label to it. The label could be, oh, it's this derivation, it's this, it's this. So this is actually, autoencoders are better, I would argue, than many of the other machine learning techniques to actually answer the question which you just tried to answer because if you have a highly dimensional regression system, it doesn't really tell you anything. So you cannot envision a hundred dimensions of regressions. Anything else? In the deep neural networks, at what point do you decide whether the neural network is generalized enough versus it has learned a lot of overfitting specific details of the given data set that you have? Because I think that until and unless you get the test data, that's the only part you can say about it. But is there any way that you could define that or determine that? No, I mean the universal test always is you have a hold out data set and you look at performance there. The thing you can do is control the expressivity of your neural net in some ways, so you add, like I said, drop out or all kinds of regularizations which you do in other machine learning techniques too, you do it in linear regression, regularization. So you can basically impose constraints on your model that make it harder for it to learn in the hopes of generalizing better. But the real golden test is always have a hold out set. Let me look at performance there. So this sounds like something abstract at this point so they make it harder for it to test. But this is really the thing. So if you think about, if you want to learn about this, just read about dropouts. That's the simplest example when it comes to these kind of things for the regularization techniques and so on. Dropouts just means that you're forcing certain connections inside the certain values inside the matrix which are the weight matrix to be zero. So and this and occasionally so this also at random. So this is the whole thing. There's a random process going on and so on. This just means that you're introducing basically noise into the system at some point and you're forcing the system still to learn things. So this will make it even if you're feeding in the same input over and over millions of times it will never actually get the same activation sense on going through the system itself. And therefore you can imagine that it will not lead to something which is overfitting ideally. Hi. So when you're explaining how will you how you're going to train your auto encoder for anomaly detection, you said you will train it first on a good data, right? Like the data which only contains cats and dogs. So isn't it how is it unsupervised? Isn't it like semi-supervised or something like that? I mean it's not totally unsupervised, right? Because... Yeah, but one can... I mean, so I think there's a smooth... So when I say it's not supervised, it's unsupervised. I mean the labels are never used by the model explicitly. Now one can argue unsupervised learning will in general never work because what if I put bad data? What if in your log files I start typing nonsensical things? But the idea being that your model is not looking at the labels during... when it computes the gradients. The model doesn't know it's a cat, cats and dogs. Yeah. But that's the only data we have for training. So it trains accordingly in that manner. Yeah, and I think the second thing is so this is more relevant to GANs, which is the second part, where one of the interesting experiments is if you can understand the process generating a data set, which GANs do very well, right? You can give it MS images or cat images and it generates novel cats. You'll see all these what's it called? Celebs face data set or something. They generate new celebrity faces and new dog faces. The idea being a GAN looks at your data set and tries to learn the joint distribution of the variables. Right. And so a central question there is if I give you a small amount of label data set. So it's supervised. I'm giving you a tiny data set. Can you learn enough to then use something like a GAN to generate more of each class? So it's technically that's supervised too, but there's a smooth spectrum between how much label data you need and how much unlabeled data you need. Are you talking about few-shot learning or something? It's like, yeah, it's basically few-shot learning but using a GAN. And maybe VA can also be... Yes. Yeah. So a VA you mean a variational auto encoder, right? So there are there's a whole stack of variational auto encoder generally, yeah. I mean they they perform very well especially. Thank you. So but you're pointing to right from there's a problem there as well. So Sanjay mentioned that yeah, you're just feeding it the systems of a good system, the locks of a good system and you're learning from this. Well, in reality if you're asking any sister admin, that's hard because at who guarantees out that of the million locks which are collected all of them so you might be actually not learning what is good and what is bad but you're learning well what one system behaves so there's a real problem actually getting good data in the first place and there has to be some supervision of some sort going on there and therefore you could argue even that is not completely unsupervised learning. And as an example the internet worm one these guys literally manually went and said let's make a list of all the data around and let's pick a time period when there are no known worms but you don't know maybe there's something else there so there's also a healthy dose of hope sometimes within the thing. But it's better than nothing that's the argument. It seems like if what you're really doing is just a compression and if your anomalies are rare enough you might not need any known but yeah this one. I think a lot of people find it unsettling when you know you're trusting a neural network to help you predict things or learn something but you have no way of understanding why it's saying what it's saying and I'm curious has there been any work in terms of what if I take the neural network I got my output now I run it in reverse maybe I drop out some of the intermediate nodes essentially the reversed output onto the input does that help indicate where some of the signal came from has anyone learned from that? There's one very famous example and again back to what Sanjay mentioned a couple of times of course there was a training of pictures which are random taken from the internet and has been learned and one of them they simply say well let's reverse the whole thing and it turned out certain pixels correspond to cat pictures you could clearly recognize the cat again at the end so in theory yes this is possible to some extent but not in general so they had to search really really hard but the the topic of interpertibility of models is something which is ongoing research topics and so on so this will hopefully result in a lot better explainability in the future but so far yeah we're trusting and therefore this is what I mentioned at some point before here that you have to be careful when you're actually using these kind of techniques so if it's only about search results or something like this you don't care if it's about driving your currents on it's something else there are these examples there where people are putting stickers on the stop sign and all of a sudden there because the picture is completely changed and that's not going to weigh any time soon I think that's it so thank you all very much and please feel free to even email us questions or talk to us after this thank you