 Good evening everyone. My name is Kiran. You can call me KK. Today, K squared. So today I will be sharing my experience working with the Cortex library. It's not yet 1.0 but it's fairly mature and just to get an idea of everybody who's here today, how many of you do not have any experience with machine learning? Anybody who has experience with neural networks? So today's discussion is tailored to people who are not familiar with machine learning and neural networks. So I'll be doing a digression into both supervised machine learning and a little bit of neural networks. And then we'll take a look at Cortex. So again in Cortex I'm looking at it not from a researcher's perspective but from somebody who is slightly familiar with closure and wants to use closure and do machine learning in a fairly black box function. So let's talk about why you should do this. I mean there are probably 50 or more neural network tools today and backpipe great engineering teams. But if you look at what happens in a data science or machine learning pipeline, these are the common tasks. So you first either have access to data or you are forced to grab your data from a database or do a web crawl or something like that. And then there's a stage of data preprocessing and cleaning. Maybe removing outliers, maybe doing some data transformations, bringing it into memory. And after that you go to the model building phase where you give this data to a model and you train the model, you do some iteration. And when you're happy with the result, you bring the model out and you try to use it in production. So you want to run what's called inference on it. So what we know is that closure and Java as well has got great support for almost all the parts of the pipeline. So you can do data cleaning, connecting to databases, say string manipulation, numeric, outlier and so on. And you also have support in Java and closure for non-neutral network machine learning. So you have, for example, tools like Weka, which has got all the classifiers and clustering and all that you need. So if you're interested in closure and your team is familiar with closure and if you want to use neural networks or deep learning, you need. At that point you work forced to jump out to a different language or to get a tensor flow or cafe or something like that. And what's problematic is that you have to actually build cross language bindings if you want the whole pipeline or you end up dumping your training data to file and have one team or set of people building the model and then they give you a model in some way and then you take that model and you want to use it in production. So there's a bit of a break in the flow there. So Cortex fills this important need where it's a pure Java, pure closure, neural network toolkit. So you can do all parts of your pipeline, cleaning, training, inference and running it in production, all in closure. However, there are cases where you might not want to do this. Notably, there are some excellent neural networks for say image recognition, which often need training to the tune of yourself, you know, days, weeks or sometimes even months. So when you have such networks, it doesn't make sense, you know, trying to retrain that network because obviously it's quite expensive. In such situations, you might just want to use like a pre-trained tensor flow model. There are options. You could use the model for inference. So does anybody not follow what inference is? So you could train the model in say tensor flow or deep learning for J and use for deep learning for J because it's in Java, you can just use the model for inference directly in closure. If it's tensor flow, tensor flow has some, you know, hand-wavy kind of Java bindings right now, maybe that will improve. You could do that if you have a pre-trained model. So to sum up, basically if you're having your own data and if you believe the data is not too big, not too complex, the training should probably get over in minutes or hours. You can just do everything in context and whatever other libraries you're using in closure. I'm going to do a short digression on machine learning, but really supervised machine learning, right? So there's a talk-la-quera post that describes what supervised machine learning is and the example is straightforward. You know, when you're young, your first time you're going out on a walk and you see an animal and you ask your parent, you know, what is this animal and your father points out, you know, this is a dog or a cat. And then after your father has pointed out 10 or 15 different kinds of dogs or cats, then when you go out for a walk on your own, you realize, okay, this is probably a dog and this is probably a cat. So that's what supervised machine learning is where supervision essentially is teaching you what you're expecting to see. What are the common kinds of applications just for you to understand what are the targets? So one of the earliest was writing recognition, specifically zip code recognition, right? So you have numbers 1 to 9 written in several different ways and what the classifier is trying to learn is even an image trying to figure out if it is a number between 1 and 9 or 0 and 9, right? So this is one of the earliest successes from 1992, in fact, in supervised machine learning. And this is called MNIST. So MNIST is basically a publicly available data set that contains these numbers, images of these numbers. And since the 1990s, since almost 20 years, it's commonly used to test how well supervised machine learning algorithms work. So the initial example I told you about cats and dogs, incidentally, turns out to be a real competition which was held by cattle in 2014. So they trained on 25,000 images of dogs and cats and the output of the classifier should basically tell you if this is a dog or a cat. And the leading teams got to 98.9%. You could also check the same thing implemented in Cortex. So just to give you a flavor, it has images like this. Any guesses what this might be? Dog. Okay, apparently he's a cat and the classifier does quite well and gets a score of 95% positive that it's a cat. Other applications of supervised machine learning, so you can detect, first of all, faces. You can detect whether the expression is happy or sad. You can detect the bumping box which tells you that the face is in this area and not in this area. So that was a bit about supervised machine learning, right? So why should we use neural networks when we know that there are several excellent performing classifiers like S1, for example. So it turns out that in the past five or six years, neural networks have really beaten state-of-the-art performance on many problems. So we saw the MNIST data set earlier which consists of handwritten digits and currently the error rate with neural networks is 0.21%. So it's 99.8% accurate and it's largely considered a solved problem, right? So now you don't even think of reading numbers as a difficult problem, it's considered a solved problem. What's also nice is that you can combine layers. You can combine different building blocks and because of the algorithm, specifically the back propagation algorithm, which neural network training is based on, you can just add additional layers, you can stack them, you don't need a math PhD to figure out if the math works out, which is not the case with other classifiers like SVM, for example. You absolutely need to know how things connect. Can you stack in certain ways or not? But here it's not that difficult. It's not the same as being easy, but given a toolkit, you can easily play around, you can add more layers, you can remove layers, you can change the layers and you don't have to do any math to see if that will actually work. Why else distributed learning is feasible on neural networks and thanks to GPUs, since 2012 GPUs have enabled breakthroughs on big machine learning problems which were not possible when GPUs were not as complex. Also, flexibility. So, what we saw so far is supervised learning where the target is either one or zero. For example, spam, it should be spam and non-spam. But you can also have multiple targets. So, for example, you saw the image. So, it's not that there's just one person in it. There are multiple people in it. So, you could also have an image where we say five cats and ten dogs and it should be able to tell where's the cat, where's the dog and the bounding box around it. So, that's again possible with neural networks, not so easy with other things like S3M. You can have outputs like sentences. So, in both these cases, you either have one output or multiple outputs. Now, the sentence is actually a sequence. So, neural networks can actually generate sequences and consume sequences. That's a great thing for you. And you can, when you talk of composability, you can have multiple domains. So, you can have say an image going in and a sentence going out. So, the image to TXC model is an example of this. So, you can see an image and the text on top of the image is actually generated by the IM2TXC model. So, it's, most of the time when people present machine learning desserts, they show the best desserts. Of course, this dessert is excellent. Sometimes you might show, say, you take this room, for example, then, you know, IM2TXC doesn't do very well because it doesn't, you can't quite say there's a carpet. So, this is the best case. I believe it has been trained on about 10,000 different objects. So, if you give it a new object, it might say I, you know, it might give something similar, it has not got off. Just briefly talk about the architecture of a network because we'll leave that when we talk about Cortex. So, we are only focusing on what we call as feedforward networks. So, a feedforward network is one where there is no cycle, which means that the flow either goes from the input to the output and then out from the output, the input while figuring out its errors. And according to the universal approximation theorem, you just need one layer, but that doesn't always help you for training. That's in an ideal case where the data is perfect. And so, most of the time we use feedforward networks with more than one hidden layer. So, you can see that layer one, two and three are hidden and this will be influenced by the output layer. This is the actual connections between the waves. So, to help you understand what gets calculated, assuming we have inputs on X1, X2 and X3, the green boxes W has some waves. So, if the input is 1 and W2 has 0.5, you want to apply 1 and 0.5 and you get a value here, but you can see the arrows up here. So, the arrows for the three arrows, one here, two here and three here. So, you add those arrows to get the value here along with the bias. It's not terribly important to understand this, but it will help. Any questions so far? And this just basically explains the same thing as on the previous slide where you multiply the weight and the input and it just summed up. We have two inputs and two waves. Assume everything is 1. So, the input is 1 and weight is 1. So, 1 into 1 plus 1 into 1. So, you get the value of 2. So, that's what this value sum of W1 X1 is plus the bias. There's also a term called the transfer function. So, the transfer function sometimes is used to change the output after you add it. So, one common transfer function is the relu function which basically returns only positive values. So, assume that the value is say minus 5. It will return 0. If the value coming in is 4, it will return 4. That's one common example. Okay. Now that we're done with our digressions, we return to Cortex. So, Cortex is relatively new. It's 0.9 something version. Now, it's been in development for probably 8 to 10 months. Features, it's first of all written in pure closure. It can work on both the GPU and the CPU. It supports most common network types, feed forward networks and convolutional neural networks which are primarily used in image analysis. It turns out that a lot of the contributors to the Cortex team are from an image processing background. Other type of cells say GRU, LSTM which are kind of commonly used for language analysis. Those are not implemented yet but it's fairly straightforward to implement that in the Cortex architecture. You can serialize the models to multiple types, EDN or MIDI files and it's fairly active. It's got about 900 commits, 11 releases, 24 contributors so far hopefully more up to today. So, what we do now is we walk through the steps of using Cortex to walk through a machine learning pipeline. So, these are the basic steps that we'll go into in detail later. We first define a network with an input layer, one or more hidden layers and an output layer. We'll create a dataset. If it has a test split, we'll use that. Otherwise, we'll arbitrarily find some instances and place them in the test dataset. We'll create a listener if required and then we'll train the classifier using one of the APIs. While training, we can evaluate the performance if it's doing well or if it's not learning at all. Finally, we'll use it to run inference on test instances. So, this is how we define a network in Cortex. It's basically a vector and each element of the vector is a network layer. The arguments depend on the type of layer. It's not that each layer takes the same kind of arguments. So, the first layer is obviously the input and the last layer is the output. In this particular network, we have just one hidden layer. It is very unusual, but just for the sake of an example, I've taken this. So, here's how we define an input layer, for example. The input layer takes x, y and z axis of a cube. So, this is a bit of a reference to loading an image which has x, y and z axis. So, if we have an image that is 28 pixels by 28 pixels, we'll give 28 and 28 for x and y and if it's monochrome, it's not just one layer. If it's color, it will have three layers. If it's got depth, it will probably have four or maybe five layers. Now, if you don't have image data like me and you have just, assuming you have two inputs, then you just give two inputs as the first argument and one as the rest. And ID data is basically saying where it should find the input data when you're presented with a map. When you present with a map. So, here are some examples of hidden layers. You can have a linear layer with just one unit. So, the first argument to this particular function is the number of hidden units that you want. Some layers take no argument. So, you can see the batch normalization basically says that whatever I'm getting from the previous layer, just normalize it. Similarly, the dropout layer says that from whatever inputs you're getting, keep only 75% of them. So, if you give the dropout layer of one, it will keep everything if you give dropout layer of 50%, it will drop half of them. So, you need not know what exactly dropout normalization is implementing. You could actually just try it out in a network. You can create 10 different network types and arbitrarily you will just put these and later check, you know, is my network, is it learning or is it not? And it's very straightforward to do that. Here's an example of an output layer. So, these are two examples. The first example is basically showing that the layer just before the output is a linear layer. So, it has just one output and the logistic layer converts that number to a probability. So, the linear layer can technically give you a number between minus infinity to infinity or something like that. And logistic will convert that to a probability which is between zero and one. So, this is an example of a case where you're trying to detect whether say an email is spam or not spam. So, you expect an answer of one to zero. One being spam, zero being non-spam or vice versa. There's an example of M-visper. You have 10 digits and you have probabilities on all 10 digits. So, if I'm expecting a one, then maybe I should expect to see a probability of say 70% on one and the rest will have some small probabilities. So, here this linear layer has got 10 units and that feeds into this layer crunch that's squashed into probability. Any questions so far? So, having defined the network, we need to create a data set. So, this is a basic form of a data set with two training instances. So, I'm going to assume that there are two inputs or two independent variables and one output is a dependent variable. So, the network consumes 0.9 and 0.1 and we try to teach the network that the value is actually one and similarly, this is another instance where the target is actually zero. Having created the data set, you might want to consider creating a listener. So, what is a listener? So, a listener is basically to step back a bit. It's often neural networks take the time to train. So, the short, small neural networks might finish training in a few minutes or maybe 30-40 minutes sometimes. The big ones run for days and weeks. So, you want to be able to have a window into its training and check if it's doing the right thing. I mean, usually you start out with a network that doesn't train and then you want to see why it's not training and then over time you realize you fix some bugs and then it continues to train and get to an acceptable level of performance. So, what a listener does, it enables you to look inside the training process. You can stop training when it's done. So, for example, if you're expecting 80% accuracy, your listener will tell you, hey, I'll reach 80% if it stops there. You could also save the stage of the network at different points in training. So, if you see networks released by say the TensorFlow team, they release checkpoints of the network after it's trained for say 1 million cycles or 2 million cycles. You can also release networks that have say 80% or 90% accuracy. So, this is what we use a listener for. So, currently the Cortex project has got two listeners. The default listener is this function I just included for a sake of being able to read what it takes. So, this listener actually creates an image from an input. So, the input vector is basically a vector. It can contain image data. It contains an image of say 28 by 28, the vector 28 by 28 pixels. The vector will be 28 by 28 pixels long. And it accepts that and creates an actual image that can be displayed. And the Somatos software basically tells what is the actual target. The 1 or the 0 and then the number could be displayed in the image. So, this is how you would create an exam for the listener for say the image. So, you're taking an image for observation vector and you're returning an image. And this is what the live web server reports. So, although I'd like to do a demo, it takes time to really run and train the network. So, this is a screenshot of the web page while it is training. And these numbers will vary. The confusion may fix as it trains. And in the beginning, the numbers everywhere are equal and slowly the numbers in the middle start to get higher and higher, which tells you that, you know, for example, expected value of 4 and actual value of 4 is really high, which means the network is trained quite a bit. And it also shows you from the train test that it picks some sample images and shows you how well the network is doing it. In fact, in this, even though I've trained it for a very short time, I really can't find mistakes. It seems to get, for example, 7s without the dash as well as with the dash. So, for example, both of these are 6s. They have killed it quite badly. This doesn't look like a 9, but I think it could be a 9, so it's... So this phase will constantly refresh and show you images. As long as you turn the value... Sorry? Is it possible that it does not return the value like for... It's trying to say something, like, it's a 4, even if it's... Yes, it will show a value. Yeah, what's wrong? Could be wrong. Could be wrong. You never say, I don't know. Yes. So that's because the last layer of the network has to learn a probability. So even if you give it an image of a cat, what's likely to happen is you'll have, because you have 10 outputs, all of them, all of the outputs will be like 10%, 10%, 10%. And one of them may be slightly higher, so you just want to pick that and say, okay, this is what is the highest. So we also have a tensorboard listener. So for those of you who are not familiar with tensorboard, tensorboard is part of the TensorFlow project. TensorFlow itself is very well-regarded neural network released by the Google Brain team. And it's got a suite of tools to visualize our network frames. You can see plots of various things, like accuracy and F1 scores and so on. And you can also see how the weights change over time. You can see the distributions of the weights and the gradients and so on. So it can help you debug problems like the vanishing gradient problem, which is a good question. So the vanishing gradient problem is basically one where, when you have a deep network, what the network is doing is basically learning. So the first layer gives some influence to the second layer, the second layer gives some influence to the third layer. So what the network is, if I were to use an analogy, it's got knobs outside. So it's basically trying to say, if I turn my knob to the left, is my error going down or not? What happens is if the network is deep, the influence just doesn't reach the end of the network. So you can see everything is just zero. So when that happens, that's called the vanishing gradient problem. The reverse is the exploding gradient problem, where the levels are so high that the network can't learn again. So to give you an arbitrary example I made up, assume you drive a car that you've never driven before and you're unfamiliar with the steering wheel. So you're not sure whether you should turn it, say, 10% to go right or you should turn it a lot to go right. But as you drive, you slowly get used to it. Now assume that there's not just one steering wheel. One steering wheel is connected to a second steering wheel and that's connected to a third and so on. So if the steering wheel is giving just 1% of influence, so you turn it a lot to the right and by the time it's reached the third steering wheel, it still hasn't received any input. So it just can't learn. Similarly, the reverse is if it's giving 110% input. The input is so high that you just turn a little bit and the car is gone off to the right and that's fine. So that's a kind of awkward example to explain what is vanishing and exploding gradient. Any questions? So this is what the output of the answer board looks like. So you can see that it's got a couple of graphs. The one here is the accuracy. You should expect that the accuracy goes up as you train and similarly, this is the loss. So you should expect to see the loss decreases as you train. Now if you start doing neural networks, you'll see that this is a really great situation to be in. When you start with a network, it goes all over the place. It doesn't train and so debugging that is what the fun and the difficulty is. So back to Cortex. We also have a tensor board listener which basically creates tensor board compatible events from Cortex. So Cortex has got its own obviously data passing in the larger data structures and tensor board has got its own format. So it just converts these events and writes it to a file in a way that tensor board can read it. So we have support for training loss and cross-validation loss and you can also see how the weights and the biases change over time. This is how you actually create a tensor board listener. You basically pass a file path argument and you just create run tensor board and point it to the database. So just a small note here. The tensor board expects its events in directories. So if you've seen the directory MTF logs you'll have multiple directories under it. So linear would be one kind of network. Similarly, there could be other kinds of networks in different directories. Finally, after... May I? Yeah, please. What does Cortex integrate with TensorFlow? TensorFlow is its library, it's a web service. Have you used... Is it an import? Yeah, so just to clarify, there is no integration between Cortex and TensorFlow. The integration is between Cortex and tensor board. So tensor board is a sub-project of tensor board which only deals with the visualization. So all that you need to give to tensor board is a file in a specific format. So it's basically a graphing library. Okay, so you produce the software so back end for your model. Which Cortex? No, you can't. You don't need to. I mean if you're using tensor flow and tensor board, you don't need Cortex. Okay. So to give a comparison... How do you write closure of tensor flow? So let me give you a short answer. The short answer is Cortex and tensor flow are neural network libraries. Right? Tenser board is just a visualization engine. Technically, you can visualize any kind of graphs on tensor board. But tensor flow is using it to feed tensor flow data. Like certain events. And then tensor board will read that and display. The slightly longer answer is that there is a Java wrapper of tensor flow. And there's a closure wrapper on top of that. So yes, you can use Cortex too. No, so I don't see a value because what Cortex can do, tensor flow can do and probably do better at this point. So if you're writing a code in tensor flow you would just use tensor flow. However... What language? Sorry? Why do you need a language to run tensor flow? Yeah, so that's Python. Because the Java and closure wrapper of tensor flow are not yet full fledged. So the support for tensor flow is primarily for Python and Go. It's a C library. I think it's a wrapper. It's a Swig generated library. I think it's generated by the Swig. Swig is a tool that generates wrappers. Yes. So it is not Java native. So tensor flow is not Java native. So again, to clarify for everybody here, if you're using Python code to write in orders in tensor flow, you don't need Cortex. If you want to stay with closure and build networks in closure, you use Cortex. You don't need to use tensor flow. Sorry to insist on that. That's a good point. Is Cortex the same level of power, quality, whatever, if it has to use tensor flow? Not by a long shot. So again, there's a short and a long answer. I think the short answer is that Cortex implements some types of layers, for example. Say we have seen that it can implement a convolution layer, but it doesn't have support for things like, things that languages need, like sentences. Sorry? Yeah. So there are no recurrent neural networks as far as I know implemented in Cortex, but they are implemented in Cortex. However, I should also point out that most modern machine learning, especially neural network learning, is driving its power from running on a GPU. So the language itself is not as critical because eventually, even the Cortex code has to run via a Java wrapper, a linear algebra wrapper on a GPU. But that, I mean conceptually, it's easy to say that, yes, the GPU is at the heart of it, but with most engineering projects where, for example, training algorithms run for weeks, you want to squeeze as much performance as possible. So you want to make sure that you're doing a lot of performance profiling and so on. So I think definitely tensor flow is far ahead. And the second thing is for most deep networks, nowadays, people don't start training from scratch. They use a network that's already been trained for at least weeks in some cases. So the IM2TXD model that we saw, the examples of that's been released by TensorFlow and that has been trained for two or three million iterations, which if you were to go buy a GPU and train today, would take, I think, a couple of weeks. Actually, depending on what you said, then neural networks, if they are the same, for example, conversion or the forward one, if they start from the same point, saying no existing data already trained, they are quite similar from a developer's perspective. Yes, they should be quite similar. Yeah, I certainly think they're very similar. And also from coming from a developer's perspective, neural network code is actually quite small. I mean, you define a network which is really five to ten lines of code and you fire off a function that says train and that's it. And most of the time you're sitting and looking at it and wondering why it's not training. It's not writing, say, hundreds and thousands of lines of code and doing plumping and so on. So that's more like an ecosystem of what is already implemented there. Yes. Actually, another question. What do you say about the gradient problem? If it happens, how do you handle it? Since you already have the network structure there and you already have picked the training data. So if it stops you learning, then normally how do you handle it? So I think, see, you're trying to balance a couple of things. One is that you want the best possible algorithm. So let's assume you're in a competition or assume there's a cost to finding the right solution. So let's take MNIST for an example. So you're the postmaster and you know that for every letter that's misdirected, there's a cost. So you want to get as close to, say, 95 or 99% accuracy. That's one. Second, you want the neural network to train quickly. You don't want to see the results after a month and after a month you don't want to figure out that, hey, this is not good enough. So you want to balance all of these. You want a network that trains quickly, that gives you good results and something you can understand. And you try to find a point in the middle. So I don't know if it answers your question, but you try to balance all of these. Why do you look first when ANM does not train? Why do you look first? Let's take the MNIST. Yeah, this data set will be something of use. Actually some of that will come later. But technically that's the, I think that question can probably fill a course by now. Yeah, because simple networks are easy. I mean, I know, for example, that Microsoft has tried networks that have 600 to 800 layers. And you can't start to think what can go wrong with it. It's not easy at all. But of course there are some rules of thumb that you can start with. So finally after we have created the network, we have set up a listener and we are now ready to train the model. And this is one of the APIs we can use. We call it perform experiment. And we pass it the network description, the training test data set. And we sell it to train for 10 epochs. So an epoch is basically a complete run through all the training instances. So if we have 10,000 training instances, make sure the algorithm sees all of them at least once. So we say that do that 10 times. And finally after training the network, we perform inference. So we basically get the same data and we ask the network to tell me what do you think the output is, predict 1 or 0. So what we're going to do is just look at a loan dataset. How much time do I have? I'm not sure. Continue? It's 9 p.m. So loan dataset is a very simple dataset. It basically tries to predict whether a customer is going to default on their credit card. So there are two conditions. So assume a customer has spent $10,000 on their credit card and they've got very little money in the bank. That's a high probability of a default. On the other hand, if the customer has just got a bill of $10 on their credit card and $10,000 in their bank, they're unlikely to stay in. So that's the dataset. And we'll try and train on that. So this is how we create the data. We basically read the CSV file and split the lines. Make sure we get the following we need. And finally we just leave the data. So we say that the two fields, the balance of the card and the income. And finally, if the default is a yes, make it a one on a zero. We've defined three layers. We've seen how to define the layers before. One input, one layer in the middle, and one output. And finally we run training. So you can see that we took the dataset. We cut it at a 90-10 slit. So we arbitrarily first shuffle the dataset. We take the first 90% of the instances, put it in train. The last 10% put it in cut. We create a system and then we run training for 10 epochs specifically. Can you see if I can? So what I've done here is I have run an example that you can actually find on the slide. So what I did is I created four networks. And I wasn't sure what kind of layers to use. So I randomly took four of these networks and trained each of them for 10 epochs. And then I wanted to visualize, you know, how it's going. So we can see that on the left side are the networks themselves. Each of them is a folder and there's a file behind that which can support the link. So I created four. And you can see that this, I presume, it blew a plant out of line. So this has stayed at like 25% and has never gotten better. So we know that this network is not training at all. So it's not uncommon in neural networks to put some common sense configuration of layers and see if it trains, which is what I did. And I figured out early on that this one doesn't train, so I will not use it in future. You can see these three networks train fairly well and the green line, which is, in fact, the managed network is second best and it's kind of trending upwards. So it tells us that maybe if we train it longer, it might even overtake the best network, which is the linear network. So this is just one metric, which is under test set. You can have multiple such metrics. If you have multiple classes, you can have F1 scores. So if you have the MGS dataset and say you're really concerned about accuracy on the number 9 because you believe it's... people really write it in very different ways, you can have a metric that shows only the accuracy on the number 9 and try and investigate that. So that was for the loan dataset example. I also trained on the MS dataset, which we spoke about his numbers from 1 to 9 about, I think, 10,000 images. So on the left here, I can choose what I want to see. So right now, I have chosen 1, 2, 4, and 5. So that was for a different example for the loan dataset. But now I'm going to choose that I want to see only MNIST. So I can just select that. We'll say toggle all and just view MNIST. So for MNIST, we're only recording test loss, for example. And we expect the loss to go down while training. And we can see that the network we have chosen is fairly good. It kind of goes down. And I have only trained this for some 10 minutes. And it's not unusual that networks sometimes train well early and then do poorly later. So what we'll typically... Sorry? Yes, they do. So that's a phenomenon called overfitting. So that's a phenomenon called overfitting. Overfitting. Yes. So that's basically... Every dataset has got some amount of noise in it. So as the algorithm learns, it first learns to train according to the data. And it also trains to the noise in it. So overfitting is when it starts to train to the noise more than the actual data. So this is just for one metric. And to give you some other things that you can see. So what is also useful to see is the actual weights. So you expect the weights to learn. You expect the weights to be distributed in a certain pattern. So without going into too many details, what this is saying is two standard deviations from the mean is this what is the thickest line. And the widest line is three standard deviations from the mean. So 99.9% of the weights are between, I think, 0.8 and minus 0.8. But most of them are between 0.2 and minus 0.4. So we expect that this should be as narrow as possible. And we can also see how they change over time. So this, for example, is the bias. The oldest generation is here. And the youngest generation is in the front. And we should expect to see it change over time. This is what I'm saying is hugely oversimplified. I'd be honest in admitting that I'm not. I don't quite know sometimes how do you interpret these because the figures could be very different from what you've seen before. So this is a lot of trial and error where you try a lot of networks and you see them fail. And then you learn that this particular image or this particular representation is telling you that it is failing in this particular way. That means if you have another problem to solve, you need to try again from the start. It's not always from the start. So one way to think about it is a search problem. So you start with 100 networks. You train all of them for, say, one hour. Then you choose the most promising one. And then you change that a bit and you train another 10 versions of that for one day. Is that possible? Some structure or some network will happen. It's made from starting to perform. Yes. Then suddenly it will... That's usually unlikely. So you can have networks that train slowly but have good performance. So it takes a call of judgment to know whether this network has potential. Some networks train very quickly and then plateau. They don't improve at all. Some networks train slowly, but they continue to improve for a long time. So it's not normal to see curves where it's going straight down and flat. And it's not so straight where it continues to go deeper and deeper and deeper. So that's experience. So that's it. So these are the resources for the project. The links are the Cortex project, the TensorBoard documentation. The presentation itself is available here. You could look at the source and each of these. And this particular project is what connects Cortex to TensorBoard. Any questions? That's it. Thanks to the...