 For our first talk this morning, we'll have classification based on missing features in deep convolutional neural networks. And Nemanja Milosevic will present us, give you a big hand. Thank you. Hello, everyone. You can all hear me, I guess, so there is no need to, okay. First of all, it's a great honor to be at EuroPython. This is definitely my favorite Python in general, my favorite conference. And I hope you have fun today in my presentation, and you learn something about some weird neural network models that we are going to cover. Okay, so first of all, a short introduction. My name is Nemanja. I come from Serbia, from University of Novi Sad. Novi Sad is the second largest city in Serbia, where I am a PhD student in my third year, and I'm also teaching assistant there at the Faculty of Natural Sciences or Faculty of Sciences. Okay, so my research topic is neural network robustness, which has been sort of a hot topic recently. Basically, I'm trying to make neural network models that we have today more robust or less prone to error in difficult situations or when someone is actually trying to make our models go wrong. You can find me on these addresses. This is my email. I also have a blog where I write occasionally about some fun machine learning related stuff. I recently wrote an article. Is there anyone from Google here? Okay, then I can talk about it, I guess. I wrote an article how you can SSH into Google collab notebooks and use SSH if you don't want to use the Jupyter notebooks. I guess that's not breaking their terms of service, but if they are, well, I'm glad they're not here, okay. You can also find me on GitHub. I have lots of projects there. You can look at them, society projects, you know, some games and things like that. So on most social media, you can find me at this handle, which is eight letters because when I was making my first email, that was the limit. So it kind of stuck. Okay, so this presentation, just to go briefly over it, I'm giving you a chance to run away if you don't like it. So it's going to be about this very weird and unreasonably, let's say, working because I cannot explain the exact reason why it's working, a neural network model, which has to do with something about classification based on missing features. So we'll talk about image classification in general, but it can be used in any convolutional model. What it tries to do is to mimic something we all do every day, which is deduction, which other neural network models don't know how to do still. So when I say deduction, I simply mean, okay, so if you know what all the possible values are and you know what something isn't, you can deduct what it is. So that's what we're trying to do here. It helps in certain scenarios, so we'll talk a little bit about occlusion. So when object you're classifying is behind, for example, another object which you're not interested in. And the general way this presentation is going to go, I'm going to talk about some implementation details, then show you some code and things like that. Okay, so the full source code is available, but more about that later. Okay, another word of warning. This is all very experimental and I cannot claim with certainty that all that I'm telling you is true. It's true to the best of my knowledge. We sent a proof of concept academic paper to this neural network portal journal, which is on University of Prague, and I'm looking forward to their comments. There are this academic journal that specifies in weird, let's say, neural network models among other things. I'm also looking forward to your comments, so you can say, you know, this doesn't make sense at all, and I will agree with you. And of course, you know that machine learning, okay, some machine learning models, especially deep learning, deep neural network models are very hard to interpret, so they may work, but we don't know why. And this is the case today. Okay, so what I'm basically saying is don't believe me and question everything I say. I'm looking forward to your comments. It would mean a lot. Okay, so let's just go briefly. If you are not that familiar with convolutional neural networks, how they work, when you work with neural networks, they sort of become very simple, because when you understand how they work, they are really not that complicated. Okay, so if we look at the picture on the bottom of the slide, you can see several layers in a convolutional network, where the first couple of layers are these layers called convolutional layers. So convolutional layers or convolutions in general, operation of convolution, is not complicated, not related only to deep learning on neural networks in general. So convolution layers or convolution operations on images have been around for some time now. You can use them, for example, for edge detection on images. But what made them really work well with neural network models is that before neural network models or the whole neural network thing exploded, let's say, a couple of years before, you had to hand craft or hand make these convolutional filters or kernels. So convolutions as operations work as, you can see on the image of the dog, there is these small squares. Okay, so these are called convolutional kernels or filters. And convolutional operation of convolution is basically taking this filter, sliding it over the image and looking for matches, for example, just multiplying some matrices basically. So before neural networks, you had to make your own filters, but with neural networks, you can learn these features. So in these filters or kernels, whatever you want to call them, there are some low level and high level features of images that we are classifying. Okay, so basically in convolutional networks, you're using these convolutional layers for extracting features from an image. Okay, so for example, on images of dogs, you can say that features on these images are, you know, if it has ears, if it has eyes, if it has a vaguely tail, for example, then you can say it's a dog. So these are some high level features, for example. After these convolutional layers come the fully connected layers, which are the traditional, let's say, neural network layers, which contain the weight and biases for actually training your classification algorithm. So to sum it up, you use features from an image to classify it. Okay, so you can say if something, you know, has eyes, ears, and a vaguely tail, then you can say it's a dog and not a cat, because cats don't have vaguely tails, most of them. Okay, so what about missing featured classification? So my idea was with this neural network model and algorithm, what happens if we try to classify something based on the features that are not in the image? Okay, so instead of saying, you know, a dog is what we said before, you know, with some features, you can say, okay, dog doesn't have headlights, if you're classifying also cars. Okay, so here is a motivational example from the MNIST dataset, you probably heard it, it's a dataset of handwritten digits. Okay, so we have a digit 5 here and two very high level features. We have one circle like feature and one like corner line feature that goes to left and then to, from left to the right and then to the bottom. Now imagine if you couldn't see the 5 and I tell you, we are classifying an image of a digit and I'm 100% sure it doesn't have these two features. Okay, so it doesn't have a circle like feature and it doesn't have a corner line feature. So because it doesn't have a circle like feature, we can safely assume it's not a 0 because 0 is a circle basically. We can also say that it's not a 6, 8 or 9 because these are the digits that have the circle like feature. Okay, we can also say it's not, it's written on the slide 1, 2, 3, 4 or 7 because all of these digits have this corner like, corner line feature. So what we're left with is that we are looking at the number 5, even though we didn't use the features of the digit 5 to classify it. Okay, this is the main point. So if there are some questions, please, okay, all clear here? Awesome. Okay, so now why would you do this? I mean, if you know that what features are of the image 5, why would you go the other way around? Okay, so the main reason and going back to adversarial learning and occlusion, what happens if we have partial inputs? For example, a digit in our example is damaged somehow or the half of the pixels are missing or one part of the image is corrupted somehow or blurry or things like that. So the classifier that I'm going to show you that works with missing features works much better, much better. It works better on these damaged images than the classic neural network classifier. Okay, so how do we implement this? Well, mostly we went through already how you can implement it, but we're going to go more in depth. So we're going to implement this with negating or inversing the output of the last convolutional layer. Okay, so at that point in the network, we are getting all the features and their positions in an image. So if we inverse that what features are there and what features are not there, we get what features are missing and where are they missing? Okay, we'll go more into detail. So we have several steps. After we inverse or negate this vector, we'll talk about what we need to do. We can train the rest of the image normally. Okay, so let's go through the steps. The first step to get the features, we can handcraft these features, you know, just draw them. But that's difficult, boring, because you know, when you change the dataset, you have to do it all over again. What we can do and what my algorithm is doing is simply training the network normally for a number of epochs, let's say 10, and then just taking the weights from the convolutional layers. Very simple. So it's basically transfer learning. Okay, so you're just taking a snapshot from your model and applying it to your new model, which you change somehow. It's automatic, so you don't have to do the boring stuff, and much faster and easier. Okay, so for the step two, we need to talk a little bit about activation functions. If you are familiar with your networks, I'm guessing you are. So activation functions are what you apply to your activation neuron, which is the sum of the old weights and the biases. And we need to be careful here, because we want to inverse or negate the output of the last convolutional layer. So we need to be aware of what activation function we are using in that layer. So the transformation of this feature, a positional vector that we are using, will depend largely on the last activation function in our last convolutional layer. So simple example, if we have a sigmoid function, sigmoid function outputs a number between zero and one. We can say zero means the feature is not there, and the one means the feature is there. Okay? So if we want to inverse or negate this vector, we simply apply this formula for each element in the vector. We just say one minus x. So if a feature was present and the value was one, it will become zero and vice versa. Very simple. And that's really nice, but it's 2019 and we shouldn't be using sigmoids in our neural network model anymore, models anymore. So what if we want to use the popular choice, which is the rectified linear unit, RILU? You need to be aware, and I'm speaking from experience, because this was one of the, let's say not bugs, but one of the gotchas in implementing this model. RILU activation function or its output is difficult to negate because it can go to infinity. So you cannot just say one minus something. Okay? It's very difficult to know where is the upper bound. Okay? So there are some solutions. So if you're using PyTorch, there is let's say a hard upper limit version of RILU function. I don't know if you've heard about it. It's called RILU6. It basically goes to just not from zero to infinity, but just from zero to six. So you can just say, okay, when I get this vector, I can just say six minus x and then I will get the missing pictures. Don't ask me why it goes to six. I'm not really sure why it goes to six, but it goes to six. I actually implemented my own version. It just goes to one, but it works as large as the same. You can use like RILU. You can use the new activation function switch. This could work based on looking at the graph, how the function looks. I haven't tried it. So, you know, if you try it, let me know if it works. You can use the hyperbolic tangent function, but beware also with it. It goes from minus one to one. So the formula will be a little bit different. It won't be one minus x. It will be just minus x. So, you know, if it was minus one, it will become one. And if it was one, it will become minus one. It's very difficult to make activation functions exciting, but sorry, you have to bear with me. So, let's see some code. So, this is the negative learning network implemented in PyTorch. In PyTorch, it's very easy to make weird neural network models because you have full control over what happens. You basically write a function of the forward pass through the network. And now you can see in the first two lines, we have some, just some normal convolutional max pooling layers with some dropout, nothing too spectacular. In the third row, you see the x.view. We flattened the vector. So, we have 320 features or 320 positional features because the positions are important. And then you simply in your class that you implemented. Actually, in my class, I have this net type field. And if it's set to negative, I just negated the vector. If it's set to negative real, I do one minus x, which we talked about in the previous slide. Okay. So, the interesting thing about just going back for just one second, if you try to negate the real activation function with one minus x formula, even though it shouldn't work, it will work because, okay. So, this was, as I said, one of the gotchas because it goes from zero to infinity. Okay. So, if it's zero, it means that the feature is not there. So, when you do one minus zero, it's one. So, the feature is there. But when you do something, some large number, I don't know, thousand. So, you do one minus thousand, it's minus 999. And because I had some real activation functions after the convolutional layer, it just, you know, because really ignores all the negative values, it just worked. So, you can't probably get away with, you know, running it. So, this code will work even though it shouldn't work. Okay. So, you can see if the, if the, if we are using the net type negative relu, we just use the function ones like x, which basically makes the tensor of the same dimension of x, which with all ones in it, and we just add a negative of that vector, which is basically the same as doing one minus x. Okay. And then the rest of the network is completely, completely normal. Okay. So, we've covered two steps. So, we got the features and we now know how to extract the missing features from an image. Okay. So, it's almost ready to be trained, but we have a, let's say a little issue. It's actually a big issue. But when we modified our forward pass through the network with activating the negation part, we didn't want to do this, but we also affected the convolutional layers during training. Okay. So, remember, we have pre-trained convolutional layers. And now we are seeing some really weird patterns. And because convolutional layers are also learned as a part of training the neural network, they will get all the filters and it will get corrupted. Okay. So, it won't longer be the features from the digits dataset or whatever dataset we're using. The negation of the, of the neural network will affect these convolutional errors in a very weird way. I don't have a visualization at hand, but it looked like junk, basically. It didn't look like features from the image anymore. Simple solution. You can freeze the convolutional layers. It's very simple in PyTorch. When you freeze them, they won't no longer be modified during training. You just use them as they are. Optionally, we can also, the other layers which are going to contain all the weights for the negative network, we can reset them. It's an optional step, but it helps with convergence. So, if you don't reset them, the network will still achieve the same accuracy, but in a larger number of epochs. So, it's just an optional thing to, which helps a little bit. Here's the code. So, the first two lines, we need the comment. You can see we are reinitializing the fully connected layers. The hidden doesn't mean I'm hiding something from you. It's, I think, 50. It's just a constant. Freeze convolutional layers. Part is also simple. You just go into your model through your convolutional layers. We have only two here, con1 and con2. And we just say the weight of it or all the features don't require gradation. So, autograd won't mess with them. We need to reinitialize. Also, another gotcha, ask me how I know, is you need to reinitialize the optimizer if you're changing your layers. So, you know, you have to, it will still attempt to modify them and throw an error if you don't do this step. Okay. So, we have completed our model. Now we need to test it out. And for testing, we introduced this, we called it PMNIST or partial NIST dataset, which is very simple. So, in the original dataset, NIST had 60,000 training samples. We didn't mess with these. And we also had 10,000, I think, validation samples, test samples. So, we just, you know, extended it a little bit with these 10,000 validation set images. We introduced new 40,000 images in a very simple way, just to test it out. We have this, you can see on the image on the bottom. So, on the, all the way to the left is the complete validation set image. Then we have something that we are calling, I think this is, we call this vertical cut. So, because the vertical, the 50% of the image is missing, then we have horizontal cut, which is the left side of the image is missing. We have diagonal cut because we're running out of ideas. So, you just cut, you know, some squares from the image. And we also have this, we call it triple cut because we just removed three small squares from the image. So, very simple to make. Just some additional remark. It would be probably easier to just train on partial samples. So, if you want a neural network, which can classify half images, you just train your networks on half images. But we want to, you know, this is just let's say proof of concept. We want to emphasize that, you know, you are not going to always have these easy ways to get to the partial input sets. For example, think of traffic signs. So, you want to classify traffic signs. And what if a tree is in front of the traffic sign? Okay. So, you have a, you can also have, you only have a partial view of it. A human will, you know, see, okay, it's not red. So, it's probably not a stop sign or, you know, things like that. And it will be immediately able to tell not to stop. So, we're trying to mimic this. Another remark is that on the unmodified validation set, we still have top accuracy, let's say, great accuracy, just as much or even a little bit more than the traditional model. So, this method doesn't break your network when the input is still one piece or whole. Okay. So, it only helps with the partial inputs. It helps a little bit with the whole inputs. And PyTorch makes implementing weird models really a treat. If you're not using PyTorch, you should really try it. There is a talk, I think, about TensorFlow 2.0, which has some of the new dynamic functional features later on, I think, after this one. I'm going to go to that one. Okay. Okay. So, some results. So, when you train this model, it's not a very big network. It's basically just the example network as a baseline from the PyTorch repository. So, not a huge network. I think it has four layers. So, these are the results. So, you can see we have our five validation data sets. We have our accuracies. And in the column delta, delta is how our model improved upon the original model. Okay. So, on the unmodified validation set, our model improved just a little bit, some 0.31%. But you can see on all of the other, let's say, partial inputs, we have some improvements. And with a very simple modification, which you can do very easily. We can see that the vertical cut, for some reason, it's very difficult to interpret what's going on. But for some reason, for the vertical cut, we are seeing 9%, which is a big increase. We have 10,000 images. So, 9.05% means that our new network can now classify 905 images more compared to the previous network. Okay. Just briefly, to go over the future work, we want to try this on the different, sorry? It's just on your numbers there. Mm-hmm. Yes, but it's not, this model is not state-of-the-art model. Okay, so. I get it. But I mean, it's a good improvement. Yeah, yeah, it's a good improvement. It's, you know, but it's smaller than the rest of... Yeah, but it improves. Yes. Yes, of course. Yes, that's the point. We already experimented with CIFAR data sets. So, CIFAR 10 and CIFAR 100, we are seeing similar results there. And we want to try different architectures. I can tell you now that deeper models which have higher level features, so more convolutional layers, yields with better results. So, the higher the level of the features, the better results for negative classification we can get. We want to try under cellular networks. We already tried a little bit to play with deep, full neural network, which can, you know, modify the inputs based on the output of your neural network. We want to see how it affects our network. And you can find on this very easy to remember link, you can find the complete PyTorch implementation, which is basically the same implementation from the paper we sent. Okay, so five minutes early. Thank you so much for your attention. I hope you had fun. Okay. Thank you, Nemanja. So, any questions? Don't be shy. So, yeah, that was a very interesting talk. Thank you very much. Yeah, one thing that kind of struck me is that it's quite similar to the idea of, you know, the 20 questions approach of figuring something out. Oh, yes. You mean the game? Yeah, exactly. Is there any way to kind of try to use that methodology to kind of... That's a really good idea. Yes. Because maybe that would like it. Yes, thank you. Yes, that's a very good idea. We can try that, yes. That's the principle of deduction. Okay, so it would probably work really well here. If you can model it somehow. But thank you, yeah. That's a great suggestion. Okay, anyone else? Don't be shy. Okay, so then I guess... Thank you again. Thank you. Next talk this afternoon, this morning. We'll have Michel de Simone talking about TensorFlow 2.0. Was it? Yep. Yep. So give it a warm welcome. So let me see. Okay, can you see it clearly? Okay. Okay, here you go. Is the sound okay? Everyone can hear me? Okay, perfect. So let's start. Well, welcome to my talk. I am Michel de Simone and this talk is TensorFlow Strikes Back. Because maybe many of you may have tried TensorFlow by Torch or Keras. Like raise your hand if you tried TensorFlow in the past. Okay, many of you. Raise your hand if you tried PyTorch. And also many of you. And how many of you did try Keras in the past? Okay. How many of you liked Keras? Okay, perfect. So you probably like this talk too. So just a quick question, just a quick word about me. My name is Michel de Simone. Online I go by the name of usually Yubik. You can find me on Twitter, Reddit, I don't know, Github and with Mr. Yubik. And I've done a couple of things in the past. Like I work as a machine learning engineer and I'm a machine learning researcher at ZuruTech, which is a Chinese company who does also a lot of R&D in the field of deep learning, computer vision, stuff like that. And you also work as a freelancer for my own company, which is Yubik.Tech. You will find the website if you are interested in contacting me for anything. I also am also the founder and organizer of the PyData chapter for the Emilia Romagna region in Italy. We will be having a talk and meetups starting in September. So yeah, yay. And I'm also the manager for the GDG Bologna, which is the Google Developer Group in Bologna. So if you want to reach me, the concert will be also on the site of the conference. So let's get it in. These are some other contacts that you may, this kind of site, the personal sites are still under construction. They will be up by the end of the conference. Follow them. I think I sometimes post the interesting stuff, especially about the like conference recap and something like that. So okay. So let's dive into the talk now. So thanks for the 1.x now. It was amazing. It was amazing. It had a lot of very nice theme, like phenomenal computational powers. It was very fast. It had a nice Python layer on top of a very performance C++ core. Amazing performance. It was very easy to deploy in production and still is to this day. Like if you try the PyTorch, I think that the one of the main drawback of PyTorch for now is still the production story. Like you can deploy the Cafe 2 model over the ONNX, but I believe that the support for TensorFlow is much, much better yet. Luckily, like things are improving on the PyTorch side. Also, TensorFlow is a beautiful static graph which allowed you to basically like a support your model and then do every sort of cool thing with that. Like you could save your train model with Python and serve it with any other language you wanted. You could do optimization. You could do a lot of things. And it was very nice. And also, very high-performance input pipeline under the form of the TF data module. They were built by Google themselves for managing the input throughput for the TPU, if I remember correctly, because one of the problems was that they had these amazing hardware, but they weren't able to use it properly because they hadn't high-performance input pipeline. So you have it. But it had a very ugly and clunky API. Like, yes, there was Python on top of the C++ code, but it was almost like not really Python. I mean, you had to forget everything you knew about Python when you came to the TensorFlow 1.x experience. And so the graph was statically defined because it was a static graph. So basically, you describe the computation and then later, you run it, which if you're familiar with Python, it's not at all Pythonic. Together with that, you had the problem of variables which were added to a global namespace and was not cleaned or garbage collected. So if you lost the pointer to your variable, they were staying there. You had to retrieve them manually. You had to work with scope explicitly. It was really not a Python experience. And that, to me, is probably one of the main reasons why PyTorch grew in popularity so much because it's actually offered a proper Python experience with deep learning. So it was nicer. The context is when you try your end at TensorFlow 1.x and PyTorch, you will see that PyTorch was really usable. You had a problem like scaling it in production because you hadn't like the support for stuff like TF serving or even at the beginning, the cloud provider were only providing models with support for stuff with TensorFlow models and stuff like that. And over the time, the PyTorch story for the production side got a lot better. While TensorFlow behind in user experience and usability, it was a mess. And it was a mess until very recently because it was only, yeah, this is a bit of a green text. It's impossible to read it now on the screen, unfortunately, but you will find it online. It's a joke. It's the usual experience of a TensorFlow developer over the year. You start, you're really loving it, the ML crazy, very nice. Then PyTorch comes along, you start seeing, oh, my God, but I can actually know how to program in this stuff, even though I haven't spent hours and hours trying to learn the special syntax required to work with variables and graph and stuff like that. And so you see also a lot of people migrating towards PyTorch. And the idea is basically that you start thinking yourself, oh, maybe I should try PyTorch too, but I have a lot of code in production. How the hell do I port everything? You start panicking, you vomit, you start trying, you don't want to port everything. And then maybe you decide to wait. Summer 2018 comes along. You try if an eager mode, which should have been the first sign, you know, like the advanced party from this new era of TensorFlow. You try it, but no, it doesn't work very well. There are a lot of bugs. There is not really that much support yet for it. So you even panic maybe even more. You wait some more. PyTorch 1.0 release. And it's amazing. It's blazingly fast. Production study got really better. It now has also the possibility of interacting with a form of static graph. And you're really wondering if you should stay on the TensorFlow bot. But then the announcement came and TF2 was a reality. And so maybe you did a good thing waiting before jumping the ship. So what is this 2.0? Well, 2.0 has basically new logo. It's sleek. It's more modern looking. It's beautiful. And maybe it's the same thing that happened to the API because the API is actually sleeker, more friendly. Overall, less in your face about you are not comprehending the graph. You are stupid. You cannot get the global namespace and stuff like that. So let's see what is really changing. So first of all, you either die of static graph, you live long enough to become eager by default. In TensorFlow 2.0, you no longer have the static graph. Actually, you still have. But you don't know it. And there will be later on a presentation from my colleagues, which is in the first row, that will show you why there is still a static graph powering everything in TF2.0. So the idea is that with TensorFlow, eager execution is an imperative programming environment that evaluates operation immediately without building graphs. Operation is to concrete value instead of constructing computational graphs to run later. Basically, what this means is that you don't need a PhD anymore to do TensorFlow. You can simply do computation as if you were Python. If you have a variable called A, which is a float of 1.0, and a variable called B, which is a float of 1.0, and you sum them together, you're saying 2.0. If you did this one in TensorFlow 1.x, it doesn't work. What you see is a TensorFlow operation node. You still have to run your session and evaluate it. Well, this is gone. Luckily. In TF 1.x, you're required to manually cheese together your operation into a graph. And so you had to construct your data flow graph. Now in 2.0, you don't need it anymore. Also, eager is enabled by default. So eager is now the default behavior. So it is basically PyTorch. What also you don't need to do anymore is that now, and especially with the stuff that I will show you later, you are not obliged to use any more stuff with the TF control dependency or using the proper TensorFlow like control flow like TensorFlow if and for loop and while, because you can use Python now. I mean, if you're in Python, use the Python stuff. So yay. So yes, we can now debug, especially the bug because in the TF 1.x, the bug experience was really a pain. And now we can debug and write Python code in vanilla Python. Yes. Finally, after like, I don't know, four years. So, as televista globals, second thing, no more global namespace which cycle your variable and does not recycle them in a decision which I believe it was due to the environmental affinity and responsibility, the TensorFlow team decided that it was time to start recycling variable. So in TF 2.0, whenever you lose track of your variable, whenever you lose your Python scope to your Python pointer to a variable, you lose it. The garbage collection came in and recycled it. So you better keep track of your variable. This may seem, you know, like something which can be a bit of a pain for the end user, but actually it is really simple because on one end, it forces you to really know what is going on and to keep track of everything. And secondly, if you use Keras as default API, which is now the default API, you do not need to track this manually unless you do really like very, you know, like custom things. But even then, it's not as difficult to track them. But if you use Keras, you basically don't notice and you can, you know, you can like delete stuff as if you were a Python object and the object is actually gone. So once again, it feels like using Python. You did like some sort of alien domain-specific language which happens to be run in Python file. So it's okay. And then also make function, not session. So no longer do we have the session run, which was this construct that you used in TF 1.x to run a particular graph. So if you remember what I told you earlier, if you sum together two things in TensorFlow 1.x, you won't be getting the result. But actually what you get is basically a graph with an end node, which is your result. But in order to get the value, you have to run the session. And now this is no more the case. You can simply have a normal function as Python and you get them. Not only that, but also, as I told you, the static graph is not gone. So TensorFlow is not really dynamic in the sense of having dynamic first approach, but the static graph is there. And you can use it to leverage its performance because the static graph is faster. And you just like better optimization in terms of like kind of vision and GPU optimization to run faster. So you can still use it. And in order to use it, you just need to use a decorator. This decorator is the tf.function decorator. You put it on top on a Python function. And basically the code inside it will be parsed and converted into a static graph definition. And the static graph definition will be basically this statically compiled sort of your function will be run as a static graph. So it will be faster. And of course, you can also export it with many things. We'll show you in the last part of the talk. Now, the nice thing about this is that you can now define, for instance, like a custom training which will be showing you later on, you can define like a custom training as it was a pure eager. And then when you're done, you slap on it to the decorator. And basically you are turning everything that you define eagerly into a static graph. And the static graph run very fast. So it's amazing. This is using something called tf.autograph, which is basically this magic black box that eats Python, pure Python function and spits out like TensorFlow static graph equivalent code. If you want to know more about how this thing works, as I told you earlier, see my colleague this afternoon, if I remember correctly, like probably the last talk of the day or something like that. So it is very interesting. It does a lot of magic underneath. It manipulates the Python syntax tree. It's nice, even like in a pure, like a theory of programming language kind of perspective. So also the depth of an API. So tf layers, which was the package that you used to use if you wanted to create neural network layers in TensorFlow is dead. tf graph is gone into hiding, as I told you earlier. tf contrib, which was this huge module containing everything like third party or half implementation or even like very cutting edge research stuff that was built on top of TensorFlow is gone. And it was about the time that this API polishing happened. Because there was a lot of redundancy. Contrib was a mess. And so what we have now? Well, now we have Keras, long live Keras. So what is Keras? Keras at a glance, Keras is not a framework in itself, even though now it basically becomes synonym for TensorFlow. Keras is a set of API specification for deep learning library. In the beginning, Keras was like framework agnostic, meaning that you could plug and play a different framework in its back end. However, this framework were TensorFlow, CNTK, and Theano. Theano is deprecated. CNTK was deprecated by Microsoft very recently. So we are now stuck with the TensorFlow if you want to have the cutting edge stuff. So basically, Keras now only lives on top of TensorFlow. Unless someone can port it with PyTorch, I don't think we will see any other framework for the time being. It is very high level. Sometimes it can feel a little bit too magic. Like you don't really know what is going on unless you really open the engine and start seeing what's under the hood. But it is very, very simple to use. And it offers basically to actually three sets of API. Two are for the beginner, let's say like that. Like everything you do probably fall into the first two category, which is the Keras layers models sequential and functional API. And I will show you later on. And then you also have the training API, which is more for expert or even like custom model. There's a nice website. And also you should try and look at the TF Keras on TensorFlow docs. One clarification. If you install Keras without any TF, it only installs Keras as the specification library. Let's call it like that. If you want to have the Keras optimized for TensorFlow, you simply install TensorFlow and Keras for TensorFlow comes with it. And you access it by accessing the module TensorFlow.Keras when you program it. You are not seeing that. Perfect. So first thing first, before we do anything with the models and training and stuff like that, let's see what layers are. Well, Keras layers are the most basic structure that you can find when defining your model with Keras. They are now the one and only layer API. This is the API to use to define layers. If you want to create custom layer, you subclass from a Keras layer and there is a whole guide on how to implement the even Keras layer. But out of the box you receive anything you may want to use. So this is the only API you have now. It's available under TensorFlow.Keras.Layers. And we usually import TensorFlow STF, so it's usually TF.Keras.Layers. They are a platonic object, meaning that they behave like same Python and not like the computational operation of TensorFlow 1.x. And they are very simple to use. They are basically all class constructor, so you have to initialize them. You initialize them with a configuration. And then once they are initialized, they expose a cold method which actually you can use them as if they were callable once initialized with so much parameters like the input and stuff like that. So very simple to use. Then we have Keras losses. There are no more TensorFlow losses or stuff like that. Everything now lives under the TF.Keras.Losses module. With TF2, out of the box you already have a very huge selection. And implementing custom losses is actually really simple, meaning that you simply subclass from TF.Keras.Losses.Losses, which is the primitive interface for everything related to losses. And then you basically only have to define, in order to have a valid loss, you just need to have a cold method on it, which accept why true and why paired, which are basically the thing that you are using to calculate the loss and that will be passed to the model, to the losses when invoked inside a model. Then we have Keras optimizer, long gone TensorFlow optimizer, which I can remember if they were under TF.optimizer or TF.train.optimizer, but they are gone. You now have only Keras optimizer. And they live of course inside the TF.Keras.optimizer. There are a bunch of them. And once again, if you want to create a custom one, there are guides on how to extend it on your home. So let's dive into the core of the very interesting part, which is the model API. We now have three model API. The first one is the sequential. Sequential is the most straightforward. You basically create a model by stacking layers on top of each other. They feed one into the other in a direct line. So it's the most simple one. You can either specify the model, the layer that you want by passing it them as an array to the, when you construct the sequential model, or you can instantiate the model and then repeatedly call the .add method and pass them to it. Here are two examples of them. The first one is when we pass like a bunch of layers to it. And the second one is basically we create the model and then we repeatedly call the .add method to instantiate, to add it to the layer that we want to. Also note that activation and everything like batchalization, software like that, can be passed explicitly as a layer, or they may be configured sometimes as like properties of the layer. So you will see later on some examples of how this is done. Then we have the functional API. Now, the sequential has a limit, which is that your model has to be, like your stack of layer has to be linear in the sense that, for instance, you cannot have like multiple input or, I don't know, like a model that comes in at another time, like later in the computation. So it's somewhat limited in scope and in use. And the, like the successor, what you should use instead, maybe all the sequential whenever it feels like it is too restrictive, should use the functional API. Functional API is called that because you invoke layer, you initialize the layer and then you call the stuff on it. And the, basically the graph of the model is created by this concatenation of the call of the various layer. It's actually pretty simple to use. And in the words of the documentation, the functional API can handle model with no linear topology, model with shared layers, and model with multiple input or outputs. It's based on the idea that a deep learning model, usually a direct graph, a DAG of layers, the function API is a set of tools of building graph of layers. So it's basically like a more advanced set of tools on top of the sequential. Here it is how it works. So we start by creating the input node. We never specify the batch size, usually. What gets you to turn the inputs contain information about the shape and the type of the input that you expect to feed to your model. You can expect this information by calling several models, several methods on top of the object that you get, which is an instance of this tf.carus.input. So you can expect and see if everything you are passing is correct. Then we add the node by simply calling the various layer or even model itself. So you can plug and play different models if there were a layer by using the functional API on the inputs that we have defined earlier. Layer model, as I told you earlier, before they can be called, they have to be initialized. So you have to first initialize them and then can use them to call stuff and create the graph. Once everything is done, you simply package everything together by using this carus.model object. Carus.model, you have to basically make it in this sort of way in which the inputs are the input nodes, like in this case, this will be, I don't know if you can see it, but I have a pointer. I don't know if you can see it in one, but here we have our inputs node, okay, and then we have our output node down there. So in the carus.model, in the end, what we need to do is simply say, specifying which inputs are like the first node of our computational graph that we have defined by calling the various layers and the output node of a certain graph. If we have multiple inputs or multiple outputs, the only difference is that instead of having a single object, we pass it an array of stuff. So it's really that simple. Once you have the model assembled, you can expect it by calling the model.summary method, which prints these very nice looking sort of, you know, like summaries in your console. There are also ways to generate like graph fees of the graph of your model. So there are plenty of tools already baked in carus to explore your model. Then we have the final API, which is the chainer API, as I like to call it, and the subclassing API, which is the official name. So why is it called the chainer? It is called chainer because chainer was the framework that popularized this sort of API. The idea is that you subclass from an interface, which is like a primitive model, and then you basically handle everything yourself in terms of initializing your layer and then defining your forward pass. So it's actually, it basically gives you like the most power in terms of customizing your model, but it of course comes at a cost that you don't have the like checks and like everything, the checks that are already in the functional and sequential API. And also you usually tend to have more bugs in this sort of way, because we are defining your own forward pass. So if you do something strange, bugs may arise. So it's more error prone. So the question is, use it only when strictly necessary. And as I told you, it's actually that simple to use. Like everything you need to do is, for instance, in this case, you subclass the TF carus model, and then you define your own initialization function, which super the model, and it usually calls the unit of the model. You can specify a lot of parameters to it in order to construct it properly. And then here you define stuff which will be available for the forward pass, and then later on you define your own forward pass. This is useful because you may have like strange models like, I don't know, like GANs or something like that, which usually are better done in this way than the functional or sequential. Or maybe you want to define your own model. Also you can use it to define a very short model, and then you plug this model into the functional API. Because as I told you earlier, you can mix and match layers and modules that basically operate in the same way. Like a model is just like a collection of layers in this case. So you can use them interchangeably. So we also have the beautiful color pipeline of Google. And here is the high performance input pipeline, because we have seen the layers, the loss, the optimizer, the model, what we need to do a proper training. Well, we still need the data. How do we fetch data? Well, we use the TF.dota.data package. Not a lot of change since TF 1.x. Basically now this works as in the overall TensorFlow experience. This is more usable. It's more intuitive. But in the older TensorFlow, you would need to create manually your initializer and pass it to the model and stuff like that. Now everything you create with TF.dota is Pythonic, meaning that you will be able to iterate over it in a simple way. What it does, the cool thing about the module itself is that the API introduced this object, which is the TF.dota data set, which is an abstraction. And you can see it as a sequence of elements. And then you can also use it to define computational pipeline with the reusable element and various transformation. And everything is optimized under the hood to be extremely fast and you can customize it even further to basically maximize your hardware performance. The basic idea is that everything has to start from a source and you can have two kind of source, which are either in memory or from a file. So if you're working from memory data, you have a TF.dota data set from TensorFlow or TF.dota data set from TensorFlow slices. If the input is stored in the recommended TF.dota format, which is a file format devised by Google to actually be extremely performant when the user together with the TF data, you can use the TF.dota data set in order to construct everything. Once you have done it, this data set object exposes a series of methods, which you can call to create this computational pipeline that does transformation, mapping, everything that you may want to do. Basically, you can do it. And there are already sort of pre-built functions and transformations you can put on it. Or you can define your own, for instance, with the map function. You create, you use this map function together with Python callables, like a lambda function, something like that, and you define your own conversion and stuff like that. So it's very nice to use. There is documentation because, as I told you, there are a lot of methods which are already implemented and stuff which I don't know, like you can batch stuff, you can repeat stuff, you can specify like shuffling, everything you need to do in a proper deep learning pipeline, input pipeline. Also, as I told you, in TF 2.0, this data set is Python retrieval, which means that you can either consume its element by invoking it in a for loop or another option is that you can create a proper Python retrieval object and consume it with the next function. Now we have also the input, so it's time to do some training. So before we can do the training, we have another step to do, which is if we, let's say that we split the training in two ways of training a model in this new API. You can either use the pure Keras approach, which is very performant, not so customizable, but also it has, like in the model API, if you use the one which comes out of the box, which doesn't require you to thinker with it, you basically are safe that no bug should arise. And if you want to do this kind of pure Keras approach, you have to know what the model compile does, and what the model compile does is basically after you create your model, you call this magic method compile, you provide it a loss function and an optimizing, and what it does is that it configures your model for training. It's basically like the preparation for the training. As you see, you can pass it some arguments, and very three important ones, which are the optimizer. As I told you, it's basically an instance of a Keras optimizer. You can either pass it the object or specify the string with the name of the optimizer, and that's some sort of Keras magic that I don't particularly like, because you have a mix and match in strings and Python objects. I don't really like it personally. I strongly advise against using it, but you can do it. Same thing with the function, like you can pass the loss function as an object or as a string, if its name is the one like the pre-built one. Again, I don't like passing strings. I prefer passing directly the object because I like to see what I'm properly passing, and not like trusting the magic under the hood. Your mileage may vary, but it's nice to have it there. Then you also have metrics. Metrics are used both for logging and also for optimization purposes. They are simply what metrics are you trying to optimize and care for. Additionally, you can use the run eagerly if you want to force the model to run in an eager way. Otherwise, Keras usually does all its optimization and it becomes a static graph. This is how you compile motor. For instance, here we have a sequential. As you can see, as I told you earlier, even activation in models can either be expressed as a layer, so your model would look like as more layers to it, or you can use string in the layer definition. Again, I don't like using string. I prefer passing it explicitly. This is an example taken from the TensorFlow website. After it, we compile it. As you see here, we pass, for instance, the proper object, an instance of the add-and-multimize, but the loss and the metrics are defined in this sort of stringy way. I mean, it's okay. It's very simple. Now that we have done this call, the model is ready to be trained. How do we train it? Well, we fit it. If you're familiar somewhat with scikit-learner, it's not that different. It'll also carry us as a proper scikit-learner clone API. Unless you're in particular use case, this is the function to use to train your model, and it's vastly more preferable than using custom training loops that I will show you later on. It is fast, it is optimized, and it is reliable in the sense that it's much less buggy than defining your own training loop. Here, the model fit has a ton of arguments, and the most important ones are the X and Y, the input data, the target data, the epochs. How many epochs do we want the model to run? The batch size, as you see here, when passed, if we here, there are two different ways. For instance, if we have data in an umpy form, we have to specify the batch size, and then we can also give it the validation data, and it will do the validation using the metrics or various things that we can specify. Or, if this is an example, as you see, it's pretty simple, or we can use the model fit together with the TF data dataset, which is the high-performance input pipeline. If we do it like that, we won't require a batch size, but we will indeed require the steps per epoch, for instance, which will let the model know how many steps we have to iterate over the dataset. Actually, the story is a bit more complicated, but not that much. You may have, like, other requirements, but it is really easy to even debug if you encounter errors in this way, and I personally recommend that whenever you want to do, always use the f.data dataset. So this is the last part, which is what I call the dark power, which is the custom training. Well, custom training is probably, in my opinion, one of the next thing about TF2.0, and the idea is that beyond the safety of the model fit and model evaluate, there lies the dark power of the gradient tape. Those power met practitioners who embrace its power through the sanity in exchange for a perfect control over the training. What this means, except from the jokes for the economic and stuff like that, is that you have this object, which is a gradient tape, which is a TensorFlow object in which you can record, you can use it inside the scope. So we have to open it with the, for instance, like with the gradient tape as blah, blah, blah, and every operation that you do in it will be recorded as an operation to later on extract gradients and then apply it to variable. In this way, you can basically define your own background class and everything, but the problem is that it is more buggy, but it gives you also a loss of power. Like, if you want to, for instance, to train a generative adversarial network, this is the way to do it. Like doing it with the chaos and model fit is a pain. You have to do a lot of magic, but if you do the custom training, it becomes that simple. Of course, it's more buggy. And here is an example. I don't know if I can scroll it. It doesn't probably fit everything inside the, sorry, it doesn't fit in the slide because there is a lot of comment, but you can find this example on the TensorFlow site. The nice thing about it, what you really need to see, which is very important, is that part, which is like we are opening up the gradient tape and we are invoking the model here and all the operations that are done inside the tape are recorded and they will be used to compute the gradients later on. Once we have done with it, maybe we have defined everything we need to do. We have called all the models, all the layers that we have to call. Then we exit the scope and after we exit the scope, the only thing we need to do is to extract the gradient, to apply the gradient using the optimizer to our variables. And it is this call over here. If you can see it, it's there. So we extract the gradient, then we use the optimizer to apply the gradients to the trainable weights. This is how you define your own custom training loop. This is usually used together with stuff like the subclassing or chaining API because usually whenever you have this very peculiar model that you need to use the subclassing API, more often than not, you also need to use very advanced training technique and maybe model.fit doesn't cut it. So you have to specify it in this way. The very last thing is exporting models because as I told you, I personally think that the power of TensorFlow is not in the library itself, but in the ecosystem of a project that surrounds it. And I wanted also to show you some of them, but there won't be time. So maybe reach out for me during the conference. I can show you some very neat stuff about the party library or extension or tools that are built on top of it because there are really many of them from differential privacy to probably programming. So come and see. Everything you need to do is about exporting models. We are running out of time. These are links that we will be able to click once I release the slides so they will point you to the documentation of TensorFlow. The TLDI is very simple. You can either export your model by, like if you train it with Keras, and they are not using the subclass. They are not subclassed. So they are either sequential or built with a function API. You can simply call model.save and you have a lot of options. You can save it either in the form of a doop file or the, let's call it proprietary, but it's not proprietary. It's open. The specific TensorFlow saves the model format, which I usually prefer to work with, but sometimes with the third party library there may be problem. So you either save in one of those two files, and then you basically can save. You can save the whole model. You can save the model as a save the model. You can save only the architecture of your model, or you can save only the weight as an HD5, or you can even export your weight as a save the model format. So you have a lot of options. For the subclass model, the story is more complicated, and I won't be showing you here. It will be pointing to the link in the documentation because usually you need the original Python object. So it's more like you are restoring the model to a previous state, and they are a bit harder to fully export in a standalone way, but it's doable. It's not that complicated. Just a little bit more complicated would require additional time that we don't have. So in conclusion, to conclude, if you're using PyTorch, try TF 2.0. If you're not using PyTorch, try also PyTorch. I really recommend it, but I also highly recommend now to try TF 2.0. Remember that whenever you try the TF 2.0, it's not yet very stable in the sense that it is still in beta. It's actually a lot more stable than when it was in alpha, but you know, there may be breaking changes and stuff like that. But now I think that we use it in production every day, and it is usable. Sometimes you maybe can beg your head against the world, but more often than not, it works really, really well and it's a beautiful experience. If you love Keras, try TF 2.0 because basically it's like Keras on steroids. If it's TensorFlow 1.x, scared you away because it made you feel like stupid about the graph and everything, and I totally understand the feeling. I had it too. Then come back to TensorFlow 2.0 because everything that scared you is gone. Most importantly, follow me on Twitter. I am Mr. Undescore Ubik. I tweet a lot of stuff about the deep learning and TensorFlow and everything and Python, and actually I have finished. So if you have questions, thank you for listening to me. If you have any questions, I don't know if we have still time. Okay. Otherwise, reach me either on the group, the telegram group of the conference, or you can find me around the conference. Also, happy Euro Python. Thank you, Michele. Very interesting talk. So we have time for one question. Hi. Thank you for your presentations. And the question is, we can add this eager execution to the Keras model to be sure that it will be executed eagerly. But what is the benefit if we are not using anything from TensorFlow just pure Keras API? If you use pure Keras, I don't think that you may need it. It's usually, there are some applications. If, for instance, you have a layer which is constructed in an imperative way, and usually it's done with the subclass and stuff like that. So personally, I've never had to use it. But I think if you have to use it, it's nice to know that you can force Keras to behave in an eager way. Okay. Thank you. Maybe one more very quick question. Anybody? Okay. So thank you, Michele. Again. Talk, we have May Jimenez talking about TensorFlow estimators. So give him a big hand. This is Tol, and I am not Tol. Hi, everyone. Thank you, everyone, for being here. I'm very happy. If you have been in the previous talk, I'm going to be the old lady who talks about TensorFlow 1.x, but still this will be applicable to Python, to TensorFlow 3. I keep saying Python because it feels like I'm talking about Python 2 while we should be moving to Python 3. So, yeah, nevertheless. So I'm May Jimenez. I'm a PhD student. So we are going to talk about pain as a PhD student. I know pretty much about pain and how to deal with painful. Actually, I was a year student. So that means that I have been dealing with pain a lot of time. So if you decide to go to TensorFlow instead of using Keras or PyTorch or Torch, that means then you have been through hell. And you decide, hey, I'm going to build this model. So TensorFlow is an amazing library because you have tons of ways to do things. But this is the course for you. When everything is possible, the most probable outcome will be chaos. So if you have 100 developers and 20 different ways to do things, how many outcomes do you have? 200. And if these 20 developers will do the same 20 times, you will have 2,000 outcomes. So I do where every responsible developer does. Like, I went to GitHub and I see, okay, how people who have more models do things. And I found that, oh, okay, people use a train function and then they have a class with all the parameters. Oh, how to do the class. Some people build a graph within the class. Some people build a graph in the functions and train. Other people do completely different things like this function. And these are the layers. And then they loop over the layers and create the graph while they are in this function, create net. So how should I do it? What's the correct way? I need a correct way. There should be a correct way. I'm a little bit OCD, so I want a correct way. And if you think about it, a model is just a thing, then takes data, trains itself, evaluates itself, and when the model is happy, say, hey, I'm happy, let's predict. So it's a thing that will take train data, train for a while, then evaluate, and then when you go to inference, it will take a new sample and then predict an outcome. It's not that hard. This should be a way, like, I don't care how is your model, how is anything, but this should be clear. When I go back to my code, I should be, no, okay, this function or this class is the train, this function is the model, wherever. And estimators came to help. So estimators became, they were released in 1.3 and became mature in 1.8. They are going to stay in two, but things in the custom estimators are going to change a little bit. Nevertheless, since we are talking about good practices, this will stay because we still want to be good developers. We are machine learning practitioners. We're still, we are writing code. We are not writing poems. And this is the API of TensorFlow. So on top of the food chain, we are going to have the premade estimator. That's the thing that developers still use all the time because it's robust, it's very tested, so we know then it's correct. Then if you want to go lower, you have an estimator and a Keras model. We have been listening a lot about Keras models and if you want to go back, let's go back to the other talk. Let's talk previous to this one. And then let's say, oh, I actually want my estimator to be completely wildly different. So I have layers that now in TensorFlow 2 is in Keras layers, but still we have some way to define layers. And then we have the Python IPA, the C++, and then all the things that talks to the hardware. We don't want to talk to the hardware. Okay. So where is an estimator? An estimator is a complete model. So it was this pink thing that I had before. And estimators will give us models for train, for predict, and for evaluate. I'm going to repeat this structure a lot because it's actually what we want in a machine learning system. And these are the functions that we need to instantiate, train, evaluate, predict, and save model because, of course, you don't want to train your model every single time. I'm just waiting one week to my model to be trained. And now I want to use it. So I'm going to wait another week. That doesn't make sense. First thing, let's use pre-train or can estimators. This used to be called can. It's like going to the shop and buying a nice tomato soup ready to eat. So can estimators are commonly used architectures. That means if you want to write linear classifier or linear regressor, it doesn't make sense that you need to write every single time a linear regressor. We all know how this thing is. And it represents a whole model. What kind of can estimators do we have? We have baseline classifiers. So you build your own, your lovely model, and you want to test it against the baseline. So you have that. It's classifier and regressor. I'm going to say classifier for simplicity, but we have baseline classifier, linear classifier, deep neural network classifier. So a neural network with layers. Okay. We have found the can estimators are our thing. We want to use that because we are not going to do anything very fancy. We just want to use these kind of estimators. How are we going to do this? So first we need to create a template function, define the model feature column, instantiate this estimator, and then call the method, switch method, try and test and evaluate. That's all the time. So let's start. Let's create an input data. I will talk later about data set, but we need an efficient way to fit data to our model. TensorFlow estimators are built for parallelize, for running parallel. So we want to fit thing properly. And here in two lines of code, we have our data set ready to use. Okay. We have the data. And now we have to tell the model which kind of data is this. Like, is this an input? It's a string? It's a float? Where is it? And we simply tell, oh, this feature column, it's actually an numeric column. So I'm going through all the train keys and tell my model. All these models are actually numeric columns. Okay. So I said, where is the data? I tell what kind of data is this. And now I need to instantiate the estimator. Estimators are a baseline subclass from estimator. And this is a deep neural network estimator classifier. And it's within estimator. So this is a deep neural network with two hidden units. Each unit has 10 neurons. And then the output is three classes. So it has three neurons at the output. And with two lines of code, actually one line of code, I built a deep neural network. How amazing is that? We all have to agree that this is pretty good. And then we are going to use it. For using it, I just need to call train. For training, I passed the correct data set. This is the iris data set. I don't know anything about botanic. If you give me a plant, I will kill it for sure. But I know, if you give me an iris, I will be able to tell you exactly what kind of virus it is. It's a classical machine learning problem where you need to decide which type of virus flower is a flower. So I train my model with inputs from a training data set. I tell how many times do I want to train my model and the steps. And now I'm going to evaluate. So I want to make sure that the model is correctly trained. So another data set is for testing for development. And I want to see how good it is. And now I'm happy with my model. And I'm going to go to inference. And this is what I say. And this is the three classes of iris that we have, setosa, versicolor, and virginica. And we measure the petal and the sepal length. I don't know, really. It's like, why do you know what a petal is? A sepal length. Yeah, whatever. But I know. And I think the most of you, if you do machine learning, you know this thing. So then the only thing that I do need to do is call predict. And I will get the prediction. And then I will figure it out. Since I know the expected, I can say if it's correct or it's not correct. This will run in parallel for us. This has been handling sessions. This handles threads. We don't need to worry about running key runners and stopping and waiting for stop iterations. If you don't this, you know how painful it used to be. You need to wait until all the threads are dead. But not really. There's a thread that is still alive. So forget about it. TensorFlow estimators will solve it for you. And that's amazing. However, what about if, like, okay, I'm a business student. I don't want to use a estimator. And maybe your problem is not this linear regression problem. Then I'm going to simply plug in. So I actually want to, like, connect a neuron from the input to the out because I have an attention. I want to, you know, kind of things. Like, maybe you want to be creative. And probably you want to be creative. So what are custom estimators? Custom estimators allows us to be as flexible as we want on top of practices. So we are bringing all the things that we have built in estimator and account estimator. And we bring it here. So when I go back to my code, six months later, I don't know where the custom estimator, where the trainees, where the model is, what kind of layers. I know it will always follow the same structure. And that's good. And the only thing that we need to do is instantiate the TF estimator and write a model. That's easy. So if you want to use a custom estimator, this is how to do it. This is the recipe. And if you see, all the steps are the same. I'm actually not going to talk about how to create an input function, define the model columns because that's exactly the same way as it used to be. So as you used to do. So you can create your can estimator, test all your pipeline, and then say, oh, but actually I'm going to do my own. And then you create your custom estimator. Creating a custom estimator requires you to write a callback function with the model to use the parameters of your configuration. And if you're, you have, this is tricky. Because I've used it a lot. Like you can say, oh, I want all my GPU. I don't want to share it. So you can say that in the wrong config parameters. Make sure that when you're using all the GPU, everyone who's in your lab are using the same GPU agrees. Or you're going to have a difficult conversation to handle. Okay. So this callback function is you basically write layers. So you are writing the model, stacking layers, one after the other. As I was saying, this is smooth, but country, never trust country. Now it's getting more stable. At the moment, we only have country layers and the sequence stable. And I'm passing the features and the parameters. So here's the problem of sentiment analysis, giving an input sentence. So my feature is a sentence and the parameters are the vocabulary size, the embedding size, and the size. So instead of writing this crazy lookup function, I'm going to use a layer that is properly tested for me. And then I'm using a convolution on your network. Again, instead of writing a convolution and figuring out that I don't have leaky memory around, I'm going to use layers. Then I have a max pool layer after that, a hidden layer, and the output. So this has moved a couple of slides, but we can agree that it's pretty easy, pretty relatable. We always know then, okay, I have a model function. This is the callback. I will always come back to here. Okay. So we have all the layers prepared. We are going to say, how do I want to train? I hear use an optimizer, but you can use whatever optimizer fits you. Again, Keras optimizers are there for you. But if you're still in 1.x or you want to do something fancy, use your own training. And here I'm writing directly to the summary. So no more variables than you need to make it global and then write and then remember, you just write to the summary, which is amazing. If you want to train, you have the callback will receive a signal in the mode. If it's trained, you will return estimator span with the train. If it's an evaluate, you want to compute the accuracy of the model, and then you return another estimator object and it's the output of this evaluation. So in summary, estimators are a good idea because it allows us to make a good decision about their code. It's easier to debug. It's easier to maintain. It's better for everyone, for your mental health. So use it. If we all agree that estimators are good, and they are, if you don't agree, we can discuss it later. So the next thing that you need to do is talk about datasets. So you have built this amazing model and you are using a fit to fit data. How not efficient is that? So if you have this model in parallel and can allow multiple, yeah, multiple runs and crazy things, don't use a dictionary to fit data. It's rude. It's not polite. So we are going to use a dataset. How to use a dataset is simply you need to import the data. You will manipulate the data. You will create an iterator and then you will consume a data from an iterator. So you don't need to have all your data ready when you are going to start training your model. You just need an iterator. That's pretty amazing. So how to import data? You can import data directly from a generator. So you can delegate the creation of data to another generator. You can use tensor slices. Different flavors of tensors or records. And then CS file, it was like not put it here because this is not very efficient. But nevertheless, it's an option. If you want, you can use it. Okay. So how we create a dataset. I have this example generator function. Then it's a generator and then it will be generating inputs. And I need to tell tensor flow, hey, this input is actually that has these types and has the shape. So that way the graph can be created properly. Okay. I have my data ready to use, but I actually want to add one or I want to do something. And of course, if I have this super fast thing that generates data, I don't want to apply a common function. I want to apply a map. Then we'll apply a function to all the elements. And then I want to use everything functional. So I want to be efficient. So I can apply a function to all the elements. I can shuffle my data. I can repeat. So once I see my dataset once, I can see it again for another iteration. And I can tell the size of the batch. So how many examples do I need to see in order to update the parameters of my tensor flow model? How to do it? This simple line. How amazing is that? You can concatenate or chain functions. So I said, hey, shuffle my data. And when you finish, repeat it. And the batch size is this batch size. And with this simple line, I create a dataset that is efficient for my tensor flow model and can be paralyzed. Amazing. Great. I have my data ready. I just need to create an iterator. So we have two options here. One shot in iterator. So I'm going to run all the data once or an initializable iterator. If you want to do crazy thing, like crazier and more custom, you can use this thing. Normal users will use one shot iterator. And this is, I said, hey, dataset, please give me an iterator. And you give me an iterator. So how to consume data? Simply call next. And get next will return me a batch of sample data. And I can fill to the tensor flow model. Great. So we have finished this part of the journey. We have seen the estimators are a good practical way to handle our models. We have seen the datasets are the proper way to feed data to the model. And, yeah, we want to have all these things in place in order to build a proper model. And when you come back to your code, you know where exactly you are. Finally, let's say then you actually heard about Glove or Wordbook or VGG. This is very old. Alex Ned. I'm going to have a lot of time for questions because I was super fast. Okay. So you have heard about all these models and said, or even Bert, you have heard about Bert and say, should I write 26 layers models with attention? Oh, that's going to be very, very tricky. Don't do it. Use TensorFlow Hub. TensorFlow Hub brings all the ideas from continuous integration and software control version to machine learning. Actually, it's a thing that we should be doing and we have not been doing, but we are solving it. So TensorFlow Hub is a place where we can find modules that are reusable, stackable and trainable. And you can use it in your own pipelines. So let's say I use a very old one. I use a neural network language model. And how do I do that? I say I want an embedding column. And this is the model specification. So this URL is the URL of the module. The last number is the version. So let's say the people who created this module updates the module and say, now I want to use the next version. It's as easy as changing that. And then I feed that to my estimator. So this is a kind of estimator. I don't even need to do anything crazy. Just two lines of code. I have state-of-the-art modules, a state-of-the-art and tested TensorFlow Models. So we should be using it. In an ideal world, I am expecting that people will publish their code, their paper, and I can use it. Because if you write science and I cannot replicate, it's not science. It's you're doing kitchen. And I can cook myself. Thank you very much. Sorry. Thanks. It was nice. Now I can finish with a lot of energy. So bits of knowledge then I want you to take away. Estimators are a good idea. Please, please, please, please, when you come back to your code, you shouldn't spend more than five minutes figuring it out where your model is, where the training is. You should be able to know where is everything. Data science allows for building high-performance complex pipelines. So if you build this amazing pipeline of your TensorFlow estimator, you don't want to fuel it with this old, slow data. You want to fill it fast. And finally, you have a state-of-the-art driving modules to use in TensorFlow Hub. It's kind of a issue to publish there. So if you have invented a really, really nice thing, please share it. The community will be very, very thankful. And now, yes, thank you very much. Thank you, May. So we have time for a few questions. Yeah, sorry. I was super fast. So I hope you get everything. I'm a very hopeful human being and I'd like to complain about bad things. Anybody? Don't be shy. Yeah. So thanks for a really good talk. That was very interesting. And I will go home and write all of my estimators and see everything that's out there in TFHUB and everything. So I was just thinking, do you have any kind of best practices for how to curate your model training performance on, like, for the models? How do you have a local database where you store the, how your models have been training for the different architectures and everything? Or do you have any solution for that? Yeah, that's an excellent question. There is no standard way to do it. So you need to figure it out your own. I can show you how I do it. How I do it is I write TensorFlow summaries. And then I will start. So every time I run, when you create a summary, you can tell, I want to save it in that place. So I save it with a timestamp. And then I say this timestamp belongs to the model. So I save, like, the layers of the model. So I remember what model I was training there and the summary. So you can write. We haven't talked about TensorFlow board, but TensorFlow board is an amazing tool. So if you have that thing, like, let's say I'm training different architectures with different parameters. So I will have all the parameters in the same there. And then when you run TensorFlow board, it will load all your architectures. Be careful, because sometimes you let it run with four loops, four, five, four loops. And then it's like, oh, damn, I have 200 architectures and you need to figure it out which one is which. But yeah, I didn't find a proper way to do it. This is my approach. Thank you for your question. Okay. Anybody else? Come on. We have five minutes. Sorry. Okay. Then thank you for the amazing talk. And he will tell us today how to secure containers for running machine learning models. Please put your hands together for Thomas. Say hello, everyone. It's really nice to be here today. Yeah, so this is for machine learning. I like to introduce myself a little bit. Recently graduated from my bachelor's degree in computer science at Delft University, currently doing master thesis in data science at Delft. And I work as a machine learning engineer at ING bank. ING bank is a Dutch bank, but we operate globally. And this is my very first time standing here giving a talk. So it's quite nerve-wracking, but it's awesome to see all of you here. So what I'll be talking about today, I'd like to give some context. So what really does it mean to run machine learning in production these days? Some concerns about machine learning in production and maybe some concerns about machine learning in general. And then I'd like to take you on a journey from a very simple model and we'll gradually use Docker to containerize this model and then finally make this image also distrelous. So machine learning in production. I think many of you have once made a machine learning model. Maybe some of you have gone the length to also encapsulate some API and then expose this API so that you've got a service running and whenever you send a request to this API, your model makes a prediction. This is awesome. And yeah, but at large organizations this becomes more of an issue. So if you've got many teams and many models all running at the same time, how can you really manage this? So we've got tens of maybe hundreds of models all running at the same time and each model has their name and yeah, you want some uniformity here. So the way we solve this is making some kind of platform. A platform where data scientists can send their model either to be some Python code or some Python code along with their pickled perimeters and this will go down a specialized pipeline and in this specialized pipeline will outcomes rolling a Docker image we can run on top of the platform and through some service discovery we can reach the right model. Now each of these models really would run in their own environments and this would be an excellent use case for containers. Maybe some models are made for TensorFlow. Some models may be made for scikit. So yeah, containers are an excellent solution for this. Though there are some concerns about machine learning and this is that machine learning models could or tend to handle quite sensitive data. Some of these features using in our models might identify people and this is very, very concerning. So we really want to be more aware of where are we actually running our models? Are these containers we're using actually safe to run? And as much as we try to make data anonymous, this is extremely difficult. On top of that, a machine learning model itself can contain sensitive information. Think of parameters. Think of some war to fact model that has a dictionary mapping maybe a name to some feature which is not quite desirable. So what I really want to talk about today is how can we make sure at least the environment in which we run these models to be a bit more secure instead of just taking a random docker image. So our little model. I'll be using a scikit flask and then of course docker to take you guys on a journey. A journey in which we have a simple model and slowly build this model further and eventually landing in a distressed docker solution. So this is our little model. It's a random force classifier. We use the iris data set. Extremely exciting. Yeah, so if you don't know what a random force classifier is, don't worry. Think of it as a simple machine learning model. If you don't know what the iris data set is, just think of a simple data set. Now we use flask to expose this. Now obviously in a production environment, you wouldn't really want to do it like this. You would do some validation. You would perhaps come up with a schema. You would use data frames as a way to communicate. You would use some libraries, but I want to keep things simple. So we do this instead. It's simply an API exposing the slash predict endpoint in which we can post an array and with this array, we can make a prediction. It works. Well, at least when I tried it, it works. If I use the curl, I would get back to prediction of our model. So let's dockerize this. So we would start by picking our base image. In this case, we'll use the Python 3 base image. We will copy in our requirements. In this case, scikit-learn and flask, and then run pip install in these requirements, and then also copy in the files we need and do the exact same thing as we did before. Yeah, if we run this, you also notice a minus speed. This is because a docker doesn't expose any ports and used to explicitly tell docker to do so. So in this case, we map port 5000 inside the container to port 5000 outside the container. Here you go. I could have just had a copy of the slides, but this is really what happened. So now we would like to talk a bit about how can we actually say anything about this image? How can we scan this image? How can we analyze this image for security vulnerabilities? And there are many, many tools available for this. There are tools that do dynamic analysis, so it looks at your running containers and verifies if there's anything going wrong, or we can use static analysis. So before we even run our image, we can also do some analysis on the file system itself inside this image, because in the end, an image is simply a zip file. So we use Clare. Clare is a way to perform static analysis. And I like to reiterate this is not the way to do it. There's many ways you can do analysis. And in this specific example, I really like using Clare. And there's a nice integration called Clare Scanner. Clare Scanner basically allows you to run Clare in the background, and then you can perform a nice little command. And here we simply specify which application we like to scan. Now, this is the result. So we see a severe vulnerability space on severity. This is for the Python 3 image we used earlier and what we put on top. Now, this doesn't necessarily mean that the Docker container we're using is vulnerable or it's unsafe to use. However, as a larger organization and compliancy and whatever, you don't want this. You don't want to explain this. So instead, what we could do is reduce our image. We can strip down this Python image we were using originally and take only the things we really need, which leads me to, yeah, so it also some nice other things is that, well, not so nice things is that the size is quite large, 1.1 gigabytes. And someone can also attach a shell to this Docker container and execute commands inside this container. Now, this is difficult to prevent, but you don't really want this ever. So this for us, right? So this for us images are basically images that try to try to have the most minimal set of absolute needed applications, services, dependencies, such that your application can run. This is a quote I took from Google container, the distrelas repository. And here they also mentioned that, okay, it doesn't have shells, it doesn't have package managers, et cetera. So I'd like to go further. I like to take the model we had earlier and now use a distrelas image. So if we do this, we use the, in this case, we use the Google supply distrelas image. This is not the image to use. Most of the time you actually want to make your very own image. You would want to make your own distrelas image. But for these slides, I'd prefer to show a quick example. So here we use the Python 3 distrelas image. And when you run pip install, you'll notice that pip is not found. So because we don't have a package manager, suddenly we, it's more difficult to get these dependencies in there. Now, thankfully, Docker has a very nice way to solve this issue. We could use multi-stage builds. So a multi-stage build is simply where we take one image, I mean, we take one container, do all the stuff there, and I have another container and copy all the files over that we need into the new Docker file. And it looks a little something like this. So we take the, again, a Python version, in this case, I'll use Python 3.5 to be more specific. We copy in the requirements again, and now we can run pip install because Python is originally installed in our bigger base image. And then again, we use the distrelas image, but instead of running pip install, we copy over the files inside the other image into our newer distrelas image. Because these distrelas images are so small, we might run into some configurational issues. So here you also see that we have to set some flag for UTF-8, otherwise, Flask won't run. Not so nice. But yeah, if we scan this now, we get this, which happens if you use pipelod and you don't plot any data. That's because there's no vulnerabilities. That's not to say that this image is super safe and you should totally use it, but it's nice because if Clara would find one vulnerability, the noise is kind of gone. The noise turns into a signal. We can use this signal to see what's wrong with this image. And furthermore, this image size has been reduced quite a bit. We have gone from 1.1 gigabytes to 250 megabytes, which is quite a significant reduction. And also manipulating this container has become a bit more difficult because now we certainly don't have all these nice help or commands inside our Docker image. So as you can see, we can still attach a shell, but list files is not found. But we can do a bit better. And this is a bit experimental. And it, we could perhaps make it an even smaller image because Python modules themselves could be vulnerabilities. What if we just also, instead of reducing a set of Linux dependencies or Python dependencies, we also reduce the modules themselves. So for this, we can use PyInstaller, for example. And this is a bit experimental, like I said, because PyInstaller is a bit difficult to work with in a production environment. And it also generates executable files, which might flag some security scanners, as the pattern PyInstaller generates flags, some security tools, because it's been used a lot maliciously, sadly. So coming back to that, a small change here. I didn't really like the flask run bit. So I just created a main method. And yeah, so now it's important that we also upgrade PIP. It's important that we upgrade setup tools for PyInstaller, at least. Otherwise, we'll get into some issues that the existing dependencies in the Python 3 container fail. Now, if you run this, we suddenly run into an issue that PyInstaller doesn't always spot all dependencies. So here you see that Siphon Blast is not found. Now, this is a bit difficult, because we can't keep going on and specifying all the missing dependencies. However, in this case, I did do that. So we can help PyInstaller a bit by giving a specification file. So here you can see that we help PyInstaller find some missing dependencies. And in this case, we've got all of them. I went to the process of running it five times, and every time it gave me something missing. So in this case, you can see that I specified some more files. And now we run it. All works great. And suddenly the image has shrunk even more to 97 megabytes. Now, we can keep going on and on and on. Because we can also perhaps strip down the distrelas image itself a bit more. We could only package Python inside our PyInstaller, but this becomes quite a complex process. So we can use a scratch, that's got a scratch image where you completely build your own distrelas image all the way from the start. So lastly, some Docker tips. I've been running my containers as a root. This is not smart. Don't run your containers as a root. Please don't do that. That's perhaps the most important thing before even considering going to distrelas. Don't run as a root ever. Also, use image hashes instead of tags. So use the shard digest instead of Python colon three. Use the hash that you can find online in your container registry. And don't use the existing distrelas images. Build your own distrelas images. And building your own distrelas images is quite a hassle, but that's a talk on its own. And also, perhaps if you're a larger organization, you might want to sign your Docker images, validate who made these Docker images so that you can verify that these images are actually made by someone within your organization. So to summarize, be very careful in which images you choose for your models. You don't want to use any model. You want to be any container. You want to be a bit more careful about this selection. You might want to make your own images to ensure that this is in control. And yeah, by using distrelas images, we limit the service on which we have vulnerabilities. So thanks so much. Here's some of the tools I used. Yeah. That's it. Fantastic. Thank you so much. Do we have any questions? Hi. Thank you for your presentation. I wanted to ask what if it will be really hard to strip out of our dependencies? Because if we include the NumPy, then we have to have installed a blast for it to work faster. And can we even do something then to strip it all? So yeah, you could entirely build your own distrelas images from the ground up. And this is quite difficult because someone has to manage this, right? If you're in a team, someone has to take this responsibility of making sure this image is up to date. In the end, an image is just a massive zip file with this entire file system. It's definitely possible to make these very minimal machine learning distrelas images, which are specialized from perhaps TensorFlow or NumPy. But again, it's a lot of effort. But it's definitely possible. But this little side effect where you need someone to manage this. Okay. Thank you. Hello. Thanks for the presentation. One question you suggested using image has instead of tags. Yes. Why is that? So right now, if I take Python version three from the Docker hub, I wouldn't know which Docker image I'm using it. It's for reproducibility. So someone might update Python three because, yeah, so someone might upload, push a new image to this so-called version and a name of the image. So what will happen is that maybe it will work once, but someone pushes a new version and I run it again and it doesn't work anymore because something changed in between. But if you use the hash, you only point to the specific version, I mean specific build of this image. Thanks for the talk. Very interesting. What do you think about alpine-based Python images? So alpine are very small images, very nice. They actually get pretty close to what the distrelas GCR images offer. I also ran some scans on them. They also contain very little vulnerabilities. I wanted to maybe show some also in the slides. But alpine images are also really nice. They are not distrelas, but they are very small. So I guess if you cannot use distrelas, perhaps use alpine images or smaller images like Python three slim, which are just reduced images. But yeah, I do like alpine images and I use them personally. Yeah, you can definitely use them, but they're not truly distrelas in a sense that you completely strip out everything that's needed because inside alpine images, they're still a shell. They're still all these things you need to have some functioning operating system. And they also have vulnerabilities, of course, like recently that this thing where shadow, where you could become root inside your local container. Any more questions? Okay. So I wanted to ask, so when you mentioned like building it from scratch, how different it is to build it from scratch than from a distrelas image? Like what, what more does the distrelas have? So the distrelas has a lot of nice things like SSH stuff like certificates, users, privilege, a lot of very nice features. And if you were to have to use them, get them yourselves, then there's just a whole list of things you would want to copy inside your from scratch image. But it doesn't stop you from making a scratch image and just really carefully looking through what you need. So do you need users who just know and just slowly build this list. But if you were to use a scratch image, then you can still use a multi-stage build where you build all the things you need, copy them inside your scratch image. But like I said, you need someone to really maintain this. It's a, it's a large job. Thank you for the talk. Another one more question if you don't mind. So I think we understand like it takes a little bit of additional effort to slim down your image and also to secure it. But in an enterprise environment, like what's your, like what's your perception, what was your experience? What's the right balance between taking that additional effort and like would you just take it to step three or like was that like a step to step two, step three? Where would you stop? Like what's the right balance? So I think the right balance is for a larger organization for sure to have their own managed Docker containers, their own base images or diskless images is whatever you want to call them as they have the resources to do so. And for compliancy, for all these others, things that come with large organizations, you, you definitely want at least these images to be, well, not the pie installer thing, because that's kind of experimental. And also you would use your own scratch images because you need to know what goes into this image and you need to be very strict on what, what is allowed in this image and what's not allowed in this image. So the, yeah, for larger organizations, I don't take me so. Okay, thank you. Do we have any more questions? Yes, we do. I'm just curious what kind of use cases you're deploying these machine learning solutions for it. ING is a bank, right? So the, yeah, can you tell me a bit about that? So I presume it's hosting some kind of API that is called. Yeah, so we have a machine learning platform and in this platform, you will be able to approach, approach, say like, okay, I want this model, I do this prediction with this model, and we could be running tens or 100 models at the same time. But the use cases for these models are very extremely. For example, we could be looking at some natural language processing or, yeah, they for a bank, they are all over the place. I could go into very specific details, but then I can talk to you. Maybe we talk after. Yeah, okay. Thanks. Any more questions? Okay, then let's give another round of applause for Thomas, engineering experience, and he will tell us about extracting tables from PDFs. Please welcome. Hello, everyone, and thanks for joining. My name is Dimitri Nidenov, and I'm a freelance Python developer from Sofia, Bulgaria. Today, I'm going to talk to you about extracting tabular data from PDFs and the problems I faced and as well the solutions which I found. So, let's start by a quick overview of what this talk would be about. So, sorry. First, we'll have a brief history of the PDF portable document format and its internal structure, specifically how tabular data is represented and why it's hard to actually extract such data. Then on to Camelot and Excalibur, the main focus of this talk. I'll see, I will list the features which those libraries make available for use and why it's so easy to use them to extract the tabular data and get control over the extraction process as well. Then there is time for some quick demonstration which I'll show you how to use the Camelot API and how you can tweak the extraction process to suit your needs. And at the end, we'll have some Q&A and also a look at possible improvements that can be done in Camelot and Excalibur as well. So, let's get started with the portable document format. So almost 30 years ago, if not more, John Warnock, which is one of the founders of Adobe Systems, started something which was unofficially called the Camelot project, sorry, and described the goals in a manifesto sort of document, six pages long. And here you can see a few excerpts from that document. The goal was to create the universal document format, which is easy to exchange between different systems, environments, OSs, and each PDF can contain rich content, annotations, attachments, fonts, and all sorts of different things that are needed to represent this PDF, the same way, regardless on which machine or OS you're looking at that, and most importantly, print it the same way as the author intended. And this here is from an article from Adobe called The Evolution of the Digital Document, celebrating Adobe's Sacrobat 25th anniversary, whatever. So, let's see a few quick facts about PDF. So, it was created in the early 1990s. It actually predates the worldwide web and HTML format. It was a proprietary format initially, but later in 2008 it was released as an open standard by the International Standards Organization. It's based on a subset of the Adobe PostScript, which is a page description language, and a subset because PostScript itself is quite broad and it's practically a programming language, although it doesn't look so. And it was designed to be self-contained so that each PDF contains everything you needed to render that on various different systems. And in order to do that, it uses font embedding and attachments and annotations and various other things. There are 13 versions released so far. Since 2008, as I said, version 1.7, it's an open standard. And it's structured as a hierarchy of objects. So, there is the page catalog, sorry, the document catalog, which contains each page and within each page, then you have different types of content, which are also hierarchically structured, and those objects can be words, paragraphs, fonts, and so on. So, there is another view of the PDF structure, which is more kind of close to the physical layout of it. So, it has a header, a trailer, and also cross references tables, which also contain references to other objects within the PDF. It can also contain revisions because PDF was originally designed to be revizable, and you could save multiple revisions within the same PDF. And yeah, and what else? No tables whatsoever. There are really no concept of tables in PDF. Tables are actually defined as absolutely positioned text boxes on the page, and they're laid out in the reading order, although they don't have to be. So, basically, they just look like tables, but there is no information internally about whether this is a column or a row or what relationships there are between those. So, if you ever tried to do copy-pasting from PDF, you might have found that it's not so easy to do. Basically, you might be lucky sometimes, and depending on the way that PDF was rendered, you might get one or two rows or maybe one or two columns easy to select, but basically, you just need to do one by one select copy and paste somewhere into probably first not path, then Excel, and so on. So, there has to be a better way than this, right? And indeed, there are multiple ways to do it. One of the first things I found, first tools, which I found that works well, is called Tabua. It's a veritable open source project, quite long in terms of history and so on. Unfortunately, Java-based, but open source. There is also PDF Plumber, which is Python open source PDF tables, which originally was open source, but now is proprietary. There is also a PDF table extract, which was basically no longer maintained, unfortunately, and various other proprietary free or paid online services, among which I tried is called OCR Space. So, and then we come to Camelot and Excalibur. I've come across Camelot in search of something better, which is open source Python-based. And gives me more control over the process, because most of those tools have their drawbacks and advantages, but basically none of them works as well as I found Camelot works. So, Camelot was started in 2016 in a place called Social Cops in Bangalore, India by a guy called Vinayak Mehta, a great guy. I've had some chats with him. He actually was facing a problem where a lot of the open data which was available and published by the Indian government or administration was in the form of exported PDFs with tables in there and, you know, just take it from there. So, yeah, basically he needed something which is configurable and also developer friendly in a way, because he was learning on one hand, you know, how to do it, and also using what was available, but couldn't find something which is exactly fitting what he wanted. So, there are some features of Camelot which I found and one of the best one is the excellent documentation. It's really lots of, there are lots of examples in there, lots of, sorry, lots of examples, lots of different, you know, ways to overcome certain problems that you might have, you know, how to use parameters of the API to fix issues that you might be facing. And also it's Python-based, open source, MIT licensed, and it has two main extraction algorithms built in. One is called Lattice and the other is called Stream. Lattice is for grid-like tables where you have dividing lines and rows and so on, whereas the Stream one is where you don't have those. So, basically it's detecting text stages based on alignment left or right, central, and takes into account white space in between. It works well out of the box. Pretty much for most simple cases it cannot detect where the table is on the page without you having to do anything. And then again it's very configurable because you could define basically all the parameters of the extraction. You could say, for example, no, that's not a single column. You've recognized here there are actually five columns and they're defined at those offsets. And you could say like strip those characters from the text because there might be some garbage or like numeric formats with spaces and commas where you want actually to get floats out of that eventually. And it also exports to various useful formats, CSV, CSV, Excel, JSON, HTML, and pandas data frames directly. So you could use it directly into your ETL workflow. And what I really liked about it is that it supports visual debugging and plotting using matplotlib. So you could actually see what it recognized where and why, for example, certain things were not recognized. You could graph it and see there are various types of plots that it supports. And last but not least, it's very actively maintained and has a quite a welcoming community of two people that mostly contribute, but lots of people who are actually using it. And you could judge that by seeing how many issues there are and there are lots of people who are trying and finding something and then usually finding a solution for their specific problem. So let's see how you can install it. It's actually quite easy. I'm not really using condom myself, but if you are, and probably you should be, there is this one liner that you could use to install it. If you are using pip, there are a couple of prerequisites that you need to install first, which are the TK and Go script. And then you can install it with pip, simply like pip install minus, minus upgrade pip for various reasons. Kind of a dash pi and square bracket CV. That's because there are various different sub packages and the one you want is the one that includes open CV in it. And for Excalibur, which is, I don't know if I mentioned, but Excalibur is the web front-end of Camelot. So if you ever use Tabula, it's kind of the equivalent of Tabula's front-end and it's flask-based UI and has an API and it uses Camelot underneath. So let's do a quick demo, hopefully it will work. So I have this, so just a little notebook here. We can install it. I already did it actually. And then how you can use it. You just import Camelot and then you say in Camelot.readpdf, you specify the path to the PDF and then various other parameters. So by default, the flavor it's using, it's called lattice, the one with the grid. You can also specify, you know, stream. And this specific table looks like this, the one that you saw earlier in the video. So it's a typical table. It's maybe more complicated than your usual table because it has spanning cells, spanning cones, spanning rows. But basically if you try to do this by hand, it's a nightmare. Whereas with Camelot, it just takes this call and the thing it returns, it's called the table list. It's an object that just a container of tables and it has how many tables it recognized. The good thing about it is that it also has a parsing report for each of the tables. So you can say, okay, so which page this table is on, in which order it was found on the page, left to right, top to bottom. And also the accuracy of the recognition and the ratio of white space within the table. And then you can access each of the tables by indexing. You'll get a table object, which is basically a thin wrapper around Pandas data frame, which you can access directly by doing .df. And there it is. So this is the whole table. And as you see, I haven't specified anything specifically, like parameters. And yet Camelot managed to recognize where that table is and everything. So then you could do export and it supports, as I said, various formats. It usually, just by specifying the file name, it can detect the format that you want. Otherwise, you could specify it, there is a F equals, oops, sorry, like, for example, CSV. Yeah, online life coding, never a good idea. Yeah, it's not my usual keyboard. So yeah, I haven't run the whole frame. But yeah, so basically this is what it outputs. So it uses the page and the order to define the file name because you could have multiple pages, multiple tables on one page. And the rest is plain old CSV, you know, as you can then import and reuse whatever, however you want. It can also do JSON exports. There is the .df argument I showed you. It's not necessary here, but yeah, which is, yeah, as you see, just plain JSON. Then again, you could, of course, then load this back as JSON and process it, stop doing things, sorry. And so on. But this is the kind of the best part of it, which is the plotting. So probably have to rerun that whole thing because it's a bit messed up. But basically you can tell it to plots. And there are several different kinds of plots. So one of the plots is the grid plot. Maybe there is no internet. Yeah, sorry about that. But I can probably show you. Yeah, so basically if you, it's, the documentation is really excellent. There is every parameter in there is well documented, how it affects everything. And this is, for example, Excalibur. Yeah, which is the web front end. So it basically goes, sorry, zooming issues. But yeah, basically it looks very much like tabular. If you haven't seen it, it's basically something that you run locally. So there is no data privacy issues or GDPR and so on. Everything stays on your machine. You can just upload it and then you can specify what pages you want and so on. And then it shows you up, shows you the table, for example, like this. And then you could just use auto detect, which usually works and it immediately finds the tables you want. Or if you want like just a subset, sorry, a subset of that, you could just like, oops, sorry, it's really a scale. Basically you can resize it, move it around and like place it where you want it. And then go and yeah. So there is this refresh thing because it's supposedly, so it's, it's, it's architectured so that you can run it on salary as well. So you could, you know, parallelize multiple extraction jobs and it's a synchronous by design. But yeah, I guess some issues with the, with the Docker container I'm using here. But yeah, so that's, Excalibur itself has some, some things that can be done better. It's barely, you know, it's just a bare UI that allows you to do those, you know, selections and, you know, just do this. There is also the, you can choose the flavor here. In this case, stream, you could also add columns to say, okay, so I just want, for example, only those two columns here, or whether it to work is another, okay, there it is. So, and it shows you what it got extracted. You can then export it to various formats. It's actually zips it up and downloads, it gives you a downloadable version. And there are also the rules, which basically are JSON, JSON files, which basically are the same parameters that you can pass to read PDF more or less. So you could define table areas, which you're interested in on the page. You could have multiple table areas. You could define the columns. You could say, for example, whether it should process backgrounds, because, for example, there are some tables, which are not sure if I have such. Yeah, for example, this is a typical gridless table, which can be processed. And let me see if I have it here. Probably, or maybe not. Anyway, so it's basically trying to make the usage of kind of a lot easier for non-technical people, but it's really easy to use otherwise. Okay, so it actually worked. So this is one of the graphs, which is the text graph. It detects all the text boxes on the page and graphs them. And there is that one from before. Yeah, so you could say, like, strip certain characters, which I don't care about. And yeah, so the other angled brackets are gone. There is a comma here, which is gone as well. And yeah, basically, that's it. So it has a lot of things, which you could try. And with Excalibur, especially, it's good because it's kind of an iterative process. So you can try it out, see what it extracts, then go back, try tweaking a bit, see how it works. And all those rules actually are then saved. You could then change it and upload new ones, and then use those tools with Excalibur as a CLI to automate, like, batch extraction of multiple, you know, similar structured documents with it. And yeah, so it shows all the jobs, each file is a job, and yeah. Okay. So that was hopefully useful. And then I just have one more thing, which is future improvements and questions. So there are currently some known issues. One of the issue is performance when it comes to multi-page PDFs, and by multi-page, I mean over 100 pages. They're like issues with memory footprint sometimes, but they're being worked on. Go script seems to be an issue for a lot of people, because different OASs, different sorts of libraries and things, it can be tricky to install, even though it's, it, Go script itself is a dependent, is a prerequisite for maplotlib, but anyway. There could be more tests. Currently, there is like 89% test coverage, although it could be improved. And yeah, as I said, better memory footprint. And of course, anything you might else, you might think of as well. So that was it. I hope you found it useful. I'll be happy to answer any questions. Thank you. Do we have any questions? Oh yeah, thanks for the talk. Is there any OCR component, or does the library integrate with any OCR components? Not yet. It's planned in the roadmap. Initially, there was a Tesseract integration, but it turned out to be problematic in terms of performance. I personally used OCR My PDF as a step within the extraction process. It works well. It's still kind of experimental. But yeah, yeah, actually, I didn't say this, but it works on PDFs with text layers because of that. And the OCR support is planned as well. Thank you. Yeah. I had a question where, what exactly do we make with the precision accuracies, right? So there is an accuracy which gives the percentage. I mean, I understand that it says, it is recognizing it as a table 100%. But with white spaces, how exactly do I see that? That's actually, there is an issue on the repo about this, whether it's actually useful. But what it tries to do is to give you some kind of an estimate of whether there was too much or too little white space. It depends on the table. Like if it's too tensely packed and you get like a low value for that, then probably something is misrecognized. Whereas the accuracy is, if I'm not mistaken, more about correctly recognizing text boxes within the area of the table and how they overlap. And that gives you some confidence with high rates of accuracy. It actually recognizes almost all of it. Whereas with lower values, it's kind of, you know, part of the table might be just not recognized or something. Yeah. Yeah. I mean, I was trying to recognize tables using this. And one problem which I came across was I was using lattice all the time. And I think when we have, when we're using stream, we are supposed to give the boundary, I think we need, you're supposed to give the coordinates. So we're supposed to say, okay, this is the area where it has to look for a table right now. Yeah. So you could actually see this with, so with lattice, there are three or four different types of plots you could use. One of them is a joint plot where it plots basically every intersection of the lines. So you can see where the rows are, where the spanning columns are. And there is the line plot which also shows you, you know, where the, so it's using OpenCV underneath. So basically it uses some various filtering and thresholding and so on. So this kind of, those metrics come out of them, out of there. Okay. Fantastic. Thank you so much. Let's give Dimitar now another round of applause. Thank you. Hello and welcome. So for our next talk, please give a warm welcome to Francisco Oshman. She's going to talk to us about boosting research with machine learning. The record. Oh, okay. That's getting a little bit louder now. Okay. So like for example, the recognition of objects and images or the detection of events and time series. And apparently a lot of research projects and a lot of research data sets deal with the, yeah, quite similar problems. And what is also quite important is that the standard statistical methods fail for some of these problems. And that's why researchers came to the idea to apply machine learning as a tool to actually tackle these problems. And at first I would like to show you some recent applications of machine learning and research. So the first one is coming from CERN, which is the High Energy Physics Laboratory, located in Switzerland. And they're producing a lot of data during collision experiments. And what they basically want to do based on this data is to discover and characterize new particles. And since they have this huge amount of data, they need other approaches than standard methods to actually do this detection of these particles. So what they for example here did is they released a data set, a part of their data for a challenge, for a machine learning challenge, and just made it available for machine learning researchers so that they can try to find a good solution for their problem. Another example is coming more from the medical field. In that case, it's more about the prediction of epileptic seizures. So imagine a patient suffering from these epileptic seizures. And the goal here is that their device is implanted in their head so that these devices can predict these upcoming seizures than also can counteract this seizure so that patients are not suffering anymore that much. And then the last example I'm showing is also coming from the medical field. But in that case, we're dealing with image data. And what the goal here is to do recognition of this tissue, a recognition between healthy tissue and cancer regions within that image. And the idea is that usually medical doctors have to do that task. And that algorithms can take over this work and can assist the medical doctors in this prediction or classification of these different tissues. So as we already can see from these few examples I've shown you so far, we see that we have two different fields of application of machine learning and research. The first one is to uncover hidden patterns in the data. For example, you have a huge data set for these collision experiments. And you want to get more insight into the data. And what also helps a lot here is, for example, if you're using classic machine learning that we have interpretable models, and then get also more information about the data set itself. And then the second application is to do an optimization of time consuming events. For example, this classification of the tissue, either we have cancer regions or not, so that an algorithm can take over this task and not a medical doctor has to do this in the end. So after I've shown you a few examples of current applications in machine learning, the next thing I want to do is at first I would like to show you the basic building blocks of machine learning pipelines, not only in research, but also in general for machine learning projects. And then I will show you two specific use cases, two specific applications of machine learning and research. The first one is about the detection of arm movements based on EEG signals. And the second one is for the segmentation, so the localization of specific cells within an image. All the use cases I will show you here is based on public available data sets, so it's nothing which has to do with my work I'm doing at ETH as something I did as a project in order to explore these public available data sets. Okay, so now coming to these different building blocks of machine learning pipelines. Usually it starts with a data set which was recorded during an experiment and then based on the data we recorded, we want to make some kind of prediction. So for example, coming back to the example from the epilepsy patients, we have these time series and then we want to say whether there's an upcoming epileptic event or not. And then there should be something in between which should bring us from the data itself to the prediction. So what I just called here black box, but of course this black box can be filled with more content. And what is happening in between is first of all the pre-processing of the data which is quite important and always depending on the data you have. And then the second step is always the modeling so that you train a model and algorithm which is then actually taking over the task. And luckily Python provides a lot of different toolboxes which can be applied and I just named here a few, but those ones which are mostly used are for example SciPy for the processing of data especially for example for time series, but also pandas which is quite helpful for the handling of tabular data. And then for the modeling itself, Scikit-learn is really important for classic machine learning models and then Keras is for the implementation of deep neural networks. So now we know these different building blocks we need to get from our data to our prediction, but how does that actually look like when we want to implement that in Python? And the implementation looks as follows. At first of course we have to import the specific libraries we need and also for example helper functions like specific pre-processing we want to apply to our data. And then of course we have to load our data and here we do a split between the data itself and the observed outcomes so something what the model should predict in the end. Then at the second step the pre-processing is done to the data which means so anything you want to do let's imagine you have different subjects in a medical experiment and you want to have a standardization between different subjects you apply some kind of normalization or could be anything else depending on your data. And then what is always quite important that you do the split of your data in the training and validation step because you don't not only want to train your algorithm you also want to validate that's actually a good model you got in the end so that's bringing you are giving you a good result. Then the modeling part itself is then done as follows so you choose a specific model like for example in this case the logistic regression you could also add specific parameters to this logistic regression and then basically you just do a fit of the chosen model with your training data set and then the last step you generate a prediction based on your validation set and then also choose some kind of score to evaluate how good your model is so for example in this case accuracy so these are the different building blocks we need to come from the data to the prediction and now I would like to show you two specific use cases the first one is to predict our movements based on EG signals so why is it important imagine there are persons who lost for example one arm and they want to use these artificial arms and control the movement of these artificial arms so what is important then is to get the brain activity which is connected to these arm movements and based on the brain activity predict and control the arm movement of the arm so how do these experiments look like so these experiments are done with healthy patients of course because you need both the brain activity and the movement of the arm and at first these patients get a what it's called the EG cap so the cap with 32 different electrodes which are then connected to the brain so just attached to the brain and can measure the activity the brain activity and then what they're doing at the same time while their brain activity is measured they're doing these arm movements so for example they're grabbing something they're lifting something and releasing something again and then in this case so I will show you a little bit of the data but since we have a lot of time series we need to do a lot of preprocessing to get more information out of our data before we can actually do the modeling itself so that's why I will yeah walk you through this quite yeah heavy preprocessing steps step by step so at first again there's a figure of or a scheme of the distribution of electrodes across the skull so this is actually a view on top of the head and this little triangle is the nose and we see how these different electrodes are distributed all over the head and I will show you time recordings for a few of these channels but not all of them because just there's just too much data to show at the same time but I will show you the recordings of these four different channels which are highlighted here in this plot and these time recordings look as follows so we have these recordings of these four different channels as a function of time and what is also in the data set is the different arm movements so we see here a recording over eight seconds we see these recordings of the eight channels and we see the different arm movements which have been done in this case it's six different arm movements it's not important which arm movement it is exactly it's just important that there was an arm movement happening like lifting releasing grabbing and so on and as we already can see here is that it's quite hard to tell just from the time series itself to tell whether there was like an arm movement at all going on or which arm movement for example to make a distinction between different arm movements so for that reason several yeah processing steps have to be done and at first what is done is to to split the data into different time windows time frames the reason for that is so usually when we are talking about this activity or about this arm movements that's not only happening at one specific time there's also something before and also something afterwards so for that reason we are looking at these windows of for example one second in this case 500 data points are always one second and all the further modifications are done to this specific frame since we're using a sliding window we will yeah split the data to all the different frames and so in the end we have the length of the of the time series times 500 which is giving us the amount of windows we are looking at and doing the modifications to so and one single window is looking as follows so it's just a one second out of this time series and we apply this sliding window to the whole time series so then the first modification which is done is to apply a low pass filter why is that important so the brain is operating at specific brain rhythms and we know that above a specific threshold it can rather be seen as noise what the brain is producing and not really as something which is giving you information about the brain activity so for that reason we get rid of all the high frequency parts of the of the signal and just stick to the lower frequency parts and then the next modification the power of the signal is generated which is just giving you some inside or information about the energy of the signal and is computed by squaring just yeah every data point and then in the last step we just take the temporal average of all of these one second windows so that in the end we just for all of these windows we generated in the beginning we get just one value per channel and then we use this data to do the model training and fitting in the end so and I've shown you this quite complex preprocessing of the data and I will follow or continue with the modeling itself so how can a classic machine learning model can look like to actually predict the arm movement based on the data we produced in the preprocessing here is so what I use in this case is called a voting classifier which is provided by scikit-learn and the nice thing here is that we combine several week classifiers in order to get a stronger one so namely in this case I use three different classifiers linear discriminant analysis random forest classifier and logistic regression all these classifiers are combined and then in the end this combined classifier this voting classifier is fitted trained to the training data and then there's a prediction made based on the training data set or test set you know validation set I'm sorry I'm not mixing up training and validation set okay and so once we've done this training of the model and then also the prediction we of course want to know how the prediction looks like if this classic and quite simple model is giving us a good result and the results produced by this model looks as follows so at first we look at the observed event so that's actually but experiments have been observed which is just the time points of the observed arm movements as a function of time and but ever there's a blue line and arm movement actually happened and wherever there's white space there was no arm movement and now I will add their predicted events of what the model predicted to be an arm movement and that looks as follows so wherever there's a dashed line on top of a blue line we see that the model predicted correctly that there was an arm movement whenever there's only a dashed line the model predicted not correctly and said that there's an arm movement although there was no arm movement and wherever there's just a straight line the model missed an event so of course there's just just a short time period within a longer validation set but to get more impression on how good the model actually is in the end and we can also look for example at the confusion matrix and so the information you can get out of this confusion matrix is first that we have around 70% of events which were predicted correctly because when we're looking here at the confusion matrix we see that there are around 9000 events which were predicted in the right way and we have around 12,000 events in total so almost 13,000 so we have around 17% of events which were predicted correctly and what we can also get out of this confusion matrix is that we have hardly any false alarm because we have only 100 in certain events where the model predicted that there was an arm movement but there was actually none so as I've shown you for this first use case you could see that this classic machine learning model provides I would say a reasonable good prediction for this quite complex task of predicting arm movements just on these time series and what I've shown you here in detail but in general these classic machine learning models also can give you deeper insight into the data so for example here you could just based on the trained model given prediction on which channel is quite important for the prediction of the arm movements in which channels are not and then also what is quite important is that this model is running or just having a computation low cost so the whole training for this model was running on a single CPU and just took around 30 minutes which is quite fast and just gives us a good result compared to other methods so that was the first use case where applied classic machine learning and also was more looking at how we can get a deeper insight into the data by applying these models and now in the second use case I want to focus on the automatic generation of segmentation images so for those of you who don't know what segmentation images are I will explain in a second but first of all I will show you the raw data which looks as follows so these images which are just visualizations of brain slices so we see all these different cell types and structures we can observe in the brain and what researchers want to know is for example one specific structure so for example what we can see here and why it is just a specific part of the cell and they want to have that highlighted throughout the whole for example stack of images and since we are not talking about like for example all these different structures within the image just about one specific it's quite hard to do that by for example computer vision algorithms because we just want to focus on this specific part of the set and for that reason in many cases this segmentation image is for example done by hand so it's quite time consuming to generate these images and for that reason the the question is if there's also a way to do an automatic detection for example of these shapes within that image and before I will show you the kind of network which can actually take over this kind of task I will show you some some slides on the general implementation of neural networks in Keras so that you can get an impression of how simple neural networks are implemented in Keras so what you see here on that picture is just a quite simple feed forward neural network with an input layer two hidden layers and an output layer and also all the connections between those different layers since we have all to all connections between these layers and basically what is done doing training is that we feed in an input into the model it's processed during the whole network and then in the end a specific output is generated like for example a classification or a regression based on the problem and data you are using and the nice thing is that Keras allows to implement that quite easily all these different layers in Python so basically what you have to do to get this kind of or apply this kind of network is first you import these different layers you want to use saying for example the input layer or dense layer which is a layer which is giving you all these all to all connections and then you specify the input important thing here is that you also name the amount of neurons you want to use then you have two hidden layers where you also specify the amount of neurons and then the output in the end then you put all these different layers which you are ordered or have ordered in a sequential way in one model and then yeah specify also the input and output which is used and as I said before or maybe I haven't mentioned but so the network which can be used to generate these augmentation images is quite complex so I will show you the general structure but I don't want to go into detail because it's just too much to show now within that shorter time but I want to give you the general intuition about what the network is doing so the network looks as follows and what we want to focus on here is that we have these two different branches so we have this downstream branch and we have an upstream branch and we have these skip connections between these different branches and so if you want to have more information about this type of model I also added the citation so there are these guys which are actually developed the model so you can also read up on this if you want so what is now important here for these different branches is that this downstream branch basically extracts the what information so what is the shape of the cell we actually want to detect within an image and then the upstream branch more extracts the the where information so where's that specific type of the cell or the specific region of a cell you want to extract and then there are also these skip connections that actually give this information from the downstream to the upstream branch and now let's imagine you define this kind of model in a different file for example you want to load it into your python code and also then train it that would look as follows so you would import your model your unit load this unit and then basically just do a fit of the model with your training data and then also a prediction based on your test data so basically that's quite similar also to all these yeah to the training of models but we've seen before for scikit-learn so it's basically just the same yeah the same steps you have to do okay so now we train our unit now we actually want to see what kind of prediction it gives us so what is the output in the end so at first again show you the raw image so that is what is just a normal image from our test data test data set we put into our model and then generate the prediction and at next I show you the ground truth so actually how should it look like what would it be if a person would color all these different regions of the cell and then at next I show you the prediction which is I would say quite a good result for an algorithm which is detecting these different shapes and objects within an image so of course this is just one single example but I can say you that the whole training or the whole network in the end reached a accuracy of 99% or 98% close to 99% so it's giving you a quite a good result for for this kind of task compared to that it's quite time consuming for a person doing it again and again okay so that already brings me to my second summary for the second part so what I've shown you for the second use case is that these deep learning models provide or assist in the automatization of time consuming events or time consuming processes like for example the generation of the segmentation images and what is also quite helpful with deep learning is that it can recognize patterns in complex data sets for for example this shape of the different cells or part of the cells and yeah how that can be put into the prediction what deep learning does not offer is that it's not giving you interpretability of the model so that means of course we can train the model and we can look at the model ways in the end but it's quite hard to tell which way it belongs to which prediction or for example to which feature and what is the information you can get out of your model it's in the end it's a black box but it's working and what is also important to tell is that it's computationally quite heavy so the training for this specific case took around two hours on a single GPU but for example compared to a single CPU it took around two days or two and a half days which is quite long okay so that already brings me to like the wrap up of the whole talk so what I've shown you is the different applications of machine learning and research we saw that machine learning is quite helpful and powerful in the detection of hidden patterns in data in research data like for example the prediction of events of arm movements in EEG signals and on top of that it also allows or gives us interpretable models and which allows us a further insight into the data and then it also so more for the deep learning part of it it gives us an automatization of time consuming processes like for example the generation of segmentation images and with that first of all I would like to thank my colleagues from scientific IT services at ETH so this is my group which is focusing mostly on research informatics but there are also other people doing consultant work kind of consultant work and high performance computing and software development for ETH researchers and so last but not least I would like to thank you for your attention and I will be happy to answer questions thank you very much for the brilliant speech I was interested in your opinion because I've been reading that certain criticism have been aimed towards machine using machine learning scientific research due to in issues with reproducibility of reproducibility with the models I would like to know your opinion on that I think it highly depends on on the use case or the the specific application so for example if we're coming back to this generation of segmentation images what you could easily do is to um yeah save the model itself right so so save the model structure and also this model basin then it would be quite easy to produce the same result again um but of course I can imagine that there are use cases where it's quite hard to like yeah to get this reproducibility for the research or the application of machine learning and research but do you have a specific example okay anybody else hi thank you for a presentation a short question have you used for this deep learning also keras images um I used keras yes the whole and does it support gpu learning too say it again gpu ah yeah it supports so basically what is quite nice about keras is that um so you can run the same code on a machine with cpu or gpu and it directly that chooses the right computational back end I would say so um so if you run it on a gpu machine it uses a gpu and it yeah the training is a lot faster than and one more question about this image image segmentation have you used color images too or just black and white in this case I just used black and white okay thank you anybody else um how do you decide which method to use like for example in the image segmentation why did you choose a neural network and why did you choose the unit um so for this specific case I use the unit because I was reading up a lot about it and it's yeah I would say why the state of the art method to use at the moment for the generation of these specific segmentation images but in general I would say um if there's a simpler algorithm or simpler model I could use I always start with the simpler one and then if it's not giving a good result I go to the next more complicated one thank you thank you for the presentation for the preprocessing the tasks are very domain specific yes so how do you know how do you find out what you need to do to prepare the data um given that you are maybe data scientists you know yeah knowledge so for this specific use case basically I would just was reading up in the textbooks and what is the way to do or yeah the the way to go for the processing of this e g signals and um I think if you're applying machine learning to your own research mostly you know what to do to your data um if you're someone coming from a different field um you have to look into what are the state of the art methods to do a processing of the data with so it's not only the machine learning knowledge it's also you need to acquire the domain knowledge yeah that's very important yeah okay so that's the time okay so thank you again next talk we have Peter and chief uh we'll be talking to us about distributed multi gpu computing with desk kupai and rapids please give a warm welcome thanks thanks for everyone for being here thanks for the introduction demeter uh well as as he told you already I'm Peter Enchev I'm a software engineer at nvidia and today I'm going to be talking about uh GPU computing with desk kupai rapids so the outline of this presentation is basically the one that you're seeing I'll be talking about interoperability and flexibility of the pi data and python ecosystem in general um acceleration or scaling up with GPUs and distribution or scaling out with multi nodes so the the talk will be mostly uh intertwined so there's no clear boundaries I'll be talking about these different topics all together before I start introducing what we aim to achieve with uh with rapids with desk and kupai in this context let's take a look at this uh simple example here and this is a very very simple example of a typical data science pipeline so you would start loading some data or creating a data set in this case we're using make moons and then we would create for example a data frame it's not necessarily in this case but just for example purposes it's there uh and then we would do some clustering with scikit learn for example but what if we want to accelerate this uh and we also don't want to reinvent the wheel so we want to keep things as simple as possible for the users so what we can do now is simply change the imports so instead of importing pandas whoops instead of importing pandas and scikit learn we would import qdf and qml which are part of the the rapids ecosystem and they would they provide the same api as pandas and scikit learn provide for you but the only difference is that they run on a on a GPU so what is rapids uh this part may sound a bit like a sales pitch because I borrowed not to say shamelessly that I copied from other presentations from rapids but uh rapids is an open source suit for end-to-end data science pipelines it's built on top of CUDA to leverage ultimate performance of GPUs it's a unifying framework for GPU data science and it provides pandas like scikit learn like api so nobody has to learn anything new except that you just change your imports as I shown in the preview example how do we how does a uh a regular pipe data science pipeline looks like so it looks something like you see here uh we begin with some data preparation say with pandas then we do some model training say with scikit learn and then we do some visualization check what we're getting and then we iterate over it uh this is the the basic part that rapids is tackling and we have several libraries that compose this ecosystem such as kudief and qml that I mentioned before we have also kugraph for graph analytics there's kuex filter and kepler gl for for visualization and there is and they all interconnect uh via apache arrow via the apache arrow standard on gpu and they can also interconnect with other deep learning um frameworks for example such as pi torch chaner and max net all through this apache arrow uh memory layout so the lesson that we learn from apache arrow is that we don't want to always do this expensive copy and conversion of data we want a unified memory layout that we can get rid of all this overhead that is basically slowing down the entire pipeline so we use apache arrow memory and we provide then we can provide zero copy memory uh transfer not transfers but we can provide zero copy memory uh interoperability between the different frameworks for this purpose uh this is just a fancier visualization of the the the pipeline that I mentioned before we start with some data we do some uh data preparation machine learning model training and data exploration and then we get some predictions and probably we deploy this result later so as I mentioned I work at nvidia but rapids is not just an nvidia effort it's a whole community effort and we can I can cite here some very important contributors like the scikit learn people and urea people that are also here on the talk and they've been very helpful with us uh anaconda coincide also other ecosystem partners such as walmart they've been all very helpful in both development and providing use cases that we can build upon so I'll focus mostly on the machine learning part uh for this talk uh because the time is very limited so I cannot focus on all data frames and kub graphs but this is how the the machine learning technology stack looks like in rapids and we have kuda at the bottom and we use already already distributed with the kuda toolkit so libraries such as kublas koo solver koo sparse to speed up the computation then we built some qml primitives on top of them and we finally have some qml algorithms written in c++ kuda and we expose this through siton to python and then we get this nice scikit learn like api the model is something like this so we have two ways of parallelization the first one is model parallelism so this means that we are actively rewriting this code in kuda and c++ so we have people writing this code for kudi f for qml etc and this is the part where the model parallelizes the data so it attempts to use gpus to the the best of its capabilities and but we can also do data parallelism for mainly for distribution so we use that for that with chunk arrays and we distribute over various nodes in a cluster for example one of the interesting algorithms that are available in qml this is just as an example is umap so you probably have seen before in the keynote the keynote sorry this the umap algorithm being used for words for clustering of words and umap is basically a faster visualization or algorithm targeted at visualization of clustering faster than ts and e and but it can also be used as a regular just dimensional dimensionality reduction algorithm and this is what umap looks like for for the mnist fashion dataset so we we see on the right it run on the cpu and on the left on the gpu we see that the clusters are very well defined in both cases but on the cpu it takes about a hundred seconds to run while on the gpu it takes 10 and a half seconds so just by switching to qml you get a 10 times speed up uh desk i'm sure a lot of you are familiar with desk already and but just for the sake of completeness desk is a distributed compute scheduler that can scale from laptops to supercomputers and it's a great candidate then to leverage this distributed systems for rapids because it's already well known a lot of people already use them and we can just connect the the CUDA back end to desk to leverage even more performance now and because it's extremely modular it's a great candidate for rapids and uh since we can have multiple workers in a single node with desk we can also have multi one worker per gpu model so this makes things much simpler to develop and to debug but how does desk really operate or really looks like so normally you would have a numpy array for example so it's uh if you see there we have a numpy array and you could execute some compute on that array and get some results but maybe numpy will be slow because it's many algorithms are single threaded or you cannot distribute them so what you do is basically you create a desk array that is a cluster or a big block with many smaller blocks composed of numpy arrays but what if we want to use desk with well on a gpu we can do that we use coupi for that so if you're familiar with numpy you're automatically familiar with coupi as well because it uses the same it has the same api implements the same api and what we have to do now is basically say okay my desk array now uses coupi as a back end so the all these blocks will be blocks on a gpu and you can distribute these blocks later on with the same desk scheduler that you would already have for numpy and this is part of the the interoperability effort that i mentioned before so before there wasn't a lot of interoperability capabilities in the python ecosystem so of course you can always copy data around but this hurts performance badly and we want to address that and also you could not for example use a coupi array in desk because it was simply not written for that purpose and to address this this sort of issue numpy introduces several protocols and in particular here we are interested in the numpy enhancement proposal 18 which implements the the the array function and this is a function dispatch mechanism that allows you to use numpy as simply as a high level api so we would call for example numpy and and some on an array and depending on the type of the array it will dispatch the work to the library that actually implements it so for example coupi or desk and what these libraries need to implement is only this array function which is can be easily implemented with like 20 lines of code for obviously for libraries that are numpy like so they operate on arrays and here we have a simple example of of computing svd with desk so this is more or less how we would do before and very similarly how you would do it now so you import desk desk array numpy and you create a numpy array of randoms for example then you can chunk it uh with desk desk from array you can obviously uh create this array also directly with desk but for example purposes i'm doing it like this to be more clear so we are creating a numpy array converting it to various blocks in a desk array and finally we call numpy linog svd on dx which is a desk array so this is something that it wasn't possible before array function you would have to call desk array so it means that everyone who wants to support desk needs to know about the existence of desk and now it's not the case anymore so you just need to know about the numpy api and for this example here it took one minute 21 seconds for for this array to compute but now if we want to do it on the on kupi we do almost the same we have to change two lines which is include an import to kupi and say that the array that we're creating is a kupi array everything else remains the same so desk array is now a an array of several kupi blocks and we use numpy dot linog dot svd on the the same desk array to compute on a gpu for example and this is takes 41 seconds so it's roughly half the time it took before and this is on a single gpu by the way but as all good things in life there are limitations to to the protocol one of the limitations are universal functions fortunately these are already addressed by the array u-fung protocol but numpy array and numpy as array will require their own protocol if we want to pursue that path and because these these are two functions that are meant to coerce an array to numpy itself so if you if you coerce an an array to kupi it might break compatibility with various libraries that are already rely on as a ray for example to give you a numpy array specifically and nothing else not a numpy like array and dispatch methods well dispatch for methods of any kind such as random state so in this case we cannot identify what kind of array we're operating on because we're not passing an array to this function we can pass a seed but it's it is uh it has no reference to base off and uh this reference is always the array in the case of array function so if we don't have an array we just cannot do anything there are alternatives to the array function protocol one is u array that is an effort of one site and it intends to address the shortcomings of the n e p 18 that I mentioned earlier and it's a generic multiple dispatch mechanism uh so this is it looks a bit different than it does with the numpy array array function so in this situation here what we would do is we would instead of using a numpy array we would set the back end so we're not explicitly saying now I'm creating a kupi array or a desk array or whatever type of array we at the beginning we say this block of code we'll use kupi arrays and with that we can uh everything can remain the same so numpy ones will create whatever back end you're using their arrays and all the operations will be dispatched also to this to this uh to this library in this case kupi for example so this is what it should look like and it does look like this actually so this is a perfectly fine code to use so in this short example what I wanted is just to create a small array and compute some on it and check that the the types really match so I begin by creating this ones array then I print the sum so four seems to match and the type of a and the type of sum of a are both kupi arrays so this is exactly what we expected we can do also multiple library dispatch so multiple back ends so say desk and kupi uh internally to do this it's also very simple we need another import here to to say that we are working with desk back ends and we also have to set that we are using multiple back ends so kupi internally and desk uh in a more high level and since desk is a lazily evaluated library we need to add dot compute but this is already something that breaks the numpy api but is already known and accepted for this kind of application so in this case we also check that the sum of the array matches four okay that the type of a is a desk array it is and the type of numpy sum of the array is a kupi array which is not it is not so it's a numpy uh scalar and the reason for that is because desk needs to add explicitly support for u array so what we would expect in this example is the red part instead of the this one line before the last where we see numpy float 64 so we would expect that the result is actually a kupi array um on the gpu side we also have uh other protocols such as could array interface this is again to address the the problem of data copy and conversion so could array interface basically provides you a pointer to a gpu memory and we can pass this pointer to various libraries so numpy kupi api torch for example they all implement this array interface and we can just pass this instead of copying any data around besides could array interface there's also dl pack which is explicitly for deep learning but you can also use that with rapids so you can do zero copy and pass data around in case your part of your pipeline is doing some deep learning rapids doesn't intend to address deep learning at all so it's a data science so conventional or classic machine learning and not deep learning but we want to have the capability to operate with everyone in the python ecosystem um there are also some challenges to this and one of them is communication again so if we are copying data that's a problem and but we in some cases so if we have multiple nodes we have to copy data around there's no way around that and desk by default use tcp sockets which are slow so one of the alternatives to this is using for example ucx which is a uniform access to transport so you can use tcp sockets you can use infini band shared memory or nv link which is an nvidia proprietary interface to interconnect gpus on a faster than pc i express rate of course uh and but this is a c plus plus library so it's very targeted the hardware so we need python bindings for that and there are python bindings in the work there's some work already done but it's not complete yet and what it will allow is desk to communicate efficiently depending on your on the hardware that you have available on your node or on your cluster so here we have already some preliminary benchmarks let's say or performance analysis on top we have before and on bottom we have after so if you are familiar with desk the red part is memory copying so it's the time that desk is actually waiting or for for some memory transfer to be done and on the bottom one we see that it's taking over four seconds in the after the 20 second mark it's spending basically four seconds there doing basically copy nothing else whereas when we add ucx to the into play that same block there becomes what like half a second maybe one second so it gives us a four eight times speed up easily and of course this will depend on the hardware that you have available these are some benchmarks for kupi so these are all kupi on a single tesla v100 on an nvidia tesla v100 versus numpy however they implement it internally so maybe multi-threaded maybe not depending on the operation so we see that there are different gains depending on the nature of your operation and these these gains can range from like 270 times for element wise computation because they are very bound by computation and not at all by data communication but it goes down to like svd which is more bound to communication so we we get like 17 times for svd which is still very very decent speed up so if you want more details later there is this blog post that i wrote about these so it has all the details how to reproduce this test also if you're interested and i also have some single gpu qml versus scikit learn benchmarks as expected we are also faster than scikit learn because we are using mostly gpus which are great for this kind of linear algebra applications and we can get like up to 120 times of speed up for example for pca and this is also a tesla v100 versus 80 core node an 80 core node here is also some distributed benchmark this is the more interesting one in my opinion so we have four lines here the top line is the time that takes to solve an svd of in the case written there 611 seconds for 10 million rows times 1000 columns on a cpu with 80 threads on a single gpu it takes a bit over half that time and when we expand this to multiple gpus so say eight gpus on a dgx one machine which is an nvidia super computer with in this case eight gpus and it takes 51 seconds and if we add a second node there communicating over over in phini band for example it takes 33 seconds so we have a very good scalability here not perfect and i didn't plot more for a single gpu because we ran out of memory and this is one of the the problems that we address with multiple gpus for example and for for numpy on a single on on gpu it would take too long so i gave up on that and if we scale up to 20 million rows so we double the size of the problem it it takes about 107 seconds or for a single node and 60 seconds for dual node so that is about 80 percent scalability it's not too bad and to wrap up rapids is used to scale up so if you have like your pi data ecosystem doing a scikit learn doing pandas doing numpy you can scale that up with rapids so leverage performance with gpus and if we add desk into the account we can also scale out so you can already scale out cpu processing but we also want to scale out gpu processing and this is how we use desk in that purpose so this is the roadmap back from from the beginning of rapids when it was first released in october 2018 version 01 so we had almost no algorithms whatsoever but we are committed to increase the amount of algorithms that are available so this is the current state in june 20 uh from june 2019 when rapid 08 was released and this is where we want to be by the end of the year rapid 012 somewhere targeting rapids 1.0 and we are also focused on robust functionality on deployment and user experience so you can use rapids on many cloud platforms already if you have also a gpu at home you can also use that for for for for data science it has to be at least a pascal gpu if if i'm not mistaken so this is a gtx 10 or an rtx 20 family for example and everything is open source so you can get everything on github you can install via anaconda you can install via nvidia gpu cloud or via docker and deploy anywhere where you have some gpus some i have here some additional reading material if anybody is interested so probably this will be available later this presentation will be available later these are our posts from different people about the protocols that i mentioned before and also about performances on on python performance and gpus from matt rockley who's the bdfl of desk so he's also at nvidia now and he is being a great asset towards this distributed data science world for for rapids in particular and that's it thank you very much thank you peter very interesting talk we have time for one question thank you so i wonder about how good is the compatibility between numpy and kupy because i expected most of the functionality is there but probably there are corner cases and i use a lot of for instance structured arrays and numpy so i want to know if if there are i could expect problems or not thanks of course it doesn't implement every single functionality that numpy implements but it it has a pretty large api and you can also find the compatibility list on the documentation so of course i don't know everything from top of my head but there is a very interesting compatibility list what is implemented in kupy what is not so i think that's the best way to figure out if if it if you can use that for your application but it it implements a lot of the numpy api so kupy predates rapids just in case you don't you don't know that so it's not developed by nvidia is developed by preferred networks in japan and it's a very stable library so it's been around for i think 10 years or maybe even more than that so it has already a good compatibility with the numpy api thank you