 Today we have Matthias here. He is an open source enthusiast and he will give us a deep dive into TensorFlow.js and how to build neural networks in JavaScript. Okay, so first of all, anyone not speaking German here? Okay, do you want me to hold the presentation in German or English? Yeah, okay, in English. So, well, currently my background is that I'm really working full time in a project for defect detection in the aviation industry but personally I like computer gaming a lot and there's nothing more fun except for writing your own computer games in the browser and I also wanted to make a little gesture recognition model to play a game. So, we do it with JavaScript. Okay, so first of all, I will think I'll give you a short introduction to neural networks. Does everyone know about neural networks in theory? Who does know about neural networks in theory? Okay, so we'll shorten this part. Then we'll talk a bit about use cases, especially in the JavaScript world. I'll introduce you to TensorFlow and TensorFlow.js. Then we'll train a little classifier and use that classifier on a browser game. And then if we've got the time, I can talk a bit about recent ideas of efficient neural networks. Okay, so short introduction is that there's a thing called universal approximation theory which states that you can nearly approximate any function, any continuous function in a compact subspace. So, this is really powerful and there's a visual explanation behind it. If you take, for example, the sigmoid function which is kind of S-shaped, if you parameterize it correctly, you can make it very steep so that it's nearly a step function. And if you change the parameter W and B accordingly, you can also get the step function to activate at any given point X, denoted as S, this is where the step activates. So, if you now activate a step at a certain point X and you just subtract it again, later on, you get a little bump out of it. And if you're weighing the output also, you can adjust the height. When you've got this, you just need to sum it up. And here we can see that you can model a continuous function using linear segments in this example. And it also works for other activation functions as well. Okay, about convolutional neural networks. Convolutional neural networks built on the concept of filters. So, you might already be familiar with Lena from OpenCV. And there's an example where you do a SOPL operator to get the edges. So, filters can basically get edges or analyze textures. That means analyzing frequency content. It's like an FFT. Okay, so a filter has a very limited field of view. It has got local connectivity. And we convolve it around the input image. So, we've got filters and weights and we just need to do element-wise multiplication and sum everything up and get our output. Okay, so how do edges work in reality? When you've got an edge, you've got a change in pixel intensity from, let's say, light to dark color, for example. And you cannot see this very well on the left side. But if you take the first derivative, you've got such a hill-shaped form where you can just threshold it. And so, you can say above the threshold, I'm going to find an edge. Okay, and this is classically done using the SOPL operator. There, the weights are set manually so that you get a discrete differentiation operation out of it. Then you combine a filter for the horizontal and the vertical edges and get an edge detector. Okay, but in convolutional neural networks, you will learn the weights, so you don't need to set them manually. And when you stack multiple convolutions on top of each other, you can aggregate features. Like in the first few layers of the neural network, you can learn things like edges, some basic textures. You can aggregate them and then in the middle here, you can already see some things like eyes. And on the right, you can even see a little bird in there in the feature maps. Okay, so how do we learn those weights? First of all, we need to specify a loss function, which we're going to minimize. And then, it's very difficult because we've got such a high dimensional space, so normally a convex problem would be easy to solve because you just get into the optimum. But in high dimensional space, it's very difficult to get it like this. There are a lot of local minima and the method you're using currently is gradient descent with a lot of modifications for better results. So here you see a little visualization of a convex problem, which is easy to solve because you get to the global optimum. And on the right side, you already see two optima. So how can you get out of a local optimum? You can get out by applying a learning rate schedule. So when you're doing your gradient descent updates, you want to get them to be lower and lower to converge into a minimum. But then, you raise it again to propel out of this optimum, of this maybe local optimum and get into another optimum, which is probably better. And then you want to converge again. Okay. So how does it look? Looks like this. You travel around the high dimensional space, get into your optimum. And because there are so many of them, you could also employ averaging techniques, ensemble techniques, like when you've got a lot of local minima, which are good, you can store them and later on, you will average the weights of the neural network and it will be more robust. Okay. So where do we use neural networks? In general, we can use it for nearly anything we can approximate. So we could try to learn fraud detection, time series forecasting, maybe the weather or stock prices, do document analysis or retrieval tasks. Retrieval means, for example, you paint a sketch of, let's say, a cat and you get all the cat images out of your asset database if you're creative designer, for example. And then, of course, there's text-to-speech, speech-to-text and translation. And recently, Microsoft published a paper where they say that they are on par with human bird error rate when transcribing Chinese to English, for example. So this is really crazy. And natural language processing is really on the rise since end of last year with models like ULMFID, ELMO and bird nowadays. Okay. But to be a bit more chair specific, you've got a lot of advantages when you run your model in the browser. So you can, for example, save bandwidth because you can pre-encode features, let's say, from a video stream and send a representation that does not take as much bandwidth and send it over to the server. Or you can get GDPR compliant because you don't even transmit data to a server and you evaluate everything locally. And there are, of course, alternative control methods, which we're going to use today in a computer game, like tester cognition. And we can build game eyes in general. So our showcase will be a simple game that resembles a bit the raptor game from the 1990s. Perhaps some of you know it already, the game. And we want to build a simple classifier. In this case, we're just going left and right. Okay. So we need to know a bit about TensorFlow.js. So TensorFlow generally has a long history. And the essence is that Google used the tool internally. And then published it. And it was written in Python and has got some C++ and QDA code to get hardware acceleration. Then they announced the TensorFlow processing units. And TensorFlow already came with support for them. And TensorFlow processing units are really specialized because they hardware some activation functions and get really great speedups out of it. Then they also published things like TensorFlow Lite. So you can get your models running on Android, for example, when you're writing Java. They published Collaboratory, which is kind of a Jupyter notebook. It's like a wiki where you have code and can execute it. And it runs in the cloud. And you can already make use of TPUs, for example. So you can get around 140 teraflops for free. As a comparison, your laptop has probably 140 or 60 gigaflops. Okay. And then TensorFlow.js recently came along, was initially developed. And they also published visualization things for it. And nowadays we can say that TensorFlow.js can be used in production because they published the first stable release. Okay. So we can run it nearly everywhere. Nearly every environment. Okay. About the concepts. A tensor is just a multi-dimensional array. Nothing more. And it's immutable. So if you run an operation, you get a new tensor out of it. And you still have the old tensor. So in TensorFlow.js, you really need to do memory cleaning yourself. Because otherwise you could end up with a lot of tensors on your TPU, or just in memory, which doesn't get cleaned up. And then you're getting out of memory. Okay. Furthermore, there are variables. Variables can be mutated and are important if you want to do things differently. Let's say, for example, during training or test, for techniques like Dropout, for example. And then there's the model abstraction, which just takes inputs, transforms it somehow, and produces an output. Okay. After all that, we've got the optimizer, which does gradient descent and fits our model between the input data and the desired outcome. And TensorFlow.js really follows the Keras specification. So for those of you programming a lot in Python, you may have used Keras already, because it's easier than the plain TensorFlow API. And it's shorter to write. And TensorFlow.js adapted this API specification from the beginning. So it's very easy to write code. And if you're running in the browser, you get hardware acceleration by using WebGL and the shaders. Because when you want to write games, for example, you need those shaders to calculate things like lighting or something like this. And this already operates on large matrices and does the same operation on a lot of data. So this is great for us. And TensorFlow.js already abstracts us away from this. So if you're running in the Node.js, for example, you will use the CPU or maybe the GPU if you've got an NVIDIA device. But when running in the browser, you really get the hardware acceleration no matter what kind of graphics cards you have, because you're using WebGL and not somewhere no specific things like CUDA. Okay, so basic development workflow is just building the model, fitting it between input and output data, and run your predictions. And it is really that easy. Here we see an example. On the left side, we see JavaScript code. And on the right side, we see the Python Keras code. So if you've already programmed in Python with Keras, you won't notice any difference, basically. So what are we doing here? First of all, we're building our model, and we say it's a sequential one. That means we just stack layers on top of each other. If we want some more expressiveness, we should use the functional API. Because in a functional API, we could also take two different results and put it in another function, which allows us for more advanced architectures, for building more advanced architectures. So then we add the layers to our model. This is just a simple fully connected layer here. And then we need to build the model, which is then using the compile function. Here we specify the optimizer, like stochastic gradient descent, or Adam. I would suggest you use Adam, because it's already best effort. And that's a great job, generally. You don't need to set any hyperparameters or things like that. And you need to specify your loss function. If you want to do regression, you could, for example, use mean squared error. If you want to do binary classification, you use cross entropy, for example. And then we're just allocating our data. Here the only thing that differs from Python is that you need to set the shape. And then we fit the model, specifying the inputs and the desired outputs. And we tell the model how many epochs to use. And then in epoch is just seeing all the data once. Okay. Because we cannot put all the data into memory, we use little batches. And when we've seen all the batches, you say you've visited an epoch. Okay. And then we run the prediction using the predict method. So nothing so special in it. How do we build our classifier for the game? First of all, to make training quicker, we use a pre-trained model from a domain that can be adapted to ours. And then we build our own classifier on top of it. So in this case, we're taking the mobile net architecture because it's really parameter efficient. It uses few parameters and fewer calculations than other models. As a comparison in 2014, the GTG 16 model had around 140 million parameters. And such a mobile net may have two million parameters for doing segmentation tasks, for example. Okay. So we're saving a lot of computational power here. Okay. But the mobile net was trained on ImageNet, which does classification. And we don't want to use classification of things like pizza and cars and other stuff. But we want to do our own classification. So we truncate the model at a certain point and put our own classification head on top. And for our little game, we will just want to have three actions going left, going right and doing nothing. So we could, for example, flatten this three-dimensional volume here and build some fully connected layers on top of it. And we get three probabilities out of it. So we can just softmax it because they are mutually exclusive. And probabilities can never go above one. Right? If you're not having mutual exclusivity, you could just sigmoid everything in here and don't apply the softmax over everything. Okay. But maybe you can already see a little design issue here. So what do you think? How many parameters do we have in the mobile net and how many parameters do we have approximately in the classifier? Are they on power or equal size? It's one bigger than the other. Yeah. Okay. A little hint. We've seen the filters. And filters use very few weights. But fully connected layers really have a weight for connecting every neuron to another neuron. So in here, we've got 50k values coming from this 7 times 7 times 1000. And when we have a layer that goes from 50k values to like 100 values, we've got a weight matrix of 50k times 100. So this is already a lot. And so the classifier would take up just as much memory as the mobile net does. So of course, we can improve this. For example, we could do global average pooling, which makes those 7 times 7 dimensions 1 times 1. This reduces the model a lot. But we lose the spatial dimension. And the spatial dimension is especially important for our task because we want to go left, right, or do nothing. So global average pooling is really the way to go if you want to do classification. And mobile net does this for its classification task. But we could also decrease the depth of the output of the network here because we don't need 1024 feature maps to separate a lot of different object classes. We essentially only need one because we want to detect persons. So we can do a 1 times 1 convolution on top of it and reduce it to a depth of 1 and save a lot of memory. So this would make the model a lot smaller. So in comparison, this model has around 20 megabytes and you would get it to 24 kilobytes. Okay. So now we want to try a little live training to demonstrate you the effectiveness of transfer learning because normally you always think that you need a lot of data to train neural networks. But this is not always the case. So I want to train the action of going left, going right, and doing nothing. And I just capture two instances per class, train the network and it was really fast. And then I can classify and we will see the class on the right side. So let's try going left, having a class of zero, having class of one here in the middle, we've got a class of two. So that's it. Took about three seconds and we already got a usable model. Okay. So let's look into the code because you're all here because you want to know how to do it yourself. First of all, we need to capture the images from Webcam and therefore we use the media devices API and query I get user media and we specify some constraints. So our pre-trained model needs an input size of 224 times 224. That's why we specify that we want to have exactly this dimension of images in the video stream. So when the video stream is loaded, we get a loaded data event. That means the first buffer, the first image is available for us. And then we can resolve a promise because we might want to wait until video is initialized and when the promise is resolved, we know that we can continue. And it's done asynchronously in general. Okay. So when we capture a video frame, we can already use a TensorFlow function, tier from pixels. We connected our video stream to an HTML video element previously. And now we can just say take this video element which you can get by document, get element by ID for example. And you want to get a single frame out of it. Now the pre-trained mobile net already did some use some pre-processing when training the model. So what they did is shifting the values between minus one and plus one. They've normalized it. So we also need to normalize our data to have the same conditions in the input data as they used for the pre-training. Okay. So we first of all expand the dimensions because we need to have some batch access in front, not just a single image, but we get a batch of size one, convert it to float divided by 127 and subtract the one. Okay. So how do we gather data into some collection? In here we've got a little collection capturing the actions, the three actions. And we want to add an instance to it to use it for training later. So when we add an instance, we first of all want to convert the label because the label could be zero, one or two and we want to get three probabilities out of it. So we use the so-called one-hot encoding which converts this class label to our desired number of probabilities. And you can already notice that we're cleaning up memory here with TF-Tidy. Everything you do in a TF-Tidy block will be cleaned away except for the tensor that you return because you want to work on it. Okay. And then for the first time we can just say we say our access are just the single image we got and we also add our first label. And then when we go on we need to add to a collection and when we add to a tensor we cannot mutate it because they're immutable. So we need to concatenate our existing tensor, add the new instance to it and we want to keep the result of that. And we also want to dispose the old tensor which contains our smaller array because we've just added to it. And in here you also see the memory cleanup with the dispose function. Okay. So how do we train our backbone model? How do we train our mobile net? TensorFlow allows us to store models in the browser using index DB for example, which is great because you can store blobs in there. And we will make use of this by using a URL scheme that prefixes the location of index DB. So what do we do? We first of all want to set a certain location where we want to cut the model. In the model there are names of convolutional layers activation functions and we use such a named function to cut it later. So first of all we load the mobile net using this function here. And first of all we try to get it locally from index DB if it doesn't work. We just load it here from our server or from Google or from anywhere. And then save it to the index DB for later use. So now we can use it offline. Okay. So how do we now cut the model at a certain point? We could just cut it by using the TF model and we say that we're using the original inputs of the mobile net. But we don't want to use the original outputs of the mobile net. But we want to get a layer, a certain layer, this conf 13 pb value layer and take its outputs and specify it as our model's outputs. So we've essentially sliced the model. Okay. And when you have trained the model you could also want to predict things because you want to know when you've got an image if you want to go left, right or do nothing. And then you also load the model and just use the predict function on it. Okay. Now let's build our classification hat which is customized and built on top of the truncated model. So first of all we need to specify the backbone of a mobile net because we need to work on the results we get out of that model. Okay. So how do we build our hat? First of all we need to make sure there are choices. You could do the fully connected model which is very large but takes very few training instances. Or we could just flatten... No, sorry. So we could use the flattened model with the fully connected which is big but we could also employ our little trick with a convolution of size one times one to get a single filter out of it to reduce 1,024 dimensions to just a single dimension. And we will employ this technique and add some more layers to our network. We flatten the data now and convert it to the desired number of classes in a dense layer to get our three probabilities out. And then we say we want to build this as a sequential model and can then use it. When we run a prediction actually we use the backbone model to predict an input image. We get this intermediate output out of it and run it through our own model and get the results. Our result probabilities. Classification is just the same thing but we want to get a class out of it. So we run the prediction and convert our result to a one dimensional vector containing three probabilities and then we're taking the index with the highest probability because this will be our class. And for training there's also not so much code involved. You just instantiate the atom optimizer for doing the gradient descent for you. You can specify a learning rate but it also has sensible defaults also and load your model, compile it as seen previously and then fit it between the set inputs and labels. Which is what you have seen in the little GUI. So now when you want to use the model you want to initialize capturing from webcam again to use it in our simple game for example. We edit the video, wait for it to be initialized and then load our saved model which can be stored by model save by the way. We load it, truncate it again and load our classification head and that's it. So now we've got the webcam initialized, models loaded and we can get a single image from the webcam run it through the encoder then run the intermediate result through our classifier and get the data back by calling data sync and get our classification label back and if it's zero we just move left and right and if it's the other class we just stop moving in our game. And then we also call await tfnextframe because we don't want to waste CPU cycles so when your model is really fast you would just have an endless loop in here which is a bad thing and tfnextframe just waits for the next animation frame so it caps execution to 60 frames per second in the browser. And then we can just use this function synchronously and does not block us forever. Okay, so how does it work? We can see this game right now with better graphics and lighting conditions are a bit bad in here because normally it just change weapons but you can see that that position really works so this is doing regression here in fact it's really running a post net on the input images so it's a bit more sophisticated so it tracks my eyes, knows if it cannot find the eyes and uses it to steal the game. Okay, so let's talk a bit about efficient models because you have seen that the game really runs fluidly in the browser, it runs around 50 frames per second which is amazing I think because in the past models were quite slow and models that do segmentation or even pose estimation were even slower and nowadays we've got very efficient ones so the model families I would suggest to use these days are dense nets and mobile nets because they're really optimized for fast processing when using the mobile net you already get a lot for free from Google for example you get a mobile net with an SSD light object detector you get rectangles out of it and object classes if you want that, if you want to do segmentation that is producing masks of let's say people for example you can use mobile net along with peep lab which is already trained by Google and you can just use it in your application and there's of course also a post net and they also provide you with an implementation which is doing a bottom up approach because there are essentially two approaches one of the approaches is to run a two stage predictor which gives you region proposals like here could be in here in this rectangle could be a person and then do further processing on this data or you could use a bottom up approach we'll just go over the entire image and get back for example a heat map and this heat map is used to let's say predict key points like the elbow joints or your wrist okay another idea are bottlenecks bottleneck structures reduce depth in a three dimensional volume so the input dimensions won't get reduced but the depth is getting smaller and it does not have that many parameters and so you save a lot of computational power okay another idea are shortcut connections because when your model is getting very very deep you've got a problem with the gradient descent optimization neural networks are basically stacking functions onto each other and when you've got nested functions you need to apply the Leibniz chain rule when differentiating so let's take the example of a sigmoid function it saturates on both ends so the signal gets around zero and when you multiply it you cannot propagate it further so one thing to ease this is to apply shortcut connections so that you get quicker feedback from the output to layers that are nearer the input and another technique is batch normalization or using an activation function like the cellu which has got a little exponential component and a linear one and it's parameterized and the values will be shifted into a range that gets close to zero mean and also standardizes the value a bit so there's a proof of convergence I think it's 170 pages the paper is from Zepp Rocheiter okay this is done in the RESTnet for example by applying an identity function in here and adding it to the result of a layer so the main idea is residual learning that means when we don't need to learn anything anymore in this layer the weights are essentially zero and we just take the identity shortcuts and if not there's only a small difference between the desired inputs probably and what we want to get so we only learn the difference between the identity function and what we need to adjust in here another paper that's very interesting they don't do it using addition but they concatenate feature maps and with this concatenation of feature maps you get a feedback even of the output right back to the input so signal propagation when using a backdrop is really great and the design allows us to reuse features so when you think of a neural network like a computer program you might want to calculate something and use it in different places later so this design enables us to do that and so we can make smaller models in the dense net for example you don't grow your models from like 32 filters to 64 you just add a constant amount of new filters so the idea is that you can reuse existing information and add a bit of new information to it okay and then there are non-standard convolution a normal convolution has a lot of parameters so when you want to get from input depth M to output depth N you have such a volume of filter size times M times N and if we do a little example and use an input depth of 16, output depth of 32 and a filter of size 3 times 3 we already get 4600 parameters out of it but if we separate the operation into two distinct operations first doing a spatial convolution and then doing a point wise convolution we can reduce the number of parameters and the number of calculations involved so if we do the maths here we've got 16 times 1 because the depth of the filter is 1 times 3 times 3 which is 144 parameters in the depth wise part that's not very much and then we've got the point wise convolution stacked on top of it and we get another 500 parameters and the total sum is 650 parameters and this is only 7% of the number of parameters of the standard convolution so it's a factor of 14 improvement already also in number of calculations so to sum it up to see the improvement again in 2014 we had models which were really really big and now they're getting smaller and smaller and very efficient to compute on mobile devices so it's a great time to use computer vision on your mobile device in your browser wherever you want and that's my key message here so thank you, do we have any questions? yes please one second, do you have your code on github or something like that? I can publish it to github and send a link to the organizers maybe yes, you have to use pre talks for that they can I think comment your your own talk and they could publish the link any other questions? thank you very much for your presentation and your deep dive into TensorFlow.js and for all of you, have a nice day at GPM