 Hello, thanks guys for joining me for this session. So this session is on Apache MXNet. So Apache MXNet first of all is not an Amazon framework, it's an open source framework, it's an Apache project. And I want to discuss the merits of this project and why you should take interest into this project. So first of all, there are just too many deep learning projects, right? Why come up with one more? So the reasons for that are, you know, by each passing day, we are training larger and larger models. And as you know, memory is the most costly resource. Compute is cheap these days, but memory is still very expensive. And many trainings are now out of bounds for a single machine, a single machine even with multiple GPUs. So you need to go for distributed training and you need something that scales seamlessly. So all these problems are being solved on the server side. There has been little more success on the client side. I just want to mention these things for your convenience. So again, as I said, memory is a very expensive resource. So a lot of techniques have been developed to conserve memory on the client side. Now models are being executed on your mobile phones. So there is this very interesting researchers from Stanford, Songhan. He has done some very interesting work in making models compact. So some of the techniques are, you know, reducing the number of weak connections, using less resolution for the weights, right, or rather less precision for the weight. And all these things have been possible without losing much of accuracy. So I would say there is relatively more success on the client side. But memory issues remain very significant on the server side. By server side, I mean for the training. So let us first see who are the backers for this project. So you will see a lot of prominent names here, including Amazon. These are the companies and universities who are behind this project. This is a list of top MXNet collaborators. Since this is an open source project, we want to honor their contributions. I want to show that this is a pretty healthy framework in terms of activity, in terms of new features, bug fixes, requests. So as you see, MXNet is not at the level of TensorFlow, which is much more popular. But it is coming up rapidly. It is consistently at number three. So it's a promising project for sure. So let us just revise a very basic fundamental. This is a very simplified depiction of how a neural network works. I use this image just to signify that there are lots of matrix calculations. And the number of matrix calculations are extremely high for multi-layer networks. And a lot of the values are required to be preserved to calculate the value of next layer, right? The output of one layer is input to the next layer. And typically we do batching to speed up training. And batching means that much, many more values to be stored. So memory requirements simply explode. So how MXNet handles the challenges? So MXNet provides all these cool features. You can read them yourself. But I want to focus today on two main, the ones which are painted red. First is it supports declarative and imperative programming. I'll discuss what it means. And it does transparent scaling. I'll also discuss what it means. So let us quickly take a look at the building blocks of MXNet. So I can't go through all of them, but they are there for your review. So dependency engine is one of the most important things that MXNet has. And this engine supports parallelism. It automatically figures out where the operations can be carried out parallely because writing parallel programs by yourself is very difficult. It figures out all the opportunities for parallelizing by itself. So this is an example of how parallelism is managed. I'm going to share these slides so you can go through all the layers. It's pretty self-explanatory if you just look at the name of the layers. So this is a very simple example of what I mean by auto-parallelization. So there is a small program on the left. So you are computing A, C, B and D. D is the final outcome. So it automatically figures out that C and B can be computed parallely and they can be added up to arrive at D. So it's pretty neat. Like it doesn't allow you to make mistakes of not making things parallel. Next thing is it supports two types of API. One is symbolic and second is imperative. So what it means, let us take a quick look at these examples. In fact I would go to the next slide. So I want you to focus on left side first. It's a very simple program. Some variables are being added and as the comment says C cannot share memory with D because it could be used in future. The system that is going to execute this program has no knowledge whether C is going to be required in future or not. So it keeps memory allocated to C, memory gets wasted. On the right-hand side the same thing has been written in a symbolic way or a declarative way. So here what we do is we first define a graph. The graph is F. It has all the same operations as on this side but it doesn't allocate any memory and when the execution happens, when the inputs A and B are supplied to the graph F, it automatically figures out what is the most efficient way of allocating memory. So for example C can share memory with D. For example once C is computed and now we want to go to compute D, you can store D in the place where C was stored because you know now that it is not required anymore. So it makes memory utilization very efficient. There are some quick stats. As I said it supports both symbolic and imperative programming. 90% of your runtime will consist of symbolic graphs. They're very easy to write. Basically you define a model, you add layers, activation functions and things like that. It requires less than 10% of the time but there are going to be many times when you need to do something which is very specific to your training program. For example you need to process inputs in a certain way. So for that you can use imperative programming and that is where you generally end up spending most of the time. So while it consists of 10% of the runtime, you typically end up spending 90% of the time there because adding layers to the network and launching the training is very easy. It is the preprocessing and things like that which require more effort. And yeah okay. So there is one thing that MXNet does, it uses something called as static graphs. So MXNet, TensorFlow, all these guys, they are basically graph computation engines and they are static graph computation engines. There is a talk following my talk about PyTorch which supports dynamic graphs. It has some cool features so my next speaker will be discussing those. But there is one distinctive advantage of static graphs and I'm going to show it in form of a diagram. So if you see this is a forward pass and backward pass, you know diagrammatically represented on left hand side and the same diagram is now painted with color. So basically same color means that those two operations can share memory because the output of the previous operation is no longer needed, right? So for example, when this is being calculated, this guy can use the same memory which was allocated for this operation because output of this operation is no more needed by the time this is being calculated. So memory is optimized extremely aggressively. Next thing that MXNet does is it trades, so here you are not trading anything. You are just saving memory from being wasted. MXNet does one more thing which is very special. What it does is it trades speed for memory. What it means is as I said compute cycles are cheaper and most compute cycles are wasted because the data is being transferred from one place to the other and no computation can happen. So generally there are a lot of compute cycles available. So what it does is it expands more compute cycles and it does certain things which require lesser memory for the training process. So there is a lot of pros on the screen but I'm going to show one diagram which will make this clear. So what MXNet does is it discards the result of small operations in the forward pass and it recomputes them in the backward pass. So forward pass is happening and you reach the end stage and there is where you calculate your loss and now you are ready to calculate your gradient but by the time that stage arrives lot of outputs of the intermediate layers are deleted to recover memory and it is deleted because recomputing them is not very hard. As I said compute it cheap. If you delete output of an operation recomputing it is easy. You go fast up to the stage where you calculate loss and start calculating the gradient and in that process you have lost lot of layer outputs you recompute them. So there is a simple formula. You keep result of square root of N out of N layers. For example if you have got 16 layers you will preserve output of only 4 layers okay and you can recompute the lost layers during the backward pass and this allows you to train square root N times larger model at 75% of the speed. So in this case where you have got 16 layers you can train square root of 16 that is 4 times a larger model. So this is being represented diagrammatically. I will spend a little more time on this diagram to explain what I mean. So this is how the normal process works. A forward pass happens let me oh I am sorry. The forward pass happens I am sorry again where did it go okay. The forward pass happens and then the backward pass begins and the backward pass requires the activation outputs of the intermediate stages right. And since these all outputs are needed you need to preserve them in RAM and that requires you to expend lot of RAM or memory rather okay. So in this approach this is a modified approach which MXNet follows. What it does is for example it deletes the output of batch normalization layer and ReLU layer and it goes very fast up to the stage where it is ready to calculate loss and once the backward pass starts right backward pass is progressing. So at this stage it requires this value at this stage it requires this value. So these values are computed just in time one more time. So they are deleted because it is not very difficult to re-compute them but they are required again and at that time they are re-computed very easily. So that's how lot of unnecessary memory wastage is saved which allows you to train really large models. If you see by each passing day the models are becoming larger and larger. So it is very easy to have models with like 60 million parameter 70 million parameter. The highest is I think the what I have heard is 6 billion of course not individual researchers can train models of that size but that is the general direction where we are headed. And if we can't conserve memory aggressively we can't reach that level. This point is clear with everyone. So I am going to show you some stats. So these are the memory requirements after MXNet was deployed. So for AlexNet model it requires 1.8 times less memory. For inception model it requires 2.6 times less memory. For VGG it requires half the memory. So while this number half may not look so impressive you know it literally save you half the cost right because your training run for days and weeks GPU machines are extremely expensive even when they when you use them in cloud and it literally saves you almost half the cost of your training. Even for predictions right for the same level of performance you require 3 to 4 times less amount of memory. Baseline is without these optimizations regardless whichever framework you use right. The number of parameters which are there they dictate how much memory you require. You understand right it is not dependent on tensorflow or torch or whatever. What how many layers you have how many parameters you have. What is the precision of the outputs right that is what dictates memory. That is what dictates the baseline. So at Amazon we love yes yes okay I am sorry I will talk to you at the end okay. So yeah as I said parallel and distributed training in MXNet is extremely easy. So there is something called as drop in parallelization. So all you need to do to train parallely. You just add one line of code once the network is set up. So if you are doing training on AWS we provide you cloud formation template which sets the cluster up and if you want to go from one machine to multiple machines all you do is you just add one line of code. So if you can't use memory efficiently your training doesn't scale linearly with compute. What I mean is if the memory utilization is not efficient. If you add four times more machine. It doesn't mean that your performance improves by four times okay. But since MXNet handles memory so efficiently its performance improves linearly with the added number of machines okay. So in this graph what it shows is the compute time has come down by a factor of 3.7 right when four machines were added in place of one machine okay. So this is a neat feature like in Hadoop the processing time improves linearly with the number of machines you add to the cluster. So far it was not happening in the GPU world because a lot of memory and data transfers were involved but since MXNet handles memory so efficiently the performance scales nearly linearly with the number of machines you add. So again parallel training happens under the hood conceptually it is no different from how it happens in any other framework but implementation wise it is very easy you just launch a cluster using a cloud formation template and if you want to have multiple machines just add one line in your program that's all all the parallelization is taken care of automatically okay. So this is a graph on the linearity so gray line indicates the ideal performance blue line indicates MXNet's performance which is almost following the ideal performance whereas these yellow line indicates performance of the frameworks which don't optimize memory aggressively. So you can see that you know if you add large number of machines if your framework doesn't optimize memory efficiently there is no benefit that you are deriving. This is another cool feature of MXNet it has got plugins so what it means is if you're a Torch developer and if you have developed Torch modules you can use them from MXNet. If you have a cafe operations you can directly use them from MXNet. So I'm done discussing the memory optimization and efficiency part now I'm going to discuss some cool features and as I said if you are coming from Torch world if you are coming from cafe world and you have done a lot of investment in those frameworks you can reuse their modules and operations respectively in MXNet. Yeah there are tons of mainstream applications which have been developed in vision NLP and speech area. So as I said MXNet is a framework of choice at Amazon and there are many other large companies which are using MXNet. So there is a lot of ready made code available for you to get started. The code is very easy to read that is what I love about MXNet and MXNet also has Keras support now if you prefer to use Keras you know you'll feel at home. So that said I have a bunch of other slides and I have just one minute left so let me see. Okay cool so yeah MXNet you can do programming using Python, Scala, R, Julia, MATLAB, Javascript, Go right. So you don't have to change your programming language. This is a cool feature of MXNet. It is open source you know whether you want to run it in Amazon you want to run it on Azure you want to run it at home doesn't matter. Next is it is extremely portable you can deploy it anywhere. So you can compile MXNet program into a single file. So you don't have a framework and a model no two different files a single lightweight file which can be run on a mobile phone or any embedded device models can be so lightweight that they run inside browsers. So that said I just want to guide you to this further leading list. Share the PPT so you can take a look at it. So yeah if you are going to train large models if cost is a concern do consider MXNet it is extremely easy to use no brainer very easy to write and easy to read code. That said I finish exactly on time.