 Next is our keynote that we skipped yesterday from Siddharth Ganju. We turned that into a dinner keynote. She is going to talk about 30 golden rules of deep learning performance, so basically explaining lots of ways of how to improve deep learning performance. Siddharth is working at NVIDIA, she is an AI researcher there. She has helped on a lot of important projects and big projects for NASA, for CERN. And I think she is going to tell us a few very interesting things. So I am looking forward to the talk. Siddharth? Okay, so before I begin, I would actually like to thank the organizers. It takes a lot of effort and a lot of behind the scenes effort to bring everyone together and have a successful conference. And before I started, just want to recognize the effort. So a huge round of applause as a thank you to the organizers. Okay, so let's get started. You know what the internet says? Well, the internet says that if you want to be good at something, put 10,000 hours of effort into it. That's great and all, but cumulatively that's a little over a year. Let's take an example of buying a bulb. Let's say I want to buy a bulb, so there are a few things that I need to know. For example, what do lumens mean, or what is the temperature of a bulb? Do we want a bulb that shines with a yellowish tinge or with a cool white light or with a blueish tinge? And what is the effect of all these colors? Now, to become a self-taught expert, either you put a year into figuring out everything and if that's your jam, great. But hey, you may also want to buy an air conditioner or a slightly more technical device like a phone or a laptop computer. And the solution at the other end of the spectrum is that you simply read a tutorial or a blog that gives you enough information to make an educated decision. So you'll figure out the bulb you need, but you necessarily won't have to sit and experiment with all the different components of a bulb, similar to what I'd imagine Nikola Tesla did. And this is the slightly smarter approach because you stand on the shoulders of giants and you assimilate the 10,000 hours of learning that people already working in the field have gathered. And that, ladies and gentlemen, is how the stock 30 golden rules of deep learning performance came out to be. During today's session, we will go through some tried and tested techniques for improving the data, training, and the deployment pipeline on different kinds of hardware. Now, you may be wondering, why is this worth the investment of your time? Well, the knowledge that we present is sourced from a cumulative two decades worth of experience and that involves being part of teams who have deployed products to a billion devices. And it's really about spending months tuning a single feature. And of course, we made mistakes and we learned from our mistakes and from other people. And today's session is really a small snippet of all the learnings that are available in the book, Practical Deep Learning. Another major component addressed in today's session is that now that AI has become relatively mainstream, there are a plethora of tools to choose from. So which tool to choose? Which tool will allow me to stand on the shoulders of giants and get up and running in less than 30 minutes, similar to the promise of Domino's Pizza? And this highlights yet another aspect of today's presentation. How do we work smart and get production quality results? Now, this is similar to what Guido mentioned at the ending of his QA. Work smart so we get to spend after hours and weekend with our families. Now, I know what you're all thinking, talk is cheap, show me the code. And before I do that, before I get into the meat of the presentation, let me give one example aimed to the audience. The last chapter of the book is on autonomous driving in the virtual environment of AWS DeepRacer. And as a slight nudge to our readers, we mentioned that the next level is the RoboRace Championship. Take Formula One racing, make it autonomous and we get RoboRace. Basically an AI enabled car at 200 miles per hour. A few months later, a couple of students actually made this happen and are going to be competing in the next iteration of the competition. Now, before I go any further, let's make sure that everyone is on the same page. Let's quickly discuss the concept which is used in approximately 95% of all industrial AI tasks for training. Transfer learning. Transfer learning is the technique of modifying a pre-trained network to work on another task. Taking an example, let's say I want to learn how to play the musical instrument, the melodica. And if you're learning from scratch, this will take somewhere around three months. But what if you already know how to play the piano? Then at this point in time, all you need to do is fine tune your skills from the piano to the melodica. And the advantage here is that you probably already know the theory of music, you have a good understanding of the keys and you need to practice the breathing technique while playing the keys. In terms of deep learning, we know that in a neural network, the generic knowledge layers are towards the beginning or the input section of the network. And the task-specific layers are towards the top or the head of the network. So transfer learning in a sense is clipping the last few task X-specific layers and adding a new set of layers and training them to make them task Y-specific. If you want to try out transfer learning in less than 30 lines without any installation, go to the link at the bottom of the slide and run using the colab button. That gives you a GPU on Google Cloud for free to run the entire fine tuning script. Now on that note, let's look at hardware availability. As we've all painfully experienced in life, what we really want is not always what you can afford. And that's the same in machine learning. But what if there was some technique that you could tune up the car on the right, put some nitrous oxide boosters on it, maximize its efficiency and make it work as close as possible to the car on the left. In other words, what we really want is more bang for our buck. And that's exactly where the 30 golden rules fit in. Now let's look at what we want to optimize in particular. So when we look at the process flow, we see that in the first iteration, we have the CPU preparing a batch of data while the GPU or the TPU is idle. In the next iteration, the CPU has passed the data to the GPU and the GPU has started training its first batch of data. And during this while, the CPU is idle. Now this process goes on and on. And what's really happening here is that the idle cycles for both the CPU and the GPU contribute to starvation. And throughout the session, we will talk about some tips and tricks that you can use to reduce both the CPU and the GPU starvation. Now let's talk about some tools that will help confirm the improvement. And of course, this calls for profilers. Profilers come in varying levels of complexities and there are many, many profilers out there. On the slightly more visual side, we have the TensorFlow profiler that you can work with TensorBoard. And so this runs directly in your browser. It's extremely visual and it offers calibration at the millisecond level. So what it's really telling us is that from 0 millisecond to 120 millisecond where we see the empty timeline, that's the place where the GPU could be doing something, but currently isn't. How do we get TensorFlow profiler enabled? We add it to the callback command. But maybe this is a slightly advanced solution and of course it requires some degree of setup. But all we really need to do is just look at some basic performance metrics. And so for that we can use something relatively simplistic like NVIDIA SMI command, which comes under the umbrella of NVIDIA system management interface. Now again, this gives a ton of good information like what are all the processes running? What is the GPU memory taken by all these processes? Which GPU? I just want to make sure Andre or Mark, if there's anything that you want to... It's nothing, sorry about this. People leaving the talkback, sorry. Okay, no worries. So continuing. It also gives the temperature of the GPU in Celsius, if that's interesting enough. But the most important fact here is that it gives the utilization of the GPU. So let's say I bought a GPU for 100 euros. With a utilization of 51%, 49 euros out of my investment are not being used at all. And that isn't really good. And this is how we will see if we're able to make optimum use of our GPUs. Now, if you want to run NVIDIA SMI over the entire duration of the pipeline, we can append it to the watch command. So a command like watch minus N.5. NVIDIA SMI will run the NVIDIA SMI command every half a second. And we can accumulate the utilization during the entire training period. So as a quick summary, we know what to optimize GPU starvation. We have a success metric that will be visible through the profilers. And now, how are we going to optimize? Well, now we have the 30 golden rules. And these rules are sprinkled over preparation, reading, argumentation, as well as training an inference part of the development pipeline. So let's start with the data preparation first. Now, in software programming, there are a couple of anti-patterns. As an example, image datasets, they usually consist of thousands of tiny, tiny files. And each of these files usually measure a couple of kilobytes. And our training pipeline has to read each file individually. Now, if we do this thousands of times, that has significant overhead. And that causes a slowdown of our training process. So we want to speed that up. Now imagine how severe the problem will be when you think about the spinning hard disk drives. And it's even further aggravated when the files are stored on a remote storage server. And that really is a first hurdle. So the solution is that we want to combine several of these files into a couple or handful of larger files. So we store it in a format known as protocol buffer or proto buff. And this is a very common serialization format that TensorFlow uses to store model weights as well as to store data. Now, the sweet spot of these TensorFlow record files is about 100 megabytes. So there you go. That's rule number one. Generate TensorFlow record files at about 100 megabytes per TensorFlow record file and read those in. Moving on to rule number two. Every time we preprocess data or we train, there is actually a lot of repetition happening. So the slightly better strategy is to save compute cycles. And we can do that by performing all the common preprocessing steps like resizing and then saving the results in the TF record format. And finally, when we want to train, we read the written TF record files and then keep on iterating on the problem as such. Finally, when we get started on a task, we usually use the unprocessed non-TF record data. And the better option is to try out the high performance data pipeline that TensorFlow offers. And this is called TFDS or TensorFlow data sets. And this offers a wide variety of data sets, ranging from natural language possessing, vision, audio, video. And one of the more recent data sets that have been added to it is this one called the Chexpert or CHXpert. And it's basically the COVID-19 related chest x-ray data set. So if that's your direction of research, you have a high performance pipeline already available. All right. So using TensorFlow data sets is also pretty easy because it just amounts to two lines of code for a high performance pipeline. And one pro tip is that if you're performing research in a novel direction, one idea is to publish the data set on TensorFlow data sets. So it's easier for others to build on your work. OK, now our data is fully prepared. Let's look at some opportunities to maximize the throughput for the data reading pipeline. And if I'm being honest, if there's just one thing that you want to take away from this presentation, it should be that TensorFlow data or TF data is really an optimal high performance training pipeline. So just use that. And as we'll see in the next couple of slides, we will have one optimization per slide. And eventually, these optimizations will add up to a lot more than a 10% in performance improvement. And as we mentioned before, this is what the basic pipeline looks like. Now, the very first improvement that we can do in this is to prefetch the data. So we previously touched upon the circular dependency, where the GPU waits for the CPU to generate the data, and the CPU waits for the process flow from the GPU. And this kind of a circular dependency, it causes idle time for both the CPU and the GPU, which, of course, is inefficient. The prefetching function here, it helps us by delinking the production of the data by the CPU from the consumption of the data by the GPU. And it utilizes a background thread that possesses the data asynchronously, and it passes it into an intermediate buffer. And the advantage here is that this CPU can carry on computing the next batch instead of waiting for the process flow from the GPU. And for the GPU, when it finishes the previous computation, there is already data available in the buffer, so it can start processing. And it's as simple as a one line command to do prefetching. Next rule is to parallelize the CPU processing. And it's best to parallelize according to the number of CPU cores. Otherwise, the end result may be a little bit slower due to context switching. Now, reading files from disk or worse over a network is a huge cause of bottlenecks. And again, that requires us to optimize our file rates. Now, one solution that addresses this is to parallelize the IO and the subsequent processing. And this is known as interleaving. For us, this would mean parallelizing the extraction of the stored TensorFlow record files. When we look at the usual TensorFlow training pipelines or even training pipelines in general, we see that it's somewhat sequential. We read the data, we train, we do inference, we do testing. But ideally, we want to parallelize this as much as possible. And intuitively, when we think about shuffling the data, then we perform a random shuffle on the data so that any implicit on the order in the data is not used as a training feature. And when we look at the reading pipeline for TensorFlow data, we see that it still attempts to produce the outputs in a round-robin fashion. And of course, one disadvantage of that is that if we encounter a single slow operation, so in file read as an example that takes really, really long, it's going to hold up the rest of the pipeline. And so the solution here is to use the non-deterministic setting, which tells the data reader to not read the files in order and therefore reduces the wasted cycles. We previously mentioned pre-processing the files, storing as TensorFlow records. And now, we also want to cache those files. Caching makes it much quicker to read. And again, we can cache both in memory or on disk, depending on how big the data set or the batch is. All right. Now, most frameworks are community-led efforts, be it TensorFlow, PyTorch, Keras. So there are always optimizations and developments available. And sometimes, it's worth looking into the development branches to see if something relevant pops up. The easier way to do this is to set the experimental optimization flag in your code as true. Well, this is a really good one. A very common issue that grapples beginners is what is the best initial value for all the parameters that my network needs? And how do I optimize it for the hardware? Well, all you've got to do is forget about it and run this one line of command that optimizes for your particular hardware. So it's hardware-aware, in a sense. And it figures out if you're running the code on CPU, GPU, TPU, and it will optimize the parameters accordingly. So far, if you've noticed, we've gone through optimizations one by one. And all this code ultimately will make your program wildly fast. Now, on to the augmentations. Pre-processing and augmentations are super, super elaborate. And I could write another book on just that. Transformation operations like resizing, cropping, blurring are mostly matrix-matrix operations. But when we look at the underlying implementation, they're almost always performed on the CPU. Most common libraries like OpenCV, Pillow, even the built-in Keras augmentation functionalities, they, by default, run on the CPU. Now, of course, you can build OpenCV with underlying CUDA capability, so it utilizes the GPU. But talking about the more default options, augmentation is relatively always run on the CPU. And that also means that we're not utilizing the underlying hardware towards potential. So the solution is to use the GPU for augmentation. Now, TensorFlow has TensorFlow image that is relatively new. And it has some handy augmentations that can seamlessly plug into the TF data pipeline. Now, of course, this is a community-led effort. It's still work in progress. There is very limited functionality available. As an example, we cannot rotate images by a random degree. It's fixed to 90 degrees. And the good news with most community-led efforts is that most of these augmentation libraries or functionalities have an existing GitHub ticket that we can use to track how the augmentations are processing and when they will be available upstream. Now, when we look at the slightly more prevalent solutions, there is a library known as NVIDIA Dali that is framework-independent. So it works with a wide variety of frameworks. And it removes the need of the TF data pipeline in terms of augmentation all the while attaining high performance throughout. Now, some interesting benchmarks have also been set while using the NVIDIA Dali library for augmentation as an example, training on ImageNet within 80 seconds. That's pretty good. Now, moving on to the training portion, this is rule number 13. Now, this talks about using automatic mixed precision. Now, when we go from 32-bit to 16-bit representation, we effectively double the memory bandwidth available. We double the model size. We double the batch size that can be accommodated as well. Now, in terms of inference, it's ideal to convert the trained model into something like int16 or int8. Now, manually writing an automatic mixed precision code is relatively difficult. So it's easier to use libraries like NVIDIA's automatic mixed precision library or even setting the TensorFlow flag for AMP on. And this makes the training two to three times faster. Now, there is a very tiny caveat that needs to be taken care of. If you're using FP16 naively during training, it can potentially lead to a significant loss in model accuracy. And the end result may not even converge to an optimal solution. Now, this happens because FP16 has a limited range for representing numbers. Now, as an example, let's say we have an improvement of 10 raised to power minus 7 or 10 raised to power minus 8. And we have a weight value of 1.1. With FP32, the weight would register correctly. But with FP16, the weight will still remain at 1.1 because it won't register. Now, this effect is even more amplified when we have activation from layers like the rectified linear units. And this often results in FP16 to overflow hitting NaN or infinity. So how does the automatic mixed precision libraries take care of this? They actually store the FP32 model as the master copy. And they perform the forward pass and the backward pass in FP32. So it's faster. And the last results are scaled back to FP32 and added to the master copy. So you're taking advantage of the higher precision in FP32 while also taking advantage of the faster processing that FP16 offers. And in some of the newer architectures, like the Bolt architecture and the Turing architecture, there is selective tensor core optimization for FP16 int 8 and so on for the lower precision. And again, adding it to your code is as simple as a one line flag. OK, the next rule is to use a larger batch size. And why we want to use a larger batch size is because we prefer training with several mini batches of data. And this is done for two main reasons. We can achieve similar training accuracy by feeding in smaller batches, similar to feeding fewer larger batches. And in a sense here, we want to experiment with the GPU utilization and the batch size depending on our GPU. Now, as an example, the 2080 Ti GPU from Nvidia has around 11 gigabytes of GPU memory. And that is enough for some of the more efficient models like MobileNet. But when we look at the GPU utilization, it's somewhere around 85%. But and that is at a batch size of 800 plus. And if we were to replace MobileNet to something that's slightly heavier like a ResNet 50, we would see the utilization go immediately up to 99%. The next rule comes from an interesting case study that was published by Nvidia a couple of years ago. And they noticed that as they were trying to tune the model, when the batch size was in multiples of 8, it would actually give five times the performance. Now, intuitively, when we think about this, as the hardware increases in specialization, so CPU, GPU, TensorCores, TPUs, we see that there is an underlying defined size of numeric multiplication. And this often has some minimum requirements. So for some of the later Nvidia GPUs or the previous generations, using multiples of 8 is a good idea. But for some of the more current GPUs, a batch size of 64, multiples of 64 or 256, and on TPUs, multiples of 128 are often considered the best. Learning rate. This is one hyperparameter that greatly affects the speed of convergence and accuracy. Now, ideally, we want the learning rate to not be too high so that it swings like a pendulum. We also don't want it to be too tiny so that it moves towards the global minimum slowly. So what is the ideal way? We use a library called the Keras LR Finder Library. Now, this is based on the paper published by Leslie N. Smith. And the algorithm underlying is that we start with a really low learning rate and we gradually increase it until we reach a pre-specified maximum value. And from that point onwards, we start calculating the first derivative or the rate of decrease of loss. And we find the point which has the highest rate of decrease of loss. And this turns out to be the optimum learning rate. Now, as TensorFlow progressed from TensorFlow 1.0 to TensorFlow 2.0, it added the flexibility of eager execution. But this actually came at a very tiny cost, which is usually on the order of milliseconds so it can be ignored. But if we have several small operations, eager execution can actually have a sizable impact. So the solution, again, is to add the other rate tf.function before the function definition. And this in itself offers a 10x speedup. I see the timer is going so I'm going to start to go a little bit quick now. The next rule is a rule of thumb, which is overtrain, then generalize. Now, of course, in machine learning, overtraining is exceptionally harmful and it is detrimental to training. But here, what we want to do is that we want to overtrain on a small set. For example, if we have 1,000 classes in ImageNet, we start from 10. We add 10 more when we're doing the sampling and start from there and then train onwards. Now, another idea is progressive documentation. So we add one augmentation one by one. It offers easier debugging, easier to see the impact of the augmentations as well. And similarly with progressive resizing, where we slowly increase the size of the images in the training set. Oh, this is exceptionally important. I cannot stress on this enough. Installing an optimized stack for the hardware. Quick solution is to build TensorFlow from source. That takes a lot of time, of course. So another option is to use the Anaconda build. Another option is to utilize the NVIDIA NGC containers if you have a slightly containerized workflow. The next rule is something that we previously learned, which is parallelizing CPU threads. So again, this is both for different operations and within the same operation or on an intra-operation level. And finally, I feel like this is a no-brainer using better hardware. So hardware over time has progressed. 16 GPUs, two better flops versus one GPU, five better flops. So of course, use better hardware. And we want to scale if we have multiple GPUs on cloud. Now, the ability to scale really depends on the algorithm that we're using. So if you have something that resembles a single node with multiple GPUs, try to use the mirrored strategy for distributed training. But if you have multiple nodes, prefer the multi-worker mirrored strategy. And of course, we can also use Uber's Horovod for distributing our jobs on multiple nodes. And then we have some industrial benchmarks. For example, a distributed training on ImageNet costs only $7, which is pretty frugal and pretty cheap. And this also attains a pretty high accuracy. So the time it takes to reach 93% accuracy is two minutes. And that's pretty good. Another industrial benchmark out there is known as MLPURF, which optimizes for hardware performance. And again, the task is to train ImageNet. So there are a bunch of industrial benchmarks available. OK, now moving on to inference. And I'll just need like three minutes over here. The very first idea is to use models that are efficient and accurate unlike the more bigger models that are available. And at the end of the day, it's just a very simple question. Does your 500 megabyte model spark joy? If not, discard the weights and head towards a slightly more efficient and accurate model. Now, the idea is to pick the most efficient and accurate model. And of course, this depends on the tasks. So for classification, you want to prefer something like a mobile net V2 or a NAS net. Then for detection, the efficient detect benchmark will be utilized for NLP. The size of the model has been increasing so since the past few years. So the ideal model is the distilled BERT network. Another idea is to quantize the model. We touched on this previously, reduce the precision, go from 32-bit to 8-bit numbers. And different frameworks like Core ML offer different level of top 1% accuracy. Like we see here, going from 16-bit to 8-bit has a 75% size reduction. But the percentage match in terms of accuracy is almost consistent. So quantizing the model and again, different rules to quantize the model. Another rule is pruning. Now pruning helps in increasing the sparsity of the network. And this again helps in compression of the network. And again, this is, I believe, a one line command for pruning. And the last one is to use fused operations. So combine the convolutional with the batch normalization layers, and finally, enabling GPU persistence. And I'm done. Just in time, yeah. So thank you very much for the very nice keynote and all these tips that you gave. It's really impressive. Unfortunately, we don't have time for questions and answers. So I would like to ask you to go to the talk room. I posted the talk room, the channel name, into the chat. And then you can answer the questions there. So thank you very much again for the pruning.