 You guys can hear me, right? I'm a pretty loud speaker. Okay. Good. Right. So I'm Ben Sadegi. I'm a data and AI specialist here at Microsoft. So I basically help customers, partners, learn about the latest tools that's available for Microsoft, especially on the Microsoft Azure Cloud. I do quite a bit of work within the machine learning and deep learning space. So today, we'd like to give you an update on the Microsoft Cognitive Toolkit. It is an open-source deep learning framework. Quick round of hands. Who's familiar with the basics of deep learning? Yeah. Okay. Good. Okay. So if I may just get started here. We'll do a quick introduction. I'll actually show you how to build a few models here. Some famous ones, if you will, VGG, ResNet, Inception. We'll go through some performance benchmarks that just came out this month. Pretty cool. I'll give you the updates and some resources. Hopefully, I can get through all this in the 20, 25 minutes. So within Microsoft itself, there's deep learning everywhere from Skype to Cortana, the Assistant, Bing, of course. That's a heavy machine learning engine, HoloLens, and even Xbox. There's a lot of activity. One other set of tools that are available are the Cognitive Services. These are now pre-built deep learning models that are available as RESTful APIs. So for vision, speech, language, some of them are customizable using transfer learning. So everything here that you see, practically everything here, is built using CNTK, the Cognitive Toolkit. Right. So we have these in production. So to give you an example of what sort of power this CNTK is driving right now. Currently, we have the record for speech recognition that is taking audio and transcribing to text. This is using on the Switchboard data set, if you're familiar, with the error rate of 5.1 percent. Again, done with CNTK. So what is this? It's an open-source deep learning library. It's available on GitHub. It's made public in Jan of 2016. It has an MIT license, so quite flexible there. Right. And importantly, the internal, what we use internally and what's available on GitHub are identical. That is, we actually use a GitHub branch for production purposes. Right. So it's available to everyone. Yeah. And of course, support is available for Linux, Windows, and a lot of Docker activity recently. Right. Okay. So in itself, it's a C++ library, leveraging CUDA. It has high-level APIs in, higher-level APIs with C-sharp and Python. I would say Python, again, is probably the most commonly used API here. It allows you to basically run your training jobs on multiple GPUs, multiple servers, the whole lot. Gives you, I'll talk about the readers in a second. Gives you a model compression. That is, you can have binaries put together for the evaluation piece. And it supports Keras and Onks. Okay. So these are even higher-level APIs where the libraries in the back end can be switched out. Right. Okay. Very quick introduction to neural networks. They're nothing new. They've been around since the mid-50s. Yeah. So these were basically modeled after the neuron, which we'll get to in a second. But as a whole, it's still machine learning where the tasks are typically for regression, classification, that sort of thing, right? And here's some tasks. So now these are really full-blown tasks. So again, going from voice to text, that is transcription, right? Speech recognition. Going from video to basically a sequence of actions that is understanding what is going on in the video, right, step by step. Text captioning, passing an image and having the model generated caption for you. And even environmental states, similar to what actually is being done on the video side. So by the way, practically all of these are available within those REST APIs. Yeah. So out of the model. You don't need to build your own syndicate models. You can just use one as a REST API. Thank you. So neurons, of course, yeah, you're familiar. It's, so you have all these connections. Some magic happens within your own self and there's some output, but these are all heavily interconnected. And yeah, and we still don't know how they really work, but nevertheless, the neural network is sort of designed after this sort of model, right, this sort of structure. And ultimately, when you have a set of weights that are being learned, right? So the weights are the strengths, if you will, of the connections with other neurons. And that's what ultimately is being trained. Okay. And there's also some sort of activation function here I've put together, sigmoid, and that sort of wraps up the sum of the inputs and the weights and then pushes that out to the next neuron. Yeah. So the whole of my looks like something like this. So at each step, each layer, you would have this sort of linear system. Yeah, just a set of weights. This is a matrix typically. Matrix multiplied by a vector of your input plus some bias added. And then all that is wrapped into some sort of by some activation function, like sigmoid, tanh, reilu, that's what's quite common. Now, notice that there's no fancy math here. There's matrix multiplication and addition, and then here it is reilu. And reilu, for example, in the sigmoid or reilu, it's a very simple, simple function. So in reality, that's it. That's most of the math there. At least going forward. I'll talk about going backwards in a second. And this is where GPUs come in, because GPUs are really good at doing matrix operations. This, whoops, why would that, yeah. So all this activity is ideal for GPUs, okay? So talk about parameters sharing. The whole point of deep neural networks is that what's learned from, say, one part of the image can be used for another part of the image, right? So these parameters don't actually wind up being completely independent. So this is an example of taking, say, an image and perhaps using a classification or a detection of that flower, right? Similarly for speech, that's an audio file that could potentially end up as a string. That's the transcription, right? Okay, so again, it's just that. That's your fully-connected model on top, a bit of matrix multiplication addition, and that's it. And then you can get different flavors of it. So that up there is what's referred to as a fully-connected or dense layer, a dense neural network. We'll show you some code and how simple it is to get one of these up and going. Then you can get into different of the flavors of it. So convolution neural network would be just a real modification of it, right? We'll go through a couple of those as well. There are recurrent networks, which have a feedback loop. So you're basically just taking the previous step and incorporating that within your current state. And of course, you just stack these layers one after another, and that's your deep network, yeah? Okay, okay, so some basics. So CNTG, really, it gives you all the building blocks to build whatever kind of, whatever flavor of deep learning model that you like, right? So again, some of the really common ones these days are dense or fully-connected, convolution for image processing, recurrent neural networks for any sort of sequence data that you have coming, say whether it's audio or string. Long, short-term memory, that's another one, that's another flavor of recurrent networks. That's actually gained a large traction, most speech and translation models use LSTM. Then you have some of the really interesting ones, generative adversarial networks where you have basically two agents battling it out and learning from each other, right? You actually typically provide them with very little data, if any. Say for example, learning chess, you just give it the rules and have them play and learn, right? So it's actually, these are now GANs or data-less training exercises and reinforcement learning, which we just heard about a second ago, yeah? And again, CNTG is production-ready. We use in production. Our production code is on GitHub, okay? So let's do a very simple two-layer model, right? This is a dense, fully-connected, feed-forward neural network, right? It's just these two. So I have two layers. So you have your input coming in X, you have some set of weights, associated with all your input, some bias. Then on the next layer, whatever the output is of this activation, whatever the sigmoid or ReLU, is then used as input for the next layer. That's it. And in the end, there's some sort of, well, it's referred to a softmax, basically a mapping of probabilities, just finally as to a number of classes say that you're looking at, yeah? And in the process, there is a cost function. So this is where optimization comes in, right? You have some sort of cost that is between cost being the difference, some sort of difference between the expected value, the labels that you have and your predictions, right? So it'll actually go forward. So the flow would be to go through this process, half softmax, say, make some sort of prediction for you. Then you compare that prediction with the actual label for that specific sample, and there's an associated cost. Now, it'll go back through the system, it'll fiddle around with these weights. I'll go through, there's a little bit of calculus there, until this cost is minimized. And that's it. Once it's basically minimized, you're done. And here it is an actual code. It's almost the same thing. This is the actual code. This is a fully connected, two layer, fully connected neural network, okay? And what's it look like in a sort of graphical way? Again, you have your inputs coming in. So you have your, X is your features, Y is your label. Those come in, they get multiplied by the weights. There's a bias addition, the activation is applied. Now that's your, basically that's your H1, right? That's your H1 output. That gets multiplied again by the weights. The same, all that continues until you get to your softmax, which is basically your final sort of mapper, if you will. And then your labels go up all the way and are compared with the prediction mate. So this is the forward process, okay? But then you have on the way back, you actually have, you actually need a few things to do. So on the way back, there's a bit of calculus involved in optimizing these weights and biases, right? Calculus 101 really. So basically if you, people say, you know, I don't wanna get into the math of deep learning, it's really basic calculus, basically an algebra. You're done, you're an expert really, okay? Now even the calculus bit is even done for you. There's automatic differentiation available. So yeah, so you just have to build this pipe and you're done. You say go, you don't have to worry about anything else. Okay, and is that a parallel training? So now this is a ability to actually run your models in parallel and have them basically have different set of parameters that each server or GPU works on. And at the very end, these weights are shared with each other. We even have some very efficient optimizers. There's one called the one-bit Socastic Gradient Descent, which now has an MIT license. This used to be proprietary, but it just got opened up as of the 15th, so it's a very efficient optimizers. And yeah, I'll come back to that. Okay, so you can think of a workflow as having three steps. Really you have a reader, a data reader. You have some sort of network of basically your CPUs, GPUs, and all the connections in between. And then at the very end, you have a trainer, which has optimization and out pops the model, right? So the reader itself, with a bunch of them, depending on the actual data that you have, whether it's text, video, audio, we have pre-built readers for those. And these are very efficient, parallelized readers, so they can actually take a chunk of data and move it specifically to an individual server. It does all the, say, the randomization for you. It's, yeah, so it actually makes it really simple to ingest your data and have it prepared for the training process, yeah. So okay, so here's, there's even a higher level. So what you saw earlier with the sigmoid and all that, that was the very low-level way of doing it. In reality, you would just write, say, dense. That's my now, my dense layer. With the sigmoid of weights times the inputs plus device, all that is really dense. And you say, how many sort of neurons are within this layer? 400, 210, and that's it. That's your, that's actually pretty deep model, right? And then for your actual optimization, again, it's just a little one-liner, really. You say, okay, I want to use, say, cross-entropy and at the end, pop it out as a softmax, yeah. So this is a fully connected one. Let's do something a little bit more sophisticated, specifically for image processing. This is now a convolution neural network. You have these, so you just say, yeah, I have a convolution layer, there's a max pooling, which is sort of a data reduction stage, if you will, followed by another such layer, ending it with a dense. And that's it, yeah, that's a convolution neural network. Let's do a recurrence one. This is, again, used for typically sequences, say, whether it's text or audio or whatnot. Here again, you just say, oh, I want it to be LSTM with 300 nodes, very easy to get, to pipe these layers together, right? And in terms of learners, there are all the famous ones, are the really popular ones that are available within CNTK, Stochastic, Great Descent, Adam is another big one, and here we actually have one bit SGD, I should have put that here on this list. So in the end, for an end-to-end sort of scenario, we'd look something like this. You have your model on top, here is your loss function, your optimization is done there, and that's it, you just come in and choose your learner, whether it's your optimizer, whether it's SGD, Adam, or whatnot, yeah. Okay, good. Let's actually, I'll just show you how some real world models look. So that was all fun and dandy, very overly simplified models, but let's talk about real thing. Let's talk about, say, VGG, the very popular model that's been around since 2014. Here's, you can actually, so by the way, all this code that I'm about to share with you is available within the CNTK repository, these are under examples. So VGG 16 looks like this, this is the whole thing. So what, I've been in 2014, those guys, they spent a lot of lines of code putting that model together. Here's, yeah, it is a full thing. Pretty straightforward. Once you have the sort of the architecture in mind, it just winds up being a lot of cut and pasting, really, right, these lines are all similar. Okay. Another famous architecture out there, the residual network is actually by Microsoft, and it's one of the top condenders for image recognition. So it's on the, say, the ImageNet dataset. So here's a little bit more sophisticated because there is this extra hop, there's a residual that's used over and over from the previous step, or I should say, from earlier on within the architecture. So here, ResNet is actually, we're talking about actually defining a few blocks of layers and then sort of culling these blocks back to each other. Right. Yeah. So this is actually ResNet 50. Yeah. And Inception is another famous one. This is by the folks at Google, sometimes called GoogleNet. And that one is, again, very straightforward. This is, well, this is the building blocks of Inception. You actually just loop this over a few times and that's it, you have it. Okay. So let's talk about some benchmarks. So these are, by the way, this is all public. This came out on the 14th. If you wanna do a, either follow that link for all the results or just do a quick web search for deep learning Rosetta Stone, you'll get the hit. But here we go. So these are some numbers done. So these are, for example, this for image recognition. So this, the same task of building a VGG style, 32-bit convolutional network on a specific dataset on here now we have two different types of GPUs. So single GPUs for two different types. You have the NVIDIA K80 and the NVIDIA P100, okay? Now these are training times, so lower is better, right? And you can see, of course, there's a big difference between the chips, between the K80 and the P100, but as a whole, CNTK is doing pretty well, right? It's basically, well, in this scenario, it's actually, for the P100, it's faster than TensorFlow, but actually MXNet and PyTorch wind up being pretty quick, yeah? That's so, those are pretty impressive. So feature extraction, this is now basically the inference step, that is, you give it an image and it'll encode it into a vector of features, which you can then actually pass on to some standard machine learning algorithm for say classification. Here, but this is still a scoring. So here, again, CNTK for the P100, it was for scoring a thousand images, it was around 1.6 seconds. TensorFlow was 1.8, yeah? Actually for this, actually this wind up being the fastest, CNTK wind up being the fastest, yeah? Another example, this is now recurrent neural networks, basically text comments that are mapped into a sentiment. So positive or negative, whether depending on the tone of that sentence, right? And here again, CNTK for training, model training times actually winds up beating everyone, yeah? So the question is why, I often ask why would you, there's all these other frameworks out there, TensorFlow being a very, very popular one. People ask why is, why bother? And yeah, TensorFlow is great. In fact, yeah, I personally use it. But for certain tasks, certain toolkits might actually be more optimal. One might be easier to actually model with and in the end might actually be faster, right? By the way, you see there's this bit. So Keras, as I mentioned, CNTK supports Keras as a higher level API. Terrible numbers here. There's obviously some of the work that needs to be done on our side in optimizing it. So ideally, these Keras numbers should be identical to the CNTK numbers, ideally, right? So actually, so TensorFlow is doing a better job on Keras integration. So we do have some work to be done there. Okay. So some release notes. Since a year ago, since 2.0. So back in May of last year, we started supporting Keras. There's now a, if you guys are familiar with Halide, this is an open source image processing framework. So you can actually do convolution layers using that toolkit as well, right? There's now a Java API available for model evaluation. This is Alpha, by the way. So don't go and build everything in Java. Stick with Python and C++ and C-sharp for now, yeah. Some other releases of Kudo 6.0 integration and more importantly, in version 2.2, there was support for NVIDIA's Nickel. This is basically an MPI-like parallelized communication protocol, yeah. So 2.3, so Onyx has been supported for a while. We got actually much, a lot of activity on the Onyx side. So this again, Onyx and Keras, as I mentioned, these are even higher level APIs where you can readily switch the back end between different frameworks. You can jump between, say, Piano and TensorFlow and CNTK, yeah. So yeah, so that came in. We started supporting Nickel 2, this next generation of communication library, right. Early this year, we added support for the voltage GPUs. This is NVIDIA's basically latest and greatest chipset and floating point, 16-bit floating point operations, yeah. Whoops, whoops, whoops, whoops. And just as of the 15th of this month, few good, some pretty interesting updates there, they're now using this integration with Intel's math library for even further parallelized computing. There's one bit that stochastic grade descent algorithm that I mentioned, learner that I mentioned, that's now been open sourced as an MIT license that used to be proprietary, it was actually just kept internally for our own purposes. Now it's available to everyone. And there have been some further distributed training for multiple learners, so that's specific for generative adversarial networks. Some quick resources for you to catch up on. So by the way, check out CNTK.ai, there's a whole model gallery there. I took a screenshot this morning. So as of today, Flappy Bird with Keras was published. This reinforcement learning using Python, for example, yeah. So you can get started very quickly. By the way, this is just a little gallery which allows you to filter and search through the different models available, but ultimately when you click on one of these, it'll send you to GitHub, yeah. Okay, so that's the gallery available. Again, you can just start with your favorite programming language and then go for it. Okay. There are a ton of tutorials on GitHub itself within the CNTK repository. If you want to get started for free, basically, there are, you have CNTK available on Azure Notebooks, the Jupyter Notebooks that you just log in to start using, yeah. So a bunch of tutorials within the Azure Jupyter Notebooks. We have a full four-week MOOC on edX. So yeah, so it's a proper MOOC. You have a ton of people coming in starting the course together. And what's great is you actually have Microsoft researchers teaching this course. Yeah. So yeah, give that a try. And in terms of actually running, so now let's talk about actually doing some serious work. You also want to use GPUs for big tasks because they're ultimately for these sort of workloads that can't be 30 times of the 30 times faster than CPUs. So on Azure, we do have GPU-powered virtual machines with the NVIDIA, supporting the NVIDIA, K80, P40, P100, and V100 as the Volta. Chipsets, well. Another cool thing within Azure, so if you go there, if you go to the Marketplace, you'll quickly find this data science virtual machine. This one specific is for deep learning. So these data science VMs are basically VM images with all the popular data science tools, all the software, pre-installed, pre-configured. By the way, all the software is free, even some of the proprietary ones that Microsoft has thrown in. But most of the software that's on there is open source. So here with this deep learning tool kit, you get actually MXNet, CNTK, TensorFlow, Keras, ton of stuff actually. So all the software is free, and in terms of costing, it's just the VM size that you go after. Then there is Azure Batch AI. This is now, so going back, so this is really for scaling up, right? You can just go with bigger, bigger VMs with more and more GPUs, right? You can ultimately hit four GPUs. So that's scaling up. You take this approach. Scaling out, now you have multiple servers doing your model training. You would use a tool such as Azure Batch AI. So this is basically your framework for distributing your learning job across a set of virtual machines. It's all automated. It's similar to, say, Google's Cloud ML Engine, but unlike that engine, you're not limited to TensorFlow. Here you can again use your favorite open source deep learning framework, CNTK, TensorFlow, Chainr, Cafe, Torch. They're all ready to go. The recipe is out there for running your jobs. And, well, I actually did pretty good on time. And that's it. Please find me on GitHub, Twitter, and LinkedIn under the Ben Seregi handle. Questions, please? Okay. Thank you much. I attended some presentation when they tried to push the GPU. I don't get the point to actually introducing, I mean, is this more like in-cloud? Correct. So for, honestly, for matrix multiplication, nothing's going to beat custom hardware, so A6 or if you will, or GPUs, right? But the Intel Math Library, that's, it's very, it's quite generic. It has a ton of math functions parallelized. So if you want to do stuff that really can't be done with just simple matrix multiplication, but you still want to parallelize, that library is pretty good. And, yeah, Intel themselves, they are pushing hard within the deep learning space to have this one library called BigDL, right? That's open source. And it's basically designed to run on CPUs, right? And I have seen some numbers where a very finely tuned configuration has, you know, does give you some really nice performance numbers in terms of training. So, yeah, it's a battle right now. Intel's, yeah, they do see themselves as the underdog in this space. So they are making some efforts to come back in. Yeah, correct, yeah, yeah. Yeah, they might, yeah, they do need to step up their game as the expression goes, yeah. Anyone else? Okay, thank you very much. Thank you.