 H.T.M.I. Has anyone done a presentation on R yet? Any R presentations? Really? Oh, that's terrible. Oh shit. Let's talk a bit about R. I talked to about Python last time I was here so I'll talk about R this time. I'm from New Zealand. I'm from Singapore now but obviously not got a strong Singaporean accent yet. La. So from New Zealand originally, New Zealand is of course the birthplace of the R programming language. Really the birthplace of the modern R programming language. It was originally born as the S language in the US. So I thought I'd come and talk a bit about doing large scale machine learning. We have a service called Azure Batch. We have these things called low priority VMs that are like super cheap and so even if you've got a really nice workstation at home, there's some cool things you can do with the cloud in terms of doing large scale training. So there's typically two scenarios when you want to work and think of this as really high performance compute. The first is what would call embarrassingly parallel. So basically every single one of my tasks can run completely independently from the other. They can go off, they can run, I can aggregate the results at the end. We've had some discussion these, right? Modular simulations. Great example of an embarrassingly parallel workload. So really I'm only bounded by the limitation of my credit card limit in terms of how much I scale this workload up. Transcoding, rendering. We're going to talk a bit shortly about doing things like hyper parameter optimizations and cross validation, right? If you think about doing a k-fold cross validation on the model, everybody familiar with k-fold cross validation, right? Instead of having a hold out set, we're on the model five, 10 times, hold out a different 20% every time. Is this my con even? No. It is? This one's on. Well, I'm all wearing this one. Ah, I thought it was just to make me look silly. So k-fold cross validation embarrassingly parallel, right? There is absolutely no reason why you want to sit there twiddling your thumbs while you run every single one of your folds on your local workstation. That's the easy scenario, right? The hardest scenario you could sort of call tightly coupled or, you know, situations where all of the nodes that are participating in the high-performance compute problem need to have some intermediary update of information during the compute process. And did anybody come to my session here last time where we talked a bit about doing large-scale beat neural network training? I mean, that's a canonical example of having to keep something updated as you go. So literally every time we run a mini batch, we need to update the weights across the entire cluster. So in that situation, a compute problem basically turns into a networking problem. Fundamentally, once you start to scale that large-scale deep neural network problem, right, it turns into a networking problem where ultimately you are limited in scale in terms of how quickly you can transfer weights or gradients between GPUs and ultimately across the network between GPUs and multiple nodes. So I talked about the second one last time, today I'm going to talk a bit about the first one and we're going to use R because, and then often you'll get situations where actually you've got hybrid-type scenarios where there's stuff that can go on in parallel, i.e. I'm doing a k-fold cross-validation to perform hyperparameter optimization on a neural net that I'm training across multiple GPUs on multiple machines. So typically how you'd like to try and handle that is you might localize the deep neural net training to a single node, so four, five, four to eight GPUs, but then run multiple nodes, each one of those nodes running the fold of the cross-validation and then run multiple sets of cross-validations, each cross-validation doing hyperparameter optimization. As you can imagine you're using quite a lot of compute so it's nice for it to be not too expensive. So we have this thing called Azure Batch and Azure Batch is basically a sort of an HPC orchestration layer that sits on top of compute in the Azure Cloud and it's basically a way of doing a really managing large-scale distributed jobs, either sort of stateful MPI type jobs or stateless embarrassing parallel sorts of things. Uses for lots of different things. We talked last time about doing Python and doing stateful stuff with TensorFlow and Horovob. This time we'll talk a little bit about using it with R. Does anybody know the R language? Yeah, see heaps. Why have we not talked about R today? It's bias, right? They didn't know I was going to talk about R today. Think if I said it they'd probably say you could just stay on the plane. So fundamentally this is about working in parallel so some of the key things you might like to do you know you try different network designs doing hyperparameter tuning cross-validation those sorts of things. We have good support in batch for multi GPU setups and MPI type scenarios and probably the thing that's most unique in terms of the machines that we have in our cloud is we haven't finny banned in those machines. As I said earlier if you start doing really big scale deep neural network training fundamentally it's not about who's got the fastest GPUs it's about who's got the fastest networking between machines. So there are some benefits I'll talk about a little bit later maybe around the way that GPUs and finny band work together. The TLDR is if you're going to build your own machine learning cluster and finny band is great and unfortunately if you're trying to optimize for multiple machines you can't get away with using GTX cards because the GeForce cards our friends at Nvidia cripple them so they don't work with GPU direct which was required for high performance path from GPU through to infinity band network adapter. Anyway I'm not going to talk about deep neural network training today. So a few concepts around batch so ultimately this is about how we map an algorithm out to a cluster of machines when you work with batch you start up one or more pools of machines those pools can contain both standard and low priority VMs so standard machines are anywhere from sort of six well low priority VMs anywhere from sort of a 60 to 90 discount of standard so to give you a gauge our NC24 v3 machines have four voltage GPUs so for v100s there are about 14 bucks on on normal dime per hour and a buck 40 per hour on on low priority that's still not cheap enough to arbitrage mining bitcoins and Ethereum because I tried I did the math on that but it's actually a really really cheap way of getting a lot a lot of GPU compute power. Batch was built around this idea of job distribution so basically it's a it's a QE than job distribution mechanism and that's really important for these highly parallelizable workloads because literally you want to treat you know every job is fungible you can just split the jobs around all over the cluster there are some benefits with that in terms of being able to run heterogeneous clusters which is which is good for these sort of embarrassingly parallel workloads no good if you're training deep neural nets deep neural net you want a homogenous cluster of the same size machines so basically you'll create jobs and within a job there's one or more tasks Batch will manage mediating those tasks out around the cluster and in the case of low priority VMs which can get killed at any time if we want the capacity back it's the quid pro quo for giving you a 90% discount if we shoot the VM Batch is going to start another one and it's going to replay those tasks that weren't able to complete so it's reasonably fire and forget you know and there are strategies for dealing with those preemptible VMs like running multiple pools running pools of different size machines keeping a good track of which machines get shot most frequently and maybe changing machine types to minimize the amount of downtime the thing is that once you're into you know pools of hundreds or thousands of cores actually you know you're kind of going to get stuff shot but stuff's going to fall over anyway so you might as well be durable to that you want to have some sort of stateful storage back end and push and pull state out of durable storage because these machines are ultimately not durable I won't go through this in detail but I'll make sure the slides that are going up on a website somewhere afterwards yes awesome so these can go up on the slides I mean key things to call out are it's cross platform so obviously it can run anything you want my strong strong guidance is to run Linux and not run Windows on it unless you really need to run Windows workloads the reason being that we while we discount the virtual machine time we don't discount the Windows license all right so so suddenly you've got this dollar 40 volt machine that's using a whole lot of cores and you're paying like two bucks for a Windows license and a buck 40 for your GPUs so just unless you really need to run Windows like you know you've got some crazy Dassault systems simulation environment and it only runs on Windows run Linux pick your poison Ubuntu red hat sent us whatever the low priority VMs are the key to this and I think I even pulled up the pricing page just before for shits and giggles so go have a look at the pricing page all right you know so these are these are pretty quick machines 64 vCPUs 64 cents an hour on low priority with those embarrassingly parallel workloads you don't necessarily need to run the biggest machine every time all right because again you can if you're chunking these workloads up into nice small discrete units then actually you might be better to run 32 d2s rather than 1d64 all right just means you've got a little bit more resilience if we start shooting machines so it's a pretty sophisticated job orchestration engine that is tightly coupled to the way Azure works which means it really understands how things operate in terms of optimizing job distribution around the data center and things like that as I said low priority VMs are the key there built on top of that we have a thing called batch shipyard it's in github it's basically a dockerization framework on top of azure batch so as long as you can dockerize your workload makes it super easy to to get stuff running in batch with complex machine configurations and complex machine configurations that are still reasonably lightweight right a lot of people the kind of a default fallback is you start up your batch cluster using the azure data science vm base image which is basically a windows or a linux vm loaded to the gil's latest anaconda latest tensorflow latest bloody everything that's great except it's kind of ginormous which means the machines start getting shot it takes a long time to stand them back up again so you really do want to for a really big job workload you want to be aggressive in terms of how you optimize your images to get as fast a start up as possible batch shipyard lets you do that by default we have all the underlying support including invidia docker to make it really easy to run dockerized deep learning workflows so that's really good for both npi type scenarios ie you know tensorflow cntk blah blah blah as well as highly parallelizable scenarios the one i'm going to talk about today seen as nobody has talked about are yet as a thing called do azure parallel again up on github with a bunch of examples and do azure parallel is a parallel back end that uses azure batch as the back end execution engine so the way that r works for parallelization is that you register a parallel back end and then a number of other libraries things like the for each library and other you know parallel list applies and things use that parallel back end to distribute their work do azure parallel is a parallel back end that's able to distribute work out across azure batch and i had a demo but i got off the plane and it wasn't working but that's okay because i can at least show you the code it's not that one so let me take you through is that visible i'd need to make it a bit bigger at the back bigger yep they nod tools options for those of you sitting there this is our fantastic programming language we have this little thing going on right let's try that big enough okay so pretty simple make sure you got dev tools in store because we pull this stuff out of github we have a we have an r library called r azure batch which is a library for working with batch clusters because there will be situations where you just want to work with batch natively and then this thing called do azure parallel right which is just like things like do snow and other parallel back ends install both of those load the libraries configure some credentials and the credentials are pretty simple so the credentials are really just if you take a photograph of that you'll be able to log in and use my batch account and stand up all sorts of crazy stuff we're not gonna i'll make sure we remove this after the video so we basically configure a batch account with a key to access and a url a storage account again a key to access i'm going to flush those keys as soon as we're done here don't even get any ideas right and there's probably there's a little bit of there's a little bit of entropy off the end to um that you're gonna have to work through um then we specify our cluster configuration so in this case i've specified that i want my my pool to use d2 v2 machines and i want to use low priority machines right so i'm gonna use my entire pool is using low prime machines three nodes of d2 v2 so those d2s are going to give me a couple of cpu cores and they're going to cost me not a lot well v3 is going to cost me two cents an hour right um and they're pretty they're reasonably chunky gpu cores right 2673 v4 cpu cores um so that's probably as good as most people have on their workstation certainly uh they're nicer cpu cores than my workstation at home uh we can specify it to autoscale in this case autoscale is not going to do diddly squat because we've said min size and max size are the same but we can actually configure this so that we specify a min size of one a max size of 30 and based on queue length azure batch will automatically stand up and tear down machines based on the queue length and based on how fast those jobs are executing and we're flushing the queue we then specify a docker container and this is just going to go and pull that docker container directly out of docker hub so piece of cake push your docker container to docker hub run the thing bob's your auntie's live and love her off you go um then we specify the r packages that we want it to install uh it can install from crann it can install from bioconductor it can install from github and actually it can also install from a private github repo you need to provide a github access token in the credentials which i haven't done but if you want to install um either private docker containers uh or um out of private github repos uh so run through load load everything create our cluster and then it's going to boot that cluster up and it it's it's synchronous for that call so it's going to wait until the whole cluster is up before it starts dishing jobs up out of that and from there in it's literally just a case of running any old r code that uses a parallel back end so in this case this demo uses uh the carrot uh machine learning meta framework anybody use carrot read the book by like a max kuhn kuhn great book great library it's basically a meta framework for doing things like cross validation um you know data encoding and things like that uh so we can use this for cross validation hyper parameter optimization of say a random forest gbm uh and so forth load some data split create a fit control so the fit control is the way that we tell carrot um what we want it to do so in this case we're telling it to do a repeated cross validation 10 times using 10 folds and we're going to do a random search across um a hyper parameter space allow parallel equals true basically tells it to use the parallel back end and we've registered that we're going to use do azure parallel is our parallel back end so basically what's going to happen now when i execute this on my local machine it's actually going to push these tasks out to azure and execute them in the azure back end um so off my local machine if you've got a big data set you're probably not going to want to do this off your local machine into the cloud you're probably going to want to have a two-step process where you've got some sort of work station running in the cloud um and you kick the whole process off up there so you're not talking over the worldwide internet uh to get your data up there you know and literally it's then just a case of running train and this is what's not working for me at the moment um so you just run train that's going to use the parallel back end to execute um execute your work once you're finished stop the cluster um that's going to tear everything back down um and get you back to spending zero dollars thoughts interesting anybody doing this already anybody got a nice cluster at work no because this i mean the nice thing about this is that you don't pay anything unless you're using it um so it's and then you can have as much as you want as long as you've got a good credit cut to it um so that's that's do azure parallel and again anything that you can parallelize and are you can spread across a cluster using do azure parallel uh then we have this thing called batch ai and batch ai is basically a deep learning optimized optimized service built across the top of batch this is what i showed last time i won't go into it in a whole lot of detail but basically if you're doing distributed deep learning uh then batch ai gives you a bunch of simple python uh in command line apis to be able to uh execute your deep neural network training at scale um and there's a recording on the singapore tech recording website where i'll talk through using what's it called fos asia website on the fos asia website uh there's a thing on using a tool called horror vod to train tensorflow models and it's basically a distributed optimizer for tensorflow so if you need to run big tensorflow models it's a good way to go lots of data storage options um and again it's going to be a case of optimizing your approach to data storage based on things like how long your tasks run so you'll need to decide if you want to actually pull data local uh or read data offer a remote store every time um local disk uh we have a thing called azure files which is basically file share built across the top of azure storage uh we have a thing called azure blobs which is familiar with aws is basically like s3 and then we have a fuse driver for that so if you're using linux you can basically mount a blob straight into linux there's a native um hba using fuse which is kind of sweet uh i've got managed nfs um and then for really big scenarios you may want to take advantage of the fact that batch shipyard actually supports standing up parallel file clusters um things like luster gluster and so forth um pretty much mount your file share and get after it um i'm a big fan of getting my data into my fastest data store first so for example if i'm training a deep learning model i am not going to read many batches over the network and you're not going to either um because all you're going to do is choke on file io and you're going to leave your gpu sitting there idle and while i'd love to take your money it's a bad idea so please pull your data locally um especially if you run a big machine with lots of ram it can actually be a good idea to just pull that and stick it in a ram disk because again you want to be as optimized as possible for getting data through the cpu or gpu how do you have a time two minutes let's talk quickly about harbour in the cloud tip number one gpu direct is really important so gpu direct is the mechanism by which your gpu can talk directly to your infini band adapter if you don't have an infini band adapter well that's kind of shitty um but if you do have an infini band adapter um then this is really important um and unfortunately that means you have to be using the tesla class um gpu's from invidia because they disable it for um the the gtx cards uh in terms of the the gpu's running in our clouds uh we've got volted gpu's uh currently is our nicest gpu um and the biggest machine we have is four by v100s uh with a with a 56 gig infini band adapter in it i've spoken a bit about the spare capacity vms which is um really azure batch you know and then sort of final tips and tricks uh point number one use linux uh and that really is so that you're getting full advantage of the low priority vm prices because we don't discount the windows license really the only reason to be running windows for your hpc workloads in azure is if you really need a windows operating system to be supported by your third party software vendor um otherwise you're much better to run linux um and then just make sure that you are fully utilizing all the hardware that you have so optimize your task size and your job size and batch um to take full advantage uh and optimize your data pipeline so that you're not doing things like your data prep while you've got an expensive cluster stood up uh scale for for the um for the tightly coupled workloads um scale up biggest biggest cpu biggest gpu first then out more gpu's in a single machine and then out into multiple nodes all right so don't find yourself running like half a dozen nc6 machines or with a single gpu in them all right actually stand up machines with more gpu's before you break out into to running multiple machines um and as i said job and cluster management tool makes it really easy um so if you're doing anything with r um and particularly if you're going to use anything something like mlr or carrot or one of those those meta learning frameworks um this gives you really cheap and easy access to a compute cluster to do things like grid searches and stuff so if you like to overfit to the kaggle public leaderboard this is a really excellent way of doing it because you can tune the crap out of your hyper parameters and be the most overfit model in the public leaderboard sorry that's an inside joke for the kagglers in the room hope that was useful got i'm tired now guys really good to talk i'm chris from microsoft i run an engineering team here in singapore so we basically build cool shit with customers and i am hiring deep learning and machine learning people at the moment three open heads so if that sounds like you find me on linkedin really easy and hit me up for a copy cheers questions query thoughts comments everybody's blown away did i talk slowly enough yeah no this dude's got one so why now like you basically it's a casual embarrassingly parallel way but let's say you don't want to do a really big regression because i have a lot of data yeah so do a really big regression now because i've got lots of data you mean my first port of call is just to run a really really really big machine you know so i mean literally if you if you want to do that the easiest way to do it is run a really big machine so both ourselves and amazon have really big machines like big machines i'm talking you know a couple of terabytes of ram then you can load your data set into ram and process a couple of terabytes in ram if that's still too hard there are external memory libraries for r so there's a bunch of sort of free and open source ones we we have a commercial one that's part of the microsoft r server product and we bought the technology from a company called revolution analytics but we have a an external memory library and a bunch of machine learning libraries that take advantage of that so they so basically the data stays on disk and we stream it through the machine learning algorithm does that help to answer that question yeah thanks guys