 Next talk is Fabio talking about deep learning on Hopsadoop. So everyone, I'm Fabio. I work for a company called Logical Clocks. We have a startup in Stockholm. And we are the leading company behind the Hopsadoop platform. And today we are going to show you how you scale deep learning workloads on our platform basically. So a bit of step back, the project was a couple of years back with OpsFS. OpsFS is basically a fork of Apache HFS. The difference between OpsFS and the main difference between OpsFS and Apache HFS is that we took the metadata of the file system, which in HFS are stored in the name node, and we put them in a distributed in-memory database. And this allows us to scale both in terms of like space and like files we can store and in terms of throughput. Just to be clear, this is not like some random vendor numbers. These are like research with recognition both from academia and the industry. So on top of that, so on top of the file system, and actually we also have a fork of YAM, which is like the Doop Resource Manager and Custom Manager. We fork it because right now YAM doesn't support like scheduling of GPUs. It's coming in a couple of next reads. And we do have it in Ops YAM. On top of like Opsadoop, on top of OpsFS and Ops YAM, we build like an entire platform and you can find basically everything, all the software that you will expect from a big data platform. But we also want to provide deep learning solutions. So we also have the support for TensorFlow. On top of that, we have, well, Jupyter and Zeppelin to write your code. And the whole platform is organized basically in projects. And you can think about like data projects. So you have like users can collaborate on multiple projects, multiple projects can have data shared like across different projects and so on and so forth. Everything is just like a recipe so you can build your own applications on top of Opsadoop. And we just released the 0.3 version so you can go and check it out. I will have a link at the end of the slides. So as I said, we want to provide like deep learning solutions. And when it comes to deep learning, Python is basically the winner. We allow people to run deep learnings on Opsadoop by using Codel repositories. Basically, each project is on Codel environment. It's like users can install on their project whatever libraries they want. And then they will be able to like use those libraries from Spark and from TensorFlow. And we see later that these two are quite connected. Besides that, we develop our own Python library that surprise us called Opsa Python library. And the thing that this library does a bunch of stuff, the main thing is that it allows you to like do hyperparameters searching for your models and manage the TensorFlow lifecycle. I will show an example later how you can do that. Now, you have a platform. You have all the tools. You have maybe some good ideas. And we develop this thing as a PhD student working real-time and getting this tool done. It's basically data. The idea of data is that you organize your clusters, your Ops cluster in a peer-to-peer network, and they can basically exchange that set with each other. And the idea is that if you, let's say you have like your new like image recognition network and you want to try out, you can go on the network, grab ImageNet, and then you are good to go and you can, you know, start training and see how good is your network. Or on the other hand, if you have like a cool data set or you did like a cool experiment, you can like register your data set with the network and then people can start using it and start doing more experiments basically. The next thing about this is that your protocol is based on UDP and the idea is that we don't want to impact like production traffic going on your cluster. The tool listens for like traffic in the network. If it is empty, then start transferring the data set. Otherwise, it just backs off and let your production traffic going on. So you do have your platform, you do have your data, and then you build your model and your model will have probably like hyperparameters and you want to look what is the like good combination of hyperparameters to use the right combinations. What you can do is doing, you can do like hyperparameters searching in parallel with the ops Python library. The idea is the following. You take your TensorFlow model and you put it into this function. You define some like additional values of saying, okay, I want to test this learning rates, I want to test this dropout values. This is actually optional. You can say, in this case, you say, okay, let's try like a grid and so let's try 001, 045, 005, 045 and so on and so forth like in the grid manner. And then you import this TF launcher which comes down with the ops library and you say launch this model with this grid and what this basically does is the following. It uses Spark to go to Yarn and request like six different executors in this case and will launch your TensorFlow model and run your TensorFlow model for you on the cluster. The next thing is that Spark is now aware of like that can go to Yarn and ask for GPUs. So your model, like your experiments can have it run on one or multiple GPUs basically. Besides that TensorFlow launcher will actually launch it for you like a TensorFlow bot which basically for those of you who don't know what TensorFlow is, it's basically like a nice visualization UI where you can basically track like how your experiments are going. So on the end of the execution, you will probably see and so let's say you were tracking the accuracy of the model, you will be able to see which of these combination of hyperparameters gives you back like the best accuracy basically. Now you have your model, you have data and so on. Now you want to like really like, scale out for instance of like you want to go like distributed training and train on the full size dataset. Now when it comes to distributed TensorFlow, distributed TensorFlow is actually designed to be used with like the cluster manager. You have to start all the processes, you have each process which where all the other processes are and so on and so forth. So you either want to use Kubernetes, Mesos or in our case as we are in the Doop world we use Yarn again and we exploit basically the same fact that we were doing before with TensorFlow Lancer and we use Spark to actually distribute the code, allocate the resources and start the process for you basically. We use a tool actually open source by Yahoo. It's called TensorFlow on Spark. The idea is that Spark is used to, as I said, request the resources from Yarn. It's used to start the processes and it's used also to distribute your code to all the work as basically. We actually fork the TensorFlow on Spark, mainly because again Ops Yarn supports exclusive GPU allocations. So you basically like this TensorFlow, no, our version of TensorFlow on Spark can request GPUs basically for you. And there is another like a minor improvement, but minor, not really minor. The idea is that parameter servers which are effectively a Spark executor will not get the GPU because they don't need it. So it's like a waste of resources basically. And again, TensorFlow on Spark will manage for you the tensorboard basically. This is how basically you run everything. Again, you wrap all your models inside this training function. Inside training function you call this tfnode such as the server. This will give you back like an object telling you, okay, am I like a worker? Am I a parameter server? And then you take action basically. And then you call this tfcluster. You basically give you the Spark context. You have to give like the training function, the number of executors, and the number of parameter servers you want. If you go on this link, like there is an official guy from Yahoo telling you how you can transform like normal TensorFlow code, normal distributed TensorFlow code to the TensorFlow on Spark. Now, the problem is that when you are scaling out the system like you are training with a lot of GPUs, thank you, the parameter server architecture doesn't really scale because especially if you have like large model, like VGC 16, which I think the weights are like 500 megabytes, something like that, network comes from bottleneck. Every worker, every GPU has to communicate with the parameter server at the end of each iteration. So you really can't scale, basically. These are experiments done by Uber. Uber wrote this tool called AuroBot. And the idea of AuroBot is the following. They actually built on top of like work done at Baidu. And the idea is that you don't have like a centralized architecture, but you organize your workers in a ring, basically. It's, again, it's data parallelism, so each worker gets his own batch of data you want to train. And the idea is that when the work is done, the updated gradients are distributed to all the workers using the all-reduce algorithm. Just a quick recap of all the reduced algorithm works, basically. You have like these three GPUs, I'd say, and you have like your gradients, let's say, and you divide it in three batches, because you have three GPUs. And the idea is the following. Sincuesly, you view GPU 1 submits to GPU 2, GPU 2 submits to GPU 3, GPU 3 submits to GPU 1 again. You do that twice. And at the end of the first round, the first like the first two rounds, you will have like one GPU that has like each GPU as one of the chunks of the gradients that it has to transmit in the final state. You do another two iterations, and you have every GPU as the updated gradients for the next iteration, basically. But as I said, this is like a synchronous protocol. So before the next iteration, all the transmissions has to be completed. And this means that you want to have as homogeneous as possible GPUs and as homogeneous as possible a network. So let's say you have like 10 links and you have like nine InfinityBand and one Ethernet. Then you are based in InfinityBand because all the workers will wait until the transfer between the two worker connect through Ethernet are finished. One optimization you can have in this system is the following. So if you have like a neural network and you are doing like the computing the arrows, basically, the gradients updates for the last or the external, so to say. External layers are available much earlier than actually the gradients updates for the first layers. So you can stop transmitting the gradients updates for the last layers before actually having computed the gradients for the first layers. So you can get more performance, basically. So this is how you use it. There is like a full support for auto-use on Opsadoop, basically. The idea is the following. You write a Jupyter notebook, including all your code. Again, you have to do some minor modifications to your code. One thing is you have to feed your optimizer into this over-distributed optimizer. And you have to tell each worker the position in the ring. So he knows to which other worker to send the gradients from which worker will receive the gradients. You wrap everything in a notebook, so to say. And then from another notebook you call this auto-use launch spark and you give the notebook path. And this, again, will transform your notebook in a Python program, execute and train, and we show you the TensorFlow logging, basically. I do have, like, a quick demo of the old platform. Alex, how do I have left? Okay. Which one is there? No, no, no. Do you know what's... Yeah, I know. There are like four of them. No, this one. Oh, we can open another one. Can you just put it together? Yeah. It's in full screen, right? Ah, okay. So this is our, like, we have, like, a production cluster with the platform installed. And what you can do, for instance, is, like, you create a new project. Let's call, like... And the idea is that, well, we don't need hype. We don't need this one, this one, and this one. Can I go out to the full screen? Yes. Control-mines. Yeah, okay. Awesome. So now we can, for instance, look for that. So we have a bunch of that sets up in the cluster. For instance, we can look for... How do... Okay. Quick draw. For instance, this is, like... Okay, this is only 200 gigabytes. We can add this dataset to our project. This is going to be fast. Like, this is going to be immediate, just because the data is on the same cluster, so you don't have, like, the transfer to do. But if you were on a different cluster and you were going to this data service, you will be able to see the progress of the transferring, basically. So once you have your data there, what you can do is you go into the Jupyter Notebooks. You say, let's do this TensorFlow. You say, well, I'm going to do, like, six part of the experiments. Each... Each executor for Spark will have one GPU, a bunch of memory, and you can start. So the first thing you want to do, maybe it's, like, doing... The data, for instance, in the QuickDraw experiment, it's basically a JSON entries. And the thing is, what you want to have is that, as you're using TensorFlow, you're using maybe expensive GPUs. You want to, like, minimize the time you actually spend doing preprocessing while training. So what you can do is, basically, if you're using Spark, in this case, I'm using Spark, I'm, like, converting these JSONs, basically, into some... into actually, like, a dataset of TF records. So it's really... It's ready for actually training. Then we can go to the... We can go to this one, which is basically the... the API parameter searching. So as I said, again, down here, I'll say, okay, define a dictionary of possible parameters that you want to test. These parameters are then provided with your training function. Then you say, okay, these are the model parameters, and you can test with different dropout values and different learning rate. We can actually round this notebook. It will take a while to get up. So the TensorFlow is basically... This one. This one is the TensorFlow and Spark notebook. We have, like, here, we have, like, auto-ingested data. You are reading the TensorFlow... You're reading the TF records' assets. These are, like, the dataset APIs, which are really cool from TensorFlow. This is the model. And then down here, we have the... basically, the... the color I was talking to you before, saying, okay, tell me if I am a server or not. You can now say, okay, just wait for it. Otherwise, start the process and start training. Here is me, not really super a pro in writing TensorFlow code. It's really ugly, but let's say... Let's just keep here. And this is, like, how are you basically running? Yeah, this hotel basically comes out from the ops library. Say, okay, this is, like, the number of executor you ask, and this is the number of parameter server you ask for. And you start the training, basically. Say, okay, this is the number of executor again. This is, say, okay, I want... I want you to start TensorFlow for me... TensorFlow for me. And this is the input mode. You will find, like, more documentation. And finally, we have, like, the... notebook. Again, same color before. Network ingestion of the data. Here is the one... Here is the thing I was telling you. You have to specify, basically, in which position each worker is, basically, saying, okay, this is my local rank, so to say. This is a notebook to actually, like, that contains all the information and all the... I'll say, the code itself. And then you have these notebooks that actually launches the old train. And potentially, we can see... Yeah, you can see here, like, for instance, these six experiments that I launched before for doing upper-parameter searching, TensorFlow is up and he's showing, like, some graphs, basically. So I have to wrap up. Yeah. So that's it from my side. As I said, we released the 0.3 version of the platform. You can go on this link and get, like, a background machine, like a virtual box machine so you can try it out. We have documentation for everything I said on this link. Please start us on GitHub if you want and follow us on Twitter. Thank you. Thank you, Fabio. Time for questions. Thank you for the presentation. How are the tools that scales with the number of GPUs? Come again? How are the tools scales with the number of GPUs? Do we get a... This one, this is the AuroBot thing. Yeah, it's AuroBot that's managed the... Yeah, the idea, if you look back to this one, the orange is the normal parameters of architecture for TensorFlow and the light blue and the dark blue are AuroBots. The light blue is using TCP. The dark one is actually... like, the darker blue is using a DMA. Okay, thank you. Any other questions? Going once. Yes. Thanks for the presentation. Would it be suitable to use this kind of approach to run a single machine with multiple GPUs on it? Or would you advise to use it more to run multiple machines? Okay, we are actually using... I have like one single machine, for instance, with like 10 GPU usage and running AuroBot. But yes, you can potentially scale to multiple machines. I think there was a paper, I think in June, from Facebook, they were like using training of ImageNet to around, I think, less than an hour. I don't remember exactly the time and they were using the same approach, I think over 256 workers, so 256 GPUs spread at this point on multiple machines, basically. Okay, thanks. One more? No? Okay. Thank you, Fabio.