 Okay, so I hope I won't put you to sleep, let's talk. So I'm going to talk about distributed machine learning before I start to introduce myself. So my name is Anand, I'm the co-founder of the start of Adodo data. We're trying to build a data science platform. So we're trying to build a hero like platform where we're trying to abstract out non-data science components so that data scientists can focus on data science. I also write advanced programming courses that people like me, we're going to do Python in advance, but then courses are banned over and around in India. In the past life, I worked at a company called SandLife Census where we built machine learning algorithms from scratch. I guess these years we can start big Python cycle learning labs, or we can start doing things pretty quickly, but those days we had to start building things from scratch and that's why it came in. I'm an anti-pustering support worker, so I'll be yourself. So that was there, and then I worked a couple of years in the target building a scalable network of patients. So this talk, I'm going to talk about, the motivation for this talk is when you look at machine learning, traditional machine learning, training takes a really long time, and trying to do a distributed cluster is very scalable and very effective. The problem is the existing tools are not so easy to use. So trying to figure out is an easy way to do distributed machine learning. So I really love simple interfaces. For example, take programming, a big fan of Python, it's a print hello world, that's all you need to write, hello world program. You can cycle or machine learning, take any model of it, train it and predict, you've got prediction on it, right? It's so simple interface. Now, how do you apply your machine learning models? You don't really find anything so a built-in small tool called Firefly, so what you do is you take your Python model, you buy the module and then function name and say Firefly that, it starts running a desktop API that you can use it outside. So it's an open source tool that opens the library to a built basically, and you start looking at machine, distributed machine learning. So we thought let's experiment that and see if you can build something and if it seems something possible to build something like that. So we tried some experiments and talked about the summary of what are the set of the art tools available to do distribution machine learning and something about whatever the experiments that we have done. So if a typical look at traditional machine and workflow, you do data preparation with your extraction and you train your models and part of them, train the models you do something called hyperparameter optimization or grid such. What you do is you try to find a best model through a set of parameters that you specify. So it goes through all the parameters, all the combinations and lets you use the best model possible. So it's like you go through all the possible combinations and gives you the model with the best performance. So you have large data set that's what we do. So that's so that's important for programming but it's the same code compiling as my model is building. So that's what people do, right? People set up screens and wait for them to finish. How do you speed it up? I think it's very important to speed it up so that people can do something more correct than that, setting up screens or doing something like that. There's not much to do here to parallelize grid such. There are more things that need to be parallelized but grid such seems to be the low-hanging fruit because you're giving a set of parameters and then you're trying to build a model for each one of those parameters. So it seems to be the low-hanging fruit to parallelize so we thought let's speed this up and see if this can be run on this grid cluster and see what it takes to that. So before actually getting to understand how those things work, let's briefly look into what the parallelization partners are there to understand how those things actually work. So there are two kinds of parallelization patterns. One is data parallelism and another is task parallelism. So to say data parallelism is what happens. So what you do is you look at so you have your input data, you slice that data into multiple parts and then you run the same task concurrently on multiple codes that are distributed way and then you probably run the task in the same way so what you're getting is here you have input data in a multiple process working all the same thing in different sets of data. This is called data parallelism and there's a couple of examples of that. You look at the GPU what it does is it runs the same instruction but it runs on 100 codes or 1000 codes whatever then each gets a different data and multiple data parallelism. So that's a classic example of data parallelism. And if you ever done parallel programming there's a library called OpenMP and MPI. OpenMP is for parallelism based on an SMP and it computes with multiple codes. MPI is a message processing interface to do the same thing on this particular cluster. To look at the purchase part the ML algorithms are data parallelism. What they do is they take the data, slice it and then run it on the cluster. The map produced what it does is it takes the data, slice that and then runs it on multiple machines. That's a data parallel example. Now let's try to understand what is task parallelism? So when you look at task parallelism what you do is you have input data or multiple sets of data and you have different tasks they can be split in parallel they can take the same input data or different data doesn't matter they can be run at the same time so you can run them on the same machine across the cluster and then they produce something awkward. So this is task parallelism. A couple of examples of research is a classic example of task parallelism because you're meant to build a lot of different models there are different parameters but each one of them is an independent task and if you ever use a task you'll accelerate in Python what it does is you have a queue of tasks to be executed, there are workers they pick the task and pick some. So if you look at the data parallelism task parallelism, data parallelism is usually very fine-grained. So you have very small micro tasks. The task parallelism tasks are usually bigger so the task parallelism can even have tasks that can take minutes or even hours. So now that we know what a different parallelism happens let's try to go back to our grid search and see how do you parallel grid search. Well frankly you're really going to have to do much to actually parallelize grid search because grid search has an option especially how many jobs you want to run that runs on your workers and actually tries to parallelize that for you but the limitation is limited by a single computer. So if it has four cores you can't support cores but if you really want to do even more better than a real worker so is there a way you can actually better multiple computers? That's what we're trying to see if that's possible. Now why do you want distributed machine learning? The reason is if you run machine learning on a cluster or a lot of computers the advantage is you get horizontal scaling. So let's get to understand this term horizontal scaling and vertical scaling. So when you say vertical scaling what you do is you ask for give me a bigger machine even scale up and say give me your bigger machine there's a limit on how big a machine can get whereas in horizontal scaling what you say is say give me more machines you don't say give me a bigger machine but you say I want to say a machine but I want more of them. That's really easy to get. So when you're an organization you're trying to grow up it's not easy to get a lot more machines than same machine I mean one machine with very large size so horizontal scaling is good and we're trying to see if we can do that for our machine learning. Now there are already available solutions so if you look at Apache Spark that's a very good tool for doing a distributed ML but the problem is you can't use the tools that you already use. You can't use Cycle-Learn for example. You can use responsible libraries. So if you're already experienced with let's say Cycle-Learn or something you can't use Cycle-Learn to this market start learning all things again. It's another library called Dask that's really a very interesting library it's supports multiple backends supports threaded process or even distributed backends so it's very flexible computing library but again so how do you do distributed computing with even Spark or Dask there are a lot of challenges right so it requires setting up a managing cluster of computers it's not a real job for data centers to do. You need to have an idea, you need to set up and then you run something and then when you start it's undamaged and so on and so on. So you're kind of very underutilizing resources. You have a cluster set up for you but you only run it for very few hours in a day so we really underutilize resources. So our question is is it really possible to have a simple interface that the data centers can manage on its own or not on? So we're trying to see if that is possible. So we did some experiments in this. So what we did is we built as part of our work to make simplify data science, abstract away the data centers, non-data centers the data centers work close. We built a complete platform. So the platform what it does is you can say I want to run this program and then say this program I want to run starts in Java because it starts running it you can specify what it's like that you want. You can say that I want to run it on a machine with 8GB RAM or 6GB RAM. You can also specify what a new environment that you want to run it on but say I want to run it on Python 3.0.2 or Python 2.7 second line so it will pick the right environment and then that's the job there. Behind the scenes what it does is it picks an available machine in the cloud or it starts in U1 if it's not available. It starts in awkward maintainer with appropriate image expose any ports required for example you want to run a notebook, you have to expose it and get a URL, etc. It manages share the space so that if you're running multiple jobs you can have shared storage between them etc. Let's say for example you want to run something training on a 6-in-core machine all you have to do is run your job and then there you go. So it's running on a 6-in-core machine with all all the ports maxed out and all the process running there. So it's a simple abstraction what you do is run a job and say what instance you want to save and it starts in there. So we had, we built this already. Now what we're trying to do is now that we have this infrastructure with us can we take this and see distributed machine learning on this? What does it take? Will a simple interface do this? So what we've done is first we built a low level tool with a similar familiar wider interface. So Python has a concept called pool. It can have a pool of threads or a pool of processes. So you can see multi-process got pool which gives you a pool of threads or processes and you can submit you can map job on rates. So for example let's say you're trying to compute a square of a hundred numbers or a million numbers. What you can do is you can create a pool and then map and then gives you results back by running them on that. So what we've done is we've tried to create a distributed pool. So a distributed pool and you can find what it does is it creates n jobs in the cluster to start taking the work from this and then execute that and then it gives the result back. So it starts on the same part of the trivial. Now that we have basic tool to run distributed learning so we'll try to do a, try to get integration. So we wrote a small sample to provide around the implementation of Brits and CV. What it does is it takes instead of using Brits and CV from cycle line use Brits and CV from the library we have that would what does it use the distributed pool that we have used in the back end and then start starting jobs on that. So what if, so you see we don't have to really do any cluster setup or anything just start using it this code you can run it and not work on it etc. So it starts creating the job and runs it and then you get the results back. So that one I think is the simplicity I really, like the first slide I said simple interface are really good like by cycle line so try to achieve a very simple interface and this kind of sure that simple interface is possible there's no manuals that are required because a lot of things are everything. It works it's a very good place to run it from notebook etc if you want save your costs you can say run this because a lot of stuff puts that so without any engine setup you can actually use a cost by running on any of this body etc. So we're really trying to even see if you can look at things like if you look at an ensemble model there are multiple models packed in one so when you're training something like that we're trying to see if it's possible to run each of this model independently in a cluster so that you can get even more scalability we're trying to see if you can extend this distributed machine learning not on a cycle line so I think our implementation is kind of we have a lot of implementation of trying this thing I think what we've done is we've got the API so it's very simple interface we can use it. I'm sure the implementation we have can improve will improve our time but I think we've really got the API really right and the summary is so if you look at so with large datasets etc the distributed learning will be more effective than the computer mode because it can scale up as much as you need and also to actually look at the business side abstracting it with the complexity of distributed learning or any datasets, workflows, infrastructure etc it improves time to market a lot so you don't have to have weight for the monsters in the cluster or get everything working together etc so you kind of have a system which abstracts a lot of infrastructure components and lets you do distributed machine learning with very very simple interface so that's all I have but with the questions Hello Hi Hi this is Parithi from American Express and we do have a lot of data to build models on and this is exact framework that we are also looking at. My question is that when you are doing a modeling on about a terabyte of data you need that distributed machine learning framework to build a model in the first place. The problem is that the accuracy of the model has to be maintained when you are implementing a distributed version of machine learning in your version how do you make sure that Guinea index or precision is always maintained even when you are using distributed version So let me Parithi, here we are not really doing anything about machine learning all we are doing is we are streamlining the infrastructure part we are trying to provide a simple implementation, simplifying the infrastructure part So the machine learning algorithms whether you run the same thing in a single machine or a faster we are not really clicking in on the algorithms so you get exactly the same accuracy of performance that you get in this machine Here is one question Which side? Right here on the top So the question is I was a little surprised that you did not mention for example TensorFlow or Big D as one of the choices for implementing distributed machine learning any specific reason or why did you just restrict to only Spark and Dash? So the thing is now, as it points out we are trying to abstract out non-data science components We want to give a choice of what algorithm to use for data scientists So we are not really trying to change the way people do data science but we are trying to see how people can run things fast etc. So it is a choice to use TensorFlow or data scientists to be working on So we are trying to start with the second line we are trying to extend that into deep learning so we will probably go and look into TensorFlow after this So I think the key thing is we are not really trying to make a choice of what is a good level to use It is a data scientist who decides what level to use and we are trying to make it faster with tools that we are building Thank you very much I am afraid we do not have time for any more questions I am sorry we are finished with questions We will have to take your questions offline This is only a 20 minute talk so there is not a lot of time for questions I am so sorry That was really interesting So our next talk is a 30 minute talk which also doesn't have a lot of time for questions but a little bit more What do you do when the usual open source suspects like Hadoop and Spark don't work for your project You could go find another one or you could build something yourself that works better