 Let's get started. Good afternoon. Thanks for coming to this part. My name is Yang Tang, I'm a maintainer and a CIGI lead for TensorFlow project. My co-speaker, Yuan, he is a member of Kubeflow and a approver and a reviewer for Kubeflow's MPI operator. He's also maintainer of several open source projects as well. Fortunately, Yuan is not able to attend this talk because of the schedule conflict, so I'm going to cover his part as well. In today's talk, I'm going to discuss about large-scale distributed learning Kubernetes clusters. I'm going to briefly discuss about TensorFlow, discuss about Kubernetes operators, and also discuss about Kubeflow, which is the main focus of this talk because the operator, we talk about the TF operator, high-torch operator, and the MPI operator all belong to Kubeflow app. So let's get started. So before we jump into the deep learning or transition, I want to briefly discuss about the background of TensorFlow. As many of you already know, TensorFlow is one of the most popular open source frameworks for machine learning. It was originally developed by Google. It has been open source since 2015. Since then, TensorFlow has seen quite exclusive growth both in open source and machine learning community. At the moment, the most stable version of TensorFlow is 1.14, which was released several days ago, but the TensorFlow 2.0 is actually up and coming and will be released soon. I want to discuss about the sum of the change in TensorFlow 2.0 because a lot of things has changed and that's actually a quite an impact to many areas we'll discuss in this talk. In TensorFlow 2.0, the biggest change is obviously is the eager execution. Back in TensorFlow 1.x, the static graph was executed, which means if you are going to run a program, instead of running it in an imperative way, you have to construct your graph and wait for the graph to be deployed. And once it's finished, you get the result. This is obviously very good for production deployment, but it also has a disadvantage in that if you try to debug, it's very hard to debug because you will not get the result immediately. So in TensorFlow 2.0, the biggest change with the execution model is that eager execution has been enabled. You can just run TensorFlow like just like a Python program very natively. Another big change in TensorFlow 2.0 is that currency essentially is recommended high-level API and our other model building APIs are pretty much dropped. In TensorFlow 1.x, people have seen, criticized the 1.x in that for high-level model building, you'll have several different ways. You can use TF estimator, you can use TF carriers, you can use TF layers, you also have the TF3. There are four different ways to build models. In TensorFlow 2.0, the decision was that TF carriers will be the recommended API and moving forward, our other method will be deprecated pretty soon. If you look into the TensorFlow 2.0's workflow, you can notice that everything started with data input and preprocessing, which is handled by TF data. TF data is a pipeline to take the data from other sources and import that into TensorFlow's graph. This is one of the most important steps and this will be one of the focus of this talk because the data processing is very much tied to the alteration of the deep learning. The second step after the data has been imported into TensorFlow's graph is a model building. Obviously, we talk about TF carriers and TF estimator, although TF carriers is a recommended way. For training, now you'll have the eager execution. Also, you could use the TF function to define, to pre-define a small graph so that the execution can be speed up. And finally, for training on the inference, you can use save the model to save the model build and reload the back. So that's TensorFlow 2.0 workflow. If you look into that, you'll notice that data input of preprocess and model building probably are the steps that's going to be needed before you do the training. So people always ask the question, I'll say, okay, what's so special about the data input model building how to apply that when you have a big cluster. So let's come up with a topic of alteration for deep learning. When we talk about alteration, we actually talk in deep learning, it has a very special meaning. Most importantly, alteration for deep learning is always tied to the TPO. We all know if you want to run a deep learning task, you're going to deal with IO, which could be input output. And most likely it's going to be network input because the data will be loaded from disk to memory pretty easily. Then the second piece is a CPO. CPO is pretty much still needed for deep learning because not all the operations are feasible for a CPO. There are some operations just naturally not a good fit for a CPO. So you still need a CPO for certain operations. And also, CPO is typically the way to interact with IO. And finally, you have a CPO, that's a heavy lifting piece to be done. But if you look into that, the biggest problem is that CPO is the most expensive one. And you're appropriately, if you're going to run your alteration, you'll notice that if the deep learning is not scheduled very nicely, you're going to waste a lot of GPO. And that's a big waste for your investment. If you look into the diagram shown in the slide on the top, you could see that you do a batch processing. You take the data from the input. It passed to CPO for pre-processing. The next step is a pass to GPO for training and inference. As you can see, until data is reached to a GPO, your GPO is going to be idle and you waste a lot of resources. On the bottom, that's the optimization of your alteration and the scheduling. In this case, you'll try to build an IO pipeline such that the data will be delivered to GPO as soon as possible. And once it's in GPO, most of the time, you can do the training and do the IO in parallel such that your folder utilize your GPO. And that's the efficiency we talked about for alteration. There are several different ways for deep learning alteration. The most common way is a so-called pyramid server model. And for TensorFlow, that has been in place since almost the beginning. The pyramid server model essentially defines a pyramid server which could be a subset of servers that's dedicated for saving parameters. You also have work node, which is the GPO heavy. The workflow for a pyramid server model is that which work node, it's going to collect a subset of data and it's going to calculate, it's going to calculate the grading. The partial grading will be passed to the parameter server. When parameter server collects all the grading, it will do the calculation, gets the weight. And the weight will be further fitted back into the work node to do the update. This process will continue. Pyramid server, like I mentioned, has been pretty much in place for quite several years and people invest a lot of money, invest a lot of resources to improve that. But there are some debating about whether pyramid server is the best way to handle deep learning alteration. The biggest concern is that there are some people, there are some discussion about pyramid server, probably best, pyramid server model probably best fitted for scenarios where you have your computer power all seated on the same machine so that your memory are shared by different GPOs. But in case of distributed deep learning, your GPOs actually scatter the wrong. They could be on different hosts and they had to talk to a pyramid server. When they talk to a pyramid server, they had to go through a network transfer and their huge amount of data are transferred back and forth between pyramid server and the work node that could be a really big burden to your training. But anyway, in the old days in TensorFlow, pyramid server is pretty much the only potential model you can run for distributed deep learning. Recently, people have been talking about if we can use some other ways to do the distributed deep learning. Most people actually come to, most people actually resort to the MPI model. MPI distributed parallel processing has been resolved in a way for a long time and there are several ways to do the, there are several ways in MPI to do the calculation. In MPI, there are several models for reduce operation. One is the reduce, another is our reduce. So I'm going to show example of the reduce. So what is the reduce? Reduce, if you look into a graph, that's going to show you exactly how, the reduce is essentially just for each node, you are going to do the calculation and you, if you look into this graph, you try to do a summation. When you do the summation, you start with a bottom node, you do the calculation, do the summation and at each node, you're going to sum up the feature step, you do a reduce once until you reach the root node. So if you show it that indirectly, it's going to be like that. So the final step is that you reach to the root node, which is the summation of our node, that's 28. That's reduce. MPI also have other, another of so-called our reduce. So what is our reduce? Our reduce is essentially same as the reduce except when the summation has been calculated, reach to the root node, instead of just stop here, you're going back and broadcast back to every node. So every node knows your result. So if you're looking this way, you try to do the summation, you start with a bottom node, one and a two, that's three, and a plus five, that's eight. You do another summation, that's another node update. And another summation, you reach the root node. Once you reach the root node, the result 28 will be broadcasted to every node. So it's going to be like that. So that's a so-called our reduce. So to sum up, the our reduce is a reduce plus broadcast and for many people, familiar with MPI, they believe that's a more efficient way of doing distributed calculations. So let's go back to revisit parameter server because like I said, in TensorFlow for a long time, parameter server model is a standard way of doing distributed deep learning. The debating is about parallelize our machine and parallelize in a cluster might be different things. Many people believe that parameter server model is only suitable for parallelize our machine. It's not suitable to parallelize task in a cluster. And it's kind of controversy. The biggest burden for parameter server is that the cross-device communication cost is very high. But unfortunately, a huge effort has been invested over the years. So you'll probably can only take that. I spend a lot of time discuss about parameter server mostly because if you talk about all the existing machine learning or tradition frameworks, they actually have the assumption that they use parameter server and that's not necessarily the best approach. So that's kind of like making the decision of which choosing the best so-called deep learning for deep learning authorization frameworks kind of difficult, because whatever framework you try to choose, it may not exactly fit the scenario you want to achieve because in this field, a lot of things are changing very frequently. So many of the frameworks you feel, for example, you feel some more framework, you probably notice that they cannot even run with new version of TensorFlow. So that's going to be a big issue. Anyway, if we look into the authorization for deep learning, we notice there are several things. If we summarize the past diagram we mentioned about the parameter server model about the reduce and out-reduce model, they all require stateful metadata and they also require lifecycle management which come up to the situation where we feel like we probably can use Kubernetes for our tradition. And also when we talk about our Kubernetes, we also think it probably makes sense to have Kubernetes operators for machine learning as well because it's just a natural fit to manage stateful data and to manage a lifecycle. So let's go to Kubernetes operators. In this talk, I'm going to just briefly cover several operators and these Kubeflow organization, the TF operator, PyTorch operator, and MxPi operators. PyTorch is, it probably has a focus on PyTorch which may have some competition with TensorFlow but I want to cover this part as well. TF operator, that's obviously it's tied to TensorFlow. It supports TensorFlow. It also supports TensorFlow TF distributes strategy. So this one has a slightly better in that TF distributes. It's relatively new and TF distributes actually support a different strategies and the background like MPI and CCO, permanent server and TPO model. The TPO is the chip set, the chip designed by Google, that's TensorFlow processor unit. PyTorch operator has a focus on towards distribute module. It also has support for MPI and the CCO. Horror World, that's developed by Uber. Horror World actually has a wider support for different frameworks but I want to point out that some of the distribution strategies that probably not exactly, not exactly fit for a new version of TensorFlow. But anyway, we'll just show those TF operators and MPI operators to be deployed on QuickFlow. So as you could see, they are pretty much very similar. The difference is that the first of all you have a different kind of job that's the TF job and the MPI job. The spec is either TF replica spec or MPI replica spec. The command line is slightly different but other than that, it's pretty much the same. So the container image are the same. So let's look at, if you have a TensorFlow, let's start with TensorFlow program. So how do we draw a TensorFlow program? You install TensorFlow and you can also install TensorFlow I.O. which give you the TF dataset support. So the first line, of course, is to load the data into dataset. That's the first line of what dataset equals the MNIST dataset. The dataset will do some transformation which is necessary because you need to do some pre-processing so that your dataset is in flow 32 data types. You also do a batch of 1000 because batch is the way to help you for your data to fit into a GPO in one process. If you pass the data one by one, it may not be very efficient because the data transfer from CPO or from memory to GPO is also very expensive. So next is the model building. As we mentioned now, in TensorFlow 2.0, everything has been consolidated into curves. So you try to build a model. That's going to be a sequential model. The first layer is a flatten and then you come up with a dense layer. Dense layer is standard layer. The next layer is a dropout. Dropout is a technique to prevent overfitting. Essentially, if you define a dropout with... In this model, you have a dropout with a rate of 20%. That essentially means during each phase of the training, for every node, the chance this node will be dropped and this chance is 20%. This will help prevent overfitting problem you'll normally encounter when you build a new network. The last layer is a dense layer as well. So as we can see in TensorFlow 2.0, everything is being simplified. So if you build a model, the next step is to compile the model. That's one line. After that, it's a model.fit. Essentially, give you the training. The data set has been passed in. The last line, as you can see, that's evaluated. That's actually the inference phase. You can do that, or you can optionally skip this step if you're only focusing on training. So this program, we are talking about just build a model and run it on TensorFlow. Now we talk about distribution. Of course, that's a focus of this talk. So how do we do distributed training? So there are several strategies. One strategy in TensorFlow now has mirror strategy. It's actually consolidated a lot with mirror strategy. All you need to do is just define scope of with mirroredstrategy.scope. The model compile can be done automatically, which seems to be nice, right? We talk about the mirror strategy. You also want to mention that in the old days, you have, when you do distributed training, you have other tools available that can help you. For example, like Hoverlord. If you combine TensorFlow with Hoverlord, it's pretty much similar except that you'll have to pass different options and use callbacks to achieve the result. One thing I want to point out is, in this example, you have TF train, but TF train has been deprecated in TensorFlow 2.0 as well. So it's kind of, unfortunately, like I said, there are a lot of things in TensorFlow 2.0, so which actually makes things a little bit messy. But that's also, that's also a good thing because it's actually giving you a brief idea of how fast this area is moving forward every day. Like if you talk about 1.0, just like how long that can go, like one or two years ago. And all of a sudden, we jump into 2.0. A lot of things change, like a parameter server model, like TF train has been replaced, that deprecated even TF estimator, which used to be the default way of building a model for many programs, now all has been gone in 2.0. Let's do a comparison. But anyway, let's do a comparison. So as you can see in TensorFlow 2.0 versus HowlRode, they're pretty much the same. You'll build a similar model when you do the distribution because it's in Python, so a lot of details has been abstracted and hiding. So you can see it trying to make your life easier. We also try to cover a little bit of PyTorch plus HowlRode. Because HowlRode, one thing with HowlRode is that even if it's not exactly a thing for TensorFlow 2.0, it still has some usage with MPI and with PyTorch. So you can see they behave slightly differently, but they are pretty much similar as well. So that's one of the positive side of a different, of a framework that can support different open source software. Let's just go back to TFR operator versus MPI operator. They are pretty much very similar as well. As you can see, they abstract all the details. So now you can see that even the YAML specification for running your Kubernetes operators are very much similar except for the last command line that's slightly different. So as we discussed, this field is too fast in moving forward. Every day, a lot of things change. I mean, actually, if you try to install, for example, TensorFlow and you try to install Night Lane, and you probably notice that whatever you're trying to run may not work after like a month. And within several months, TensorFlow 2.0 will be released and there will be further change. So it's really hard to conclude what's the best solution for you to select a moment because whatever that is, you'll have to make sure your authorization framework can support the best software with the latest version that's stable, right? So what's the conclusion coming up with to this state? The conclusion is that if you want to focus on deep learning alteration, first of all, you have to focus on trying to follow your GPO because that's the essential piece. Otherwise, deep learning alteration, it's probably not worth a lot of effort to waste time on that. Another piece is if you try to, people tend to this debating about if you want to find a framework that's going to support our platform or our frameworks or you want to find a framework that's going to dedicate for something. It's really hard to say. Like I said, initially TensorFlow only have a permitt server for distributed TensorFlow. And later on, we have the mirror stretching. We also have our reduced stretching introduced in TensorFlow TF for distributed. That actually simplify a lot of things. I think that's something you want to take into consideration when you'll try to come up with a solution that's going to build for so-called AI ops or so-called machine learning Kubernetes because whatever you're trying to build, it may not be feasible after maybe just several months. I see a lot of people just smiling, just think, okay, what exactly are you trying to deliver? Unfortunately, I think that's a static call for the current state of distributed machine learning field. It's too tight to distribute framework and it's probably hard to come with a stable solution at the moment. If you try to come with state solution, I'm sorry, probably in several months, you'll probably realize it's cannot even run some of the latest version. But one thing I would like to recommend is a lot of people mentioned, so what exactly are you going to do? I think a Kubeflow could be a framework that you can try. There are several reasons. First of all, Kubeflow is a very actively developed, which means the final framework will try to catch up as soon as possible. Secondly, Kubeflow has a very big community. As far as I know, Kubeflow has a lot of developers actually countering to that. They try to make an adjustment as soon as possible. So if the upstream framework, like TensorFlow, make a change, Kubeflow tend to catch up very soon. Yeah, I think that's pretty much for today's talk. Another thing I want to mention is that Kubeflow also has several shared API and basic practices. They have a common standardized API spec. They have a base job controller interface. They consolidate all that in one repo, but they have different issues. Okay, they consolidate those common API spec, common utility functions into one repo, but they have different ways of deploying different frameworks, like they have TF operator, PyTorch operator, and they also have MPR operators. That's pretty much for today's talk. Any question? Regarding model serving, what's your recommended technology, Seldon or others? Okay, that's another interesting topic. You, if you talk about, like, you're talking about serving models, like, I mean, of course you have a TF serving. TF serving is one that's coming out for Google as well, but I think they're also falling behind with $2,000. Falling behind? Yes, it's... It's not compatible with 2.0 models? In $2,000, a lot of things have changed. Remember, in TensorFlow $2,000, first of all, you have an eager execution enabled by default. Secondly, so when people use, in the past when 1.0, 1.x, a lot of the frameworks use TF estimator as a default framework. Now this framework is being phased out. Of course, TensorFlow is still supported TF estimator, but that doesn't mean it's a first class system anymore, to be honest. So I can only say a lot of things will change pretty soon for TF serving as well. And also in TensorFlow 2.0, another concept is a so-called TF function. TF function essentially is a small subgraph to help you to optimize. Remember in 2.0, a static graph has been replaced by an eager execution. In eager execution, every step, it can run impartially. This actually slows down the speed of training dramatically. So one way to improve that is you can define a small function. A small function is essentially a subgraph or a small graph. This graph you can apply. So that gives you the advantage of graph optimization. At the same time, it still gives you the eager execution capability to allow you to do a DBAN. So with TF function, if you try to write a, if you try to, let's say you try to write a model, you can come with a, you can write it in an eager mode. And if you're stabilized and decided, say, your model is pretty much free, you can convert that into TF function to speed up. So that's another thing that's going to help you. Like I said, a lot of things are in the change in 2.0. But I think that's our good change because like I mentioned, for model building, you'll have a curse consolidated. You don't need to think about which is the best way to build a model among the four different ways of building that. And also with eager execution, that's really helps a lot for debugging purposes. Any other questions? Okay, thanks everyone for coming here. I have one more day.