 Hi, thank you for taking your time for my session, productionizing ML with ML Ops and Cloud AI. I'm Kazusato, developer advocate from Google Cloud team. In this session, I'd like to discuss these topics. First, I will introduce what is ML Ops, and then I will discuss how you can build a scalable machine learning development platform, that is, that can be shared with the ML Ops team and also data scientists. And then I will discuss what are the gaps between the POC system and production system, and then why it is important to establish data and model validation practices, and also machine learning lifecycle management. So what is ML Ops? I started talking about ML Ops like three years ago. The reason why I started discussing this topic is that I learned so much about the operating the machine learning system from some senior machine learning engineer at Google, who has been supporting one of the largest machine learning system at Google for over 12 or 13 years. He said, the launching the machine learning system is the easiest part. The operating is the hardest part because the real problems with the machine learning system will be found while you are continuously operating it for the long term. So we thought the practices and learnings of DevOps can also be applied to the machine learning systems. DevOps is all about unifying a software development and software operation, by advocating the automation and monitoring at all steps of the software construction. Likewise, you can introduce the same practices as DevOps to the ML machine learning systems. It is all about unifying the machine learning system development and machine learning system operation, by utilizing the automation and monitoring at all steps of the machine learning system development. And the topics I will cover in this session are based on these publications from Google. The first publication is Machine Learning the High Interest Credit Card of Technical Data. This paper is written by the Machine Learning Engineers and Software Engineers from the Google software team. And another important publication is Rules of Machine Learning, Best Practices for ML Engineering. I'd like to pick up some of the important learnings and practices from these publications. So let's move on to the next topic. How you can build a scalable machine learning development environment, starting with the anti-pattern code, depending on the ML Superhero. I have seen this anti-pattern happen in couple of past clients. The large enterprises and the large organizations usually have a few numbers of the ML Superheroes, who can be an ML researcher, who can read all the machine learning papers and understand what's happening in the machine learning models, but also they can write the Python code or code or Java code to build data warehouse systems or ETL systems, or they are serving by infrastructures for the machine learning production system. And also they can be a product manager who can define the machine learning system requirements, specifications, and discuss what's the benefits and the downsides of those machine learning systems, talking to the stakeholders and business executives. And they do that from POC to production phase. And these people are great people. Machine learning Superhero is awesome, but it doesn't scale, but especially for the years, enterprises and the large organization, because there are only a couple of such people who can do everything. Instead, if you want to really build a scalable machine learning AI adoption in enterprises, then you have to build a scalable team with the multiple roles, like data engineers, data scientists, ML engineers, and so on. So that you can split the responsibility into each roles. For example, data engineers should be responsible for ingesting and ETL of the training data working with the data warehouse and do the setup, the large scale pre-processing against the training data, whereas the data scientists and ML engineers can focus on designing the optimal machine learning model that can achieve the highest accuracy. And also the ops engineer can focus and can be responsible for the deployment of the machine learning model and keep operating the large scale and robust and available serving infrastructure with the continuous training and continuous monitoring setup. Let's think about the one scenario where the data scientist is visiting a cubicle of the ops team for asking to ask to build a machine learning system for his machine learning model. So what are the problems they would face? One anti-pattern you want to avoid is a black box problem. This is not the black box problem of the machine learning model itself, but I'm talking about a black box problem that could be happening in the production machine learning system, where a researcher understands what's happening inside machine learning model, but the researcher doesn't understand what's happening in the production infrastructure. Likewise, the ops engineer does not understand what's happening in the machine learning model, but he understands what's happening in the infrastructure. So there's some black box or there's nobody who understands everything starting from the machine learning model behavior and also behavior of the serving system. And you may want to avoid this kind of the pitfall. ML technical debt papers says at Google a hybrid research approach where engineers and researchers are embedded together on the same teams has helped reduce the source of the friction significantly. Let's think about another scenario where the ops engineer thinking about how you can build a good machine learning platform for development and deployment that could be shared by the shared with the data scientists and ops team. The solution could be using a cloud services. And these are the reasons why we introduce the product called AI platform as a Google cloud service. AI platform provides the multiple components and modules for that could be shared by the multiple stakeholders and roles like a data scientist, ops teams, machine learning engineers, data engineers and so on. And also these components and the functionalities can be easily integrated with the other data products from Google cloud such as BigQuery, Dataflow, Dataproc and so on. Another product I'd like to introduce here is TensorFlow Enterprise which is a commercial version of the TensorFlow distribution that is fully optimized and supported on Google cloud platform. To understand the benefits of the TensorFlow Enterprise and AI platform, let's think about a scenario where the data scientists is want to have one more GPU servers for his training jobs. He want to have a GPU server with everything installed like a code drivers, everything. And he want it as soon as possible because he want to finish their training job in a couple of days. But it's really hard for the ops team to take that kind of responsibility. Procurement of the GPU server and provide technical support for those GPU servers is not so scalable because there could be many challenges. For example, the openness GPU servers are expensive and usually takes a couple of weeks for the procurement. And also in our days of the data scientists, the data science, each data scientist would spend their times on installing the code drivers, TensorFlow, PyTorch, but brought me none by everything from scratch so that they are not managed by the ops team or technical professionals. And also there could be a problem of the over provisioning of GPU resources. Maybe one data scientist would need eight GPUs right now but other data scientists would have unused GPU servers forgotten because the project has finished like a couple of weeks ago. So for the ops team standing standpoint, it's really hard to provide technical support for these unmanaged servers or the resources. Instead, one solution could be using the TensorFlow Enterprise on Google Cloud especially the Deep Learning VM or DLVM. DLVM is a standardized VM image for data scientists. It's a virtual machine images that includes the all year everyday tools that is used by the data scientists such as the TensorFlow 1.15 or 2.1 or 2.3, what code drivers, NumPy, Scikit-learn, Jupyter, Notebooks, PyTorch and everything. So the data scientists or ops teams don't have to spend their time on installing those softwares from scratch. And also because it's a VM image, it can be easily instantiated with Google Compute Engine or AI platform notebook. And not only the standard GPU instances but also you can create a preemptible GPU with those virtual machine images. Preemptible GPUs provides the low priority GPUs with the higher cost performance ratio. So it's possible that the preemptible GPUs instances could be shut down maybe a couple of times in a day. But if you're using those GPUs for the batch, batch processing based training jobs, it's not a big problem for you. But the benefit you could get is the cost. Preemptible GPUs is one-third of the cost of the standard GPUs. Let's take a look at the one instantiated demonstration video of the GLBM. To start using GLBM, you can just open the cloud console and look for the deep running VM on the marketplace so that you can easily find the GLBM images and creating Google Compute Engine instances with it. Specifying the version number of the TensorFlow and how many CPUs or GPUs you want to use, and that's it. In this case, it's specified the number of CPUs and push the create button so that you will be getting there all the software stacks for the data scientist and the virtual machine instances within a couple of minutes. Another product I'd like to introduce here is AI Platform Training. This is a managed infrastructure for the large-scale distributed training. And again, all the software stacks are already installed, so you don't have to spend much time on installing software or building the large-scale cluster with Kubernetes Engine and so on. Instead, you can just press a couple of the buttons to invoke the training job on the AI Platform Training so that you will see maybe a couple of GPUs or tens of GPUs running as a distributed training to finish your large-scale training jobs in a very short time period. And also, it includes the state-of-the-art automated hyper-parameter tuner that is originally used for the internal use cases in Google's machine-running engineering. So you can get the state-of-the-art, the Google's own hyper-parameter tuner for finding the best hyper-parameters for your machine-running models. So that was the problem, as you can solve with the production development system. So what about the problems you can find on productionizing the serving system? There are many gaps between the POC and production systems. They think about portability and scalability of machine-running systems. For example, data scientists would ask upstream where they can support the 10 million users with the prototype code. But usually you cannot do that because prototype code usually are written without thinking about the scalability or the portability. You may have to maybe add 100,000 lines to it for the scalability and portability. There are many gaps between the POC and production to fill the gap, there are many gaps between POC and production systems, such as the dependency on the local environment or dependent on a certain data scientist, whereas production system, you have to make everything portable and shareable. And also you cannot rely on a single notebook instance. Instead, you always have to have multiple instances. In some cases, you have to have like 10 instances or maybe hundreds of instances or containers that is orchestrated with the auto-scaling out or auto-scaling in or load balances because you'd be getting maybe tens of hundreds of incoming prediction requests from the other services or consumers. And also everything should not be depending on the fragile manual process. Instead, everything should be reproducible, automated as in end-to-end life cycle. Lastly, you have to think about monitoring because it's a production system. You need a monitoring and lighting system. One product I'd like to introduce here is the AI platform prediction. This is a managed service for the online or batch prediction service. And the largest benefit of the prediction service is the scalability. It provides auto-scaling feature, so you don't have to take care of how many instances you would need to handle the current workload. Instead, it automatically adds more and more instances or use number of instances based on the traffic. And also everything can be logged into the BigQuery, the data warehouse for the continuous monitoring. And it's supposed to be the T4GPU which is the one of the best TPs for the prediction workflow. Everything is running on the GKE Google Kubernetes engine, but you don't have to spend much time when you're using GKE by yourself because it's a managed service. It's running on GKE, but all you have to do is click some of the buttons on the Google Cloud Console and that's it. Let's take a look at one example where the data scientist and ops engineer can work together for building a scalable prediction platform. So you can get started with the usual Jupyter notebook as you can see on the screen right now. So in this video, data scientists would be using the usual Jupyter notebook environment for training the scikit-learn machine learning model. Using the BigQuery SQL, you can execute some query on BigQuery and download the result set and convert it to the pandas data frame and build some training pipeline with scikit-learn. And as usual, you can call the fit or function to train the model and get the model and try out some prediction with the trained model. This is the usual scikit-learn usage. But the difference is that you can upload the trained model to the Google Cloud Storage so that the ops team can share the same model and deploy the model to the AI platform prediction so that now you can have the full-fledged, fully scalable production-ready infrastructure for serving the exactly same scikit-learn model. As you can see, both ops team and data scientists can share the same notebook. Another product I'd like to introduce here is deep-running containers of TensorFlow Enterprise. Deep-running containers shares the same standardized software stack with the DLVM, but the difference is that this is a container image for data scientists. So again, it has the pre-installed TensorFlow or CUDA drivers, NumPy, but everything is in the containers, not the VM, so that you can easily deploy these containers to the container runtime environment, such as the Google Kubernetes engine or the AI platform training and AI platform prediction, so that you can get all the benefits with this container environment, such as scheduling of each instances or life cycle management of each instances or parts. What if your instances would have some failure? The GKE would take care of rebooting the pod or instances automatically. Well, what if you are getting thousands of the millions of requests per each second? In that case, you can use the load balancer of the Google Kubernetes engine so that each serving request or prediction request will be dispatched to the appropriate way for each part for the serving. Logging and monitoring and troubleshooting and security, those are the all problems required to be solved for the production system, but you don't have to spend much time on solving this problem by yourself. You don't have to write like hundreds of thousands of lines of code to solve these problems. Everything is already encapsulated, provided to be the container environment of Google Cloud. This is an example, a system diagram, how you can combine those components to build on one solution. For example, you can use the BigQuery for pre-processing or ingesting data from Data Warehouse, combined maybe with the Cloud Datapro for ETL, and then you can use the AI Platform Training for training the machine learning model at large scale. Once you have the trained model, then you can upload the model to Google Cloud Storage and use that with the AI Platform Prediction for serving. And what about performance? Maybe data scientists would test with 10K, 10,000 loads, but usually data scientists, the PLC code are not optimized for handling the production level training data, but upstream have to think about that. So this is where you can get benefits from the TensorFlow Enterprise, because TensorFlow Enterprise, you can, with TensorFlow Enterprise, you can get the optimal performance on the Google Cloud Storage and BigQuery reader. For example, when you are reading the some data, training data from Google Cloud Storage reader, you don't have to change anything on the existing TensorFlow code. As you can see on this slide, you can keep using the usual TF.data or dataset API with TensorFlow. The only difference is that you can specify the GS colon slash slash URI for downloading data from Google Cloud Storage so that TensorFlow Enterprise binary can detect that and tries to get the highest optimal performance for reading the training data from Google Cloud Storage read into the TensorFlow training runtime. Likewise, you can get the optimal performance with BigQuery reader that provides the easy API client APIs, like you're seeing on this slide, so that you don't have to spend much time on performance training by yourself. This is a comparison graph of the big Cloud Storage reader performance compared with the TensorFlow 1.1.4 and TensorFlow Enterprise. As you can see, there's a significant change on the significant difference between the IU performance. That means you can pump much more data on GPUs and get higher GPU utilization rate. This is a benchmark result we got with the P100 GPUs with the open source TensorFlow 1.1.4 and TensorFlow Enterprise 1.15. We observed about 30% difference on the GPU utilization rate and also technical support. TensorFlow has been an open source tool but with TensorFlow Enterprise distribution we provide professional technical support and also we provide the enterprise grade long-term support for each TensorFlow version like 1.15, 2.1, or 2.3. And we provide the technical support up to three years. Next, we'd like to discuss about how it is important to provide the data validation and model validation practices for the production systems. We'll be solving this problem by using the TFX open source tool. This is an anti-pattern called lack of data validation. So with the ordinary IT system, many people are writing millions of lines of unit testing code to make sure that the behavior of your IT system is working as expected. But with machine learning system, the important behavior of your IT system is defined by the data and machine learning model you have trained with data. How can you make sure that your behavior of machine learning system is working as intended? You cannot write the millions of unit testing code against machine learning model. How do you make sure about that? Another problem you would see with the production system is training-serving skill. So maybe data scientists will spend much time on the POC notebook and something. But if you bring the same model to the production system, in many cases, there could be some changes or differences between the accuracies between the training data and the serving data. This happens because of the many different reasons. For some cases, maybe the distribution of the features found in the training data and the serving data are quite different. Or there could be a difference of the pre-processing or how they apply the windowing on the time-series data or any other slightest changes can make the difference on the accuracy between training and serving. So to solve these problems, we provide a solution that is packaged as a TensorFlow Extended or TFX. This is an open source tool developed by the TensorFlow team. And the orange components are now published as an open source tool. So you can download those components by going to the TensorFlow.org slash TFX. And TFX is based on its previous version that is called Sibyl. Sibyl has been one of the largest machine learning production framework used in the large machine learning services in Google. And now the TensorFlow team is be writing the framework as an open source tool called TFX. So many large machine learning production system at Google, like our Google Play teams or YouTube teams, machine learning systems are now based on the TFX. So it's a proven technology. One of the components TFX provides is the TensorFlow Data Validation or TFDB. TFDB helps developers to understand, validate, and monitor the machine learning data, training data at scale. And as I mentioned, it is used already in the some of the largest machine learning systems at Google. So it's a proven framework. And the reason why TFX or TensorFlow team provided this functionality is that they wanted the user to treat the data errors with the same wiggle and care that they deal with the bugs in the code. So some changes of the distributions of features in training data is just like the bugs you can find in the code. So actually Google Play team was able to introduce the data validation practice with TFX and they were able to increase their application install rate for 2% of that's an average of all applications. By introducing the data validation, let's take a look at how you can use the TensorFlow Data Validation with the demo video. So you can use the usual test on the Jupyter notebook environment. So as you can see in this demo solution, it uses some data set right from BigQuery. That is a Chicago taxi data set. We see all timestamp for each trips and distance or payment types and so on. And by using KFTB, you can easily visualize the statistics values of each features or drill down into each features and distributions like histograms and so on. For example, it looks like the most of the people are paying by cash or credit card. So this is the visualization you can get with TFDB. And then you can easily find out what are the differences of the distribution between two data sets, like between training data and the variation data. If you are seeing any drops of accuracy in the variation data, maybe one possible reason could be a difference of the features or distribution of the training data and the variation data. So you can easily visualize those differences. And also TFDB provides a feature for detecting drift and skews. You can define what are the skews you want to detect so that the tool can compare, keep watching the difference of the, for example, distribution in each features. Another anti-pattern we have to solve in this production system is the freshness requirements. Not knowing the freshness requirements is another anti-pattern. What is freshness requirements? This isn't intervals you want to, intervals or you want to retrain the model based on the each business requirements. For example, if you are training a model for the new aggregation site, maybe you want to retrain the model every five minutes or maybe 10 minutes. Whereas for the other use cases such as the recommendation systems for e-commerce, maybe it can be a longer interval, like the model can be retrained every day. Or for other use cases like natural language processing or voice recognition, maybe you can just retrain the model every one month or every year. So the thing is that you have to understand what are the intervals you have to retrain the models. So rules of the machine learning paper says, knowing the freshness requirements of your system. And to solve this problem, we provide a tool called TensorFlow Model Analysis or TFMA. It provides the compute and validate the evaluation metrics of your machine learning models. So TFMA can be used after the machine learning model training. So my feature you can use with TFMA is that it can easily compare the differences between the different models, versions of models. For example, your model has been performing so well maybe one month ago. But after one month, the accuracy is gradually dropping down. And now the accuracy of the same model is somehow so low. So you can detect that kind of the drop of the accuracy by comparing between the different versions of the machine learning models. And not only a time step, you can also define any other slices or segment of data sets. One popular slice of segment of data set is the time step, like a time of day or maybe day in a week, so that you can see the difference between those times. But you can also define other segments such as the demographic data of each user. For example, your machine learning model is working so well for the people in United States. But it's possible that the same model doesn't work well for the Japanese people. Or how about the ages? Does it work well for the children or elderly people? Does it work well for the different kinds of the genders? So by using TFMA, you can make sure define, you can define any different identity groups and compare the difference of the matrix. Let's take a look at how TFMA works. Again, you can use the usual Jupyter notebook environment and you can start loading the sample data as you can see on the video. And by using TFMA, you can easily visualize the changes of the matrix based on the slices you have defined. In this case, we are trying to visualize the difference of the matrix based on the time step for the Chicago taxi sample data. So as you can see, you can easily find how the matrix can be changing based on the time step. The matrix can be anything like ROC or the accuracy or FN score and so on. Lastly, I'd like to discuss how you can establish the machine-line life cycle management is the practice in production system. This is one of the popular anti-patterns you can find in the production system, lack of ML life cycle management. This happened many times in the early days of the data science. Researchers or data scientists would create a notebook and in some cases, they would bring the same notebook into the production system as is. And maybe it works so well at the time of the service launch, but how about the after one year, two year or maybe five years, will it be keep working without any problem? In some cases, the data scientist or researcher would get an emergency call from business people. Rapid Director would ask you to retrain the model ASAP, but can you remember how you can rebuild the same model by finding some old notebooks you have in your local laptop or something? Actually, this is not a job for the data scientists or researchers. This should be responsible. The other rules such as the ML ops, the ops team should be responsible for this kind of job. Another anti-pattern for this aspect is the lack of continuous monitoring. So if you don't care about continuous monitoring or alerting system, who is the first person who would notice about the drop of the accuracy of your model? The answer is the consumers or users of your machine-running models. So for example, technical support, people will be getting so many claims or feedbacks from the consumers that the machine-running model or for example, the recommendation is not working so well. So you don't want to be the last person to notice about the problems in production systems. So to avoid this kind of problems, we introduced a product called Cloud AI Platform Pipelines. This is a commercial product that is based on the open source tool called TFX. In TFX, there's some product components for building production pipeline systems. And we provide the product as a commercial product that is supported and managed on the Google AI Platform. And AI Platform Pipelines allows you to publish your machine-running workflow within an orchestrated pipeline. So it works like on Lego blocks. You can take some components in the TFX and connect together, combining them to build on the pipeline you want to build. And inside it, everything is running on the DKE Kubernetes engine. And we use the open source Kubeflow Pipelines tool, running as a managed commercial service. And combined with the metadata store, it manages the old metadata for the data sets or hyper parameters or the models of different versions. And with the AI Platform Pipelines, you can easily check out what is happening in the production machine-running pipeline by using the user interface and console. For example, what's happening in a 40-year data preprocessing or validation phase? What's happening in the feature engineering? What are the training processes are going? And what are the metrics you are getting from the training model and so on? So that you can easily compare the different metrics like accuracy, ROC, and so on. On different timestamps or versions of the data sets or the machine-running models. So take a look at the actual live demonstration video of the AI Platform Pipelines. So as you can see, you will see the DAZ or directed acyclic graph of the machine-running pipeline or the user interface. And also you can access the lineage management user interface. And for each artifacts, like a data set or machine-running models or the hyperparameters, the only lineage, like what kind of the data set and what kind of the hyperparameters and what kind of the models are used for training a specific model, those lineages are diminished and easily reproduced by looking at the lineage explorer in the user interface. So that was the topics I wanted to cover. So first I introduced the concept of ML Ops that is based on the DevOps practices. And then I will introduce how you can build them. I discussed how you can build a scalable machine-running development environment that could be shared with the Ops team and data scientists. And then I discussed what are the gaps between POC systems and production system and how you can fill the gap by using the products like the AI platform training, AI platform prediction and Kubernetes engine and the DRVM and the container images. And also one of the important concept of the production machine-running system is the data validation and model validation. And TFX is the key to solve these problems. Lastly, I discussed how you can build a machine-running lifecycle management by using the tools such as AI platform pipeline. If you are interested in these products, please go to cloud.google.com and check out the each product page of each tools. Thank you so much.