 Jangan beritahu saya. Jangan beritahu saya? Jangan beritahu saya. Okey, satu langkah. Mari kita periksa dengan saya. Saya ada beberapa penyakit di sini. Banyak penyakit. Saya akan mempunyai... Okey, saya akan tinggalkan untuk orang lain. Jadi, pertanyaan. Ini bukan pertanyaan penyakit. Sep videos kita membolehnya, kita akan mem metebutkan penyakitinations. Jangan beritahu saya siapa pun Ya, ya, ya, of course, of course. So, so, so, when, when I speak about data engineers, I also include data, data ops. So that's like a sub branch of DevOps that, that does, does data stuff. So, okay. So, right. So, one sticker question. Who, can, can anyone tell me what is Kubeflow? Okay. Yes, that's right. Kind of. Okay. Sorry. Sorry. I'll just press it down. So, so, 2019 is going to be, the prediction will be that, it'll be the time when there'll be a large uptake from the AI side, the machine learning side on Kubernetes because we are facing a lot of problem with our structure and deploying apps. So, to answer the question what Kubeflow is, Kubeflow is actually the combination of Kubernetes and TensorFlow, right? So, it's basically a framework for doing machine learning in the cloud. So, but, but, we are not, Kubeflow is pretty new is at 0.4, I think 0.4 just released. So, what I'm talking about today are basically the basic components of Kubeflow. So, Kubeflow can think of it as a massive help chart with many sub help charts. So, so, so that's basically what Kubeflow is. And actually, Kubeflow is not using help, it's actually using Kesson app. So, some information about me, I'm, I'm data scientist at Onusby and before that I was doing computational biology in the domain of genomics. So, any researchers here? Oh, okay. Okay. So, so this is the setup I had when I was doing my postgraduate, right? So, we had a simple SGE cluster. So, there's a hate note for submission of jobs and basically you write scripts and they send it into the cluster to do work. And it's usually a shared resource and you have queues and basically a huge ass, what do you call that, a nest to do data storage. So, this is very different from me going into Onusby. So, I don't have that type of resource at my disposal. In fact, this is like my home setup, right? Just simple desktop. And at work, I was actually just, I was given actually something even worse. I was given like a very low-powered laptop to do machine learning. I was thinking, how am I going to do this? Right? Ya, no, this is my home one. That's the i9. So, so. The 10 call. Thankfully, Nvidia actually sponsored the titans because you can actually apply for it for, for using an academic license and actually only one of it belongs to me. The other titan actually belongs to another researcher that we just shared. Basically, I'm sharing this desktop. Ya. So, how do I get from this back to something like this? So, when I came in, I sort of knew that Kubernetes is going to be the solution. Because I have seniors that are in other companies that are using this. Specifically, people from human longevity. I'm not sure if anybody have heard of them. But that's how they do their training to you. So, actually, this is Tommy here. Ya. So, he's like evangelist for Kubernetes in Onusby. So, he's like telling everyone, okay, please use it. So, that's how I got to know about that. Okay. So, similarly, like in the picture of the SGE cluster, you have nodes, right? So, instead of compute nodes, these nodes do different things. They more or less serve web applications and maybe run some databases. But from the data scientist's point of view, we need to crunch numbers. We need to do compute. So, I'm looking at this as a compute resource. Okay. So, I knew nothing about Kubernetes. So, I basically went with Tommy for the TripCon at Shanghai. That was a pretty interesting affair. And I was actually pleasantly surprised that there's actually a whole track just for ML and data. So, I was actually shocked that there is a whole track. So, you can see for two days, it's back to back talks of using Kubernetes for machine learning. So, most of these is basically talking about how to actually bring from a desktop to production. So, this slide keeps coming up. No matter where I went, which haven't talked that happened, I see more of the machine learning talks happening. So, this paper is actually the hidden technical depth in machine learning. So, many people think that as a data scientist, I just have to write training code. But really, I have to do this whole set of things. And it's crazy amount of stuff. And the yellow dot is just that. So, this was actually an animation. So, it will be this and then many other things come up. And this is what I want to do. I want to be doing machine learning. I want to train, I want to run, and I get data and come back. So, this pre-standard. So, credits to Yorg, not honestly Yorg, but that there was another speaker which did this slide. So, I basically pillaged it. So, we would do this, but there's also this other slide that is a huge problem which is at inference time. How do you deploy it? So, this comes back to this image. So, there's a lot of depth of help here. But we also need Kubernetes here as well to train. So, I'm sharing with you what I've learnt from KubeCon. So, basically that same image now is has been one of the speakers actually broke this up into the three rows. So, you have the data scientist, you have the data engineer and the data ops. So, this is slightly different from DevOps. This guy basically has to understand enough data science to bring it into production. And that's actually DevOps. So, ML and we are supposed to be according to the speaker that basically Yorg. Species Traction Analysis and a bit of Model Monitoring is basically what data scientist should be focusing on. And the data ops are basically helping us with process management tools, monitoring data verification and collection. And then there's the SIS admins which are doing overall cluster maintenance. So, you have configuration and basically resource management. And there's also this chit-chang which is serving. So, you can see how Kubernetes is everywhere in the chart. And you will see that Calum actually sits here. So, besides the division of labor, there comes this new term called data ops. You can call it AI ops or something like that. Not sure, whatever name you want to call it. So, this guy basically needs to know these two things. If you work long enough as data scientist, you will also pick up some of these. And whoever that has with you will pick up on the data scientist. So, I'll share with you. Q-Flow, from the very beginning when I mentioned, is actually a framework to try to cover this whole piece by each component. So, I would not recommend that you use this in Q-Flow in production yet, but we can take inspiration from its individual part. So, our style is the first, which is actually Jupyter Hub. So, who here knows what are notebooks? Okay, so, everybody is familiar with notebooks. Okay, there's different types of notebooks. There's Zeppelin notebooks, there's Jupyter notebooks. Basically, I think everybody just loves Jupyter notebooks and nobody really likes Zeppelin notebooks. So, we have to get back. But the thing with Jupyter notebook is that it's not really supported by most of the paid-for solutions. You have to run it on your own. And this is basically how we came up with this, right? So, Jupyter Hub came, when Jupyter Hub was first conceived, it's supposed to be running on just a single server. But now, there's this new help chart that's called Kubernetes Jupyter. So, basically, it's just Jupyter Hub on Kubernetes cluster. So, when you stop this chart, if you list the ports, you'll get basically a hard port and a proxy port. And each user will get its own container that's running its own Jupyter notebook. And there's something called Kube Spawner. So, you can actually specify the image. Honestly, we have two teams in data science, we have an e-commerce team and we have a logistic team. So, you imagine the types of software that we're using are very different. We can have a base image for data science but each logistic team will be using something entirely different from what we're using. So, maybe something like Cplex, but I know that's not... The last five, something similar. And we can actually store this in a registry and you can pull this and then you can spin up the containers. So, the good thing here is that this lets you customise your data science image. Who plays Kaggle here? No. So, if you've been using Kaggle recently, you'll see that actually, their kernels are getting more interesting. You can actually load save images as well. So, I'm guessing this actually runs on top of something that's like Kube Spawner. So, basically, when the pot is spun up, there's a persistent volume so that if it dies, at least my data will survive. And this is linked up to this. So, this is the base image and then I'm happily here doing all my data science stuff and I don't have to care about the server anymore. So, having a very good base image is important because not everything can be done in the notebook. You might need to access, I mean, not access actually. Nowadays, you pop forward into the container. You use catty. So, you will just go inside the container. So, you will treat it like a base. You will install all the software there. And this will also help you, this base image will help you get to a working state in the same day you join a company. It used to be like, you take maybe a week to set up your environment. But once you have a base image, one hour, that's all you need. So, something which I did not mention much is that there are GPU images. So, if you have a Mac, these images will not work. So, in the Qt spawner, you can actually specify a GPU resource. So, you can say on the machine where I have two titans, I can actually have two users spun up. So, that's the beauty of the thing. So, just like a step by step, you will pop forward to the proxy and then you will be greeted and you will sign up basically. And then here is like spawner options. You can fully customise this. This will go into the help chart basically when it's being deployed. So, you can specify the image and then maybe how many CPU what's the memory constraints and the GPU constraints. And then you can start your Jupyter notebook. Here, Jupyter notebook. So, if you want to switch, you just need to change the URL. Okay. So, we'll move on for analysis. So, we spend, as data scientist, we spend a lot of time here and there's actually some time that is spent here. Actually, a lot of time is spent here. So, feature extraction. So, we have to do some scrubbing of the data okay. So, once you've got this feature extraction you want to do some ETL transformation and that takes time. And you need to repeat this. It needs to be reproducible science. So, what's the best way to reproduce this? You need to write some sort of execution plan. You do the extraction, you do the transformation and then you do the load. There's nine from really long goal and even in the SGE classes you have the ability to link jobs together. So, if one job's finished, you're done. And then, there's also something like airflow. So, who heard about airflow? I'll give the sticker. Okay, pass it. So, what do we use airflow for mostly? You have to put your hands up. Quick, come on. Okay. It's airflow for automatic age running the jobs. Ya, ya. Awesome. Say again. Yes, scheduling the jobs. So, these are like big ass quant jobs that you need to repeat over and over and over again. But, when we are doing okay, feature extraction yes, you might want to do this over and over again if you have retraining tomorrow. But, there's something else which is hyperparameters search. So, if you've got a lot of parameters that you need to search through, is airflow the best tool that you have? Not really, right? And, only recently did airflow have support for Kubernetes. So, only in the newest release in 1.10 was there port operators. So, before that there was only docker operators. So, the only option before that to run airflow in a multi node machine I guess mostly on mesos marathon. No? No? Okay, sorry. Okay. So, I'm focusing here on the hyperparameters search part. So, you have a standard DAG. Okay, you've got feature extraction, you've got feature processing and then you've got training. And, you want to try the different parameters here. So, your DAG changes and this space that you're going to be exploring the combinations that you'll be exploring will be very huge. Right? You can do a random search but it will still be huge. You can do train and then you have to do the evaluation. So, each step can be containerized. Right? This helps with the DevOps as well because they don't want to know how your data science algorithm works. Right? And the data scientists are not the best. We are not trained usually as engineers. We are trained to get the job done. So, our code base is not really the best. Right? So, once you containerize, it's a black box. Right? It really truly becomes ML black box to further the stereotype but this is how it works. Okay? So, I'll talk more about job submission now. So, like, in the SGD cluster you do something like this. Right? You can write a very long batch script to do the submission but this is as light as it goes. If it queues up, you specify the amount of calls. You give it some variables and then you execute the script. Right? I did not include here that there will be positional argument that you can pass. Right? So, routinely I will do like 5,000 jobs when I was doing my PhD. Like, easily. Then I'll go I'll usually submit it on Friday and I'll come back on Monday a bit later. Right? Hopefully nothing goes wrong. Right? But it's pretty smooth. This is a very stable structure. Right? And you can if anything does go bad, you can you can actually check on the status of the job. What happened? Like, how did it feel? And things like that. So, so how am I going to replicate this in a in a Kubernetes environment? Right? So, so, I was desperately searching and then I found QtFlow but when I tried, I mean, so I was very excited when I went to QtFlow the founder actually said please don't use it for production yet because the founder himself is not ready for production. Right? But the QtFlow has this component called QtFlow Pipelines and it was just released but before that they were actually using Argo and in fact QtFlow Pipeline was based on Argo is still based on Argo. So, Argo is a workflow management. I think you can use it for CICD as well. So, there's GitOps. Like, if you want to do GitOps this is like the perfect thing. Right? So, so like I said workflow is just one component. There's actually four different components to it but I'll just be talking about the workflow. So, like Alex mentioned on CRD so Argo also has a customized resource definition, right? So, it's a workflow definition. So, so when you install Argo, you will have basically two ports running and that will be the UI port. So, there's actually a interface um, better than actual. And, and there's a workflow controller. So, this is basically managing all the workflows that you'll be giving it. Okay. So, so, like, like QSAP, you will have Argo submit. So, Argo is actually a wrapper around cube control. So, cube counter, cube, whatever you want to call it. So, so, it will still follow cube control. That's why I'm on YouTube. So, so, um, you will have the name so you have to define the name, right? And, and basically here, there will be a job ID that's attached to that. Right? So, you can try different parameters. They will all have a different job ID but the prefix will remain the same. Right? And, and you can actually pass the argument. Right? So, okay, I'll show you how to do it later. Okay? And once the script is done, you can, you can actually get details about it using Argo Gap. And, you can actually look and visit the outputs of these jobs using Argo logs. Right? So, this is really, really useful. Um, okay. So, this is basically how a definition of Argo workflow is. So, um, you can actually write templates so in, in case you have nested, uh, what do you call that? nested workflows where we use certain components. Uh, and you see here there's actually a container image so you have to build the image. So, so previously when, uh, you can just submit a script into your SG cluster now you have to put the script inside the container. Right? The image. Sorry. And then you run the image with an entry point and then you have arguments. Here the arguments can be actually template and basically you can run these templates with different values. Right? These can be, the values can be written into the flow or it can be given to Argo at runtime. So, so you can like just Argo submit minus p for parameter and then give it the, the argument name and the value basically. So, so here actually I, I didn't know that there was no videos allowed in the videos or rather the presentations. So, I'm not sure how it works, but this basically how the DAG looks like. So, you start with the DAG name and the first that and these two actually carry out in parallel. So, you can actually spin two containers in this workflow. Right? So, how you do this is actually have a master structure. So, you have a sequential one so this happens first and these two happens sequentially. So, you can do this for any type to all parts of this. But, but personally when I first transition to on the speed from academia to industry, the worst component was actually serving and like, Vincent say I need a lot of support from that false site to get my machine learning models up. Okay? And it's like a double whammy like, okay, there's lots of potential and there's not enough manpower and basically I was at mercy of all the developers. Right? So, but going forward I think all developers not just devops should know how to write their own help charts. Right? So, with help tree if if the dream is true you can start writing in a scripting language it will be less it will be easier to sum up. Because lots of demo files don't look very nice. So, but I the scripting language will be in Luar. So, they actually announced it at keep calling us like, what? So, but the the person who said who announced it says that there's no stopping open source for building something like a python scripting layer over this it's just that Luar was chosen because it has a good integration will go. Okay? So, I like this chart this picture because like, okay, so this is the typical flow of a data scientist. So, you have algorithm design maybe if you are just pillaging from an existing paper you don't have to really do this. Then you have exploration, proof of concept and then you use some small data to validate this and then you validate this with production. Right? Maybe this will this will be a small set and this will be a big set. Right? And then here it says promise your DBA you will write select and no install update. Right? So, you need to basically ask for a lot of permissions to do all this stuff. Right? Because you need to draw data and everyone is very willing to provide you with this. I know people have been stuck in this cycle for a long time. Especially the bigger the company you get the more time you get stuck here. And you have to integrate with staging and then you have to prove that it works. I'm not sure how you prove your machine learning algorithm works in staging and I'm still trying to figure that out. Then you have but then you have to go through QA. Right? Provider QA or data science. And then finally you go into production. Right? And here be dragons in between. Right? And then finally you have money. Hopefully prove that your algorithm works and you belong in the company. Right? Else it's off plus. Okay? So, the infrastructure for serving these microservices will be managed by DevOps. But really we need to pick up the I mean we need to pick up the slack developers to write our own help charts. Right? And whether there will be a repository is another question to be said because if you are deploying a lot of microservices going through personally I don't feel the need for a CICD as long as the chart is installed. It's working. I don't want to see that again but different ideas. So that's one like there's a main like there's a there's a bad end for data and the bad end for data will actually call out of that model. Okay. So data scientist we have to do the help and then this is basically how we connect the help charts we need to connect with the feature part and I basically drew like there's a duo right? So we need two people to party. Right? The data scientist Maybe the data scientist is this one and the data office is this one. Right? We will change rules. Okay. So this is what I've gone through so far. There's another component here which is in model monitoring. So I'm using a lot of these traditional monitoring tools like data doc and hopefully graphana. Not graphana yet but not only we normally these are used to measure like CPU usage, memory usage, you know, connections but I'm using this to look at more business metrics like dashboard which is life. Right? And stream data. Okay. So while doing this I I learned that Alpine is not my best friend. As long as it doesn't lead to the C library. Ya. So I I found out that it's not good at dissolving the DNS. So I spent a lot of time building a deployment image but I found that it's not the best thing. And then Ya. I don't know. Okay. So extra slides. Actually I'm working on the recommendation systems. So this is a bit off-topic. So basically you have products and then collaborative models which are run on like batch. So this will be your workflows and then there will be ranking. Right? So you basically generate a series of recommended products and this is where the microservices will have to do the processing life. Right? Based on the features that come in life. So this will sit in the currently structure. Okay. Thank you. Ya. So do you use TensorFlow serving and microservice or GRPC? Yes. Using GRPC. So so you... Okay. So in the help charts there is the liveliness probe and the what's the other one? The readiness probe. Right? So TensorFlow serving doesn't have a rest endpoint to check that. So you have to use the GRPC TCP to check whether it's like. So but it cannot serve the model directly because the input needs to be transformed first. So it's usually a pack. So you have one flash and a TensorFlow serving. So the amazing thing that I found was that actually TensorFlow serving I didn't have to scale the pots. It's actually really fast. But that Okay. So back to this chart, right? So in Qflow the solution for serving is actually one more layer sitting on top of on top of charts. It's actually something called Seldar. So the input is kind of similar to TensorFlow serving. So, yeah. Is TensorFlow serving is that K-native serving? Like serverless? No. It has to be running. Right. That's actually GKE has something which is kind of serverless as well. So you can upload your models directly. Yeah, there's K-native. Yeah. Does that under the impression? I was just wondering. Okay. You said serving, I just thought it was K-native serving. Ah, okay. It's a different animal. Yeah. But the thing with TensorFlow serving was that if you have big models that have a lot of weights and needs to be initialised into memory, you need, like, I think Brem Koon suggested like a sidecar to hit it first to warm it up. I was the first user that uses, I worked more often and countered this because I'm testing it out so the first user will fail because it's still warming up with all the weights. So that's one problem with TensorFlow serving. Thank you, Masli. Very good.