 OK, we're going to get started. This is an introduction level talk. If you want a really, really big deep dive on Kubeflow or something advanced, this isn't it. So you can leave. Also, normally this talk, it's me and a co-presenter, and we're telling jokes the whole time. But jokes don't really go over the language barrier well. So if you are expecting a lot of jokes, this is actually going to be real contents. And I'm sorry. So you can also leave now, and I won't be offended. But if you like jokes, I recommend looking up another version of this talk, because it's really just me and Holden just cracking jokes for 30, 45 minutes about stuff with pretty much same slides. A little about me. First of all, things that aren't on here. And also a little bit about this presentation. Normally when I give talks, I maybe like the day before, that's when I really start practicing. I go through it a lot. I correct all the errors I find on the slides. For this, I submitted the slides a week ago, so they were all done great. I came here, and I wasn't able to get into my Google account, so I couldn't edit any of the slides. So there's lots of errors and mistakes, and I'll try to point those out. So I apologize about that in advance. Also, I changed some content, so the slides don't exactly match up with what I'm talking about, but that's OK, too. I also talk really fast. I am working very hard to talk as slow as I can. But if I am talking too fast, please just raise your hands and let me know that I need to slow down. And that's OK. We'll do questions at the end as well. Also, there's some idioms. Same, if I say something and it sounds weird and doesn't make any sense, just raise your hand and I'll stop and I'll explain what I mean by that. And I'll also be watching my live translators in the back. If I'm going too fast, just let me know, and I'll slow it down. So about me, my name is Trevor Grant. I'm from Chicago. I work at IBM doing stuff. I've worked a lot in the Apache communities. That's a lot of my open source history has been. Kubeflow is my first adventure into the Kubernetes ecosystem. I'm putting together an IoT track at Apache Con North America, if y'all want to come, you should. You can find me pretty much all over the internet with this handle of rock and treble. Got a website and blogs and GitHub and you name it, rock and treble, it's probably me. So at this point in the talk, I normally start with some jokes about how many buzzwords Kubeflow has. The point of this is, though, that there is a proliferation of frameworks, machine learning, libraries, serving things. It's been a problem that people have been talking about in AI and ML space for a while now, that there's is there ever going to be a tool, a single tool that everyone rallies on? And I think the answer is no. And so then the question is, how do we start dealing with this rampant proliferation of frameworks? Because it seems to be growing too. There's more and more frameworks popping up to solve more and more new problems. Another example is, there's three talks on just frameworks on Kubernetes going on right now. I appreciate you sticking around and coming to mind, but that, again, underlines how much it underlines how much there is the problem of this proliferation. Best case scenario right now, you're using some of these tools we're going to talk about, these components, and a little glue code. But that's maybe you are, maybe you aren't. There's more jokes here, but the gist of this is an algorithm. Data scientists like to think they're very important. We have built them up to think they're very, very important. And everything they do is so important. The reality is, the model or the algorithm that data science makes is really one small and trivial part of creating productionalized machine learning and artificial intelligence. There's data prep. There is machine learning library version control. There is model version control. There is validation of the model and doing all these things in real time. And while the data scientist model, having a good one is good, a very elegant model that you can't productionalize is worth less than a simple linear regression that you can put in production, in my opinion. Oh, another little asterisk on this. I work at IBM, but all of the things I say are my opinions and do not necessarily represent the opinions of IBM. So there's that. Yeah, going back into just the model serving and model training, being able to serve a model and being able to scale gracefully on different types of hardware is an important thing for your machine learning, your artificial intelligence pipelines. This joke still carries. So as you, your data scientist, you can usually tell, you can get a feel for how experienced they are by how much they're demanding that they need so much and many more GPUs. Cloud providers were always happy to sell you GPUs in time on our GPU machines. What is Kubernetes? As everyone at a Kubernetes conference, I assume that you know, and if you don't, it's magic and containers. So the other bummer about me not having my slides working the way I thought they were going to is there's a lot of good animations. And this animation, the cat is going, so you can imagine that. And this is our data scientist who thinks that they're very important and that they're like the critical piece of the organization. In 2019, again, Trevor, not IBM. In 2019, anyone who is self-identifying as a data scientist is not a technical user. So in addition to all these problems that I was just talking about, we also need to create something that data scientists can use, which has a nice user interface, is easy to configure. Otherwise, they're just not going to use it, and they're going to write some glue code. So yeah, another fun animation that you can't see. Thank you. So what is Kubeflow? The problems we're discussing are common. The solutions to these problems, which they're common across all industries, all sectors, lots of businesses. The solutions that people come up with traditionally are ad hoc, glue code, hack scripts. And by that, those are all idioms. And by that, I mean it's individual teams creating these little one-off solutions that will solve this particular problem that they have on their team or on their product for this one case. And the problem with that is you get lots of duplicated effort when you do that. You're writing a, it's the same thing over and over, but a lot of times maybe it's copy and paste and then just change a couple of things to fix it for this next problem. And the even bigger problem with that is you're getting a lot of technical debt. So if you have one person who's in charge of productionalizing all of these things, well, when that person leaves your company or gets hit by a bus, then no one knows how anything's in production. And so you just don't touch it because it works. And that's fine. And someday you'll fix it. Someday you'll fix it the right way, but for now it's good. And then how many people have ever gone into some code with a little comment that says, not production, do not deploy so-and-so, and then a date from 2009? No one, really? OK, there's one, and that's a problem though. And so we try to avoid that. So if you think Kubeflow is Kubernetes plus TensorFlow, that's OK. Lots of people think that. It's not the case. Kubeflow is more appropriately thought about as a buffet. And a buffet has lots of choices and options. Here are some of them. It's not all of them. You can add your own. But in general, the pieces of this buffet you need to create a machine learning meal, for lack of a better word, is you're going to want to have machine learning libraries, which are version-controlled. You need data prep. You're going to need the models, data prep, model management. So when you have a model and then you have the next version of the model, maybe you need to roll back. And then you're going to want to have serving. So how does this all fit into the ecosystem? So you might be saying right now, hey, I don't need to do all that stuff. I just do research. My management comes and they say, we need to know something, and then we do research, and then we give them the answer in a report. That's great. You've got a great job. Keep it. Because those jobs are going away in favor of, we want answers in real time. We want to know. We don't want to have to do a post-mortem analysis on why sales were down last quarter. We want to know in the middle of the quarter why sales are down, why they're down, and how to fix them. And that's where the world is going. But for everyone else, that's the problem that Kubeflow is trying to solve, trying to help. And also to scale out those resources. The data scientists on the laptop, a big problem with laptops is that they're not great for training machine learning models a lot of times. They don't have a lot of power. That's not what they were built to do. They were built so you could code stuff in a coffee shop. Everyone can see you. You look cool. But they're not great for actually training models. So you can scale out your training resources. You can deploy to production. And so why would you want to use this? My notes are out of order. So here's how you set up Kubeflow. And this is also there. OK, so there's a lot of errors on this slide. Go look at the docs. It's not that hard. It's even easier than this. There's not control scripts anymore. There's another version coming out next week that's going to make it even easier. I'm just going to short it out with that. The chefs recommended pairing, though. If you use those out-of-the-box scripts, you're going to get Jupyter Hub. You're going to get TensorFlow Serving, TensorFlow Job Hub, PyTorch, Catib, which is hyperparameter tuning, a list of things, and Pipelines, which is Argo and a little bit of Magic. And you might be saying to yourself, what are those Pipelines? What does that mean? There used to be a joke about this cat, and it made sense with Pipelines, but I've forgotten it. So you're probably thinking, OK, cool. I've got some scripts that let me deploy a lot of services and components on Kubernetes, but it still needs to be easy for my non-technical data scientists to use because they're not going to go learn Kubernetes. They're going to keep doing things their old way and writing their glue code, and we're never going to get anywhere with this. And also, that's not an impressive. You wrote an install script that just installs a lot of things, which I guess with Pipelines is saying is that's not what Kubeflow is. Kubeflow is a way for these non-technical data scientists to keep track, to set up experiments, to run these different experiments with different hyperparameters, configuring in and out different, let's say, input data streams, also to have reproducible results. Are there any data scientists in here? And if so, I'm sorry, I'm making fun of you so bad. And no one wants to raise their hand now, of course, because I've been teasing data scientists. Well, I think data scientists sometimes do with their Jupyter notebooks is they will make a model. And they say, you know, I'm going to change it. But I want to keep this old version so I'm going to make. I'm going to copy the notebook, and I'm going to name this new one dash v2. And then the next one is like dash v2.1. And then dash v2.1-final dash v2.1-final dash v2. And it goes on like that. And that's really hard when you're trying to roll back and find the old models. So you want to be able to keep track of those experiments. And then it's even worse if somebody else wants to also know they're maybe working on something similar. OK, I'm just going to zip up all these notebooks and send them over. Good luck. It's not a great system. So you want to have this easy to use UI. You can build experiments, run experiments, where you can take the same pipeline of data prep and data validation and all of those things that are important and then move it from training to production. And nothing changes. And that's a really important thing, too. Because if you train a model and you prep your data one way, but then in production, the data prep does something different and you wouldn't think that would happen. But changes in machine learning libraries will quietly introduce changes in the way a data prep step is done or the way a model is computed. And you train with sklearn 23.0 when you are training, but the production server is running 18.1 or 0.181. And now your model is totally wrong and no one knows why. You can find those things, but it's better if you don't have to work with that in the first place. So pipelines are going to show a good one. This is actually a screenshot from Kubeflow in the pipeline UI. It's not doing a great job of showing how to set it up. It's done with the Ammo files and Python. You can do with Python as well if you like. But it has some important parts to it where you've got your validating your data. And you can, again, doing that the same in production, where you've got streaming data coming in and making sure it looks the same. And also, not only schema is the same, but also that your sensors aren't broken. So there's two ways your data can fail, if you will. You can have a schema change. And maybe something that used to be true false is now coming in as a 1 and a 0, and it's being treated as an integer instead of a Boolean. That's a problem. But you could also have busted sensors. So if we've got a sensor that's supposed to be between negative 10 and positive 20 degrees, those are reasonable values. If that thing starts reporting like 500, it could be that the thing's running at 500, but it also could be that your sensor's broken. And so you want to be able to know that fast. You don't want the thing to just start making predictions based on the machine's running at 500 degrees, activate the fired suppression system. Well, no, you need to fix the sensor. And you want to know that quick. So another big important part about this is treating errors in your data like they're bugs in software. That kind of goes down a rabbit hole. It's just something to think about. And keep in mind when you are deploying machine learning jobs to production. So also you've got your data transformation. That's the pre-processing. I've already kind of touched on why that doesn't always work out as well in production as it does in your training. Or things can change quietly. And another problem that you have when you move from training to production is your data scientists will say, oh, we found this really good predictor variable. And it does. It accounts for 98% of the variation in the model. Well, that's because it gets calculated after later. It isn't available at runtime. You want to find that out sooner rather than later. So I think a lot of people are maybe here for Spark and Kuflo. Holden is a PMC on Apache Spark and really, really knows this topic very well. I can give you her contact information. And I can also point you to some other videos where she goes into depth on this. I am going to go through it at a pretty high level. That's just pictures of stuff. PySpark is the interface later, basically, here. She's talking about the serialization. There's a lot of serialization going on. And she says you should use Arrow. This is another aside. Don't ever trust vendor benchmarks. Vendor benchmarks are usually garbage because the vendor sets the test up to make whatever they want to say look great. I had some notes because, again, I couldn't get to my G drive. I can't see the exact numbers. Using Arrow still does give you a really good speed up with Spark, but not this good. This is a lie. Not a lie. They fake news. This is how things were before Arrow. And again, it's just kind of going through some code of the changes you have to make. This is it after Arrow. But again, me trying to talk through this when she has everything's like an ish or whatever. I can't really do that very well. So CrossCloud. This is something I can't talk about. So one of the great things about Kubernetes is we are theoretically moving toward a world where you can create these jobs and you can deploy them on various cloud providers. IBM, Amazon, Alibaba, Huawei, maybe has a cloud, you're of any rate, various cloud providers. You don't have to get locked into one cloud provider. In theory, the same Kubernetes job should work wherever you deploy it. It's not there yet, but in theory. That's good because that creates a commoditization of the cloud. You just go and you do your training wherever it's cheapest today. You serve your model wherever it's cheapest today. That being said, how do you do that with Kubeflow? Well, the good news is if you get Kubeflow up and running on your cloud provider, which is not exactly trivial on all cloud providers at the moment, I will be the first to admit. However, we're working on it. The pipelines make it very easy to transfer your training to different jobs and create consistent results on different clouds. There, it's a series of steps. It's Argo really again with a thin wrapper over it. But that's good because you can run an example and then you say, well, I wanna try it like this. But then your cloud provider is more expensive and this one's cheaper. Cool, I'm gonna shift the job. Data gravity is a new buzzword that you can learn today. That is, where is your data stored? If your data is all in cloud provider one, then you're gonna have to upload it to another cloud provider. That will be a cost. But if you're seeing significant variation in the price of their cloud time, probably worth it. Because again, you wanna train near your data. So, I don't know. The other way you can do this is, let's say you've trained a model on cloud provider one, but over time they think, oh, we've really, really got you locked in. We're gonna keep turning up those Kubernetes costs on you. You take the pipeline, you redeploy to cluster number two or cloud provider number two. That's what happens is you are able to basically tell cloud provider number one, all right, it's been fun but we're leaving now. And run on a cheaper cloud provider. Again, all this is about preventing vendor lock-in. Which again, is all about me saying I'm not representing IBM and I hope they never watch my videos. So, yeah. There are some other videos. Oh man, I'm guessing you guys can't get to most of these links, sorry about that. I just didn't realize until the day, just now actually. Why you shouldn't use this? It's zero five now, it's zero six. Lots of API breaking changes are being introduced all the time. It's starting to settle out, but it's still a thing. I personally would not say this is production grade software yet. I don't think many people would. But it's interesting and it's exciting and it's something to start thinking about for your six to 12 month architectural plans. Like what are we gonna move all our data scientists on to? Compared to doing it locally, this is a lot of overhead. You got a whole Kubernetes cluster, you got 15 different services running. You need like, your laptop's gonna start like just buckling under the weight of all the stuff you're running just to get up and do simple like machine learning. It's hard to do a hello world example. There's also workarounds for that right now, but another issue. All of these components mean that three different talks on Kubeflow could give you three totally different component sets and not even look like it's the same project, though it is. And then also, and again, this is me saying this, but Kubeflow's not in a foundation and I'm not saying this is a shot against Kubeflow, but you should always be aware of software that's not open source software that's owned by a company not a foundation. I think Kubeflow does have a very strong community and they seem to be working really well with the community and I don't think that's gonna change, but it's just, anytime you're looking at some open source software and it's owned by a company not a foundation, be careful. That being said, there are some workshops, a book that you can, we're writing, you can sign up and get updates. There we go, I'm giving you some time to take the pictures of that. I promised that I practiced this like four or five times and I thought I had it really, really dialed in to being about four minutes longer than it was, so we're gonna have a little bit more time than expected for questions. I hope you please ask some. I also have, I believe, a Kubeflow conmitter in the audience. Yeah, there he is, okay. So if I don't know, I'm just gonna pull it off to him but he can answer anything you want to know if I can make up something that sounds legitimate. Book again, all right, and these are just screenshots at this point. So yeah, with that, do we have any questions answered? All right, there we go. The question was, okay, go ahead. Yeah, is the book to be published soon? We're hoping to get the book out before sometime near January 1st. So I wouldn't say soon, soon. Part of the problem and the reason for that is Kubeflow just recently refactored away from case on it which means a lot of the code changed and we're like, let's not write anything because we're gonna have to change everything anyway in a month, so we're basically just getting started at the moment, even though we've been hyping it for six months already. Yeah, software, forward selling. How do you manage the dataset associative with the training jobs or the prediction? There are a few ways. You have, you can use network file stores, what they're called, persistent volumes, S3, versioning your data. I don't believe there is anything, I'm kind of looking for a head nod or a shake, but I don't believe there's anything associated with Kubeflow at the moment for versioning your data, correct? There's no, okay, that's wrong. So there is something at the moment. So now it's only about managing. Okay, correct, then I was correct. There is not anything at the moment. Go ahead, Zain. Yeah, so now it's only about managing the version model and the code and your computation jobs not about the data. Yes, that is correct. That's also very important. I would agree, that is very important. For managing your data at the moment, Trevor, not the Kubeflow project, IBM or anybody else would recommend zip data v1-final, that zip. But that's a good point, definitely. So you mentioned the data gravity and the difficulty of moving to different cloud providers. Did you have experience within Kubeflow using TensorFlow or any other tools federated learning that means actually moving your models towards the data ease? So I think that would be the idea with the pipelines. That you would basically just set up your Kubeflow cluster next to the data. So if I've got my data, let's say I saved two copies of my data or two slices of my data, one on AWS and one on IBM, just because that's what I work for and I should plug them at least once. So the idea there would be that you would have, you'd run the job on AWS and you'd run around Kubeflow as far as merging things back together, like if you had two slices and you want to merge it together, that I don't believe again that there's an, I don't have experience. I can thought exercise how I would go about it, but I'd be literally making it up on the fly. Yeah. But yes, moving the compute to the data. Because I think that that actually can mitigate that problem with data distribution and also with eventually keeping data confidential at the whoever owns the data and through some kind of encryption, homomorphic encryptions and so on, let you get just the insights. And I know that at least I saw that new TensorFlow has this federated learning options. I don't know for other tools. And I was wondering, you know, how difficult would be in Kubeflow to set up these pipelines and to define this federated learning model actually. Richard, how hard it would be to set up federated learning in Kubeflow? Yeah. Our TensorFlow apparently has it as a new feature. So this is kind of off the record, but... One of the GCP's major directions in the future is to work on hybrid cloud and like including multi-cloud environments. And there's a broad projects called Anthos and this was announced at GCP Next this year. So training that's like on premise and across multiple cloud. That's one of the general direction for Anthos. So the AI component for your situation will be also available. But I cannot give an estimate on that at this point. And to that point too, that we've been talking about like these public clouds, but private cloud is also, I think, reality for a lot of companies because there is fear about for various reasons putting data up in a public cloud. And so you also wanna be able to, for example, thank you. You also wanna be able to let's say, and maybe a more realistic use case is to train on your private cloud where the data is and then serve scalably on a public cloud. And I always forget to mention that, but that is literally the use case of the moving the pipeline and the cross-cloud training. But I always just come up with arbitrary clouds and that's wrong. So thank you for that as well. Yeah, yeah. So you would, yeah, you train on your private cloud and then serve on the public cloud. And that can be done without federated learning because all your data is private cloud. Oh, wait. Oh, no, go ahead. I saw your hand up first, so that's fine. Oh, sure. Also fair. It sounds like a lot of the driving force behind Kubeflow is in order to enable data science and machine learning at large scale. My question actually relates to small scale and the applicability of Kubeflow there. I'm working with a research team at a university and right now it's a machine learning project, but our pipeline is a series of Jupyter notebooks scattered across five GitHub repositories. You have to look in the 2017 repository for some data. It's really, really bad. And I see a framework like this, which is almost like a full stack kind of data pipeline machine learning framework, comes with a little bit of overhead given that you need to have Kubernetes involved. But my question is from your experience in the project so far, do you think that the job management and the model management and the versioning that you get from Kubeflow is applicable to smaller scale projects or are there simpler frameworks that I should be looking at? As someone who's talking about and writing a book on Kubeflow, I 1,000% think Kubeflow is the perfect solution to your problem. Do you agree? Yeah. Yeah. I don't want to over slam the, whatever, you need a server and that can be public cloud too. Or you can, as a university I'm guessing you've got old blades laying around here and there and everywhere. You really just need somewhere that can hold, that'll just stay on and holding data. Like if you've got that, you're probably gonna be okay. Yeah, that's the, but I think, yes, this is definitely something to watch and keep an eye on. We got 30 seconds so, as best. Yeah, so for my company, we have many local servers in our house. And so a lot of work is done without the cloud or even on your local computer because you do experiments locally. So we use a lot of conda containers. So it's not Docker or real cloud containers. So I think in the future we really need something that can also work off the cloud and Kubernetes but also can integrate and go production very quickly. And I also use MLflow from Databricks and they have a machine learning packaging with conda container. So it's more easy to work for small scale. I think we are out of time. I'm gonna answer that. And just after we're gonna shut it down. Thank you everyone again for coming to our talk. Thanks for coming to KubeCon. Thanks for being great people and maybe clap for me. Thanks for clapping. Yeah, there we are. All right, awesome. And we'll be around after to answer questions. So please stop on by.