 Hello, hello, can you hear me? Okay. Yeah, I usually don't need a mic, but I think we're recording. So I'm gonna use one How is life? Oh Wow That is good That is that is so good because usually when I give a talk and I say like how is life My second sentence is like that bad because like nobody's reacting. So wow amazing I think I think it's also isn't the first talk of the day or like maybe the second I think like but not all tracks. So you had time to sleep, which is good Yeah, sometimes that happened in conferences where like it started like 8 a.m Like people are still lying over from like the conference party or they didn't slept enough So no, I realize people Seems to have a little bit of life within you. Maybe I'm gonna kill that life talking about communities We'll see so super happy to be here My name is Frédéric Harper. I can call me Fred because if you don't speak French Frédéric is really Painting the ass to say so my name is Fred. I'm a principal Developer advocate at Cube first. What we do is cloud native We basically give you an open-source free tool to create a production ready Kubernetes cluster with all the tools you need to manage your CD pipeline secrets to manage like everything You need to be production ready, but I'm not here to talk about her product today I'm not here to talk about Cube first. I'm here To talk to you about what the heck is Kubernetes By the way during that talk feel free to Share stuff on Twitter. I don't know if I still call this Twitter. There is no way I'm gonna call this X So feel free to share things on Twitter that you agree. Disagree things that I've said that are good bad Take pictures of the slide of my beautiful face. Whatever. I'm on Twitter if it's half Harper feel free to connect also Other thing if you have any question You don't want to ask publicly Twitter is also a good place to ask question Obviously, I'm not gonna watch my Twitter account during the talk But after that I'm gonna take the time to check if there's any question comments or insult so What the heck is Kubernetes before we start about that topic? I want to tell you when I Decided to write that talk like write the abstract the title I was like that's a brilliant idea like that's the talk that I wanted to have when I started my cloud native journey and When the talk got accepted and I have to create that talk. I was like that was a bad idea That was really a bad idea. Actually. It was not a bad idea It's just because like there is so much thing about cube for a cube first Kubernetes that I was like What should I talked about like what would make sense within only an hour with those folks And the thing is that that talk could be like a full-day workshop And I'm not even sure if we would have seen like everything that is Kubernetes. So I Did a lot of like thinking talking to myself a little bit too much And I decided that like that talk is gonna be a mix between like a high level introduction about communities So like the fundamental of Kubernetes so you won't know let's be honest You won't know everything about communities at the end of that talk, but I also want to do some demos and Hopefully help you get started when it comes to communities or at least get you a little bit more excited or a little bit more like What the hell is that to understand just a little bit more the technology to maybe maybe try it at home Maybe try to move your traditional cloud infrastructure to cloud native or communities If that makes sense for you so Copernic what? What is Kubernetes but first? Let me give you a warning actually nobody's gonna listen to me for the next 30 seconds because you're all reading what is on the screen Except the people doing stuff on your computer that are not really listening to me right now, but the other people it's still burnt basically, what does that say? Like basically at the end Communities won't solve your issues or all your issues. So it's not a magic solution if your application is not good Not gonna work if your infrastructure is not Well-designed it's not gonna work also. So it's not a magic solution I know it's used like that all the time when you heard about communities like everybody wants to Like move to communities it wants everybody wants to be cloud native It may or may not be for you depending on your need But again, it's not a magic solution. It won't solve all your needs. I just thought it was a funny comic and it's really like You know the reality for some of us. I'm lucky right now my bus are not like Gilbert Bus, but I used to have folks like that were like, yeah They're just touring like keywords and like you should have like solve every issues we had or every problem With that said how many of you felt like that when you started to like war You wanted to read a little more about communities and we're like, yeah, I have no exactly idea what it is You can raise your hand. I was like that The other people you're just lying you're just liar you were probably like that because in the end communities it's not a lot of new things, but it's a lot of like best practices that we put together and It brings a lot of new terms that we never heard before which is the complicated thing. It's a Powerful but complex product for some complex cases that you have so I felt like that when I Was like, okay, I need to do some cloud native stuff I need to learn about that and that was a little bit too much So hopefully this talk won't be too much. I tried to cram a couple of things within the hour But at the same time I tried to like keep it low So you tell me that's the first time I'm giving that talk So you tell me if like an at the end of a format of one hour was like good enough for people starting But so what is Kubernetes and you've probably seen like k8s? It's just another way to say Kubernetes It's basically like the number eight is like the number of characters between k and s So it's just the it's just for the cool kids. We say KW k8s I just community is so long to type like anyway, so you can see Kubernetes or k8s Which seems to be longer to say k8s than communities, but like it's faster to type so it's what is Kubernetes? It's a open source container Orchestrations it helps you orchestrate container like that written there. This is what it does. It's really gonna improve scalability For your whatever you need to deploy in the cloud It helps with high availability availability, and I know I'll struggle with that one in English high availability It is resource efficient Depending on how you use it depending on like if you know how to use it, but just like every other technology and It is also There is also self-healing Capabilities that I'm gonna demo you at the end of the talk really interesting. It really makes your life easier when things go down Depending on the reasons Kubernetes is gonna be able to fire up like again your application I'm gonna show you a little bit how it's gonna work There's also a simple stateless disaster recovery model Because it's easy for you to roll back to previous version of what you deploy on your cluster. You really make it easier there's also Way of doing things the communities that you don't have to do is call get ops I'm not gonna talk too much about get ops But the point is is that it becomes your source of truth So everything you put in your cluster should be in your git repo and it even makes it's not part of communities per say It's just like one Like way of doing things when it comes to running a communities cluster and it also make even easier The stateless disaster recovery model because it becomes your source of truth and everything is in git So it's really easier to save the day when something is going wrong And there's portability and I put it in it's a lick I know the the screen is a little bit small for that size of room and like it's a little bit low I did not anticipate that I put portability in it's a lick because communities per say It's a portable technology where it become less portable Even if all cloud provider tells you like there is no vendor locked in it's like when you go with the bigger cloud Like AWS Google cloud Asia Depending on the technology you're gonna use because they're free manage communities cluster for you depending on the technology You use within your infrastructure. It may make it a little less portable But it's not Kubernetes fault is like how those cloud provider implemented it Which is usually not the case with the smaller player like digital ocean or CVO Where the offer is really straightforward and they don't have like 10 to 20 to 30 services to help you manage your communities cluster So there is a little less vendor locked in but Kubernetes at the foundation it's created. So it's easier for you to move from Unpremises to public cloud to one or the other public clouds because Mostly everything you do it's gonna be defined in YAML file with that said brief of story just because I find it a little bit fascinating and Because I like to understand when there is a technology like that that will basically take over Everything in my cloud like where if I decide to replace my more traditional cloud The way that I hosted my application to communities I want to understand a little bit like is it technology that makes sense for me So just a brief history because I liked it In 2003 Google created Kubernetes, but it was called board that was Someone said the yeah, you need a small these stickers I need like, you know those stickers that you give to kids when I give you the great answers That's good. Yeah, so it was called board. I think at some point a year after two years Changes for Omega Anyway, it doesn't matter the initial name was board, but it was an internal project which I assume I don't know hold the story, but which I assume was Because they needed a technology to be able to scale because you know, it's Google. I don't know Which services were available at that time, but like just like the biggest like Google search just YouTube Probably take a lot of like cloud power just to make it happen So they love this internally to their own need, but in 2014 so a couple of years ago They're like, you know what? That's a great technology and a couple of engineers I Google where I like yeah, we need to open source that So they decided to open source the project to change the name to communities And this is really when we started to hear a little bit about it That was still not popular. That was the early beginning, but the year after in 2015 It started to become a thing There is that CNCF cloud native computer foundation that is like a kind of like a sub-horror of like Linux Linux Foundation that Partner with Google to really like make communities what it is to open source So Google basically and I'm like paraphrasing but like gave the project to CNCF and like they become the honor of Covernity, so it's not on per se per Google. I assume they're like probably their team are like The biggest contributor to the project still but it is not it is not a Google project anymore So for those of you that were like hey, I've read we're like in the Linux slash like open source or like free software conference You're talking about Google. Do not be afraid now It's under the umbrella of the CNCF and the CNCF doesn't have only communities They have a lot of proud native technology projects that are under the umbrella of the CNCF But it was not the only thing they also start in 2016 Actually, I'm missing one thing I think it's 20. Yeah, okay. It's 2016. So in 2016 There's one technology that I'm gonna show you that's gonna become one of your best friend called elm that was created and released The first version was really was released in 2016 They also create the first QCon conference, which is the only grail of conference when it comes to communities because it's organized by CNCF it is huge and I assume was not that huge in 2016. I was not there But now it is pretty huge. It's twice a year once in Europe once in North America It's the place to be if you want to learn more about communities. There is a lot of vendors there There's like a huge haul of like Exposin so it started in 2016 so it's become it became a little more serious because now we had a conference In 2017, this is where the thing really started to pick up. It was enterprise Enterprises started to like go and jump in the community strain. So AWS started to give like a public offering like managed communities offering Docker when all in when it comes to community so within Docker desktop you can like create communities cluster like they really went all in And a company like get up so pre Microsoft acquisition. Am I right? Yeah pre Microsoft acquisition get up like get up was running on community So just to give you an example of one company that trusted communities like had the beginning hish was get up and We probably hold use get up Or maybe get lab or other technology, but like I would say many people use get up So it's it was probably his still working on communities which bring us to today It's kind of like the de facto now for many medium to large-sized company using communities But also mostly every startups right now or actually before in a previous life a couple of years ago I was working a digital ocean and part of my job was to work with startups And every startup technical funders were asking me like okay, like I'm building my infrastructure for my application Like I need to go cloud native So like every startup like it was the kind of like the cool new things to work on as I saw before It may or may not fill your need what it was telling to those folks at the time is like Maybe don't spend time in the fort right now to understand communities because it's complex thing while you're still building your product You may not have users. You may not even have paid users But if there is a way for you to architecture your application, so it's gonna be easier after that to move to cloud native and Kubernetes So just a quick brief a story if I want to create a cluster So this is the easiest part when it comes to working with communities So you tell me it's gonna be big enough There's a couple of ways to do that So there's a couple of tools where you can create cluster locally on your own machine So you don't have to pay a public cloud to try to test it most of those tools give you the same ish Experience that you're gonna have in the public cloud Some popular one. There is like kind There is mini cube and I use key 3d for different reasons. So air what I can do. I have a creativity key 3d It's a CLI install on my machine and I can say key 3d create cluster and Demo key 3d and actually it's the other way around. It's key 3d cluster great Yeah How do you call demo key 3d? And I don't know how to type Technology is hard Plaster. Oh, there we go. So what's gonna happen here? I have my firewall asking me once in a while But if I go in Docker, so it's using Docker desktop here. Oh, it's over there if I look for a key 3d You're gonna see now. It's like creating some containers for me. It's running is gonna run Kubernetes version in my Docker machine. So now in Turi actually not in Turi in practice I have communities running on my machine. What does that mean? Not that much. I have a cluster Yeah, I'm nothing really running on it except like the Kubernetes technology that makes the cluster the cluster I'm gonna tell you a little more. What is exactly a cluster? So if I use and I'm gonna tell you about cube CTL after but if I get pods and I'm gonna explain what the pods are Just bear with me right now, but just to show you that something exists in my prompt here You may see I know if your people in the back. It's probably too low You may see here. I'm connected to my cluster automatically for different reasons And I'm listing like all the pods that are inside my cluster So those are the things that are installed by default when I use key 3d to create a cluster So that's one way to create a cluster if you want to test your team After that, you can also go in public cloud. As I said most public cloud they have managed communities Offering for you the prices the price is different from cloud to the other I like CBO for two reason and transparency to just acquire us So there's one reason why I like CBO but like I love them before it's just a sample cloud that you know What you're gonna pay for at the end of the day and I can go get go here and say like create a new cluster make it a little bigger and I'm gonna give it the name Nope click back Gonna click create cluster again CBO demo Can choose the number of no we'll get back to this and Here I can define like what's gonna be You know the resources that I'm gonna need on my nodes, which are basically the hardware or virtual Hardware that is running my communities cluster. I go here I'm gonna skip all this because specific to CBO But like if I create here in one minute or two minutes I'm gonna have my communities cluster running in the cloud. So at that point you're like, okay. Good job Fred You have a cluster now what? Now let me explain you before I start to show you how to install vacation Let me explain you a little bit. What is our communities cluster? I think it's really important to understand the for them fundamentals Again, I'm gonna focus on what I think is the most important to know right at the beginning But I'll show you there is a lot more to know about communities like I was saying at the beginning of the talk So communities in a nutshell There is that cluster which is everything Like it contains my nodes to work my application So the cluster is what I created with t3d cluster that create common. This is what I created on CBO This is basically my communities teams running locally or in the cloud, but within the cluster I have some nodes. I have in every cluster I have one node called the master node and the master node it's basically a bunch of shit to run communities So I won't go in details for dad because like there is a lot of like Kubernetes specific technology that helps me to run my cluster in my application I Debated with myself if I wanted to talk about that I was like, yeah the first time I heard about this in a talk They lost me totally and that was not fundamental for me to understand how it's working like I'm driving my car I knew don't need to know like how exactly every pieces in the motor is working for me like in the engine Sorry, I don't know why I said the French word with the English accent, but in the engine I don't need to know hold hold every pieces working together. I know it's working My car is going from point A to B So it's a little bit of the thinking that I had when I say like, you know what? There's a lot of things related to communities Just know that you have like that master node in every cluster that you create Where the fun begin because so far it's not that fun Where the fun begin is when you go to the other type of node, which is the worker node think about the Machine or the resources either the hardware like physical hardware or virtual machine that is going to work That's going to be basically where I'm going to put my whatever I need to run my application So within my node I Have two specific things related to communities and those I'm going to tell you just a little more So you're going to have cup couplet, which is basically the interface between the master node and the worker node So you don't have to worry that much about that. It's there. It's working when you create your cluster It's there. You don't have to do anything But it's still good to know that it's there and there was a second thing called the cube proxy Which going to be what the known master going to use to connect to your worker node or nodes and To be able to talk to cube lap or if you have Part of your application that you want the external users to access which is probably the goal at the end of the day They're going to go through the cube proxy, but everything is transparent. It's just there. I just wanted you to know that The exciting part at least for me. You may not be as excited as me But I am is the container runtime. It is the smallest part of your communities cluster They're called pods and this is where within your pods you're going to run container or Containers so you can run one container with your application two three four how many containers you want? And it's depending on how you want to deploy your application Let's say I want to deploy WordPress WordPress. It's a PHP application. It's used my sequel by default as a database I think you can change further databases, but like it's my sequel by default and maybe what you want to do is have a pod We're going to deploy WordPress and I'm going to show you how to install an application In a pod how to deploy an application So what do you would do in my example with WordPress? Maybe what do you want to do is to have WordPress so a web server that have PHP support a PHP extension that serve WordPress in one container And you may want to have a second container which have you're gonna have a second container is gonna run my sequel Which is the database that WordPress WordPress need so what you want to do Maybe is in your pod in one pod you install WordPress with the web server engine next apache And on the second container it is to running my sequel So you would be able to do that or you would be able to have to pod with each one container One container in pod one which run WordPress with the web server Second second pod with the nutter just one container that run my sequel Which is probably a best way to do that But you have option for different reasons depending on like what you want to do So just understand that the pod it's a smaller part of your community's cluster Container is not a community's technology per se But like this is what you're gonna use to run your application and you can have multiple within a pod And you can have multiple pod within your node worker, but on top of that There is a lot more objects that you can use so pod are one But there is a lot of things config maps where you want to have like different Configurations about your different resources in your cluster. You can use deployment to deploy new application I'm gonna show you an example about that Ingress controller where it helps you like there's a engine x ingress controller Which is quite popular that helps you to like basically offer your application to the rest of the world You know my WordPress Installation that I was talking about it's probably for like people That are that don't have access to my server to access like any other application you publish in the web You want people to be able to access it so ingress controls gonna do that you have job and you have a lot of like Objects that helps you in your community's journey and all those things are part of your node or of your worker node And you can have multiple nodes too So that is communities in an out shell But I told you there is those other objects that I won't spend time to talk about today because I think since the latest version There is like 104 six objects part of communities So there's the list of other objects that you can use I'm gonna go fast ish So maybe you have the time to read a little bit But like there is a lot of objects that you can use within your node to do different things There's job. There's chrome job. There's ingress network policy They're in the namespace. I'm gonna talk a little more about that because that's a really critical component of your Community's journey and there's a lot more Secrets secret store so you can store secrets within your cluster There is like a lot of objects that's gonna help you so again I'm not gonna talk about all those because we're gonna be there at until the end of the day for sure and not All are exciting, but just know that the little I Don't like to say that it's kind of like saying my talk is bad But like the little you're gonna know about you for communities when you leave that room There is a lot more of other stuff you can do to help you within your journey But now today we're gonna start with you're the beginning For your communities journey You need best friends, and I'm not talking about your co-worker or people that you go have drinks after the day I'm talking about Principally, I selected three technologies. That's gonna be part of your journey Those are the three technologies that I feel you absolutely need to be successful So the first one which the logo means with all respect for graphic designer doesn't really tell you what that is It's cube CTL. So the common line that I use Before to show you the content or part of the content that was part of my cluster The second technology is helm. I'm gonna explain a little bit what it is after and the third one is Canines, which is also another great tool that you can use to help you So let me start with the first one cube CTL Cube CTL basically let me go back cube CTL basically is the default CLI to do whatever you have to do with your communities cluster It is Part of communities project it is on their CNCM CNCF There's not a lot of things you can do without actually there's probably nothing you can do within your cluster without cube CTL And I say cube CTL some people say cube cut all because it's k u CTL I call this cube CTL. I think the other people are wrong, but that's another discussion So a cube CTL a cube cut all is the common line. That's gonna make your life Better and worse at the same time because there is a lot of things you can do on the cluster But anyway, that's the common line too that you need to do most of the thing within your cluster We're gonna see also and you may see this in the demo even in the docs They suggest to do an alias from cube CTL to just the letter K So if you see me type K, it's just because it's a nab it and it's basically cube CTL with that said What I want to do now is I created some cluster I want to deploy an application because my cluster is basically useless right now It's just a communities cluster with nothing in it so I want to deploy an application and Mostly everything you do within the communities world is YAML How many of you know YAML? How many of you liked it? There's a lot less people. I understand where you're coming from. I was a Jason person I hated YAML with passion when I started to do a little more community stuff I did it for whatever reasons, you know, like As much as I love like my team well format and everything like they were complaining because you were missing a space and stuff like that I just hated it it took me time and That's the story of everyone. I know nobody is like fuck. Yeah, YAML. That's nice That's the best technology ever with all respect for the people who created at the beginning you hate it But at some point it becomes the best thing in your life Because imagine if I'm trying to reproduce this with Jason I'm a big Jason fan. I have to put up brackets everywhere and double quotes everywhere and it's a little less readable again You try to Jason But now I'm a little warm the YAML team for some of the stuff and this is one of the thing because it's it's readable It's easier to read the YAML fight So you need to get used to YAML because that's your new life now if you go communities way So here I'm using one of the other objects that I told you Deployment object That's gonna be the definition of my deploy like the kind of deployment That's gonna be the definition of a deployment that I want to do in my cluster here What I'm gonna do I'm gonna say you always need to have like API version haps Slash V1 the kind in this kind. It's a deployment object. So I'm using this to deploy an application I put some metadata the name of my deployment is gonna be engine X The labels is gonna be engine X to you because I'm you know, I'm a simple person There is a specification part where I can define the number of replicas. So how many pods do I want to? Deploy in my cluster in that case. I'm gonna keep it that one for the demo But what's interesting with the pods when I was saying that like you can have multiple pods It's not just for a separate application. It's part of the scalability and the high up High availability of communities where if I install my server now and let's say I still running WordPress and I installed an engine X server and I'm gonna put WordPress in it and I have I need to have multiple pods because what's happening with the first pod like there are so many people going on my website and the First pod like the resources like cannot give like back the HTTP response to people So there is a way with like a low balancer or community is gonna be able like to point the users to different pods That's still gonna run your same application. What happened if I have an issue on my first pod? It fails got killed didn't work Like you have other replica. So that's super interesting If I skip to the end, this is the interesting part I don't people in the back if you can see but there's a speckception within the template where I Mention the containers what I'm gonna deploy to my application So let me let me show you may be in a better way for the people in the back Yeah, I use VR not cat on person but I can see I have yes code which Which I like to so there's a containers again name is engine X But what I'm saying here. I'm saying, you know what like the images that you use when you use Docker I'm gonna do the same thing here I'm gonna say hey what I want to deploy is the engine X image the latest version You should probably not use latest you probably should define the version for different reasons But like you can do that and I'm gonna say hey I'm gonna be able to access the engine X server on port 80 within my pod. So what I'm gonna do now I'm gonna go here. I'm gonna do cube CTL apply Dash F which is like basically a file the next thing I'm gonna share is a file That is the YAML file that I've shown you and before I do that. Let me just show you where am I on key 3d? So let me do keep pods a so pods is again the smallest in it within my cluster Dash a is because there is a thing called namespace. I'm just saying like show me all the namespaces I'm gonna tell you just a little more about that after be patient So just to show you like there is a bunch of like default stuff that were created with key 3d. No engine X there, you know I'm not a magician And I'm not faking it. So Cube CTL. I'm gonna do apply That half engine X and what's gonna happen the deployment got created. So in Terry if I do K pods dot. Hey, you're gonna see now if it's big enough Gonna say now that in the default namespace I if I have a container that is being created and my engine X Server is being deployed So the third tool I'm gonna talk to you about is gonna help you to show the status of your cluster and stuff because like always Getting like K get pods to see the status. It's like a little bit like annoying, but now you see it's running Another object I could use right now because my engine X server is not accessible for me It's not accessible outside of my nodes right now So what I want to do is not the best way to do that for something in production But just for the sake of the demo I can do some port forwarding Gonna be lazy on this one Actually, I Need the exact name here. So what's happening with the name here is that I call it Call it engine X and there is some kind of like UID that were created for me Because within a same namespace you cannot have the same name So Kubernetes created like a UID for me after the name. So here what I'm gonna do. I'm gonna use K again Cube Ctl again, and I'm gonna say I'm gonna use a common code port forwarding And I'm gonna say for the pod call engine X dash whatever whatever it's written there I'm gonna say hey Locals on the local us on my machine. Let me access the port 8080 That will point to my pod my engine X pod to the port 80 So if I go enter here, it's kind of like blocking my terminal because now it's it's in port forwarding mode and if I go in my browser and I do what close 8080. I have engine X working. Is it not the most beautiful application you've ever seen? Yeah, don't be too excited about that So that's one way of deploying the second way of deploying an application is through helm Actually, there is a lot of there are some other ways to deploy But like this is your second best friend. It's gonna be helm Elm is another CLI tool That's gonna help you to basically do two things if you want it's gonna help you to package application So you your team or other people publicly if you put them publicly in what we call chart repo That's so that's gonna help you to package your application in a chart And being able to use the common line without having to play with the YAML itself You're gonna play with a chart and it's gonna help you to either create that chart or Install your application with a chart and I see faces and then by people like what the fuck he's talking about Give me one second. So if I Go here What I do I have a not my email You have an application another application a good website to check the most of the packages that are available publicly is Artifact hub that IO and I go here and I say, you know what I want to install engine X Is there like already packages of engine X? So I don't have to create my own YAML file and usually they are better build with different objects that you need to be successful So if there's a bit me that is like probably the company that is like I guess Oracle doing that like they're publishing so many So many packages out there and I'm so like hey, you know what there is an engine X package So there's the install here that explained me how to do that. So I installed the elm CLI on my computer What I need to do first. I need to tell the CLI hey like now have access to that chart repository So I'm gonna do helm repo had bitten to me. I forgot to remove it before the show So now it's not working because already there but let's say it was successful It had this repo within my own personal local ecosystem the next step I should do and again now it's not gonna work because I did not remove it the first thing you need to do after that It's helm repo update because for whatever reasons when you had a repo I always thought that you should update automatically so you have already access to the content doesn't work like that So you do helm repo update It's gonna go through all the repos that you had update them to see the last version. Let me kill this one After that What I want to see it's like okay engine X which I don't know how to write engine X. I Can search to do say search repo I Can search for engine X like now do I have like engine X chart? Accessible to my elm to so now I see I have three and we're like, you know what this is the one I want to install is called bit me dash engine X This is the chart version and this is the application version those can be different meaning that like the application version is the Version is the version of engine X the chart is like hey I can deploy the same application quite often. I made a bog. I I had an object like an ingress Controller to make engine X Automatically accessible from the outside world or stuff like that So I can deploy chart as many as often as I want which give me That kind of version number so first Let me Remove the engine X that I installed manually. So there is another common call QCTL delete Dash half again for file engine Xiamo so QCTL is gonna look at my ammo file It's gonna look. Hey, does he have a deployment that is called engine X should I remove that boom should not be there anymore If I do kick at pods Kick at pods Actually kick at pods Hey, you're gonna see that like engine X disappear now what I'm gonna do I'm gonna be lazy again. I can do Elm install. I Call my installation whatever I want again. I'm a simple person. I call it engine X I say this is the chart that I want to install This is the version that I want to install So the flag I'm using the flag version if I don't want to I think it's if our members gonna install the latest version I can define the namespace where I want to install it I'm gonna tell you just about that right after but bear with me and I use the flag creating them space Because if I want manually if I wanted to manually install engine X Like I did with the apply on the ammo file and I decided to do apply dash dash and or dash dash namespace engine X Before you would have to tell me like hey, you know what the namespace dots not exist There is no engine X namespace Elm give you the opportunity to create it if it does not exist so I click enter here and Now there's a lot more things because that chart contains the deployment that I showed you plus a lot more other things Even though some a little bit of explanation about like how to use the chart once it's deployed. So if I go back to Okay, let me go K get pod without the dash a If I go hey here, let's say no resources found in default namespace Because right now I'm in the default namespace a namespace is a way to Group and isolate different resources together and you're gonna see this about everything and naniting and communities It is super annoying at the beginning really useful when you kind of want to kind of like get to use to always like Either use like the option to get all namespace or switch namespace or if I go K get pod dash a I'm gonna see all namespace, but I can say Dash and engine X. I'm gonna see just the pods that are in the engine X namespace So really useful feature now engine X is running Let me show you that is running, but by doing that. Let me show you the third tool that should be your best friend So it's not a CNCF project. It's still open search. It's still in get up on get out It's called canine canines Which at the beginning I was calling canine s until one of my co-workers was like dude It's just like canine plurio. I was like, yeah Make way more sense. So canine s or canines when you run it. It's a good mix between Now, let me maximize that a little bit Can you see in the back Yes, yeah, and that too is not super good though when you zoom that one is better Yeah, it's fine if you don't like read all the words what I want to show you is that this thing makes me use tube CTL Mostly never for most of the thing because that's a mix between like a UI tool for the terminal So here I'm showing all the pods within my namespace and I have all the information plus Some other information that I didn't have when I was getting get pods So there is a way for me to get more details But like the default common doesn't give me everything and what I can do I can navigate between object show me just the namespaces So I did Cullen namespaces now it's shown me the namespaces and I can say like hey, I want to see only What is in the engine x pod, but there is other objects that are chart deployed So if I do Cullen and I say hey, I want to see services I Click there. There's a low balancer that was installed for me So this is why sometimes I'm installing the chart is just make your life easier because you don't have to write all that YAML, you know for default application that for sure you're gonna use again. You want to deploy WordPress There is a chart for there instead of you trying to find the right image write the YAML Which is nice at beginning you want to do that to learn but at some point you're just like Hey, I'm just gonna elm install my application and that's gonna be there and that tool is great if I go back to pods Actually to pod. Oh, I'm in the The Elgin X namespace and I can also search I'm gonna search everything that is and Jen, I don't know how to write engine X most of the time Engine X and now it's gonna list everything. So that tool is really amazing But the other thing Remember when I did the port for it was blocking my terminal It's a little bit annoying and you can do one port for it at the time if you use cube CTL unless you need to like run the Port forwarding as a background task and like it's just a pain in the ass to be honest So here what I'm gonna do I'm gonna go in the service because that's my low balancer now and actually not this one let me go to the service and I'm gonna go in engine X and now you don't see it because that interface is not super good when you zoom in But there's an option called shift F on my low balancer and now I can say like, you know what the low balancer is running on 8080 compared to like the server running just on port 80 and I'm gonna say I want to Port forward and I can't do other stuff after that now I can go back to namespace and the port for it Actually namespace and the port for it is still working. I don't know how to type today So if I go here again, I have engine X again working Another amazing application That deserve a round of applause now that is the most beautiful application you have so anyway That tool is really amazing because that let help me do that and the other thing that it helps me do if I go to let's go back here and What I want to do I want to understand what is that pod like what is what is that but I can click D And that's gonna describe that pod so every information about the pod is gonna be there everything I can do everything in it. I click on the hell I get the logs from that pods Everything is in within that tool. So it's canines as I said, it's one of the tools that I love the most To be honest right now I don't remember half of the Cube CTL comments that I should do to do the same kind of things most of the time Because now I use canines. So it's still important. I talked to you about Cube CTL because that's the basic This is what you need to be successful But canines it's gonna help you to navigate your cluster get the information I can even like kill the pod and I click K it killed the pods and now the pods already. Oh, is it K? I Cannot see I think it's K. I know it's control K Something original so I can kill the pod here and I'm gonna show you a little more how it's working but you're gonna see the pod restart again and That's gonna work like that. So canines is one of the tool that I like the most The main contributor Italian guys super nice super sweet. No, actually, it's not for that technology I don't know the main retainer of this one, but like just a great technology is open source. It's again It's the CLI you need to get used to CLI to terminal. I don't know for you. I'm a big terminal lover I love everything that I can do in the common line not everybody do and I respect that But the thing with communities that you don't really have the choice because most of the tools that's gonna help you To be successful within the cloud native ecosystem are things that run in the terminals You have some UI that can helps you to do stuff There is some actions that you can do within your public cloud provider But again, most of the thing run on the terminal So there was a third tools that should be or that could be your best friend if you decide to go the The cloud native way actually, I'm gonna go back here There's another tool. It's not in the talk but I don't even because I use that tool. I don't even remember how to switch firm context within why is that what is crudy? Anyway, I should not do allow but I don't even remember how to switch like context because what is great is that usually you don't even you don't have just one cluster you can have multiple cluster and One way to connect most of the tool from your CLI to your cluster is File called cube config which get you the information about your cluster and how to connect to it. So Has an example Earlier I created a cluster and CVO that should be ready. It's called CVO demo Two ways I could get the cube config is that there is a download button here But what I can do I have the CVO CLI install and there is a comment called CVO You don't have to know it's it's not about the talks is not about CVO, but just to tell you I do CVO communities config CVO demo, which is the name of my cluster I'm gonna save the cube config and you're gonna see my prompt would change now I'm connected to my CVO demo cluster the one that I created before let me Switch that again a little bit higher if I go in canine now My firewall is gonna ask me is it okay? It's okay, and now you're gonna see it's a little bit different because now I'm connected to my CVO cluster that I created before If I want to switch there's a real nice tool called Cube CTX That is also a way where I could have said like cube CTX cube first Fred and where I switch the context for me to my New cluster or there is that little UI thing where I can say like hey I'm gonna connect to another cluster. I have called cube first Fred and here if I go in canine It's a cluster that I created with you first. I'm gonna see there is a lot more thing in that cluster So it was just to show you a little more actually Like the canine interface where now I have like all those namespaces We install like chart museum cross planes cert manager vault Hargo CD, which is a CD pipeline That can be really useful for your cloud native journey and just to show you the interface But also for my next demo. I wanted to finish with hopefully something fun something. I like Tell you a little more about like the self-ealing part of communities Do you know what cast engineering is? It's basically what is that it's chaos engineering. It's created chaos within your co-system It's a way to test if your system gonna work if some shit happened. So basically what happened I think it's Netflix who coined the term Basically, they're just like killed services right or wrong like randomly and like they like Remove some some stuff in the firewall like they do a lot of things that you should not do on your server and by doing that Preventing like is it work preventively prevently the word in English. Yeah, I don't know Prevently it's worth it. Okay. Yeah. Yeah, so by doing that before it happened Yeah, I should have said that by doing that before it happened It helps you to see if your system is gonna be still working or at least ending well issues that could happen So instead of like it happens when there is like millions of people watching a show You know before that like if there is part of system like if my API server goes down for Like I don't know getting the description of the movie or the list of the movie you have access in your country What's gonna happen like if everything is gonna be on fire or is there like a service B That's gonna be able to take over and do that kind of stuff So anyway, it's about creating chaos in your ecosystem to understand where are the point of failures within your ecosystem? so Real nice thing to do and painful thing to do also because that also show where you have witnesses within your ecosystem So with that said I'm gonna use another open source tool called Cuban Vader and Before I do that let me go back in canines Actually, let me try to zoom that enough Okay, can you see good enough in the back actually like that? What are you gonna look here? It's the status of the pods running in my cube first Fred community sloster What I'm gonna do here. I'm gonna use Cuban Vader that I install on that same cluster before the talk and I Love that thing. It's a game It's like invaders like there is space invaders, but the invaders there are the resources within my in that case I selected my development namespace and now I have like all those. I don't know if you see them It's not zooming for whatever reasons, but anyway, you see those little halion At the top and I have my spaceship here and I'm gonna kill part of my resources in my cluster But look canines Hopefully no, it's not a good Example, I don't know where I'm right now. Let me go to That is so big Let me go to development And okay now I'm seeing all the pods Let me scroll. Hopefully I'm gonna kill the one that are in the viewport here, but I'm gonna be a bad guy And I'm gonna kill Some of the resources and you can see them in Cuban Vader They just died poor poor little cube first resources, but look canines, right? I'm gonna try to kill a little more so you can see them you can see some of those resources the pods Actually, those are just pods. I'm showing you all the pods. They're being killed They're being terminated and communities is just taking every say like hey, Fred Something's wrong, and it's just turning the pods again So in the end if I go back here, I killed a lot of them. I think I was not bad You're gonna see like most of them Should be running now Obviously some are completed because they were like just like things that needed to do some stuff and like they're not needed and the more But the one running there's still running and I killed them like I was really not nice with them I killed them. They're still running. They're back It was fast because doors is like smaller application k takes you can take a little more time But what's great is that in the case like here if I kill I have one two three four five So I have a replic acid with five Development actually four because the other one is API Development pods. So if I kill those two three the way I Put my I created my communities cluster communities were like a life is good I still have pods running and I'm they're gonna send people to the right pods So this is the this is an example of the self-healing part and the high availability of communities that I really like to demo On that note, I hope you're not like Fred I don't know if I still I don't know if I understand communities and it's kind of okay I made peace with the fact that like not everybody at the end of the talk would be like, okay Now I know communities because again, I shown you a small part. We only had an hour together But it's only the beginning hopefully understand a little more what it is how to deploy application How to get started to do the basic thing you needs to do and the good thing is that with any complicated technology You start at the beginning you try it a little bit you messed up. It's fine. You create another cluster to try it You create an inner cluster. It's way easier than I don't know for how long you're being in tech but like It's a little more painful than like just connecting with a FTP client to my server and uploading new version Which we're doing like a long time ago. That was so fast. That was good But like if something was not working on the server That was terrible. That was like a pain to fix things working on the server right now Something's not working. I created a new plug. I delete my deployment deploy it again because it's working within my container It's a lot more easier. So some resources communities that I owe. It's your main place It's the website the official website the documentation is there. I Suggest you maybe not like I know some people just like hey, I'm gonna read the docs I'm gonna be a happy person. No, you won't be an happy person that docs is like so huge It's good, but like it's so huge. So I suggest you if you want to know more try it or Find some tutorials online a Udemy course There was a lot of good things out there if you need help you're a little bit lost in your journey There is like three main slack places the CNCF one the communities one because there is two there is like in CNCF There is all the project, but there is a community's channel I think it may be a little bit better to go to the community slack itself because there's channels for specific things About communities and the last one. It's the cube first one even if you don't use our product We have a communities channel. We're just friendly people if you need help sometimes. It's a little less You know Asking a question in a channel where there is like a couple of thousands people versus we have like three or four hundreds members It's a little bit nicer in the communities channel is really low because most people in the community has for a cube first question But anyway, we're just just saying that we're friendly folks the tools that I share about today except cube CTX, which is also Open source tool I didn't put the link here because I didn't thought I was gonna use it But just look look for cube CTX on get up key 3d canines Cube CTL it's part of again It's part of the docs when you read about the docs to tell you to install that Cube Invaders that I use helm and I put cube first here and this is my one-minute product pitch What do you've seen before its? Cluster that I created you first the easiest part of Kubernetes is to create your cluster is after that You need something to manage your secrets You need something to manage your CD pipeline You need something to manage the certificate you need something to manage the external access You need something to manage like if you do Infrastructure as code you need all those things to manage just to be able to deploy your application with cube first It's free as open source that helps you to create a production ready cluster in most of the public cloud That have all those tools already installed in pre-configure, but also use something called get ups Speaking about the CD pipeline if you're still here tomorrow I give another talk in this exact room at 1145, but it's about our go CD our go CD is again A open source, but clown native base CD pipeline that you want to use if you go to the community's journey On top of that it's built with get ups in mind Which I'm gonna talk a little bit more about today, which is one way to manage your communities cluster So there is a talk about this tomorrow and on that note My name is Fred. I Think I'm gonna have like time for maybe I don't know one question But if we don't have time that much for question I'm gonna be out of the room right after my talk if you have to if you have question You can send me an email Fred at cube first the diode does my cat look so nice. It's one of my cat However, she's so nice. Anyway, always Twitter connect with me on LinkedIn if we don't have the time to chat if you try cube for Kubernetes in the future and We have the name is so close like everything This is the thing you need to get used to in the community space every two is K something or cube something so sometimes it's just like you just lose the name But anyway, I offer free coffee chat you can schedule at Fred the dev slash coffee 30 minutes call with me video chat We can talk about communities. We can talk about our go CD. We can talk about cats. We can talk about whatever you want Just to get to know each other. I'm a friendly person. I know I doesn't look like that But I'm a friendly person. So on that note, hopefully it was not too mind-blowing It was helping helpful a little bit and thanks for your time and have a good rest of the conference everybody And actually I'm gonna be respectful for the next speaker I know there is 15 minutes, but like it's always annoying when the speaker before you stay here and like take all the time So I'm just gonna remove my stuff and go outside of the room if you have any question comments and salt. I'm gonna be there Yeah, I think let let's begin. I think it's time for this session and Hi everyone My name is transit time and today my team made like Sean and I would like to talk about like cloud native data and model access management for machine learning in the AI domain So for here, I would like to give you folks a quick introduction about Sean and I currently I'm working at the Starvisa scientist at Alasio And I also is a presto commuter contributing to the open source Distributed SQL engine and Sean is currently the software engineer at Alasio and he is also the fluid Apache product commuter and Here is the agenda of our session today January like we we have a pretty long long list of Points that we want to cover at first. I want to talk about the machinery in the cloud What is the current pattern? What is the current architectural design about it? And then Sean will come to discuss about like say how to access data and models in the cloud and he will present the existing Solutions and evaluate the pros and cons of each design And then he will propose a new design with Alasio for the unified data access and the model access and then he will also talk about the cloud native Kubernetes operator and CISI fuse driver all on Kubernetes and then I will come back and talk about the data access Management for PyTorch and Ray and see how a unified data access could help to improve the training efficiency and Finally will give some use cases that we have learned from partnership So let's begin the first the topic we want to cover is machine learning in the cloud And here is the general architectural design about machine learning system in a hybrid cloud or multi-cloud system Like say there are generally like three parts in this diagram on the very left There is a training platform and it could be in on-prem data center or in a cloud And then in the middle it is the storage layer it could be like different object stores or like say HDFS is kind of archival storage and Generally during the training the training cluster will read the data from the remote storage and run the the necessary training jobs in the training clusters and after the training job is completed They will write back models into the remote storage for the model storage and then on the right side There's a serving platform for the serving they will get I'll say fetch the models from the remote storage and serve them in the serving cluster and There is a very clear like Signal of the separation of compute and storage tiers in this kind of a design the the training cluster could be in one data center or in the cloud a and Then for the serving cluster it could be in another data center or in another cloud so that's the design and Here we want to talk about the data and model access patterns like say in the machine learning domain What is what are the traffic patterns to access the data sets and the models? And here is a pretty large table and we want to focus on mainly two parts of it The first one is about the model training and what is the access pattern about that? Generally, we put them into three categories computer vision jobs net natural language processing jobs and checkpoint rights and for the first the two types though Mostly they are like a rating rating data set for computer visions Mostly it could be some images or videos So the typical pattern of this type of data set is there's a very large number of small files and Generally the files are read sequentially and for the natural language processing What we have observed is that for this type of data set there could be like Not that large number of files But there could be a medium level a medium number of files but those files could be extremely large and Generally during the training the access pattern wouldn't be a sequential rate But it's mostly a random rate especially when using like parallel to read these types of data sets and Generally for the training side our target is to maximize the throughput Maximize the rate performance from the storage and the maximum the GPU usage here and What about the model access pattern especially used in serving and here for this? Diagram we zoom into the model access pattern a little bit here And we have like model deployment model inference all in the like model serving domain And generally for this type of access it is still read and mostly the sequential rate that we need to read a whole Like like model into the serving serving cluster And so but for this type of thing what our target would be like we want to minimize the Latency to load it load the data set. I want to get a high concurrency in the model loading part and Then I think I will introduce Sean to talk about like say how to all exercise the data and models in the cloud And what are the existing solutions and their trade-offs? So let's first talk about what are some existing solutions for accessing data and models So for data access we apparently can directly read them from the cloud storage and second We can also before the training job we can copy the data set from the cloud to the local machine and the third way would be using a local cache layer and We can also use a distributed cache system and For model access the most common way we see is just to pull the models directly from that's cloud storage after Training is done and the model has been written into the cloud so now we look at Zoom in and look at them one by one So for for the case where we always read from cloud storage. This is the easiest way to sell up But at the same time the performance is not ideal For the model access apparently because there there could be multiple say More deploy multiple servers for the models so each time a server is starting the model has to be repeatedly pulled from the cloud storage and for training data Because because the training does sometimes as transition mentioned can be small spiles In fact, we sometimes see reading data can take more time than actual training And this is a screenshot of our one of our experiments done by using pytorch We can see the data loader actually takes 82 percent of all the time And apparently we don't want to see the bottleneck appeared on the IO path Now to look to make the date look to make reading data faster. We can certainly instead of reading from cloud if we copy them before training to the local now We can have a lack of much faster access and also less cost because The data is there. We don't have to always read them across different app books But then the manage data management is a hard problem because The disk space is apparently limited After using the data set or after this model is outdated We always must manually delete them. Otherwise, there's no space for the next batch of training or Reading data But at the same time the local storage is kind of limited because the data data said if data is growing The side of data set is growing faster. It's like very very huge The local disk may not cache all the files. So Although we do have some data can be faster of can be accessed faster, but it's kind of limited Now if we use a local cache layer for data reuse for example S3 has their assets S3FS built in local cache and Luxo has few SDK Now the reuse data is local. So after reading once Those data are cached locally so when we reuse them We get faster access and less cost and now because of this Cache layer they can help with data management. They can do Basically cache eviction after the cache layer is full So there's no more manual deletion of the data or any supervision But the same problem is the cache space is limited because we are depending on the local disk We have we have also use a distributed cache system. So for a Luxo 2.x We use a traditional master worker architecture so the data are cached on worker nodes and The metadata is cached on master nodes. So whenever there's a client needs to read data We the client needs to ask master first for where the file where the data is and then ask worker for the data now for both training data and Models we can keep them in the Cache system in the worker nodes So this is a perfect unified solution for both reading training data and train the models and A cache system to have some additional data management functionality. For example, we can preload the data We can also there's more like pin pin the data so that it won't get evicted Some more functionality like those But then we have the problem that masters now, although we do have a high availability functionality based on roughed But there are still single point of failure What I mean is if masters is down The client when client needs to read data it cannot get the metadata or say where the data is in the worker So there's no way for the client to get the data anymore so now masters are the single point of failure and As the file number grow fastly This problem is more and more severe than makes the masters the bottleneck of the overall performance so to sum up a Few challenges we see in this accessing the data in the cloud the first is performance if we So apparently putting data from cloud storage is hurting both training and serving and the cost repeatedly requesting data from cloud storage is costly both Both metadata and data APIs would cost money to if we need to read them and the third is the reliability because availability is the key for every service in the cloud and Data management is another problem where manual work is unfavorable Now let's talk about the new design with a lot so so earlier. We see that master is the bottleneck master is the bottleneck for the performance both because When there's huge amount of files retrieving metadata is very slow on masters and because of the reliability of the master part So now we use consistent hashing to cache both data and metadata on workers instead of the method on masters so by say consider hashing basically you the client can calculate where the where the data is on which worker by the file name and Worker knows because we can scale them up Pretty easily. They had they have plenty space for cash So training data and the models only need to be pulled once from the cloud storage and later on whenever Either training or survey needs them again. They can directly reach from workers so this is we are using less cost and There's no more single point of failure because master is now Not serving any Traffic we get better reliability and there's no more performance bottleneck on masters This greatly increase our performance on reading data and We also get the data manager system So in implementing this in the Luxio 3x We get higher scalability So one worker can easily support 30 to 50 million files and they scale linearly and we can easily support 10 billion of files, which is Nowadays pretty common in the machine learning and AI workloads We also can achieve high ability but high availability we can guarantee a 99 for a nice uptime and There's no single point of failure in the system and we can get faster data loading in the preload stage and for cloud native Kubernetes scenarios, we also have Operator and it says I feel for the data access management and now let's talk about these two parts so for for a luxury operator on this diagram on the bottom we have cloud VMs and Kubernetes installed on those VMs and on top of those we have the Luxio operator Which manages the life cycle of a Luxio clusters and datasets so on top of operator would be a Luxio cluster and Then on top of the Luxio cluster we have training frameworks Which will read the data from a Luxio cluster and the cluster really sits outside of this More of the come sits outside of the compute site So for a Luxio cluster CRD the custom resource definition It follows the Kubernetes operator pattern So on user side they first create the cloud a Luxio cluster and datasets there are CRS Which basically means they submit a configuration to the Kubernetes API server Which then inform those configurations to the Luxio operator Upon receiving this configurations of CRS a Luxio operator will in term modify and manage Kubernetes resources which involves creating a Luxio cluster monitor a Luxio cluster Creating a dataset bounded with the Luxio and do Depending on the state of the current current cluster will do corresponding stuff and then we repeat this cycle or we call it reconciled and Basically repeat repeatedly compare the current status of the cluster with the desired state of the cluster And if there's a mismatch Will do something And with the Luxio operator tool we can achieve zero downtime upgrade high availability and also scaling So with the operator we can achieve a fully managed cache With this simple Python code we can easily preloaded the data Either train data or models into a lot of a lot of cluster And all data will get evicted if the cache space is full According to the eviction policy though, so there's no more manual work of deleting the existing or unused the data and a lot of server will take the job to read and cache from the cloud storage if there's some any cache miss So now let's talk about the Luxio CSI fields driver The first about a lot of fields a lot of fields is able to expose the Luxio file system as a local file system and So users can directly access the cloud storage just as accessing a local storage for example We can easily do a cat or LS on the inside the directory and We can simply easy do an open file. This is a pie a simple Python code. We can just treat the The cloud the files in a cloud storage just as a local storage This has very very low impact for the end users and here's a star. It has a screenshot of This is basically my S3 bucket. You can see we call LS and it just shows on in the terminal just as a local file system and For CSI for Kubernetes CSI is the container storage interface so before we have CSI for Before we have CSI for Kubernetes every storage provider for example AWS EBS and order they have to write their own code to implement the Kubernetes volume interface so that Kubernetes pause is able to use them as a storage But then this reason the problem of their code can only be used after Kubernetes releases so this make the Made a collaboration between the memory cumbersome because one side has to follow the timeline of the other side and there's a lot of communication between but after the CSI comes in the CSI is Basically one type of volume and then if anyone implements the CSI interface They can just plug in as the CSI and become the volume for Kubernetes pause which separates the development cycles and would make the The middle in the implementation and The release much easier for both sides So now there's more than a hundred existing CSI drivers For example, AWS EBS after this after file and of course we have the Luxio CSI driver So if we combine them together for data access Luxio fuse is able to turn the remote cloud data set into local folder for training And CSI is able to launch Luxio fuse pod only when the data set is needed So which means If the training is now started or the applicant we don't need the fuse process to be there Sometimes this can take some resources that the cluster manager The cluster maintainer doesn't want to see So with this setup We have three layers of caching. So first with the kernel fuse. We have the kernel cache This is the fastest and then with the Luxio fuse we have the local cache for the training node and And Lastly we have the local server Distributed cache that sits inside the compute inside the compute cluster Serving Data as cash, but it's still much faster than accessing data from cloud So on the right side this diagram shows the architecture of how application pod can access the Luxio as the data from a lot of views So on this host machine the application part is basically just the training part where the training Python or Whatever training is running and It uses the persistent volume and person volume claim To be mounted inside itself So that they can see the data inside this persistent volume claim and Then a lot of fuse container runs on the Luxio fuse process Which mounts the data inside the Luxio on to the host machine so that by this Two-step mount the application pod is able to see the data stored the data exposed by the Luxio fuse process All right now I'm back to transition to talk about the data access management Thank you. Thank you Sean and So Here I come back again and talk about like like Sean has talked about a unified caching layer for data and model access Like how would that be really helpful in production use cases like say could we get some results from some evaluation? And that's my job here to talk about like say there are some Experiments that we have done with this new design and we have already gotten some like Significant results that we want to share here today So the the first section about the experimental result We want to share is about using it integrating. It's this kind of solution with pytorch so here is the general design that we Integrated a lot so with pytorch for the training and say on the very left side There is a training node and for example here We have a pytorch cluster to run the training job and then we have a cache client installed in this training node with pytorch so the cache client has a Affinity block location policy so it can help to when a data would be needed by a piece of data It's needed by pytorch. It could know like which target cache worker it needs to talk to to get that this kind of file and then We have a service registry in the middle in the cache cluster to serve for the service recovery discovery So when the the cache client could gather cluster information from there and then talk to that Specific cache worker to get that data file if that data file has been cashed in the cash worker The data file will be directly returned to the cache client and then feed it to the pytorch training node But if it is not there it will get the file from the end of storage Like different object store or different storage medium and then get it back and return it to the cache client So this is a general design about like how to integrate a lot so with pytorch And we we want to say here that this is just an example of a lot so but if we use another type of cash generate the idea would be the same and Here is the some data loading performance that we have collected Still as we have discussed there are two categories of data set for training One is computer vision and the other is natural language processing So still we compare those two types of data sets here on the left side We compare the computer vision like training data loading I OPS So the data set is the very popular image net and we use a subset of it And then what we could Gotten from the result is that a lot so fuse has the highest the performance then S3 fuse and AWS S3 and for the AWS S3 here We use the bottle API the Python API here and for the right side We do some we did some comparison about different types of API's different types of data loading for Natural language processing data set and we use the Yelp review data set here And what we have observed is that a lot so rest for API and the a lot so S3 API Provide a better performance over the others And also like when we do the training we really care about the GPU usage of the training job Because nowadays GPU is a very important resources that we really want to maintain a high usage of it So here is an example that we ran the Render pie torch rest net mechanism on the image net data set that will train this kind of job there And we use the S3 fuse to help us to fetch the data from S3 and Here is the result We could see that the GPU usage here is only around like 17% So most of the time actually the training the whole training most of the time is spanned on loading the data set from S3 So what we can observe on the middle like green block is that like say around like 82 percent of The total training time is spanned on the data loading section And with the a lot so fuse we can improve that dramatically like say still the same training pipeline Just that we used a lot so fuse here And we could see that the data loader rate the proportion of data loading has been reduced from 82 percent to only around 1% here and the GPU usage has been improved from 17 percent to 93 percent here so five times improvement So that's an example that we integrated this kind of data access solution with pytorch And here we want to give you folks another like example of integrating it with ray So before we jump into the experimental results, I want to give you folks a quick idea about what is ray Ray itself current nowadays is gaining like popularity in the machine learning domain It is designed mainly for distributed training distributed compute. So itself uses Distributed scheduler to help to this trap Dispatch training jobs to available workers. There could be CPU. There could be GPU So for users generally we can just write some like still like single-threaded Code but real will interpret it and distribute this kind of job in a distributed pattern So ray help us to horizontally scaling for the training jobs across multiple nodes and Also, it provides a data abstraction for customers to use So for users we can very easily write some ray jobs for distributed machine learning training and Here is how we integrated like a lasso into the ray ecosystem Generally ray as a unified like distributed compute layer It uses like different machine learning training frameworks like a pytorch tensor flow Etc etc. It's mainly have to go in ray. It mainly have like a train train serve and tune modules and then When on usually the data is all the models is stored in In the story player that the very bottom layer in this diagram and then a lasso could provide a unified like high-performance data access layer To read the data to load the also preloaded the data and models into a lasso and the feed it to the upper like training or serving framework and This is the design like how we integrated with Integrated a lasso into ray so integrated a unified caching layer into the ray ecosystem in ray January the data load loading is Is taken care by a module called ray data loader? So and ray data loader uses pyro to raise the data files and with pyro the pyro itself uses the FS back as an interface to talk to different data sources and so what we have done is that we like created a lasso's FS back implementation We which we implemented the interfaces like a rate file with open file a list of directory et cetera et cetera and This implementations talks to the alasio Python client and alasio Python client talks to the rest of all API servers in the worker nodes of the alasio cluster and Here is we also have a EDC cluster here It is responsible for for alasio workers They register their worker information into the EDC cluster and then the alasio Python client could talk to the EDC cluster to get the worker addresses so when Like the ray wants to load a specific data file the Python client who could determine Like which worker it should talk to to get this piece of like data file So this is a design and how about the results? What about some Evaluation results whether we could get some improvement with this kind of design and Here this is a benchmark on some small files on the small small files here Mainly referring to some small images like the experiment was done based on the data set of the image net data set we use the 130 gigabyte Image net data set and we there are some settings on the on the right side like say we have like four ray train workers nine process rating and General what we have observed is that the active object store Memory ranges from four hundred to four hundred megabytes and on the left side We show a diagram here about the throughput the images per second that we can load in ray So we compare we directly rated from S3 So that's the blue bars and also like say what if we low use ray to load a data set from alasio That is the red bars and what we could observe is that when the object store Memory is pretty high like say four gigabytes Directly from S3 versus alasio. They have pretty similar performance And if we have a limited memory here, like say only have a few hundred megabytes for for ray to for the Rays object store we could see that alasio has a higher performance regarding the throughput So that's about some small files. That's only one type of data set like what about some the other type What about the files are pretty large and they could be in different data formats like very Popular column oriented data format like parquet, right? So here is another benchmark that we have done like say for the the data set still the image net data set but but we Put them into some large pocket files in general each pocket file is around like 200 megabytes And in total there are around like 60 gigabytes of the data set So there are like 28 ray train training workers 28 process reading and what we could observe that from the diagram is that we could see that generally Alasio directly loading the pocket files from alasio In ray has a better performance have a better performance over directly Loading them from S3 into into ray. So alasio is a red Red line here and the S3 is a blue line here. Okay and Also, if you folks remember that Shawn has also talked about some cost Say when we fetch say in a multi-cloud platform Let's say we have a training cluster in the on-premises environment But we have a cloud storage for the data set and the model storage Then every time for the training or for the serving we trade the data set all the models from the remote storage That will cause a egress cost. So that will be a data transfer fees like Needed by the cloud provider caught by the vendors So generally with this kind of a caching layer it can also help us to reduce the egress fee here So here for example, there are generally two diagrams It's mainly about the cache itself One is a cache miss and the other is a cache hit So it shows that one we could have a high pretty high cache hit Then we could save a really not large number of API costs from the cloud So because with a very high cash Cache rate a percentage Generally all the calls here like the list of status call to get calls all those API calls will only access the Caching layer instead of the cloud storage and the caching layer has Already cash or say preload the data sets and the models in the cache So they could directly feed the data and models back to the training cluster or the serving cluster So that will save a very high number of API calls from the cloud side and Finally, I want to talk about some use cases Actually, this is not a something that's very new that we have collaborated with different Companies in this kind of design and in this type of evaluation So here we want to share a very practical use case a very practical story Like how to use this approach to really fit it into a production level machine learning pipeline So here is a diagram of the basic like a design Architectural design of a rail like a machine learning pipeline in a company So generally this is a design before we use the caching mechanism. This is the they're like a previous design Generally, they have two different model training clusters one model training cluster mainly runs the pytorch jobs and the pytorch jobs like fetch the training data from the object storage and they also have another training cluster and that training cluster ran Rans spark jobs and from the spark jobs they get the training data from HDFS and the both those two types of training jobs like pytorch jobs and spark jobs Write their models back to HDFS So that's about the model training and then on the right There's about like a model serving a model serving they have a model deployment cluster that will have to fetch the latest the models from the HDFS and Then serve it to for for like continuous serving like say model inference, etc, etc And the the they have different applications in the company And those application will send some API calls to the model serving cluster to use the to use the models to for prediction and forecasting and There are three issues that they have reported in their previous design The first one is a very low GPU Utilization in the pytorch site. That's the one problem because they have to repeatedly like read the data from the object storage And the other side the other the next the problem is about the overloaded storage in the HDFS side So lots of things though. This makes the HDFS cluster a bit unstable unstable during that time And another problem is about a network congestion here because the model deployment cluster will need to Very frequently try to look repeatedly load the models from the HDFS side. This cost some like that network congestion in there in their like this data system and Then what we have done with them is that through the collaboration we Adopted like a lasso into their production environment and it solved those problems say for the pytorch Training jobs and spark training jobs We they both now have a lasso as a unified data access layer to help to Cache the training data and feed the training data to the training jobs And also for the model serving side We also have a lasso as a model cache that it can preload the models into a lasso As a cache layer and feed the models to the the the model deployment cluster So what we can we have observed from this production environment is that we can get higher GPU usage Around like two times. There's no network congestion now for the Model serving and also we have seen like a faster Model rollout here. There's around like ten times compared with the previous design Yeah, I think that's that's all for our like session today any questions or anything that you want to ask us It's open. Yeah, sorry. I didn't quite understand how the data path works from your caching layer to To the worker products like the the pod process. Is that going over a network? Or is it going through like I like the actual IO subsystem of the system like very basically Are you copying to some sort of block device on the system when you're caching and then that's being loaded by the pod? And in which case, how do you make sure that? the data that that particular pod is training on you had like You know multi replication of the cache layer But it looked like maybe as possible for a pod over here would be trying to load a piece of data That's on the cache on this other worker. Is that correct? Let me repeat Yeah, let me repeat your question first So from my understanding your question is I say we have a large number of worker nodes and on the same time we may have a large number of training nodes and then the training data may want a Piece of data set and then we have a worker to help it to load it And then there's maybe another training training training pod that may also need this data set and acts another worker to To load it so there could be some like duplication or miss a communication between those those nodes And you want to know how we handle this kind of problem. Am I understanding it correctly? Yeah For that. Yeah, I think the second question is about like how we communicate What is the communication protocol or the channel between the the worker the the cache worker node and the training node? Okay. Yeah, so I can help you. I think I can answer the first the question first about the architectural design about the between the training node and the the Cache worker node here, so the general idea is that when we have the large number of cache workers there so for Eat for only for what only one data file It will be only stored in one specific worker unless the the customer configures the replicas of this data file and so say we have like 10 training nodes and the training nodes have like a They will have like a Python client there So actually that will answer your second question the communication is Communicated via the HTTP like channels like say we have the Python client talks to the rest of all servers in the cache worker node so and for the The Python client side its uses a consistent hashing because the worker nodes are listed into a hash rain So and then for if you know the hashing there could be different workers and for one data file It will be only stored in one worker and the Python client could know for a specific data file Where is that data file is stored so it will just to talk to that specific worker node So that's the basic mechanism there Is it open? I have two questions one one about the performance of s3. I'd like I Would have thought that you would improve also like on the smaller file size there But why is it not improving is that basically because that the array itself? Caches that this has enough memory as the first one and the second one is kind of more curiosity I mean because here we're not like is very technical, but You could have a lot of these pods are that they are creating the claims Are they all creating the claims against the same volume? So you create one volume and they all have claims Against that volume Could you repeat the first question? The first question was about when you were comparing the performance of s3 for the two different values of of memory on ray Why doesn't it improve for the when you have a large memory on ray? Is that because ray itself caches? Oh, I got your point. I think yeah I think that the second one is to you Sean like do you get did you get that? Yeah, I think I can answer the first question the first question. I think you asked about You asked about the first case all they this case the second case the first one this one Okay, I think the The The case I think what so yeah, I think your question is like say From understanding your question they say for example one the Memories is pretty Limited like a lot so it has a higher performance, but why when we have a large amount of Memories the performance is here is pretty similar. Yeah. Yeah, I think from our Experiments like say because for ray it also have the object stall here as what you mentioned ray is Object stall you can we can consider it as another type of cache Yeah, so it's it's will catch the data in the object stall here So but for a lasso it can help one there's a very limited Memory here because for the ray object stall it mainly store the data set in the memory side And when the memory is limited the data will be spilled to disk But the performance there is not that like significant that there will be like a Versant performance, but for a lasso side we mainly use SSD there So if SSD for the data storage that could be pretty fast. Yeah, so that's the reason Repeat the second question as well it's it's more a curiosity and because it doesn't have anything to do with the performance, but You had those parts that access Alexia Create these PVC's the claims on the storage are they all creating different PVC's on the same PV? so it's a PV and PVC they are like a pair one-to-one pair and When Alexia is created we have this PVC created for them and PV and PVC pair for them and then they just need to mount that PVC into the their application part but when we say PVC enter the hood is just a Path on a host machine so there's there's a Mount between a lot of fuse part with the host machine path So they are like this basically, you know, that's the same thing and then this part is also mounted into the application part But they are the same thing. So this is how like the two paws can communicate with each other Yeah, I think maybe I didn't catch for the SSD layer for your storage there Are you using LRU or some other cache scheme and you know Analyze what the impact on the SSD drive rate, you know burn rate is and whether that costs Basically, you're burning SSDs in order to do this or Yes, I think I think the question is mainly about like say the cache eviction policies in this design and Also about like say why we choose SSD for that, right? So so for the first question for the cache eviction policies Yes, we have like different like choices here like say L IU faithful L a few et cetera et cetera But what we have observed in the current use cases that this is a very basic mechanism like L IU has already been pretty well Working pretty well in the in the production use cases. So fun at this moment. We didn't like Implement like many more complicated mechanism because the straightforward approaches have been pretty nice I have pretty pretty well and the second question is about the SSDs, right? So why we choose SSD over for example, memory here is from some very like Critical observations from industrial level machine learning workload. That's nowadays in the training workload For example, we have different training nodes, right? And the training nodes usually have both CPUs and GPUs and some some companies they only use CPUs for the training because they don't have enough resources for the GPU provisioning and no matter which type of design memory is usually like a very limited or say a Very important resources for training because training itself will have already consumed lots of Memory resources there. So for the caching mechanism, we have a limited amount of RAM that we can use So but SSD has also a better performance over like HDD over like a previous like Disk design and at the same time usually on each training node They have a pretty large amount of SSD unused there So that's that's why we choose SSD for the cache storage for data and models. Yeah So any any other questions? I Think yeah, and thank you. Thank you Test test cool, and I can just switch this on and off just turning it on test test Great. It's working. Yeah Hey everyone Thanks for joining us Welcome to the cloud native track of scale 21 Our next talk is bridging open source developer platform backstage meets coder presented by Ben Potter head of products at coder.com and Long time scale attendee and now presenter at scale Please become Ben. Hey, thanks folks. So yeah today. I'm gonna be talking about two different open source developer productivity platforms backstage and coder For us to give a quick intro. So I grew up in San Diego Currently living in in Austin, Texas Like you said, I work for product at a coder.com and I spent the last five years working for various dev tool startups And and I really like this conference in particular. My first one was about 10 years ago. I Was super nervous. I didn't exactly know what a conference was at the time And I couldn't get Linux installed on my laptop So I figured this conference was gonna be a bunch of people sitting around a table talking about their favorite Linux distribution Comparing commands and I couldn't get Linux installed. So I was absolutely terrified Fortunately, that's not what a conference is So I went over to the Ubuntu booth and they helped me get Ubuntu installed on my machine So that's kind of my my history with scale But we're not here to talk about that. We're here to talk about developer productivity So we're gonna start with an intro It's gonna start with kind of a mix of both like high-level concepts And then we're gonna get technical into technical solutions through the lens of a developer productivity team I know some folks here are familiar with with one or both of these tools We'll get into that but we're gonna start with some kind of 101 style content So first what is developer productivity? I am by no means an expert on the topic I actually think Hans doctor is he's written the developer productivity handbook. It's 91 pages It's quite a page turner and it really kind of divides developer productivity into two kind of different segments developer productivity engineering and developer productivity management and Developer productivity engineering is really optimizing machine processes such as a slow build on a laptop Where management is more optimizing people processes such as Joe's not doing a great job at work today How can we kind of figure figure that out? So for this talk, I'm gonna be spoke focusing a little bit more on the developer productivity engineering aspect of things and Through my work in dev tools. I've had the opportunity to have conversations with some pretty great engineering leaders and This is a quote that I've heard from from several engineering leaders at fortune 2000 companies and to me this is shocking. I would take this with a grain of salt I when I try to dig in and ask where does this come from that developers are only productive two to five percent of the time They I don't really know what where this comes from I it could be an angry person in C-suite or it could be a very kind of method Methodological like survey. I really don't know But what we do know and this is this is pretty well known is that enterprise developers spend 40 to 60 percent of their work day in their editor This is a pretty good number There's a lot of other things that developers spend their time doing whether it's sprint planning or doing documents or code reviews But this certain this data comes from the JetBrains survey as well as the slash data survey And something else the JetBrains survey found is that when the when a developer runs into an issue Maybe they're like waiting on a build or their laptop freezes up. They decide just to take a break So that's a lot of kind of time taken out of flow Again, I would take this data as well as a grain of salt I think it's really important in your organization to run your own surveys and Understand kind of where are the the bottlenecks on on your developer teams and there's plenty of good templates out there So now on to the type of productivity teams and the folks at DX This is getdx.com. They have a developer productivity platform They do a really good job segmenting different types of productivity enablement Teams and what I really want to focus on on this talk to to get even more specific is the developer tools teams These are typically developer infrastructure platform engineering Developer experience all these kind of sorts of of names So these are some common themes that I've gathered both from analysts as well as Practitioners on kind of what are the best practices when thinking about building a developer tool stack thinking about what tools Do I want to offer engineers that they could consume and opt-in to use and the theme that I see Very commonly is to adopt a product management mindset this means to really think about who your users are treat them as customers and a lot of product management concepts build a minimum viable product and and a lot of the the Recommendations to that I see is to to build internal tooling as opposed to buying a platform This is probably a combination of using several different vendors as part of your tooling but to really own that Rapper because you really want to own the developer experience and you also want to own and understand the developer workflow Another kind of fourth point that I see it's pretty important is to provide trading at enablement Team topologies is a really interesting Book on this and it essentially explains that you need to kind of have two separate efforts working in parallel One is platform Which is kind of maintaining the tools and treating it like a product and the second is enablement Which is more of a service which is going to different teams doing workshops Explaining what tools are and you have to have both of these things functioning it could be one team It could be two teams But these are kind of the four pillars that I've seen that makes it that they're really important when you're thinking about Building your developer tool stack and I put as a bonus to use open-source software We're an open-source conference and it's a lot more fun when you can submit a pull request to fix something versus having to wait on a on a product manager And so Today I'm talking about backstage encoder. These are two separate categories. There's the internal developer platform IDP and Cloud developer environments CDE and it's kind of a tongue twister You use your IDP to connect to your CDE and then you use an ID E. There's a lot of a lot of Lot of terms here So before we get into backstage and internal developer platforms, I wanted to explain the services problem Many folks here might be familiar with this, but it boils down to many software applications start very small They get larger difficult to manage at a large team They're ultimately split from monolith into microservices. This creates a lot of benefits It allows teams to contribute with each other. It allows these services to talk to each other with a common API And More services are created. They grow in size some get orphaned They no longer have owners and it becomes extremely difficult to manage and this visualization covers the scale and quantity of Services, but it also doesn't cover the complexity that comes with different dependencies talking to each other and hierarchy meaning This service sits on top of this one and it becomes when you're doing a request this giant chain down Services and it's very difficult to understand what's going on So now imagine one of these services stop reponding or you look need to upgrade a dependency And you don't really understand how it's going to impact the rest of the stack So this was the problem that Spotify was constantly running into And instead of building and testing code teams were spending more and more time looking just for the right information to get Started so the thing that they did first and foremost was they built a service catalog And a service catalog was one of the first things that Spotify built when they were creating backstage in 2020 and essentially splits these services into separate components Makes it pretty easy to track a specific Item in the component and you can see metadata inside there such as the team SLAs and a different hierarchy Independencies that these services rely on it's pretty easy to query You can read the metadata inside a specific service API documentation see I pipelines It's relatively easy to kind of track what's going on in in backstage So this is what backstage itself looks like it was originally launched by Spotify in March of 2020 is an engineering blog post in 2022 I believe they donated it to the cloud native computing foundation and It comes with a service catalog which I talked about comes with a pretty cool tool called the product the project scaffolder Include and then they also just have a plug-in ecosystem that really allows you to extend Backstage for your own uses So this is a scaffolder it's kind of like get fork on steroids meaning you kind of pick a template whether it's like react for example Give it some details and it'll fork your repo and then also set up an arbitrary number of steps as you go Another important thing to mention is that backstage is a framework It's not a platform meaning you'll need people with react experience when you're creating Plugins and extensions for this. It's kind of like a create react app for developer portals So what you're looking at here? It's called runway and it's American Airlines version of backstage They do a lot of presentations on it. It's really cool And you'll see that it's very differently skinned than what the Spotify backstage looks like Again just to emphasize that backstage is a framework not a platform. This is a documentation on how to add OIDC based sign-in into the application. It's not setting some yaml It's actually importing a react component and then placing it into your application And again backstage is a framework not a platform. This is the plug-in store on backstage Not only can you consume plug-ins from the community? But if you want to extend backstage the main way of doing so is through these plug-ins a big thing that I've seen talking to organizations that use backstage as they really promote something called inner source which is Engineers contributing up to the platform team to either add a plug-in for their specific team or make a change to An existing plug-in that they'd like to see So on to kind of how organizations are actually using backstage this comes from conversations I've had with with dozens of different backstage users in various stages and What I've seen for a good backstage deployment is it is a small focused core team that that manages this This typically involves one product person again Who's thinking who are our customers? What's the MVP? What's on our roadmap? What rabbit holes will we kind of save for for another day as well as an engineer with with react experience someone with infrastructure experience? That's kind of the three that you would need that could be one person if this person's a jack of all trades But really have seen that backstage teams are about six to ten people The the thing also that really makes a good backstage appointment is important services are actually being cataloged It's a service catalog, but if a critical application goes down, and it's not in backstage. It's probably not a fully mature backstage deployment at that point Third thing back plug-ins are added based on need again. This kind of goes back to the The product mindset you really have to survey and understand your engineers maybe if you came from a the engineering or it might be a little bit easier, but You really want to be able to add plug-ins that save time for developers Some plug-ins have seen work really well our infrastructure provisioning ones Maybe one that would give each developer a namespace for example or replace kind of a service catalog So I've seen people using backstage to request to do laptop and for example And then the fourth is it's kind of I should have added an asterisk here developers use it daily I think it should more just go to developers use it again If you create something and you don't have metrics on whether people are using it or not or you don't hear about it much It's probably not super successful Expedia has a really cool blog post on on this backstage blog talking about how their engineers use it and One of the metrics said that 4,000 developers use it for over 20 minutes a day So that's I thought that was pretty cool And on this kind of fourth point as well backstage has a discord and there's a channel in the discord called adoption And it's kind of for organizations who struggle kind of getting getting adoption on onto one of these projects On to a great backstage deployment though And this one is going to be things that are significantly more tricky and significantly fewer organizations have been able to reach this at least from the conversations I've had First this leadership gets it. I think this is really important American Airlines talks about this a lot They actually had an organization-wide okr around inner-source contributions and and being and making sure that everything's catalogued in this portal And again, I think if that's really to have leadership by and it makes these projects significantly more successful second is data-driven metrics The the third is true self-service provisioning So backstage can catalog with 30 exists out there But if you can't provide a golden path for developers and this is a very common term in platform engineering right like a golden path For when a developer needs to create a new service Then all you can kind of do is wrangle all the various services out there and try to consolidate it as opposed to creating a Pathforward for people to create new things using backstage And the fourth this seems to be a very popular metric for teams managing backstage, which is contributions This could either be people contributing to backstage plugins or people using backstage discover other areas in the the business that they can Contribute to outside of their kind of core function. I Broke this into stages to the first stage is installing customize and again I think this is a good opportunity to bring in this this product mindset Really to I think it's really easy to kind of get stuck in this phase and get the theming just right to line with a company brand guidelines or to create a plug-in for a very specific use case where I Think it's easy to go down rabbit holes before actually rolling it out to your users and using it The second is to catalog the status quo a lot of organizations talk to me about how many services They have cataloged in backstage and I actually don't think that's a good metric I think it's more kind of important to understand what like kind of Important services are cataloged as opposed to like the percentage of total services. It's someone's like we have 300 services in backstage again, if a critical production-facing app isn't there then It's probably not a complete Like it's snapshot of your service And then the third is a developers toolbox and this is essentially relying on the flexibility of backstage to be able to add different plugins and Tools into backstage to help developers with other parts of their workflow whether it's CI pipelines or Infrastructure provisioning or kind of keeping track of other tools in backstage again backstage is essentially the create react app for developer Portals so the platform team can really add anything in that that needs automation for developer productivity So this is data from an independent survey from slash data and it essentially covers the average time spent By software engineers in their day And backstage is a really great job at Covering many of these these are the things in the software development life cycle around planning maybe making sure it's something's deployed as being successfully monitored Making sure deployments run successfully Operating prod code as well as kind of having the proper security scans Visualized in backstage. However, you'll notice what's missing is where developers actually spend the majority of their time Developers spend the majority of their time in their editor whether it's setting up their dev tools writing code Or even managing code. So that's kind of where coder comes in Imagine you are a developer you work in this organization with a bunch of services and you want to make a contribution How would you actually get that environment set up to do so? So that's where coder comes in it's an open-source platform for cloud developer environments It helps we see people using it for several use cases The first is faster onboarding meaning it takes on average about a week for developers to get their environment set up If it's completely automated away, it could go down to count to like an hour The second is consistent environments This is pretty important if you have a complex microservice mesh and if you're running code on your laptop You will have to modify or mock a lot of the data that you would typically be running in a data center on your laptop to get Things working when you're creating a remote environment. You're able to kind of give developers a full environment that has production parity The third is to be able to secure your source code And this is important for a lot of customers that run in Regulated environments where source code can't run on laptops or it has to be severely audited and laptops are bogged down making it very difficult For developers to get their work done. So what we see really is People will download and use our our project for kind of some combination of two or three of these things Coder has two main features the first is templates These are essentially blueprints for workspaces They're written in terraform or this is an open source conference. So open tofu and you can essentially Write any resource in terraform as a template So you could give each developer a pod or you can give each developer a VM and again this can run in in any Like on-prem or cloud environment The second is workspaces and these are individual environments for each developer This is the kind of coder UI on the left here and the developers able to pick a project and Set those those values and you are able to do things like vm and emacs as well So this is actually what the the coder workflow looks like I think the term environment gets conflated a bit this on the right is vs code And on the left is the coder UI and you'll notice in vs code If you're sitting close to the screen that It's a it's a Linux terminal and the file system is remote to that of my laptop So essentially i'm able to provision this remote container running Click the vs code desktop button and and get in and it already has my repo cloned and it also has my project setup So kind of back on to the the good and the great A good coder deployment is very similar to the kind of attributes of a good backstage deployment The first is a small focused core team. You don't necessarily need people with Software development experience, but you do need people with infrastructure experience So you'll need someone who knows terraform you'll need someone who understands cloud infrastructure And you'll also want someone with that kind of product management mindset Again, it could have be one of those two people, but it's someone thinking who are target users How do we want to use this platform? And and a good coder deployment is similar to a backstage deployment is great for spinning up a new Project and it also has patterns in place for people who want to bring a project that they're doing locally into coder For a for a faster or more secure experience Um, again for a great coder deployment. I think it's really important that the leadership gets it They understand the value props. It's significantly easier to kind of justify things like this if If there's buy-in from above Another good attribute is it's integrated with a golden path. So if you're trying to promote for example dev ops or or cloud development or An existing kind of pattern in in your organization with with backstage yourself service It's it's pretty great to be able to incorporate coder into this existing story And and the third point which i'll talk about a little bit is teams can bring their own dependencies. For example, I need to use python 3 java 11 and ruby for example Into coder and coder knows what to do with it and build it and we'll talk a little bit about how that's that's possible down the road So to break coder kind of into three different stages The first stage we typically see our users doing is these sandbox environments These are environments that come with a bunch of tools pre-baked in it Someone can click a button get a workspace and start tinkering with docker or react or terraform But it might not necessarily be ready for their actual day-to-day projects The second stage is is developer teams and this is essentially kind of Done in part with the sandbox environments. You deploy sandbox environments people say Hey, I'd love to use this for this project and then you kind of do this this enablement model to help get Coder kind of configured for that and and then the third kind of stage is this bring your own tools approach where Engineers can go into the platform. They can say I need these specific languages in it It'll be downloaded from a secure artifact store. It doesn't have to be downloaded from the the internet And developers can kind of develop that way So going back to this chart, we know now that Oh interesting We know now that coder backstage makes a great internal developer platform coder makes a great cde And now let me just go ahead and give it demo So to integrate coder kind of into your backstage golden path. We have an open source repo. It's just on I'll share the link after and like most things in backstage you have to do some Coding to get it imported in it's relatively easy. You add the package And add a few kind of lines in so here I have a deployment running and This is a relatively standard backstage deployment the only plugins that I have are the the coder plugins And you'll notice this looks like a pretty normal backstage plugin It has automatically detected though that this project is compatible with dev containers And I'll explain why it's a little bit important later And the the other thing our plugin adds is this coder workspaces panel on the right From here, I can create a workspace Takes me directly into the coder platform and what it does is it starts building a Project encoder based on the specs in my get repo I am I have one running here. This is the the same project And I can click one button via sco desktop. It'll open my desktop editor And create a tunnel over ssh into the workspace You'll notice it's automatically cloned a repo If I go into the terminal And I can make this a little bigger It has python installed if I needed node. I believe the image has node as well Oh, I think it has yarn Yep, it has yarn as well and I can start my project. So I'll go into the project yarn dev Flask run it's a python project And get in and start coding So I did not have to set up any tools on my local machine besides having vscode running And I'm able to connect remotely into this workspace and start working on a project Just to kind of prove this is a remote workspace. This is uh through the coder extension It's showing me that I'm connected peer to peer to the to the workspace And if I just run a uname as well, I'm in a linux environment as well Another thing coder offers is a A web based experience as well. So technically someone could connect through a chromebook or an ipad the chromebook experience is pretty good The ipad is it's all right And through the web browser the person can get the same vscode experience with their repo cloned And can be ready to go If I go back into backstage It has detected this workspace and if I wanted to create another one Perhaps for a second environment or I'm working on a new project I can just hit this here and it spins up another one of these cloud environments The other plugin I wanted to show is if you didn't want to use coder at all But you still kind of wanted this magical one click into a into a project Without something like this. It's typically a contributing md or some wiki where it's 10 to 30 steps on how to set up a project for an engineer It probably only works on one operating system meaning if you begin supporting Maybe you moved from intel to m1 It would be a whole new set of instructions if you moved if you had a developer using linux or windows It would be a totally different set of instructions So again, there are a lot of benefits to having this instant kind of containerized dev environment experience And if you didn't want to use coder We created a plugin as well that uses the dev container spec To talk a little bit more about the dev container spec I'll actually go into the repo There's a dev container folder here And in a dev container folder there's both a dev container dot json and a docker file The dev container spec is not something that we came up with it's not something backstage came up with It actually I believe was created by microsoft and they have a local extension for it as well and it's also in code spaces But it's essentially explaining kind of what tools I need for this project So to each developer doesn't have to run one of these It starts from the python space image It'll install the the editor And it's also installing two pip packages So this is quite a simple example, but you can imagine for more complex projects The dev container spec can still scale and install multiple languages multiple tools. It's essentially a docker file Dev container spec also has support for something called features. I don't believe I'm using any of those now But features can be used to install kind of sidecar processes and mount them into into coder Because it detected it has a dev container spec I have this open in vs code local button And what this does is it takes me again to vs code Gives me a pop-up And it's using the dev containers vs code extension to actually run the same exact kind of environment locally So let's see Let's close this out and do that again. Cool. So we can see now it's no longer using the coder extensions using a dev container extension But again, if I run uname It's a linux environment I have python installed And I can do my development this way as well So both of these plugins are open source. You can install them into your backstage deployment today And it's a great way to kind of add these extra steps to your workflow Really solves tools the problem of developer setup. Let's go back to slides Cool. So, yeah with that you can use backstage as an internal developer platform coder is a cloud developer environment And there's kind of a lot of things that these tools do differently But there's a lot of kind of concepts that they solve together Which are kind of around onboarding context switching and provisioning environments So with that, thank you. I think there's plenty of time for questions And the qr code takes you to our backstage plugins repo if you wanted to download or check out one of these one of these plugins Thank you. Thank you Ben for such a great presentation You definitely have time for questions. So if you have one, let me know and I will bring the mic to you No questions Anything else? No, cool. Thanks everyone. Thank you. Thanks everyone The presentation will be on the scale website soon as well as the replay on their youtube channel. Thank you. Bye. Bye Okay, all right. All right. Let's uh get started. So this session is about um Well, it's a little bit about uh cloud stuff a little bit about java and a little bit about serverless so Um, you know how the cloud is, right? It's like ooh super cool the cloud You can deploy your applications on there and it solves all your problems, right? Of course not. Um, so anyway, my name is Kevin Dubois. I'm a developer advocate at red hats. Uh, I'm based in belgium And I have We have another speaker that's going to be next is also from belgium, which is a pretty random Coincidence because we're a tiny country Um, anyway, so um, yeah, I talk a lot about cloud native development and Also about java. I happen to also be a java champion And uh, yeah, if you want to follow me and I'll share those at the end too That would be also really cool Anyway, so cloud computing, right? It's uh, it's nice because it helps you to uh respond more quickly to demand, right? You can just Create new environments pretty easily. You don't have to worry about, you know managing the hardware, which is convenient Uh, you get stuff like high availability and disaster recovery and resilience and Um, it basically allows you to grow your applications kind of in a manageable way, right? You don't need to create tickets and provision new machines, uh in your um in your data center and stuff like that And you know in theory at least, uh, you would use only the resources that you need, you know, maybe You might save some money. You might save the planet, but maybe not Um, anyway, so but if you want to create an application platform for the cloud Um, you need to build not just your application, right? You need some stuff around that application to make it You know manageable, so you need like some developer tools You'll need some way to automate the deployment with cicd. You'll need monitoring Logging, uh, you probably will need a container registry if you're working with containers um, you know, so We have to take that in into account too. So Of course, fortunately all these cloud providers are really nice. They provide a lot of these tools for us, right? So if you go to aws, you'll see a whole bunch of different services. This is just a subset Um, that'll allow you to you know, create really cool stuff With your application and you know, you can create databases and messaging systems and storage and all that good stuff Which is cool. And it's not just aws. Of course, it's Also, you know any kind of cloud provider that's out there, right? So azure Um google google cloud and yeah, I have to mention ibm cloud, right? I'm from red hat and they're kind of our corporate overlords. So They also have an ibm cloud Um, but you know the the the thing is with uh with these cloud environments is that What if you go in with one cloud provider, you know, build some really cool stuff and then um You want to switch to another provider? It could be for many reasons, right? It could be because there's regulatory changes like for example us in uh in europe Sometimes the eu says that you know, the data actually has to stay within the eu or here in the us Maybe uh, they'll say well data cannot leave the us or it needs to stay in a particular state or you know There's all sorts of weird changes Sometimes it uh politicians come up with for good reasons or for bad reasons. I'll leave that in the middle Um, but it could be you know, maybe there's some outages with the provider that you're with. Maybe they change their prices Uh, maybe the other vendor offers Better resources, you know, like maybe in this day of AI maybe, you know, some other vendor offers much better gpu's or maybe they're Um, you know a lot cheaper, you know, there can be many reasons a very typical one that I also see With uh with some customers is a new cio comes in and all of a sudden he has Uh his uh golfing friends from another company and he says yep We're going to switch from x2 to y2z or whatever Um, and then also kind of shadow it right so some developer teams They might be a little cheeky and they go and build their uh software on a different cloud and then you know You need to kind of converge or or not. So There can be many reasons why you might need to Uh work with different cloud providers, right? So you might have some workloads on a aws and with azure some on prem Because maybe you don't want to run it in the cloud and uh, you know, then comes the pain of lock-in, right? So uh, yeah, you've built everything with for example a aws or azure And um, of course they make it so that everything is of course a little bit different between the different Cloud providers making it a little bit of a pain to uh move and a little bit might be an understatement Of course, so you have to be careful when you go into the cloud and uh, I really think about, you know How do you build your uh your applications and what services do you do you need and which ones? You know, could you maybe replace so what is the up? What is the solution then right? Is it hybrid cloud? Is it multi cloud? Um Well, it could be you know, you could have you know some team working on aws some team working on azure And and but you'll see that the differences between the different cloud providers requires, you know expertise and each one of these different cloud providers Um, and that's you know, unless you're a very big organization and you're fine with that It's not so ideal, right? So it's like, uh, I want to go to the cloud but So fortunately the open source communities are are great and they've come up To rescue us if you will, right? So we've seen the advent of uh containers and kubernetes and then a whole Massive ecosystem built around that um, so I don't know if you've ever been uh gone to the cncf landscape and there's like a whole bunch of different services Um, but those services also include proprietary services So what I did I went to the cncf landscape and I actually filtered by only uh open source Uh community projects, uh, and then you end up with a much smaller list here, right? So there's still a fair amount of projects, which is the good news And that means that we can leverage these projects as open source community projects that can also be supported by a vendor But that means that we can use them on any cloud in the same way together with kubernetes and then I also Put one particular little project in there scupper, which is an interesting one That allows you to interconnect applications between different providers On a layer seven network So a little bit higher up so that you don't need to You know start figuring out how to build vpn connections or dedicated networks between different providers So scupper is kind of an interesting project, but that's a little bit of a sidebar So what about serverless? Um, so serverless Maybe I should begin with explaining what what serverless is, right? So of course, you know, we always have to mention, you know, like serverless Doesn't mean there are no servers. It does mean that, you know, they're kind of abstracted away, right for the developer And the idea with serverless is that you auto scale based on the demand of that moment So if you have a lot of requests coming in you create a lot of resources And if those requests go down then you have less resources And the idea is that you get kind of build based on the exact usage of of what you're using so Why is serverless interesting? Well, this was a survey that the cncf did recently And they were asking what are the factors that are leading you to overspend and looks like 70% said over provisioning is a problem Of you know why they're overspending and then another one is sprawl such as resources not deactivated after use So they just you know sit there consuming Energy and you know costing money and then you see fluctuating consumption demands. So where you know you get Sometimes a lot of requests sometimes not, you know, it could be you know, because it's black friday or because you have you know, somebody calls Your service from a batch script or something Or just like poor planning and prediction on cloud consumption. So, you know, like But it's hard. It's hard to plan out how much your your applications are going to be used, right? So Those are actually all good reasons to consider serverless, so Of course, I have to put in some some graphs that makes me look fancier. Doesn't it? No But you can see here. This is an example of a traditional non serverless deployment and you see They have over provisioned for most of the time, you know, they have a certain amount of resources that can be containers or pods And let's say that they have 20 pods running to handle the traffic just in case that there's a big spike in traffic And then you see yeah, there's a spike and they can still barely handle it and then another spike And it actually is too much for what they've provisioned and of course, you know Some users are not going to be happy because they're not going to get to your service, right? So that's not so good. So even though most of the time they've over provisioned They're paying for what not what's not being used at some point. It's the opposite, right? They're not actually using or have enough resources so the difference with serverless is that your Your applications your resources your pods are going to scale exactly to the demand of that moment And scale up and down and you see like the third arrow there points to zero, right? So if nobody's using your resource if nobody's using your services in a particular moment, maybe at night nobody's Needing to go to your services. Well, then there's actually no resources being used. So that's kind of the the interesting part about serverless so Of course, I'm a big fan of java. So What about java and serverless because Java traditionally is not so fast to start up, right? So the idea with serverless is that we scale up and scale down based on the demand very quickly And then yeah, if I think of my java applications from, you know, five ten Of 15 years ago, they did not start in in milliseconds. That was uh, it was probably more like minutes And that's yeah, nobody's gonna wait for minutes, right for your for your service to come up Now fortunately, um, there's some uh, there's some Nice changes in uh in the java landscape, but here we can see an example of An application on the jvm that uh, that that's supposed to scale, right? So you see the blue line Which is basically The load that's going up and you know a new container a pod gets created Um, and the application might even start up and say that it's ready to receive requests, but the jvm is Something interesting because it optimizes after its startup time So it takes a while to actually hit, you know Can I get to its maximum throughput where it can take as many requests as it's actually able to To take so there's there's some discrepancy there So in that case we'd have to always over provision our java applications, which isn't so ideal But again, like I said, there's solutions to that And one of the one of them is the quarkus project. So this is a project that was started. I think about Four no five years ago. We the quark has just had its fifth anniversary So quark is uh, they say is supersonic subatomic javas the supersonic so very fast subatomic So, you know an atom, right? It's very small And subatomic so smaller than an atom. So the idea is that, you know, it's java, but you know very fast and very small So the way that that works with quarkus is that If you compare it with a regular kind of normal Java application running on the jvm You'll see that still do their packaging and then during the runtime during the startup time It's going to do class path loading and doing a bunch of optimization But that's already when it's started up, right? So imagine a container it starts up and then the jvm is loading everything and then finally starts up so Which makes sense in the traditional java world where we had big monolithic applications on dedicated servers And the idea was not that they would start up, you know and stop all the time, right? The idea was to give them keep them running as long as possible Um, but in a container world, of course, or especially serverless We want things to start up fast and we don't want the same optimization to happen every time for every container that starts up Why don't we just do it before and that's exactly what quarkus tries to do So it does all that class path scanning and building its model of the world Before it packages the application so that it doesn't have to repeat that over and over again when it starts up So that's kind of how it works It's more sophisticated than that but this kind of gives you an idea And then what's also interesting with with quarkus and this is not just quarkus You know other java stacks can can do this too But what's interesting with quarkus is that enables this Out of the box, which is a native compilation of your java application So it compiles it down to a native binary. So typically a linux You know the target is is a linux binary for a linux container And so there's no jvm actually in this natively compiled binary So it has a whole bunch of optimizations thanks to the project grolvm And that's really interesting for serverless Because we're going to have a very fast startup time. There's some downsides to native compilation because the jvm actually has Some some interesting optimizations that you lose by using native compilation, but Maybe we'll get to that. Maybe not. Anyway, here's like a little bit of a comparison between A traditional cloud native stack, which I guess in this case it was a spring boot application That started up in 4.3 seconds and then the exact same application with quarkus on the jvm because you can also run quarkus on the jvm. You can already see that's quite a bit of difference, right? So especially if we're thinking of serverless This is this is great. And then you can see, you know, if we compile it down to native Binary 16 milliseconds, I think I can wait for that I think that that would be okay. And then yeah in terms of the footprint and the memory usage It's also quite a bit less So going back to that Java warm-up time like I said native compilation is interesting because There's no more optimization to be done after it's Compiled its binary. So you can see here this yellow line is basically Fairly constant, right? So that's with a native compilation of our job application But what's interesting is you see the red line, which is you know happens to be the quarkus application on the jvm Even though at the start it does not have the highest throughput It does thanks to the optimization of the jvm end up having better throughput So it kind of depends on what kind of application are you running, right? So if you have serverless applications that just need to do something very quick and then that's it native compilation is great if you have longer processes That need to handle a lot of throughput then you're probably better off running on the jvm So that's a little bit of a sidebar on how to optimize You know java for serverless. There's actually A lot more that you can do But yeah, there's the talk is only so long. I'm actually writing a book about serverless java And so I'll go into a little bit more detail in there and I'll I'll share the details. It's not published yet But I'll you know, I'll share what what we're up to Later on so anyway, so we have our java application that starts up fast now. We want to deploy it with serverless, right? So most people when they think of serverless, they think of functions as a service. So, you know, we can start with that So functions as a service So I think amazon lambda was pretty much the the first one or at least the one that I saw the you know the first time Is uh, basically the idea is that you have your code and you package it and then you give it to Amazon lambda and it takes care of deploying it and and scaling it based on The usage and then same with azure function same with uh Google cloud functions and IBM cloud functions and all that Which is great. It works pretty well. Um, the only downside is that uh, we have um To create these functions as a service Amazon for example or same with azure and same with the other ones is like, oh, yeah Here's an example of a function that you can write for uh, you know to deploy to aws and then you can see that Oh, that's nice. They have import this from amazon aws import that And you can see that, you know, you have your handle request and then there's a context and stuff like that And that's actually all uh, very specific to uh, amazon lambda So if you want to redeploy this function to azure functions, well That's a problem. You're gonna have to rewrite your code Because you're you have a whole bunch of dependencies here. So that's not so super great, right? So yeah lock in again and this one in my opinion is worse Than a lock in in terms of infrastructure I think if you have your code where you need to rewrite and refactor That's a lot of man hours and that's very expensive, right? So we want to try to avoid this as much as we can so, um quarkus actually has some uh, some interesting solutions for that too. So you can actually write a cloud agnostic functions um With uh with java and quarkus so, um, one of them is uh, There's a project called quarkus funky and you can write, you know, you can see down there at the bottom a basic uh function you just annotate it with this funk And um, that's pretty much it and then the only thing that you need to do is you add a uh, a quarkus extension, which is basically like, uh Like a smart dependency. So you just add that to your palm dot xml It has all the knowledge of how to package and deploy your application. So you can add for example, um An extension for quarkus funky amazon lambda And by adding that extension I can deploy my function I can build it and deploy it to amazon lambda And then, uh, if I then want to change it I can change my quarkus funky amazon lambda to quarkus funky azure or azure htdp I think it's called And uh, basically that's all the change that you need to make. So it's a much less painful process to do that um Now i'm gonna have to be a little bit creative because of we're you know, usually I mirror my screen so I can see what What i'm doing if i'm demoing this Um, so i'm gonna have to think for a second how i'm gonna do this. So maybe i'll get This screen over there And so here is Uh, a project that I've created with a few different functions. So you have The quarkus funky aws. So we'll start with that one Um, and you can see that we have, you know, a regular palm dot xml and I hope this is big enough For you in the back I have to look over there to see what i'm doing. All right So you can see there's not much to this palm dot xml other than you know, it's using the quarkus platform To add some convenient stuff and then you can see here quarkus funky amazon lambda as a dependency And then I have a function here source main oops No java there we go Uh, and I created this function For my fictional company that's Creating some, uh, you know some space stuff. Anyway, so this is a function to ask for a landing request on a given planet And so you can see you pass in some landing details And then it'll you know reply with approved for a landing. It's not a very complicated thing, right? But you can see this is just plain java And then you just see this uh at funk annotation and things to that we can deploy it to aws Lambda so let's try that here for a second Oh, so when you compile your application what quark is what the quarkus extension will do is it'll add Uh some stuff in your target folder And this is where it becomes interesting because you can see It has a function dot zip and that's exactly what amazon lambda is expecting you to send So it's expecting you to zip up your your artifact and and send it and then there's this manage Script and that's a little convenience script to deploy to amazon lambda without having to know exactly the syntax of their Of their cli so with that we can do Target and then that managed script Maybe I need to recompile let me see if I'm in the right directory here Quark is right. No, I'm not that's why CD go back and then quark is funky amazon. All right. What is it? Quark is wait This is so hard to do from here What was aws? Okay. Thank you And then uh if we now do targets manage dot sh we can uh create our uh our function So I would do that except uh the one that uh that gets shipped has by default Sets the runtime to java 11 and I happen to be running java 21 So what I've done is I've created a copy of this manage sh In my main folder so I can just call it from there and then use java 21 So I can do manage sh and sorry. This is probably a little small for you and then create And that's going to create my function on on amazon lambda and then You know then I'm up and running so I could invoke it from there and then I'm Using amazon lambda then at some point if I want to change to Uh funky to uh to azure I can create the exact same code. So you'll see main Java and we see the exact same landing request function Uh Scroll this over so you can see here exactly the same uh function and then When I compile this We'll see that in the target instead of that Function dot zip and the amazon stuff. It creates an azure functions folder With all the stuff that you need to deploy it to azure and so then you know, I would do kind of the same thing To deploy it to azure, but i'm not going to spend too much time on this because otherwise we'll we'll run out of time I think But that gives you an idea right so I have examples of these And I I'll share the github so you can see exactly what the difference is between the different ways to Deploy to the different functions as a service um Now the thing is that even if you create a function on these Providers and you do it with something like quarkis and you don't have the dependencies right in your code um Chances are you might still get some lock-in right because uh if you look at Lambda, maybe you want to add some events to it, right? And you want to tie it in with a messaging system or you want to back it up with a database Like dynamo db or you want to have some storage or something Well be careful right because we're going to have the same problem of having lock-in with uh with the provider So you need to be careful with that too. So function of the service is cool. You get uh, you know kind of on demand billing and uh in scaling and you don't have to worry about the infrastructure But actually serverless goes a little bit further than just you know functions Serverless becomes even more interesting if we combine it with uh containers. So um, Again, the cloud providers provide some solutions for that as well. So you can see that amazon for example as far gates azure has container instances, I think and google has cloud run And uh, so you can basically create containers and push them and it'll uh, it'll scale for you as well And then there's some other projects as well out there such as uh, k native which we'll visit in a second Now the thing is you know, this is for example, azure because I don't want to always show aws and and uh, Say that aws they're locking you in it's the same with all the cloud providers, right? So here you can see you know, this comes from uh, from the azure website Uh, some examples of how to use uh, azure container instances. It's like, oh, yeah, this is really cool And then you add an azure application gateway and then uh private endpoint and then you call lists from azure and blah blah blah And then yeah, of course the same lock in right so um So that leads us to a project called uh, k native. So k native is an interesting project uh, because enables you to uh use serverless Um together with kubernetes. And so it's basically an open source project working with an open source project To build serverless and event driven applications on kubernetes. So the idea with k native is that It tries to make containers easy. So it helps you with Building the containers pushing them and deploying them all at once. You don't have to create yaml's in in kubernetes. It kind of does The heavy lifting for you Then it has you know, autoscaling out of the box. It also has this revisions concept. So every time you create a new um You change a configuration or you push your application Again, it's going to create a new revision So it makes it really easy to roll back And forward between different versions kind of out of the box Or you can do traffic splitting and enjoy things like canary deployments or ab testing and stuff like that So that's also something that comes out of the box with k native, which I find interesting And then yeah, the idea what's what's cool about k native too is that because you can run kubernetes Not just in the cloud. You can actually run kubernetes in your private data center It allows you to use or enjoy these serverless capabilities of autoscaling and easy deployments You know kind of in a hybrid cloud Situation and because it's containers, you know, it supports any programming language, right? So as long as you can run it in a container you can use serverless with it Um, so, you know, you can create Functions with k native as well. So you can you know, again kind of the same concept as before with with lambda um or azure functions you create your code and then K native can just package it up deploy it for you and it will do that in the form of a container But to the developer that's all kind of abstracted away, right? They just do k native func create to create a function and then funk deploy is going to build the java application in case of java In case of no js, it'll do whatever npm build and then it'll create that into a container and then Deploy it so it'll use by default build packs for that if you're familiar with that project Or you can also just Have your Applications built as a container and you can deploy them as a k native service As well. So you can just use containers as a serverless Wait, so we've seen kind of the evolution from functions to serverless containers Um, but where serverless is really interesting is kind of with uh event driven architectures, right? So you have events coming in that triggers You know a certain scaling of your application and then you can Create really cool kind of applications or architectures that are going to scale up You know this part and that part and you know you can use you know exactly only the resources that you need And you can do that of course all the cloud providers have their solution for this But you can use it with with k native as well. So k native supports Uh The cloud events protocol, which I'll tell you in a moment But maybe first to give you an idea of how k native works. So on the serving side, that's where uh, you have Basically a control plane that's going to monitor the traffic If if requests are coming in it will scale according To the requests that are coming in so it's going to scale up and down Based on exactly the demand of that moment Um And then there's also an eventing part of uh, of k native to handle the events Going from one service to another so you can see there has Brokers and it has uh channels um And so the way that it passes those events between uh different services is using the cloud events specification Has anybody heard of the cloud events projects? See one hand, uh, so cloud events is uh, is is pretty interesting project. So it uh, recently graduated as a cncf project Um, and the idea is to have a comment event Schema standard, um, you know If you're familiar with uh, sending messages between different systems I can be you know, like in this case, we're using kafka and we're doing avro and there we're using amqp and there amqt and we you know the consumer and the And uh Provider and the consumer the producer and the consumer have to know exactly from each other like hey I'm sending this in this protocol in this format And then the consumer can do something with it, right the idea with cloud events is that uh, you you add that kind of information to your payload Um as in the form of a cloud event Specification so everybody knows exactly what to expect so it's protocol agnostic and uh, it's it's an easier way to communicate between different Systems and this is something that k native supports So you can see here a usage pattern of uh, of k native here is uh, you have an event an event source And you can send that to a sink that's gonna See that hey an event is coming in and then i'm gonna scale my k native service Based on that but you can make it more interesting right so you can create a channel Much like uh brokers, right Where you have a channel and then different services can subscribe to the events that are being sent through this Channel so you can in this case have two different Subscriptions being subscribed to a channel and then you can also have like a full fledged broker where you can have Subscriptions and filter the messages that they're interested in so you have a whole bunch of uh Interesting scenarios that you can build with k native and again using serverless to scale up and down your uh, your resources So with that, um I have one little demo to show and uh, you can participate in hopefully because I need some help Uh to generate some load to scale up my my uh, my applications, right? So I have this uh, yeah, I'm terrible with ui. So this is really the best that I could do Uh, but you can vote for your uh, favorite ide so Some people like vim other people like sorry, I don't have emacs on here by the way But I have intelligent a vs code uh eclipse So yeah a little bit focused on the java developers and then uh, yeah because I'm from red hat You can also vote for open shift dev spaces Which is uh an ide that runs on open shift and that you can use in your browser and basically is uh You can use intelligent or uh vs code, you know kind of in your browser, which is cool But yeah, I just added that kind of uh to be funny. Anyway, so Now I have to get this window over there There we go. Okay. So here's my application running in a kubernetes cluster. Yeah, it's uh, It's an open shift cluster in this case and we can see that we have uh, a few different uh components and actually what I'll do is Go back to this slide because I think I have a diagram of this uh of this application So basically I have my ui right that uh scales down to zero if there's no request coming in Once you start uh voting. Well, actually once you go to the url You'll uh, you'll see that uh, this is going to scale up because it's going to wake up And then uh, every time you vote uh, it's going to send requests to my uh, Ingestor and then that's going to send the requests over uh through a kafka topic in this case To a consumer that's going to you know update my database and then my ui is then going to uh, Read those events back uh into the ui. So you know, this is kind of uh, A very over engineered uh solution for for this problem, but you know kind of demonstrate how it uh, how it works with Um with serverless So I don't think this qr code is going to work because I forgot to update it Actually, uh, no, but I'll I'll just cherry the qr code. So uh, I'll go here to my ui So right now you can see uh, if there's no blue circle around That means there's no uh actual no pods running so we can see here. There's no pods. It's autoscale to zero And so once I go to this url Uh, it's gonna wake up you see and it's gonna start a pod and then my ui is now up and running um, and so I really have to go look over there. There's a qr code thing, but I can't see it from from the stage start uh, oh, I think it's in uh It's not there. It's in chrome. So sorry apologies I was hoping I could mirror my screen So I'm going to go to chrome here this and then you see a bunch of tabs apologies for that and then uh We'll go here again to the thing and then now we have a qr code So go ahead and scan that and if anybody uh feels up for a challenge Uh, you can vote as many times as you want Um, you can create a quick little script if you If you uh, if you want to and then uh, we'll see that the votes are coming in And then uh, let's see what uh, what our open with our cluster is doing right So let's go here and keep voting and we'll see that right now We don't have a ton of uh votes coming in but we do see that our pods are scaling up Right so we can see more pods are being created because you're creating more load and I you know, so you can configure k native um Based on how many requests it should Scale to how many pods so I have it set up that every time there's concurrent events It should handle those by a separate pod which again is a little bit over engineered But you can see that it's going to scale up and down based on whether you start or stop voting. So, um Kind of up to you if you want to see it scale up or down But uh, yeah, so we can see now it's terminating some of the pods So, you know, some people are voting a little bit less and then if you start voting again It's going to scale back up and so it just keeps going based on the demand of the moment. So Let's see what our votes are at here So we have IntelliJ winning and then vs code. That's the the usual suspects, right? Uh, and we have, you know, but is it 870 votes for IntelliJ. So at the last conference I was at somebody was uh had A script running and I think we got up to like 200 000 votes for for IntelliJ or something So and that was funny because you'd see like 50 pods running, you know, and then uh, yeah And then yeah, if if uh, let's say, okay Let's say that everybody stops voting and let's see if it's going to scale back down to zero. So if everybody stops voting Uh, we will see that uh, it'll terminate in just uh, in just a moment And then uh, it'll go to sleep And just wait for more requests coming in so we see that The dark blue circle means that everything is terminating and then uh, in just a moment We'll see that no more resources are being used Excuse me. Yeah, unless Jamie wants to be smart and then All right, so we see that it's uh, going to sleep And that's exactly how I feel. I'm a little bit jet lag All right So yeah, the last one is terminating There we go. Now it's zero and now if you start tapping again Or you know you hit refresh on your ui because actually what's interesting is that the ui Is only gonna wake up when you go to the url But once you're already on there, it's not gonna reload, right? So but now we see that it wakes up again Anyway, so that's uh, just a quick little example of uh, how that could work Uh, so I'll go back to my slides. I think it's this one. Yes Um Yeah, and then of course we have to talk about ai or every talk has to say something about ai, right? Um, so serverless in ai, uh, actually they're a pretty good match because Um, we have in ai we have for example Uh, somebody's using uh fancy chatbot or something So every time somebody types in once a question or something that creates load, right? That creates requests to some uh server some service that was built by by your model There's many kind of use cases with ai, but serverless makes a lot of sense, right to When my model needs to be created I'll scale it up and then when it needs to be used and stuff like that So I added the link here of a use case from the k native project If anybody's interested in uh, that's uh, it was it's an interesting read. Um, and then yeah, just sort of wrap it all up So, you know cloud providers they offer a lot of cool stuff, right? So, um, I would say use it but Yeah, keep in you know, like keeping an eye on what's what's happening in the in the open source world in the community Because if there's a solution for it in the community, um You will lock yourself in less and you will have less pain later on Um, oftentimes what the cloud but the cloud providers offer for services might at the face Seem like a better solution an easier solution. Uh, but yeah, you need to be mindful of that lock in Um, and you know serverless if you thought that it was just functions as a service It's a lot more right you can actually use it Uh for many use cases in event driven architectures or just you know With containers that you want to scale up and down and just use the resources of that moment That allows you to have more density in your nodes That use them more effectively and you might be able to use less nodes Saving you a lot of money and saving the planet just a little bit too, right? And then yeah use open source when you can and then yeah provide Proprietary services when you must because It does end up being a little bit inevitable that you're going to end up in a scenario where you do Have to resort to that But the more we use open source and the more we push these cloud providers to use That we like open source the more they're going to start investing in it, right? So we've seen that Microsoft AWS IBM They're all doing a lot more open source lately and that's because we are using it, right? So please use open source if you can Um, and then yeah, I would say if you feel yourself limited by what uh, what you find in the communities Contribute, right? Uh, that doesn't mean you need to write the whole solution. But you know create issues Provide some ideas. Uh, that's what open source communities are for so, uh quick kind of, uh Link if you're interested so our team so I'm part of the red hat developers team So we write also books, um, and red hat is nice enough to sponsor us Or sponsor those books at least, um, and then you can download them for free So it's kind of a nice little thing so you can see that there's like for example a quarkis for spring developers So if you liked, uh, what you heard here from quarkis and you're a spring developer Well, there's a book for you right there If you want to modernize your applications if you want to work with get ups and then, uh, yeah And then one last kind of, uh Thing for for myself. So I'm writing this book serverless job on action. So what you saw today Um, there's a lot more that goes into that. Um, so that's what we're writing the book is not, um Available yet, but it should go into the manning early access Program in the next few weeks. So keep an eye on that. So, you know, if you if you're interested Follow me on, uh, linkedin twitter, uh, masodon, whatever And then I'll definitely be posting about that because I'm pretty excited about that book And with that, uh, I thank you And I would like to know if there are any questions Yes, uh, we have a microphone. I don't know if it works Yes Yeah, I wanted to know, um, what jvm were you doing when you were doing the benchmarking with some of the quarkis stuff you had earlier? Uh I don't remember because it was a few years ago. Yeah, but yeah, you'll see, um Overall, you'll see that if if you compare like for example spring boot or, uh Like java e or something and you compare that with quarkis, you'll see the the difference pretty quickly Um, the only thing is that, you know, you you need to kind of Measure the right thing because for example a quarkis Uh, when they say they measure the startup time, it's actual When it's ready To accept requests whereas spring boot it'll say like, uh, it took this long and it actually is to, uh To start your application, but not being ready to actually receive requests. So yeah, but anyway, thank you Uh, any other questions? Yeah, I've been running the serverless framework. Um, and I'm wondering why I would want to choose The tooling for quarkis and k native natively versus using serverless to Basically set up a yaml and I don't I don't have to change a single line of code Except the yaml file for deployment Sorry, uh, I didn't get the question. Oh, um, why would I just use quarkis and k native? natively versus using the serverless framework That creates a uniform and unified CLI and mechanism and I don't need to change my java code because general it'll just package up And add the the packages for lambda or for whatever we have For each native cloud platform, right? So why would I it's like so why would I want to use that versus like one more layer on top of that via serverless? Yeah, that's a good question. Yeah, so I'll I'll ask because there's basically two questions in there, right? There's a question about quarkis uh, so Um, the thing with quarkis is that you'll get your performance boost And some integrations with the different providers the serverless framework is interesting. Um, however, I've seen it kind of I don't want to say fizzle away, but it seems like the adoption has has kind of stalled. So I would be a little bit Um, I mean, I don't know if you're part of the serverless framework people, but Um, I mean, it's it's an interesting project. Uh, what I like about k native is it's that it's uh Back by cncf and uh, and so you know that this project is going somewhere Has a lot of adoption and you can use it with kubernetes. So it lends itself to a lot more use cases than just Uh, than than just with the serverless framework Um, but yeah, I mean if you uh, if if it fits your use case, absolutely I just talked about k native. Yeah Right. Okay. So in other words, yeah, it's still you know, it's still a uniform Uniform mechanism to deploy I'm just wondering why it's it's I'm lazy I'm extremely lazy and and I just want one one mechanism and I don't have to change any line to code It's like and and just have one piece. So yeah, I can package it up either for google or for azure Or it's like whatever I want uh, and lambda. So I'm yeah, I'm like curious. Um, yeah It can still use quarkus. It's uh underneath as well. Um, uh and and caught land but the question is is Is um, should you and what are the you know, what are the bonuses the one thing I do see is Quarkus is native is really interesting That'll run much faster, which I guess eight of us lambda doesn't provide very well at all Right. Yeah. So for example, eight of us lambda. They provide snap start, which is an implementation of project cracks like checkpoint restore at something I can never think of the name but it basically makes a checkpoint of Your application on the jvm that has already started. So, you know, it's kind of makes a checkpoint and then starts up from there Uh, and amazon lambda kind of uses that implementation. Anyway, that's kind of beside the point, but yes, um, I'm not like I'll be honest. I'm not super familiar with uh, Latest developments of the serverless, uh framework. Um, that they support k native Um is interesting Um, but to me then it's an extra layer in between quarkus and k native because I can just use quarkus that has extensions for k native So then I don't need that. Sorry. I have a another question there But yeah, I mean if you are interested absolutely use it, right? I mean if it works for you Yeah, I have two questions like first question is um regarding like language support Like like javascript guys are using typescript like that java guys are using kotlin And curious like kotlin encourages using coroutines and flow versus quarkus encourages its own framework for Non-blocking code. I was curious. What is the direction? Is there any possibility to support Kotlin native? Yeah Um, so quarkus supports kotlin. Uh, I'm not sure for native though so It could I'm I'm just not sure myself One more question. Yeah, sure regarding the cloud events Uh, generally cloud events are I'm using it with uh with kafka And I was curious like how it works for the htp transport than kafka. Uh Can we use the cloud events with the htp server? Context also or not? I yeah, I don't see why not. Um, yeah Because it it basically Supports that protocol out of the box, right htp or whatever. Um, so I yeah Any other questions? All right. Well, then thank you and have a nice evening All right, I guess we'll get we'll get started. Um, hopefully more people will trickle in We've got matias, uh, geese. I hope I didn't butcher your name Director of technology at venify learn something today. It's not the benefit. It's it's venify Matias has been part of the jet stack effort behind cert manager as well and besides that The fun anecdote about him is that he's uh, he's quite an accomplished marathoner sub three. What 249 249 So we'll we'll talk afterwards. Maybe I'll get some good hints. All right, it's all you Thanks. Thanks for the introduction. So today I'm hoping to help you The mystify a bit about spiffy and how it can solve secret zero or what some people have Talked about and and mentioned as the bottom turtle problem um Maybe a lot of people already might be using spiffy. Um, it's the underlying Technology that is powering the identities within is to your service mesh. For example Quite recently linker d has announced support for it as well Sillium is using it for their mutual authentication and authorization system So more and more systems are starting to adopt it But like I'm trying to wanting to demystify a bit around what spiffy is spiffy is a workload identity Thing and to be able to talk a bit about workload identities We kind of need to go back to a higher level and like Where doesn't spiffy is a potential solution and like you kind of have to a kind of identities Now you have machines and you have human identities human identities are well understood And you have countless solutions that help you solve the management problems of them Examples are google sso Enter active directory Octa and all of these kind of solutions But the lesser known identities are the machine identities Machine identities exist out of two categories device and workload identities For devices think about tpms that you might even have on your laptops and other solutions But with the sprawl of microservices We have seen an increase of workloads and it is important that we can uniquely identify workloads To be able to govern and authorize communication Research has proven that machine identities will outcompete user identities with 45 to 1 Currently there doesn't really exist Many full blown solutions to manage those With workload identities and we get a mix of secrets api keys Tokens certificates job tokens or even cloud native or cloud solutions that give you the workload identities But they're not very much compatible with each other. It also makes it very hard to govern and manage them so why should we care and This is the reason we should care by because of this man Aristotle An entity without an identity cannot exist because it will be nothing To exist is to exist as something and that means to exist with a particular identity So that's why we care about workload identities So the purpose of this talk is to guide you through the basics of the spiffy framework And showcase how it can help solve you secret zero or bottom turtle problem But let's start at the beginning. What does spiffy actually mean? Spiffy stands for secure production identity framework for everyone. It's really a mouthful It's an open source project designed to establish a standard for securely identifying and authenticating software services In a distributed and dynamic environment such as cloud native and microservices architectures The project aims to provide a framework that allows different services to securely communicate And trust each other all the identities without needing to rely on traditional network based security mechanisms Like firewalling IP table rules and all of these kind of things Joe Bida, who is one of the co-founders of Kubernetes first proposed spiffy in 2016 At the time Joe was still at Google and here I find that specification of spiffy framework with security experts from other organizations like Netflix and Kind of like these kind of big organizations with the aim to have an interoperable framework for workload identity management There is an actively maintained open source implementation of the spiffy framework, which is called spire It is actively maintained by many organizations and we can see public use cases of the spire usage from the likes of uber pydance and many others There are some really great blog posts out there from uber on how they are running spire at scale and It's definitely worth doing a quick google search on this to like read them like they're very well in depth uber talks about how they run spire at scale They also talk how they use it together for example with kafka for authorization to kafka services like very interesting blog posts so Over the last two years we have seen spiffy taking up a lot more traction First and foremost spiffy itself is fully open source It's part of the cloud native computing foundation and it's a graduated project in there We've seen It's being adopted by and standardized on by the industry So like for example, azure has announced that they're working on a product for spiffy Google is working on a product for spiffy and ados has already support for like spiffy workloads More on the ados use key is even later during some demos that I have so We also have seen that Due to the failure to protect identity leading to supply chain attacks like solar ruins was a really great use case So like it's becoming more and more important to be able to uniquely identify all of your workloads So that like you can protect them And also regulation and zero trust mandates Play into like this more awareness around spiffy because like spiffy can help with like your zero trust strategy So it's all fine and well that we know this what spiffy stands for But what problems and challenges we face can spiffy actually Help us with so The first use case and probably the one I get the most excited about Is removing the need for api keys Imagine a world where you no longer need to have long lift api keys Or tokens you don't have to deal with securely storing them Or having a fine the way to retrieve them from your secret store Which on its own also requires an authentication token One of the major hurdles of adoption for spiffy will be to make your application spiffy aware There are some great SDKs out there in the different Programming languages to help you make that really quickly to implement If you can change your applications for example with databases You can always opt to put a proxy in front of it. That is spiffy aware Like really great use cases of proxies are the envoy proxy, of course, which powers like service measures like Istio But there is also for example Really great proxy for post-gresql that already has spiffy support as well for example A second use case is to authenticate from one cloud to another cloud Through the usage of spiffy An example of this could be that I can write a file to an advas s3 bucket From a google cloud instance or even from a non-premise VM And the third use case is to give all your steps in a cicd pipeline a unique identity That can help with allotting of your software supply chain And you would be able to use your spiffy identity to authenticate To your signing provider and then sign your release artifacts Or any evidence that you provide as part of that release artifact like an software bill of material and sbomb So implementing spiffy is a major step in your zero trust security story By giving all of your workloads applications and machines a unique identity this spiffy identity Allows you to do explicit authorization between your workloads By using your spiffy for authorization between workloads and services You prevent that the api keys can be shared between multiple applications without your knowledge I've seen countless times Where one api token gets shared between 10 different applications making it impossible to track where a request is coming from By using spiffy identities It allows you to have an improved auditing system on your authorization layer As you can explicitly log from which spiffy identity a request is coming from On the previous slide I talked about that you don't have to think anymore where to store your api keys or have a way to retrieve them A spiffy identity in most cases will be automatically provided as part of your platform Where your application is running on a great example of this is kubernetes When your application runs on kubernetes The platform thing can automatically provide a spiffy identity to each workload running on kubernetes This makes the life easier of the application teams as they can focus more on adding extra features improve reliability And security as well as other levels of their applications One of the big challenges with long lift secrets is most cases a regular manual rotation is required To stay compliant with policies and control Spiffy takes this all away and it's even recommended to have shortly spiffy identities of between one hour to 24 hours maximum That automatically get renewed as well as rotated So this is a quote from joe bida again And it summarizes exactly on the previous items we talked about It's very important to deliver a great developer usability experience to improve the rate of adoption Spiffy is also a solution that works everywhere on premise And in the cloud serverless and in mainframes And the advantage that spiffy gives on observability and policy enforcement are unprecedented So we now have talked a bit about what spiffy is the problems it can help solve Spiffy is an open framework with its specification published to get up for everyone to see use and implement We are now going to get into the depth and you're going to discover how a workload can get a spiffy identity and what a spiffy identity actually is So to be able to Demonstrate this I have created a small architecture of two servers with each of them two applications running on them And as part of this we're going to now start with the current status quo that we probably all know Where we have a secrets manager manager running somewhere that can be hush-core vault That can be like one of the cloud providers that have the secrets manager So the way this kind of works is application b for example Will retrieve a secret with its client id And then it will initiate a connection to application x and it will send over the client id that application x then can help verify With that of course application x only knows that the connection comes from a customer that has this client id The client id can be shared between multiple applications without the knowledge So you cannot be 100 sure that this Connection comes from application b unless you're using ip's But like we all know in clouds ip's change quite a lot One of the other downsides as well this connection is not secure. It's not mtls or tls encrypted So the spiffy framework exists also five distinct components And each component sets it has its own function unlike as part of like this presentation. We're going to go over one by one We're starting with the spiffy id To a human that is comparable to however our name and how we are called So the spiffy identity looks very similar to what the website's URL would look like And it exists out of the following The standard scheme that identifies what follows will be spiffy identity The second part is a trust domain. This is the to identify Distinct domains of trust one enterprise can have multiple trust domains And common example is to have a trust domain for production And another trust domain for development because you want to keep them completely separated and not necessarily trust each other But even in production you can have multiple i've seen for example if you want to have distinct trust domains per data center For this example, we're using venify.com as our trust domain And the last part is the path and that's how we identify our workloads Each workload should have its own unique path The path can be a unique uid that's being randomly generated Or a more meaningful one as this is the case in this example We can see our workload runs in data center one on node 10 and it's a web server of the front end So that's like allows us to uniquely identify our workloads So we're back to our application And we have removed the need of our secrets manager and now application b gets a spiffy identity We can see it's spiffy venify.com server one application b and Application b is sending over its identity to application x and application x can verify this and it can It can authorize it based on that spiffy identity So we know that application b is that application x gets like a connection from application b one of the downsides of this is still like our Network traffic is still not encrypted because like it's an identity In essence, it still can be forged So that's why we need to go to the spiffy as fit the spiffy verifiable identity document It's comparable to our passport that we carry around and that's how we can for example, we can get verified at border control So a spiffy verifiable identity document or as fit in short is a cryptographically signed identity document That can be used to verify the identity of a workload within a specific trust domain It's a fundamental component of the spiffy framework Which aims to provide secure and standardized identification and authentication for services in distributed and dynamic environments Such as microservices architectures as well as like cloud native applications Currently job tokens nx5 or nine certificates are supported as cryptographic key materials Which can cover a lot of use cases And also like the all these keys should have a short lifetime that in case of compromise the exposure is very minimal This is an example of a spiffy as fit x5 or nine Certificate and in there you can see in the uri part You can see the spiffy identity in there So like this is the as fit and that has the identity encoded in it And it's signed by our ca which then can be trusted in our environment and it's kind of encapsulating our trust domain Similar example between the job token and again, we have our spiffy ID in the job payload So we again have our Application on two distinct servers, but now instead of just sending over the spiffy identity We're going to send over this spiffy as fit as part of an x5 or nine certificate So application b is again initiating the connection to application x And this time we're sending over the public key of our 519 certificate Which has our spiffy identity in it And application x can verify this based on the common root of trust based on the ca And it verifies this it and it authorizes this And from now on our connection is mtls secured as well because it's the public key You can start doing mtls with that one The biggest missing things after all of this is how can we get applications as spiffy as fits From our central system and how can identities be issued in a trustworthy way And this is where the workload api comes in like requesting our passport at the passport office We need a way to get the spiffy verifiable identity document for each workload And this is done through a standardized api the workload api The spiffy workload api that preferably rents on a local socket Or at least only listens on local host of a note as it's unauthenticated This api is the way to retrieve spiffy as fits It is responsible for the automatic renewal and rotation of our spiffy as fits as we want them in short lift It is quite common for us pfs fit to only be between 1 and 24 hours of a lifetime The workload api allows you to create integrations for your different platforms that deliver spiffy as fits directly to your applications on a well-known location Currently it already exists for kubernetes, but imagine a live that wherever you run your workload You have a spiffy as fit no matter where you run from ados lambda functions to on-premise vms to even mainframes The spiffy workload api is an unauthenticated api endpoint and hence why it only should be reachable locally It's also responsible for validating and attesting the different workloads that turn on that note This happens out of band Otherwise, we wouldn't really be solving secret zero a bit more on the attestation and verification later We are now back to our two servers, but this time we have added the spiffy agent Which runs locally on each of the server and it exposes the workload api through a local socket on that server So Application x will never be able to reach the workload api on server one. For example, it will only be able to reach the workload api on server two So we are going to request and spiffy as fits from the workload api my application It's going to get validated and attested in the background again more on that attestation and verification process later The workload api is going to issue an as fit for application b And application b can then start using that one We're repeating the same process for all our other workloads so that each of them have a unique identity And then we're doing the exact same thing as before application b is initiating the connection and it's sending over the Over the public key which application x can then verify and we're still empty less So we now have solved almost all of our problems And the only problem is remaining to be able to attest and verify that we can issue x5 or 9 and jot certificates through our workload api We can get as fits from a central place But we also need to make sure that those workloads are with this adr The verification of workloads can be happening in band as that would require api keys And we wouldn't really be solving secret zero or the bottom third problem verification and attestation of workloads need to happen out of band and asynchronous Verification and attestation happens at two levels. First, we need to trust our node where the pf agent is running And then after we trust the node we can start trusting the workloads running on that specific node And to be able to build up that trust we need to gather evidence of the environment The nodes and workloads are running on for nodes in cloud provider For example, this can be done through querying the instance meta data, which is locally available to each ec2 vm or tcp vm Retrieving facts about it and then a central system can verify this information out of band In on-premise system, this can be done for example through tpm or other environmental verifiable data Once the node is verified the workloads are running on a node can also be verified This is again done by verifying the environment proof can be gathered there for example from the windows or unique sockets Or when running on in kubernetes by gathering proof from the kubernetes api through the kubelet for example I'm going to talk a bit about an example of this again To kind of make this better graspable. So we have now added a spiffy server As well as a node attestation endpoint From our spiffy server and our spiffy server is also the one that controls the issuing ca for Each of the x5 or 9 and shot tokens that are going to get issued On the servers itself We also have like a node attestation and a workload attestation process running That's helped with that and and the way this kind of works is when a spiffy agent comes up for example in server run It's going to gather some proof of note and it's going to send some proof of note to our spiffy server Our spiffy server is going to validate that information by doing its own checks Like for example by querying the cloud api endpoints and valid validating that the node Is with this it says and it's going to then issue a spiffy identity for the server It's going to send this back And the reason that the spiffy agent also gets its own identity is that from then on out The communication between the spiffy agent and the spiffy server Happens over a secure connection based on that identity as well So like all communication between the agent and the server after its initial verification is happening over a secure connection So we now have like an identity for our servers Now we can start verifying our Workloads we send a request sfit to the workload api And the spiffy agent is going to verify that like it can issue an sfit for our application based on unix Processes or like for example when it's kubernetes on the kubernetes api querying that again Verifying all of that information And it's going to then issue An identity once it's verified and from then on it's exactly the same as before like we send over our certificate We validate the connection and it's still like secure So we now have like actually solved the true secret zero process that can be used for multiple use cases One thing I still want to talk about a bit is spiffy federation It's a More advanced topic, but like I still want to briefly Talk about it and very high level cover it So spiffy federation is important when you're in multiple spiffy servers or multiple trust domains A common scenario is to have like a spiffy trust domain per environment But in certain cases for example, you will have like development and production But you might have a shared environment that then Needs to be able to talk to both dev as well as production Like think about the icd servers for example So you want to federate the shared with production and the shared with dev But not dev with production to be able to have like distinct domains and like reduce the blast radius And that's why we need to start thinking about spiffy federation because Two workloads from different trust domains still sometimes need to be able to talk to each other So each distinct spiffy server must have a trust bundle endpoint that can be queried by other spiffy servers A trust bundle it's comparable to a ca root chain that publishes the public keys of that spiffy server When spiffy servers want to federate and trust another spiffy server They need to explicitly define it in their config And based on the refresh timings that are set in the trust bundle The spiffy servers knows when to re query the trust bundle of the other spiffy server to get an updated one By using those Timings or three first timings it will allow allow like the different spiffy servers to do rotation of their ca's as well in a Nice way without causing down times The three most common use cases for federation is to segment environments with different levels of trust As I talked about with the shared and the production and the shared and the death for example A second use case for spiffy federation is between different companies It is similar to the first one as federation will happen between different spire spiffy deployments But there might be difference in implementation and administration of the spiffy framework And the third and final use case is to enable consumers that don't have a spiffy setup yet People without it can fetch the trust bundle and use that to authenticate their callers without having to commit to a setup of a full blown spiffy deployment And like for this start setup This is for example, how you can do federation with like lots of cloud providers lots of cloud providers support oidc open id connect And spiffy or the open source implementation spire Supports this so that you can do the federation with like a non spiffy full blown setup And afterwards the spiffy server is responsible for distributing the trust bundles to each workload So that a workload can verify the spiffy as fits from another spiffy server This was a very high level overview And we're back to our application. This time we have two distinct spiffy servers We also have added a trust bundle endpoint and server one will Get like be part of the trust domain of spiffy server one While server two will be part of spiffy server two and you can see Each of them have their own ca's that are not linked together. They have their own trust domain So by default they wouldn't be able to talk to each other because x519 certificates wouldn't be able to trust each other So what happens is when we define in our spiffy servers that we want to trust each other Is that they're going to query the trust bundle endpoints of each other and they're going to retrieve those trust bundles And then afterwards they're going to distribute those trust bundles It's a bit of a simplification, but this is kind of how it works And then we get like our spiffy identities again like by doing all of the normal rounds that we talked about earlier And because even though they're in different trust domains because the trust bundles have been distributed Like application x again will be able to verify and authenticate and authorize this connection And set up mtls even though they're part of of separate trust domains so I'm going to do try to do a small demo. It's going to be very basic On like showcasing some of the the spire stuff kind of thing. So if I So spire have Is an open source project And it's part of the spiffy community And what they have done is they have created the helm charts For installing spire and I kind of wanted to quickly take you through some of the things This is like my helm chart and you can see it's it has quite a lot of information It's mainly exposing some of the Domains that I have running like for example, I have tornjack is a ui for example for it Although like it's a very basic ui for spire, but like it shows you something Like for example here, you can see that I also expose my oidc endpoint And and set up tls for it And then also one of the things I do is is As I talked about the workload api runs on a socket I will write a socket to a specific socket And the reason I need to do this is to make it work with istio for example istio works together with spire And I'll show that also A bit later So once after running this kind of thing what you get is spire running inside your Kubernetes cluster So this is my kubernetes cluster and what you can see in here is I have my spire server Currently, it's not highly available setup for production use cases. I suggest to do that, of course And we have a spire agent And the reason I have two spire agents is when I do kubectl get nodes You will see that I have two nodes running in my gke cluster I also have two csi driver ones These are the ones that make it easy to map the workload api socket within my kubernetes spots and again more on that like in a bit And I also have the oidc endpoint and the tornjack kind of one running in there So to be able to showcase this I have a spire tool server set up Within my kubernetes cluster that has the workload api socket running in them and if I now run the spire agents And I'm running the command spire agent api fetch So I want to get my identities And I want to write my identity to a the temp file path and I define the socket path So this is the socket path where I can get my workload From so when I do this And I run this it has like Do the validation in the background and the attestation in the background and it has Gotten my keys And written them to temp. So when I now do This one the bundle And this is the ca bundle you will see this is my trust domain spire.internal.muchasg.be And It's signed by itself. It's a self signed one that I have like currently using because to make it very easy but one of the things I can also showcase you is This one And here and you will be able to see here. This is the spiffy id of my workload It's quite hard to readable, but like you can see for example, it's My trust domain is spire.internal.muchasg.be This workload runs in the namespace default with the service account default and you saw me Getting into the pot. It was running in the same namespace One of the nice things you can also do is you can add extra dns entries to it as well Like you can see here. It has a full pot name In a bit you probably see later the demo that it can also have like service names for internal kubernetes ones Like it's it's fairly moldable to do that One of the advantages of being able to add this extra information is that you then can use this as well for Applications or other things that don't support spiffy yet, but can work with like certificate authentication Like databases and stuff like that. They're supported by default So I'm and for the last bit I'm talking a bit about Istio. So Istio natively can work with spire So Istio has its own As workload identity generator. It's part of citadel But it doesn't necessarily do the same in-depth attestation and validation Then what it's it's really simple attestation and validation spire can do way more in-depth validation of it And that's like you would be able to replace this And it's it's fairly easy to do this. You can see here. I need to inject some sidecar webhook and Besides that like it's it's it's a very basic config like I deploy my ingress gateways. They also have like those workload sockets in there And after I've deployed My bit I can So this is an istio ctl command to retrieve some secret and the certificate from it And this is a And is the standard book info application that I have deployed And you can see in there that it's like runes in the namespace book info with the book info details And it's also quite clear in there that the extra dns name I was talking about earlier like all of like it has services in front of it So I have auto populated them This is all possible with like the the kubernetes integration and it's very easy to start playing with that one spire itself Is is a really great open source project, but when you start running it in production, it has certain challenges It's store. It needs requires a database. So by default it uses a Local file database, but it needs to be sequel based So you can back it with postgresql or my sequel for high availability if you have high availability requirements And it requires a bit of setup and understanding to to start running that But like as I said the cloud providers are working on support for that There are a few other startups that are working on it So there is all at this moment quite a lot moving in that space that hopefully is going to make this a lot easier to run And soon I will also talk a bit about like how to make it How to get like something more simpler running than spire Also like before ending this talk, I have also one more quote from the head of security and czo of notable For him spiffy is a game changer and it really helps in simplifying tls distribution to hosts To end this talk with a bit of a summary spiffy provides foundational identity It's going to reduce the need for api key distribution It's going to give you short lift identity that automatically gets renewed All of the attestation and validation of your workloads happens out of band It is hopefully going to make the life easier of developers like they don't need to go to secrets manager anymore Populate secrets in there like it's something that should be foundational on your platforms And spire is an open source implementation of spiffy I also promise that I'm going to make it a bit easier to get started with spiffy Two weeks ago. I did a webinar With for the cncf where we demonstrated how you can use serp manager Serp manager is the de facto kind of certificate management tool within kubernetes And it's very easy to set up and serp manager has simple support for spiffy It doesn't have support for the full spec, but it allows you to get spiffy certificates And together with another open source tool called authorize You can use that to fully set up the authentication authorization to a double s across clouds This is the link to the webinar page And in that webinar page, we're talking you through like how serp manager works together with authorize It also has a demo that you can fully replicate And what is this going to do is uh, it's from google cloud for example from a cheeky ecluster You can talk to a double s as three without needing to do anything like it automatically gets like an a double s Authentication it all sets this up and all from within kubernetes So after the initial setup together with authorize like all of your workloads are getting a spiffy identity Thanks to serp manager and then authorizes doing the authorization bits to a double s like you add some annotation to your workload You declare an intent in your kubernetes cluster and that intent in your kubernetes cluster will automatically set up For example the ados im policies for you On the ados site and this is all can then natively happen within kubernetes And this works in a pure multicloud one unfortunately This serp managers spiffy implementation currently only works with ados as ados is at this moment I think the only cloud provider that allows authentication with x519 certificates And serp manager currently only does x519 certificates If you want to do make it work with other cloud providers You need to more look at job tokens and then you need to look at spire at this moment I want to thank everybody for coming here and i'm open to questions as well So if you have any questions i'll pass the mic over to you So we can record it When you had the two different servers serving trust domains and they would swap the trust bundles How can they authenticate the other servers trust bundle if they don't have a shared cryptographic source? So the trust like if I go back to the To this thing here So what the retrieving of trust bundles does is it gets the public key of the ca And then the public key of the ca then gets distributed to for example the spiffy server one Uh spiffy server two retrieves a trust bundle of spiffy server one, and then it's uh Populates the public key of the spiffy server one in here And because you then have the public key you can verify do the verification of it That's true. So Yeah, so so yeah So the question is can this uh trust bundle distribution endpoint be man in the middle If you don't use an htp htps connection Yes, of course like you might have like for example other means already of like public key distribution in your Images for example where you have like a ca thing then you can of course put it in front of htps And it's it's a lot harder to uh man in the middle that So it can be an htps endpoint your trust bundles and as long as you then again have the trust some in through some other way Uh, yeah, this can be secured and you can get rid of man in the middle Or prevent it So you mentioned this makes the developer life easier What does this look like so I know where to get my passwords. I know how to generate one of those. Yep. How do I generate a spiffy? our spiffy id so That that's indeed where like spire for example comes in which is the the production ready implementation of the spiffy framework Or like start manager with the start manager sees is spiffy store spiffy driver that Generate those spiffy ids for you Because like a big part of the spiffy identities is also doing the verification and attestation of that you would be able to Mint your own x519 certificates kind of in a manual way And you would be able to do the own verification attestation and then just mint with the spiffy identity in the right place on that certificate Of course That would be the very rudimentary way But most of the cases like you're going to go to an automated way and look at like tooling like spire and start manager As well as like maybe the cloud providers that are coming that are going to give you that spiffy identity that you then can use for authentication authorization between different workloads but like it's indeed it's it's a great question and like I've been thinking quite a lot around like from when does it start paying off having spiffy everywhere and Like because for a really long time you're going to have like both of the solutions And one of the nice things for example is hashiq or vault As an example, as I know, it's a very popular secrets manager supports authentication with spiffy So like a really great way For example to start is especially when I look at VMs VMs don't have like basic identity like Hashiq or vault with kubernetes clusters can already be used through the the short endpoint, but like VMs don't have this But once you have like spiffy for example on your VMs through like system like spire You can already authenticate to a hashiq or vault and then slowly start chipping away Kind of thing and getting rid of secrets and doing more spiffy native But like there is for quite some time going to be A time that you're going to run dual kind of thing Can I use can I use spiffy in a model model app? to Have the mobile app identify and authenticate with the server Sorry, what was the question can I use it in a mobile app can configure in a mobile app on on a phone to Have it authenticate itself and exchange Trust information with the back end with the cloud um Nothing stops you because like spiffy itself is a framework so in like we've been Debating about it like can it for example be used for user identities as well and where your user gets a Certificate then In theory it's possible because it's a specification Uh, I haven't seen it done like as an actual implementation kind of thing or like any of the open source have have implemented this But like I don't think nothing stops you from kind of doing The development or like somewhere and then Doing it out. I think it's probably would be a nice and pure So as of about six months ago hashi corp vault doesn't exist really for a lot of us anymore At least not as an open source project. So are there any open source? secrets providers at all that Be done with this fire One of them actually is quite recently and I'm not sure how That's also like what what the future is of that since the acquisition, but Uh, vmware created vmware secrets manager, which is powered by spiffy for authentication in the background And uh, it will be an alternative I really haven't really tested it But I've seen a bit of like buzz around it and and I really need to test it out As like the people that grace it that are very active in the spiffy community as well And and it sounds like a quite neat solution indeed like where you need to run, uh, jewel I also know there are some efforts on Having like an like an open tofu for hashi corp vault, but like Yeah, I I haven't really looked into that story. So But I mean so that vmware thing you said that that that's A replacement for spire or it's actually just a cb provider that spire can work with Uh, so so it's a secrets manager That is natively using authentication off off along the spiffy side. So it works together with spire natively for example Is is it yet another piece in the tensu product portfolio? Yeah, but like the the vmware secrets manager itself is fully open source and and you can run it without being as part of tensu Yeah, I forgot where you mentioned. There was a uh a database, uh that the users. Where is that in the agent? It's in the spiffy server. So the spiffy agents itself are fully stateless the spiffy server Or like the spire server because spire is a production gradient implementation Is has a database uh, of course like they're like The spiffy itself is a framework is a spec um, I know like again Knowing a bit about what's what's being built out there Not all of them are going to have a database on on the on the on their spiffy server, but spire definitely has a database Thank you Matias. Thank you. Thank you for the questions. They were really good Okay Okay, okay, this is all I guess we'll get started a few more people may trickle in so We welcome Carlos Sanchez To discuss lessons learned around migrating existing app to a multi-tenant environment Carlos is a principal art scientist at adobe and He has been very much involved in open source for over 15 years including His focus on Jenkins kubernetes plugins And he is also a member of the apache software foundation. So welcome Carlos Thank you. Thank you for having me So i'm gonna talk to you about what we did Real-life lessons what we did. Well, what did it wrong? moving migrating to multi-tenant cloud native And hopefully you'll get some ideas some things that will be useful for for your for your projects So let's well first. Thank you for being here and instead of I know there's a leakers game at the same time So thank you for coming here And who's here who knows about kubernetes? Okay, who's who's using kubernetes in production? Okay, so almost everybody Okay, so let's talk about well, this is all down And because we are in a Small audience let's you just can interrupt me whatever you want ask me any questions you have I'm gonna talk about a little introduction about what this other experience manner it is So you understand the challenges here So it's an existing distributed java osgi application. So it was Before we move it to the cloud. It was already a distributed application So that had already some benefits people could run this on-prem across multiple VMs multiple Machines and so on it's it would scale horizontally So that was something that made it our life easier It uses a lot of open source components from the apache foundation And it has a huge market for extension developers people that write code to run on the em platform So this is gonna be interesting later on So what we did was let's take a m and run it on kubernetes because kubernetes kubernetes kubernetes, right And we are currently running on a sure we have 45 more than 45 clusters and we keep growing over time And because we are a content management system. We run across multiple regions Whatever the customer or customers want to have the content closer as close as possible to their customers So we have u.s. Europe, Australia, Singapore, Japan, india and whatever new region that comes up will probably take it And another interesting fact that the job is that we have a dedicated team building The clusters for us. So what is called today a platform team? because platform is the new trendy word and That also limits what type of things we can do, right the cluster. We don't own the clusters We have another team that provides a cluster. I think this is gonna be if it's not already It's gonna be very typical in any big company where there's gonna be a team that is gonna Give you access to the clusters or even a cloud provider where you say, okay, I'll take the cluster but I cannot Do everything I would be able to do in a local cluster that I run For the for good reasons and bad reasons you do not you're not gonna be able to do everything you want you you could do we have 17 000 environments and What an environment is is a Set of deployments that we give a customer So a customer can have multiple environments And this comprises multiple Kubernetes deployments and services and Kubernetes objects and so on And that means that we have more than a hundred thousand deployment objects in kubernetes clusters That's also That means also we have over 6 000 namespaces So that's more or less the scale we as am Just the single product of the W the scale that we have An environment for a.m. Is Something that the customer self serves So they come to a ui. Well an api. They say, oh, I want a new environment a new environment Can be a dev stage production environment for them And so they can have Multiple dev environments one stage what's one production that means at least they have three environments And each environment has their own It's a bit like a helm chart with their own deployment services and so on and These environments are also separated by kubernetes namespace for isolation So each customer means at least they're gonna use three kubernetes namespaces And each environment is what I like to call a micro monolith So we took the job application that customers were could run On prem or on a vm or we could run on the cloud for them already on a vm And now we take this and we run it in kubernetes containers in kubernetes spots We use namespaces to provide the scopes For the multi tenancy part namespaces in kubernetes give you network isolation quarters permissions So you can say, okay, I don't want a namespace to talk to another namespace I don't want a namespace to grow over this much amount in case You misconfigure something and and the scale is uncontrolled And uh, yeah, I don't want a namespace to see things in another namespace So that's all what you get for free in kubernetes when you use namespaces We have multiple services multiple teams building services and different teams have different requirements So we kind of leave Let people do whatever they want in a more of a you build it you run it mentality So we help them Because i'm sitting more on the on top of the coroneris part on the on the layer on the more a bit infrastructure Not so much application side we tell people building the the services on top Okay, you can you can do a bit of whatever you want just have these things into account You want to use golan? You want to use node? You want to use java? That's fine and The model we are following is api patterns. So services have apis and operator patterns where we build operators That will do actions on the clusters on the environments and everything So the operator pattern kubernetes if you are not familiar with it is Of a managing state You create a custom resource definition kubernetes Then you have an operator that is a service that is continuously running and monitoring these custom resources and saying What is the desired state and are we in that desired state or I need to make changes So the operator keeps this reconciled reconciliation loop forever Checking when there's changes to one of these custom resources What things need to be done? So this is very useful when you have For instance the helm operator the helm operator you create an object that defines I want to install this chart with these values in the cluster and in this namespace The operate helm operator goes and always looks okay. Is this installed? No. I need to install it Has this changed? Yes. Oh, I need to update it So that's the operator pattern and we use it for a for a bunch of services On the environment side We use init containers and many sidecars to do division of concerns So on kubernetes you have the concept of init containers Things containers that run before your main containers run And sidecars containers that run alongside your main container And this has been there's a new feature on kubernetes where you can have More init containers or sidecars that start as init containers and become sidecars That's in the latest versions And that is useful for things like logging So you when you want logging you want the logging sidecar to start as early as possible And start shipping logs somewhere And you want it to continue to run all the time Before this I think this was in the last version Before that you will have init containers and Main containers or The main container of the sidecars and they were separate Now you can have one that spans from the very beginning to the very end of the life of the pod On the sidecar part Yeah, the division of concerns Model that we follow is instead of adding more things to the main container and to the java application that is already a big Micro monolith or whatever you want to call it We create sidecars in its containers that that do specialized things So we don't we can separate them and it's better different teams manage them. They follow their own release cycle and so on So we have service warm-up storage initialization We have an htdpd server front in the java application for caching and for other configurations Sidecar containers that export the metrics to primitives fluent bit for logging Um to in java we can collect threat dumps and ship them also and store them For uh, net for more advanced networking. We use envoy. Envoy is our proxy and I'll talk about I think I'll Following slide and and another is the auto-updater. So the service warm-up for instance is uh, it's a service That when the pod comes up It starts hitting the most requested URLs So they are warming the cache Before the pod is receiving traffic So we manage the readiness pro in Kubernetes So the when the pod come up comes up before saying it's ready to accept traffic. We warm it up We warm the cache And then it it can mark itself as ready and start getting traffic And it does this lazy caching without having to do very expensive starts Fluent bit is a very typical Solution to do use you run it as a sidecar You have a shared volume Where your main application or all your? Containers are writing logs to fluent bit resource logs from the file system and ships them wherever you want I mean you could do this from the main application main container and whatever But this makes it easier to change it without having to deal with the application So it's the separation of concerns And we can configure it independently. We can say oh, we need to upgrade Fluent bit we don't have to Make a release of the main application. We can just do behind the scenes and update the fluent bit without Changes to customers Envoy proxy every who knows envoy Okay, just a few people envoy is a A very Well used proxy Engine on Kubernetes. It's used for by a lot of service measures behind the scenes So if you use Istio if you use Service and pretty much all the service measures use envoy as a proxy inside And we use it Because we have customers that say Oh, I need to connect to my internal vp and to do to get some data or I need to To go out to the internet Using a dedicated ip because I don't want to get affected by other tenants in the clusters And I want to use a dedicated ip just for myself because Maybe You have one tenant that is doing a lot of requests to a service and you get in throttle So we have this ability Having dedicated ip is vp and connectivity what we do is in the pod we run an envoy sidecar We send the traffic from the job application to that proxy that envoy proxy that envoy proxy does An empty ls tunnel over hcdp To a vm that goes out to the internet with a dedicated pipe So that's how we we implemented it now you have more options And there's tools by cloud providers that will make this easier to say Oh, I want this specific pods to go all through this specific network route And that network rock and have vp and can have other things So there's the the cloud providers are giving you more Out of the box functionality The out of data is is a sidecar or any container that we created Anybody heard about the log for j cv? Well If you were not under a rock you probably heard about it So now suddenly we are running these thousands of deployments and we are have to figure out How do we upgrade log for j in all of them? And because the environments are in control from by the users so the user can say Oh, I want uh, I want to upgrade now or I don't want to upgrade or they're using a specific version of am We have to go and say Uh, okay, whatever version of am you're using log for j has to be upgraded So what we did was add in this unit container that on a startup Does changes on the file system before your main the main container starts So different people have done it in different ways But we chose this because it was a bit transparent, but not too magical So every time your pod starts this unit container Can do whatever you want right and go into the file system and set a file change something And that way we can control this without changing the the the main application the java application At all and this allows us to patch the whole cluster for life. So if tomorrow we have another issue We can go and say, okay, just changing the the container for the sub top data. We can do whatever we want Any questions so far? Yes This probably isn't worthy of being recorded but just so In theory for log 4j. Are you running through and like going through every palm file in am and Dating something. Okay. No we we patch we don't rebuild We patch it in the in the main application container file system. Gotcha, right Just wanted clarification because we I mean we we run the same application is just different versions and different things So we know where to go and it's like oh in the file system log for j Is here in this path and the version is all one affected by the cb then just Copy the other file over on the operator side As I mentioned we use a bunch of operators. So we started with or we created one That I think on the operator side is what makes business sense or what makes functionality sense for you We created an am environment operator that goes and You create a custom resource that defines an environment And that operator goes and looks at that and because that makes business sense When you create an environment, you just create an object one object You don't need to create 20 different objects You just go and create one object and then the operator will look at that object and the The parameters and then it will do other things But from your business point of view is oh, I'm creating one environment and The the unit is the environment. So I create one object And this manages this life cycle of environment. So when I create this object this custom resource I'm going to end with an environment running if I delete it. I'm going to end with the environment delete it and it makes this Semantic sense And this operator instead of doing everything inside the operator. We also delegate to other operators So for instance, we have to launch jobs before an environment is created This operator launches the jobs and then it uses all these other internal operators to reconcile And reconcile the status of the of the environment An example open source the flexibility helm operator. I mean flexibility is a github's operator and Yeah, it allows you to to deploy things to the cluster similar to is pretty much like cargo cd But also they have inside the the umbrella of flux They have the helm operator. So you can use this operator separately. You don't need to adopt the whole github's approach for each For each of just to take advantage of the helm operator So the helm operator Allows you to manage helm charts using declarative state. You create an object. There is a helm release The operator is going to watch the helm release object Oh, if this is created is running a helm install if it's changed i'm running a helm update And so on And it always keeps them synchronized So we when we create a hell A m environment cr the operator behind the scenes create a helm release cr another ones And this change the operators So this way we don't have to implement everything inside one operator But we take advantage of other open source operators and operators that already exist and divide the functionality across multiple operators And it also when the helm operator reconciles the helm release It writes the state into the helm release and we can take that state And put it back into the environment in a way that makes sense so It's a change of of things that happen some of them happen in parallel some of them happen one after another Helm install hem updates and a few other things launching jobs and so on and then the main operator says You can go and see the main operator The status of the main environment resource and know what happened to all these Like subcalls and change of workflows, let's say argocd we also use argocd Again is the whole you can use the whole github's Services that they provide And it applies github's state into the cluster. It's very widely used at adobe and we contribute back to it And if you have different things inside workflows events, so you can also use part use parts of it We use it for some namespaces So because our platform team has established argocd as a standard so when we create new services, we can just go to git and say Okay, just deploy this with argocd and behind Well, we don't even have to say that we just onboard that project that git repository and that gets deployed with argocd and argocd Or argot the argot umbrella project includes argorollouts, which you don't need to use argocd at all But argorollouts, I think it's very cool because it allows you to do Advanced deployment techniques progressive delivery So you can do canaries, blue, green, savey testing, whatever There's a bunch of different things that you can set up And the very cool thing that it does is allow you to do automatic rollbacks So you deploy you configure your argorollouts Object you set some metrics. Well, that's what we do. We set some metrics and say A successful rollout means that Less than 10% of their requests get an error If there's more than 10% of the request that get an error Argo will automatically roll back to the previous version And you don't have to do anything. There's no manual checks. No Not nothing Obviously that requires you to have nice wood metrics and to have some sort of confidence on On doing this, but it's very useful If you don't if you use a service mesh, it gives you a lot more Power where you can say I want 1% of the traffic to go to the new version If you don't use a service mesh, you can still use it. We don't use a service mesh But it's just you are limited. You can only play with the number of pods. So if you have 10 pods You can have one more and have 10% of the traffic go into that pod But you cannot do 1% or 5% right is You are a bit limited, but you can still do nice things, especially the automatic rollback is very nice on the Moving on to the how to scale and how to automate resources when you have big deployments And when you are coming from a bit of a monolith So I mentioned that each environment each of these 17 000 plus It's a micro monolith. We have multiple team teams building services So we need ways to scale that are more orthogonal to to the developer teams So we don't have to go to each developer team and say hey change this change that change this other thing On the Kubernetes wall, uh, there's two important resource concepts Requests and limits Request is how many resources you have warranted? Limits is how many resources you can consume And depending on what they're applied to The result is different the the the action is different You can apply them to cpu to memory to ephemeral storage And when you apply them to memory The limit is enforced So if you go over the limit So you have warranted the request that you asked for but if you go over the limit, uh, your, um Your process and the container is going to get killed On the ephemeral storage part The limit is also enforced. So if you use more storage Your pod is evicted Pod eviction means that pod is removed from that node and Kubernetes will schedule it in another node So you probably are going to lose all the data that you have in the ephemeral storage Well, probably, you know You are going to lose all the data you have in this ephemeral storage and your pod is going to start fresh somewhere else A very interesting one is the cpu the cpu resources And on the cpu request side The requests are used for scheduling So you are saying oh, I'm requesting one cpu for this pod Or this container So Kubernetes is going to find a node that has one cpu available and it's going to put that there And after but after that is still used as a relative weight So it's not the number of cpu's that can be used Is a cpu number of cpu cycles that the process can use So if you have two Two containers running in a node And they just request 0.1 cpu They can all Both use At the same time 50 percent of the node cpu time So this is a bit tricky because it it got it got us We were figuring out what the hell is happening And it tricks a lot of people On the limit sides this translates to cgroups quota and period Some people said containers are just processes containers do not contain So they are just processes that have cgroups Kernel cgroups enable So cpu limits are cgroups quota become cgroups quota and period The period is by default in the kernel 100 milliseconds So the limit is the number of cpu cycles that can be used in that 100 milliseconds If your container is going over those Over that limit your container is going to get throttle So imagine you have one Container that has one thread only And it's using one core You can and you request one cpu if you have a limit of a cpu You're going to be fine. You're going to use one thread as much as you want In one cpu that's that's going to be fine It's going to run for the 100 milliseconds and then it's going to get other 100 milliseconds another 100 milliseconds and so on This is a bit challenging for java applications and multiple thread applications For instance, if you have if you request one cpu a hundred a thousand milli cpu's In Kubernetes and you have four threads Now suddenly If each thread uses all the cpu In 25 milliseconds, you are done. You don't have more cpu time So hopefully this makes it a bit Clear you have four threads each threading one core different core The period is 100 milliseconds, but after 25 milliseconds you already consume 100 milliseconds of time across the four threads So you're going to get throttle 75 milliseconds And if you are doing the same thing, you're going to have over and over and over again Every 100 milliseconds You're going to only use 25 So this is something very important if you are like serving web requests You're getting a request and you look oh Typically it was well, but now suddenly the response time is going like through the roof what's happening You look at the cpu throttle in metrics that Kubernetes provides And you realize that your container is being throttle So that's something that also tricks tricks us and tricks a lot of people Yes I can repeat the question Okay, so that's the that the million dollar question is what I suggest using cpu limits in general And the answer I don't know if I have it later, but the answer is In production you should not use cpu limits Because if you use cpu limits, you are artificially limiting The amount of cpu that you use so you are leaving cpu unused for no good reason So typically Uh because the request is a relative weight you have warranty that two processes Are not going to starve each other One process is not going to starve the other one. So there is a relative weight between processes If you set up a limit imagine you have one two processes one is doing nothing The other one wants to use a lot of cpu If you have a limit To one cpu that process is going to use only one cpu But you have maybe a 16 cpu 32 cpu node you're wasting nodes you're wasting a course So there's it doesn't make any sense Even if you have the two so you just remove the limits One process is doing nothing the other one wants to use a lot of cpu it can go to 32 cores The two processes want to use a lot of cpu's if they have the same request they can use 16 cores 16 cores So you don't have the problem of masturbation between each other For us we One consideration you have to do like We haven't made this change we have it planned But we want to remove the limits in in the in the production environments But also imagine we have stage environments for clients and production environments But for us it's all production but for them is oh, I have my Stage website and I have my production website and they may run Performance test on the stage If you remove the limits and that Pod happens to be or those set of pods have happened to be in empty nodes that don't have anything else The performance can be great But then the production pods are happening to have some noisy neighbors and running on busy nodes Now the performance is not going to be as good as in the stage So the thinking here is like maybe in the stage we want to have limits So people have an expectation Of what are they going to get the best? The minimum they're going to get in production, right? Because otherwise like oh my stage test was great, but no production is not So this is going to be very confusing. Yes, I guess I've got a follow-up on that one So what do you enforce some sort of a resource code or Do you enforce a strategy more upstream? Let's say whether using scaling groups or you're using carpenter type of implementation where you've got specific node groups for specific workloads Or do you get more close to the namespace level resource code or limitations? No, we don't we don't do separate node pools for separate workloads With one caveat that I'm going to talk about in a bit That's a there's definitely a possibility that you may want to do if that makes sense It just complicates things a little bit But there's tools like now carpenter carpenter the auto scaler Uh, I think it's supported now. Well, it's supported by aws. It's open source, but it's also adopted by assure And and you can run your own Carpenter We'll look at different considerations and do autos smart auto scaling where It will look at uh, oh, okay. So you are using You you have pods that require 64 cpus Okay, I'm going to start a node that has at least 64 cpus Oh, but you are now you have a mix of pods that have a memory cpu ratio one to four and now another ones that have one to 32 And I'm going to okay. I'm going to start different types of nodes so you can Optimize the cost and optimize how the workloads are scheduling Kubernetes So this is carpenter what it does and also the price you can It can also start Well, it will try to start the cheapest Sizes that you need So it doesn't make any sense if you if your queue of Kubernetes pods waiting to be scheduled Is uh, oh just Five cpus is not going to start a node that needs has 120 cpus. So it's going to do some smart Things without because before that you have to Manually go and say okay I want a node pool that has this size with this type of VMs with this memory cpu ratio Or if you have many of them you have to have define all of those Yeah, but you can still introduce discipline when you're defining your carpenter definition, right? So you can say Do not do this In carpenter you can configure it to do different types of things Uh, and what I was talking about we use a different node pool with arm cpus So we estimate that we get like 15 to 25 percent savings for the same performance And it's very easy to switch if especially if you're running java You just switch the base container To an up to another jdk That is built on arm Done nothing else to do. I mean you just have to test that nothing breaks, but I said anybody using arm here No Well, you should be you should think about it because you're gonna save money if if anything just Oh, you're gonna save money and you're gonna Warm less the planet. That's another benefit. So that's that's a wing wing Anybody here doing java? Yes one two Okay I'm gonna skip a bit through the quiz because it's very java oriented but in java The the way the java jvm And this is similar for other for other Languages you have to be aware of how it decides How much cpu is going to use how much memory is going to use like in java The default heap size When you run java in a container It depends So depending on what's the size of the container the jvm is going to take more or less memory So it's very hard to figure that out. I mean to to to get it right And if you use the defaults you are wasting money because the default if you have a Normal container the default is going to just use 25 percent of the container memory for the heap size Uh So that's a waste of of money um This was improved late in the in previous versions of java But let's skip that but you can the thing is you can configure the the ramp that you want for the heap And you should do it because otherwise you're running just with 25 percent of of the memory of the container and you're wasted A lot because for us. I think we are running around 80 percent So that's that's a lot of money and that you are wasting Yeah, typically you can use 75 percent in java Unless you have things that use off heap memory like elastic search And spark But it's it's kind of similar for for other for their languages On java, we also have the garbage collector and depending on How big the container is the jvm is going to pick one or another So it's also tricky that um, and if you're running some tests in a In a smaller container This those tests may be wrong when you run in a slightly bigger container because java decides that oh Because I have a little bit more cp. I'm gonna switch the implementation of the garbage collector So that's also tricky And you can also configure it on java with with uh with flags And there's there's a garbage collector's table that uh microsoft released, but it depends also on your use case um The cpu is that java we the java the jvm will see also is intriguing because um Depending on which version of java you are running and uh It's going to use a different number and depending on how many you assign to the container And they changed this lately to to do the as many as there was allows because before It was it was calculating it with cgroups and it was not quite right What it meant before if you say to between zero and 1,023 Millie cpus you get assigned one cpu if you set to 10,024 you get nothing no limits If you say and then it was more or less normal But now uh, yeah, there was uh, there was an improvement to the jdk And Yeah, basically you can set the active processor count if you want To tell the jvm hint the jvm on how many cpus it should see And that's um And and this is about the the request on the limits that I was talking about before So let's say you have a 32 cpu and this applies not just for java for any process that you run You have a 32 cpu host two jvms or two processes with the same request What is the maximum cpu that you can use if you set the limit to eight cpus The maximum they are going to use is eight each so you're wasting 16 If you set it to 16, well They can use maximum 16. So if both are busy, you are okay. If one of them is not busy. You are wasting cpu But if you don't say set any limit so that the question that that came before Uh You can use uh Any any process any of the two processes can use up to the 32 So that's It's never going to start one another But you can use the whole cpu of the of the node Which at the end of the day is money The other important bit is how do we scale kubernetes, right? How do you scale the pods? How do we scale these things so we don't have to manually go and change things? Or we don't have to have huge humongous clusters all the time On kubernetes, there's three types of autoscalers You have the cluster autoscaler the horizontal pod autoscaler and the vertical pod autoscaler The cluster autoscaler is Increasing the number of nodes of the cluster house based on cpu and memory requests So that's an important bit there and We said the well, it's important to set the maximum nodes that you want at the cluster level because uh Bucks can happen and this of course I have a lot of money because you don't you don't want to run at full capacity all the time So a typical scenario is You get we get more requests more scale up and then kubernetes scales the number of nodes And then when there's not traffic, there's less cpu usage, whatever the number of nodes goes down. That's the typical seesaw pattern but another interesting one that happened to us here is Suddenly this went up to 100 something 140 something or 150 And this didn't grow up more because we had a limit on the number of nodes. Thank god And this what happened is we had there was a bug here on the autoscaling process If not, this would have like keep going up So the bug was fixed and then things went back to normal So that's why it's important to have a max node or a very big credit card to pay for your cloud spend The vertical pod autoscaler will increase or decrease the resources for each pod. So Um Suddenly you say, okay this pod would Because it's getting a lot of requests it would benefit from having more memory. You can define that But Until the very last version of kubernetes this request that will start of pods and you can set it to automatic on neck or on next start and this is now a alpha or beta feature alpha Of the of the last one So that's tricky. Um, we use it to to only We used it not anymore To developer environments to scale them down if unused and the If you set it to automatic well, you take the risk that your pods are suddenly to improve And then you may get a notice. So that's be very careful of of what you're doing there And the horizontal pod autoscaler is having more pods whenever you add more traffic more cpu More whatever metric you want to you want to measure it for and we scale on cpu and hdp request per minute and It's uh, it's obvious when you're reading now, but not always obvious for everybody You cannot use the same metrics as for the vpa You can have both configured at the same time But not on the same metrics because then you're going to have a mix of scales at the same time and it's not going to be great It's going to be confusing cpu is a bit tricky. It was a tricky for us because You can have this periodic tasks or a startup cpu spikes for java, I think it's more very common The pod is starting now. There's a lot of things to be done at the very startup And then the cpu goes crazy. And if you don't configure it right Is the hpa is going to say oh your cpu went down Very high. I need to start another pod the other pod the starts goes very high on cpu. Oh, I need to start another pod so this happened and so you need to be careful on this because You can configure a lot of things like what is the ramp up time the startup time all these things Because you are just doing The nail of service on yourself and this is what We have a bunch of these problems where we We are running at a scale that we can easily Do a distributed the nail of of service internally There's some horror stories there But yeah, I mean you you change something but it's running across thousands of places and now suddenly One it doesn't one is not important, but a thousand times It's it screws up things So, yeah, that's the space on the startup can can do the the The cascading effect So to sum it up Three things that I If if you want to remember three things is that on kubernetes is very easy to start and then optimize So it's very easy to do lift and shift to kubernetes existing application. Just put it in a container run it There's some things to consider like database, especially state is is the tricky part You can use patterns to decompose the application sidecars in it containers new services operators So you don't have to add things to the monolith And on the resource optimization part because money seems to be important for people if you heard kori Last night money seems to be moving the world And things are expensive in the cloud You can all tune this jvm cpu. You can tune the memory on This on java the garbage collection and so on and you can use all the auto scaling capabilities that kubernetes provides So if you have any questions about a couple questions one You talked a lot about jvm resources and management of those How much discipline did you guys did you guys have? Because of the extent of the number of clusters you have in the upstream base container images For these java applications. That's one question second how much sorry the base container image like that the size of it like how Oh, the the size of the images of the container of themselves, right, right? So that's one one question Like did you have a strategy for using certain types of container based images second question is You didn't you didn't talk about kubernetes jobs in your clusters. Did you have those scenarios as well in your ecosystem? And how did that Into the role So the image base images we don't I mean we use some JDK images that are built internally at Adobe for the main application for others services whoever builds the service picks up an image We don't care too much For now about the size of the images, but there's some approaches that we are looking at at improving the The download time and the startup time Obviously is caching the images at the cluster level at the regional level and so on There's some features that I know for sure as sure has About streaming images. So if you have a very big image that you need it to be so big you can Configure it so it will stream the image and get the layers The main layers that you need to start up early start up early while the rest has been downloaded But so far, yeah, we're looking at some of the messages, but we didn't give it too much thought or too much importance Because there's was there are other things that are more in the critical path and for the question about Jobs, yes, we run some jobs But there's nothing really the only one of the tricky things is When you use Prometheus and you use jobs because the jobs are ephemeral You need to use the Prometheus push gateway to send the metrics instead of getting the metrics So making sure and also the logs making sure that your container Terminate idle termination period is a bit longer. So you have time to ship the logs seems like that Thank you Yes, I'm just curious for the amount of Infrastructure and stuff for caring for your Kubernetes clusters Can you give me like Approximately how many human resources you need to basically maintain and operate this at this scale How many people? Yes I have no idea Ah fastest of people no no, I mean so we have We have teams a team that builds the clusters for us and Maintains more of the core of the cluster. So those does the upgrades So they are doing this for the whole adobe or for a lot of different products And so there's this I guess nowadays it's a platform team, right? They provide Kubernetes as a service internally And so this is a bunch of people also building new features I mean you have to to consider it's not just maintaining what it is There is building the new features having new things coming in new services and so on and then as On top of that we run the services I mean the whole product is probably The the total thing is is hundreds of people but It's it's not just maintaining is building new features Building new stuff providing new business value Yeah, if if you run on if you run on top of a cloud provider Kubernetes service and we are only talking about purely the Kubernetes service You don't need a lot of people But purely the Kubernetes service So okay making sure the cluster is up the cluster is not rushing Upgrading the clusters Yes Yes Yes on the scale. I mean if if if we we have 45 clusters ourselves And keep growing them If we run on premise there's there's a lot more things that we have to consider We already have to ask cloud providers saying Hey, we want to run in this region. Are you going to have capacity for us? Are you going to have arm nodes? Are you going to have these? Are you going to have that? The reserve. Yeah, we have reserve capacity also but Cloud cloud Yes and cloud providers There are some is Infinite capacity. There's something that is not infinite about it So what they sell about infinite capacity is not that infinite I know a case of a cloud provider that is Migrating people off of a region. They're saying like this region no longer can onboard anything And you have to move off And we're giving you time to move off because this region is gone. It's done or Oh, there's a launch of a new region and You need I cannot give you 100 VMs Because probably they have 100 for from somebody 100 for somebody else or whatever and they have to measure that capacity so that's At some scale, there's there's that's something you also have to consider Yes So it sounds like you chose namespaces to do the multi-tenancy. I'm just wondering if any other strategies were considered Yeah, so the namespaces Isolation is the few things I mentioned is not total isolation We also look at cata containers for So cata containers is a open source project where you can run containers both as a VM So each pod is effectively a VM with micro vm with the hardware support that vm's have So that makes it harder if you want somebody to Explode getting out of the container and so on um We have issues for instance if we if you overcome it the nodes if you say Okay, I'm gonna put I have 32 gigs of memory I'm gonna put 32 containers with one memory one gig of memory request each But the limit I'm gonna put it in 10 Now any of if any of those pods go Over the limits or if all of them go a bit over the limit that node is goes into a kernel out of memory And now things get a bit weird for a bit until that those pods get killed and reschedule somewhere else and and the kernel is doing things So you can still have noisy neighbors problems Same thing with the cpu If you don't account for the requests And the limits correctly cpu memory even disk space if you have Another thing that will happen to us is There was an issue With the cleanup of all the images Because we were getting a lot of different images and big and so on And the disk space the nodes were running out of this space And then suddenly is They have to enter in this loop of I have to delete something I don't load something But then that something gets evicted because there's no room for the image and now it has to be rescheduled somewhere else So it was this continuous loop of Downloading the image filling up disks Kicking out things for other two other places that also get filled And because you are kicking out things and the loading Scheduling a new pod now you are filling it more. So we have for instance that issue. So we you still have The multi-tenancy corner is not Just pure multi-tenancy, right? You were talking also about quite a lot about like optimization of the cluster and everything around that and I was wondering Have you looked into the shadowing of workloads kind of thing to improve the bin packing of of the nodes Of of the nodes or like Yeah, but like currently when communities decides where to run a pot it's that pot is is tied to the node and due to the lifetime like say for example Deployments have happened like your clusters can be less optimized and I know there was some project out there that that de-shadowed like Did some optimizations and did some de-shaddling into the things but I've never seen that go anywhere and like at my previous employer for example We wrote our own controller to try to de-shadow workloads to run on spot VMs. For example Have are these kind of things that that you're looking into with adobe as well? Yeah, we use the de-scheduler So this the de-scheduler is another controller operator That will look at how your things in the cluster are and will kick out pots of the nodes. For example You are saying I think If you say, uh, I want to scale down the cluster Maybe you cannot scale down because the nodes are a bit busy So the de-scheduler will go and say, okay these pots that are in Nodes that are not busy out So that node can be deleted another one interesting one is when you have multi availability song Uh spread So you say I want to run three pots of this service And I want one of them in each availability. So So coroneris will go and schedule them Um, well, let's let's say six spots is what I mean the sample better So coroneris is going to go and put two in each availability song Okay, now suddenly the hpa or the Node needs to go away because your cluster is scaling down or whatever And two pots in this availability song in one of the ability songs go away because whatever reason this can happen in Coroneris pure coroneris is not going to care about that So you're saying oh, I need to scale down I'm going to take this node And this node and this these two pots are in that node. They're gone or because anything else So you end with two pots in one availability song And two in another one and nothing in the third one The de-scheduler can do you can configure the de-scheduler to say Correct the spread across availability zones So it will kick out pots from the other ones To make sure that they are spread out correctly Because by default coroneris is not going to do it. So if you have hpa and availability song As hpa can go Send up and down the number of pots hpa also doesn't care about spread After the pots are scheduled So there's nothing really that cares about spread after the pots are scheduled You have to rely on the de-scheduler To kick out things If they're not correct No, it cleans up the pots is just that you end with a spread. That is not what you want So it's if for whatever reason Yeah You you have six pots running across three availability songs the hpa says oh, you don't need six You only need four it may decide to kill two in the same availability song There's no real measure of which one to kill So you may end with two in in four in two availability songs or all in one Things like this happen And you needed the scheduler that to go there and say Oh, the spread is wrong and want to kick things out So cuberneris It starts in your pots and when when it schedules the pots is when it looks At what in which availability song should I put it? Yeah, I will put it back in And correct it. Yes Yeah cuberneris once you when it is the basic cuberneris once you schedule a pot that pot is going to be there until Something else kills it like an hpa or The node goes down or anything available is the it or is the Is that like a namespace is across availability zone in your configuration Uh, a namespace is a virtual thing. You don't care if it's across availability songs or not at the ploy the A deployment or a stateful set can be across availability songs you have to configure the pot topology is called And configure it in the way that you want So you can say I want a minimum of one pot in each availability song or I want No more than two difference The difference you you you configure this cue So if you have one pot in one availability song Do not have more than three in the other ones. It's a bit complicated, but you configure it with pot topologies And that's when cuberneris schedules Looks at the pot topologies and decides in which nodes to do it And this is all with labels on the nodes So it's not I mean, I don't think he's aware of the concept of an availability song You just set no labels in nodes For instance for arm We are building multi multi-architecture images that can run on both arm and uh Intel And we say in the pot topology we say Schedule this in arm if available and if not schedule it in Intel Because otherwise you It may not be able to be scheduled essentially that requires your your nodes to have Selectors on them like all the nodes have to have the labels the selectors everything configure correctly So it seems like you ran into a few issues when you're migrating to cuberneris One obscure one that I heard about when I was in another talk Was that in clusters where there are I think hundreds to thousands of nodes it can Applying it applying a deployment can be inconsistent Due to at cd not being scalable enough. I'm wondering if you ever ran into anything like that So we also have that issue not because of the number of nodes, but because of the number of objects At cd has a limit of eight gigabytes When you get close to that limit, uh, it becomes read only And you cannot do anything and your cluster is pretty much gone And you have to recover it from a backup Yes, and then What happens is One is the that that would be like the worst-case scenario, right? You You have too many objects You fill it up It's another thing that may happen is You have so many objects that The cuberneris api takes a long time to respond You can have the watchers in cuberneris also consume memory So every time you mount a secret or you mount a config map by default that that's a watcher And those watchers consume memory on the api if you have a few it's okay if you have thousands That's a problem And that limits how much you can scale the cluster because now suddenly the api every time you make a request It can take longer and longer and if it gets over one minute The api cancels the request So once your api calls get close to the minute You are in big trouble because nothing will work Like the auto scaler is not gonna work the scheduling is not gonna work the The leading pulse is not gonna work things like that the traffic The ingress controller is watching the api to know where to send the traffic your external traffic is not gonna work So you gotta be careful about that too You have to keep You have to watch for the number of watchers the number of memory that the api is using the number of objects that you are using in the cluster so First advice, don't you set cd as a database Don't use cuberneris api as a database don't use the cuberneris objects as a database So that's that's the one because it's very convenient. Oh, i'm just gonna store things here in this secret on this config map Don't do that That's the first one and then if you have a scale where you have thousands of secrets thousands of config maps objects and so on Try to refactor it in a way or limited size of the clusters. That's the only way out We are going we're not growing the clusters way bigger because we don't We have this scaling problems. We try to go now with more smaller clusters Big Using postgres There's there's no the only Supported really supported way to to put a storage backend is for cuberneris is at cd Yeah, well k3 k3 s is not something that you would want to run in production for loads No, there's I mean you If you if your problem is i'm storing too much data in config maps or something Yes, use a database to store that data. Don't use cuberneris api There's no reason why you should be storing gigabytes of data on the api Now there's also some tricks like you can limit the defaults in cuberneris the default for the number of pots That they are kept in the history is 10 You can say out or other replica sets things like that If you have many of them there's you can say don't don't keep 10 just keep three Help help is the same thing history 10 by default don't keep 10 just keep three There's small tricks like that Yeah Okay, all right Cool. Well, thank you carlos. Thank you Now you can you can you can go Enjoy the rest of your day night One hour and eight May not be very smart and it's like just ask away Yeah, listen to you in person how many total that came Um, I will find out. I'll let you know Yeah, yeah, I'll let you know I'll definitely let you know we had I think call for papers We had I can tell you what I because I was part of the committee that reviewed all the calls We had over 400 CFPs That's that's a lot. Um, so Yeah, yeah, yeah, but then some dropped off because they also got into kube on paris So then they started Oh, cool. Good luck You're flying out of lax great I am based out of san diego down south