 Testing. Okay, I've got 1130. Welcome to the opening for Saturday at Scale on the cloud native track. This is, this track is the one of two parallel tracks related to cloud native here at scale this year, but we're going to add on an additional factor here of edge native and our speaker is going to explain that to us along with a number of other things. I've known Fred Rick for a number of years and recently bought his book which was just out in December and highly recommended. I won't make him plug it out. I, but this is still a legit testimonial that this is a really good book and he has pretty deep knowledge on the subject of edge and cloud native. He'll bring it up in a slide but building, yeah. Yeah, yeah, I have a slide on that so you will see it in full version. And he represents the Eclipse Foundation including some huge number of edge native open source projects that he'll tell us about. So with that said, I'll turn it over to you for entering. Thank you very much Steve, really appreciate it and welcome everyone to my talk. It's really great to have you here. So of course the track lead is really a courageous person because yes you're here for cloud native and yet we are discussing edge native here. As you will understand of course, you can only define edge native in contrast to cloud native and this is why this is highly relevant. And in any case, if you build cloud native applications, the probability it's very probable that at some point you will need to interact with so to speak quote unquote the real word right and edge native applications are those applications deployed in our daily lives in our pockets on our phones or whatever that interact with cloud native ones. So of course they are part of any serious or enterprise great solution that you could think about. Now before I jump into this topic and Steve mentioned this briefly, I mean I'm with the Eclipse Foundation. So please raise your hand if you knew that the Eclipse Foundation is something else than the Eclipse IDE. Just one, two. Okay, so now you will have to pledge to me. Okay, so stand up, you know, hand on heart and you pledge to me that it's your new mission in life to tell people that Eclipse is more than the damn IDE. So feel free to love the IDE or hate it. We still have six million users for that so it's not a trivial open source project but we have been doing much more than that for a very long time. In fact our IoT working group at the Eclipse Foundation has been around since 2011. Okay, and yet we've done a bad job because you don't know about it. Okay, so I'm responsible at the Eclipse Foundation for everything about IoT and Edge and when I say responsible this is about two things. Keep an eye on 50 plus open source projects in that space. Okay, so that's quite impressive that we are probably, apart from the second in importance Edge native open source community that you can find around. And then part of my job is to be out and around to evangelize this tech. So my book is part of that and of course my presence at scale is part of that. So yeah about me so I've been around in IT for quite some time now. I've been working for large organizations and some smaller ones and yeah you can find me on Twitter at least if it's not done this morning or so or if Helen doesn't decide to ban me on a whim or something right and you can find me on LinkedIn as well so feel free to connect. And yeah I just published a book and the point of the book was not to get rich because with the number of hours I invested on that certainly the Holly rate is horrible but this was we're doing for our community and the title is very long not my fault the editors wanted something with Eclipse in the name and anyway but the point there is to give you a thorough understanding of the IoT and Agico system at Eclipse but not just at Eclipse we have FedFH projects we have Apache projects mentioned in there as well so of course if you love this talk please feel free to order a copy if you hate this talk please order 10 copies and destroy them so that nobody else will be exposed to my ramblings. And we have a booth here at the conference as well so feel free to visit us and when I say us it won't be me because unfortunately well I have two talks today and after that I'm added to LAX and went back home and then after that after six hours home I'm added to Germany for another conference so I cannot stick around this time unfortunately but we have community members there and when I say community members you see them on on on the picture it's really committers and project leads in both cases from some of our open source projects one is Eclipse IO fog which is container orchestration at the edge and the other one is Eclipse Hara which is a client for our Eclipse Hubbit platform and Hubbit is a platform that you can use to push software updates to microcontrollers or any type of device at the edge okay so relevant in this context of Edge Native this morning and so they are very knowledgeable and I think Steve can attest to that very interesting people to talk to so although you will miss me they are waiting for you so please feel free to drop by after the talk all right so our agenda today really four simple points so first we will think about the difference between the edge and the cloud in broad terms and after that we'll jump right in into trying to define Edge Native applications and have a look at some of the possible runtimes you could use at the edge okay and finally we'll have a look at what we call edge ops at the Eclipse Foundation which is in our opinion the right way to do edge computing so edge versus cloud so let's start by defining the cloud and if you are you know following this cloud native track probably you have an idea of this so I will I will be I will be briefed there but when we try to define cloud native environments we are talking about on-demand availability of resources and this is about three things first it's an homogeneous environment well you will tell me hey I have all of those different instance types on my previous on my favorite cloud yes that's true but they all are similar you know you can order a hundred of them two thousand of them and start them and they will all be similar in their properties and the way they behave then of course the cloud is a large scale environment of course your particular application may be small scale no problem there but of course your cloud provider operates thousands hundreds of thousands of servers and you can tap on to those resources in an early limitless fashion I said nearly limitless because in a previous job when I was at Pivotal I had a partner and the guy exhausted completely for a specific geographic zone a specific instance type on AWS congrats hands if you ever see this talk I mean that was quite an exploit so Amazon couldn't provision even more of those instances because they were using so many of them so anyway we had to implement changes in our software platform at Pivotal to accommodate those kind of weird situations anyway and of course the cloud is centralized and you will tell me hey come on I have availability zones and I can spread my infrastructure over the whole planet if I want yes but you manage this through a central console right you connect to whatever and that's whatever will do for you the provisioning in various locations and the thing is we care up to a point when we do cloud native about geography right because we want load balancing across continents for example or things like that however you don't care if the actual servers in Berlin or Paris or Dakar in Africa or whatever right or you don't even care if the actual server is on one side of the street or the other in edge native you care about that okay and we'll see why so what is the edge the edge is literally the opposite of our cloud in the sense that the edge is about resources that are anywhere and everywhere and of course your phones in your pockets are probably already depending on what apps you have installed but you can consider them as edge devices so what is the edge the edge is fundamentally distributed right it's operating at a small scale in the sense yes I can have thousands of nodes spread over the physical space all over the US all over the globe but each of them has a limited amount of resources to offer to applications okay and those nodes typically will be heterogeneous and by that I mean you will find arm chipsets risk five chipsets exotic things you never heard about and that's the whole point you use hardware which is tailored to your specific use case and not run the mill generic hardware that can run any type of application sometimes you will have AI accelerators sometimes you will have IO that will be specific to machines that you are driving like Modbus can bus whatever back net anyway you know lots of technologies that interact with the real physical world real machines or the HVAC for this building or whatever okay edge is all about being as close as possible to what you are trying to do in the real world so for a more formal definition edge computing is essentially to bring compute networking and storage as close as possible to the source of your data and in some cases you will maintain the elasticity and consumption based pricing model of the cloud if you're dealing with let's say large-scale edge operators but of course if you design your own edge in procedure or deploy it then it would be up to you to implement that level of elasticity right so this is really about putting code as close as possible to the real action the real world and what it can do for you is really about well many things but you know I put my favorite four on this particular slide first you will do edge computing because you want to reduce latency and and when I say reduce latency this is for mission critical types of applications you are running a nuclear power plant you don't want to implement AI in the cloud and wired that directly so that okay do I need to do something about that valve and then you wait you go to the public internet and you go to the cloud and you've got some processing and then the comment goes back and one the whole plant has exploded already right so latency in those cases is really critical think about connected vehicles think about you know a factory where you have robot doing precision welds on car frames okay you need you know some millisecond latency in some of those cases in order to do the right thing at the right time and this is why you are literally deploying compute as close as possible to that action so that you can make local decisions that make sense in predictable time frame because of the of course when you lose or when you go outside your local network well already on the local network sometimes it's hard to control latency locally so of course involving the internet in all of that makes things completely random then of course you will use edge computing to save on bandwidth I mean video analytics is a popular use case and typically you will do video analytics with AI specifically on the cameras that you are deploying and not in the cloud why well just a single full full HD feed from one camera is depending on compression and whatever but roughly two to three gigabytes an hour worth of bandwidth if you transmit in real time to the cloud so multiply that by I don't know you know thousands of cameras for the whole city of Los Angeles well that's a lot of bandwidth AT&T will love you or Verizon or whatever of course your boss will aid you so you try to do local processing because you just want to transmit that video feed in real time to the cloud let's say when there's an incident where you need to record it for safekeeping or regulatory proposals or whatever and rest of the time you just do processing locally okay and that's how you save on bandwidth of course edge computing helps you make more resilient applications right if I am doing a smart convention center with sensors in every room to control HVAC and with AI that will anticipate occupancy in the rooms so that when there's no one we don't we don't cool it or heat it as much and when we anticipate people will come for a specific event well half an hour before we start cooling or heating depending on the weather okay you can do that that's edge computing but you want to do that locally and not necessarily connected to the cloud so that if the network is down well you will still you know have proper temperature in the rooms because people will come anyway and they don't care whether your network is down or not or if the internet is down or not so edge is about resiliency and that's especially important in cases once again where you have real-time mission critical requirements okay and finally there's the whole topic of data sovereignty you know I work in IoT and edge right so you would think I have the smartest home on the block in my hometown not at all I don't have anything for two reasons one there is no commercial solution that I like especially no one that is open enough that I can check for the quality of it behind the covers and two I have the skills to implement my own I don't have the time and then there's my wife nagging me about budget as well okay and all of that to say is that I have a very dumb home because you know I care about my data I don't want my data to end up you know in the hands of some actors that I will not name and of course some actors that will bombard me after that with hats and hats and hats like you know to know and so to that effect I want my data to be local so if I'm implementing a smart home I will make sure to have you know to do it edge why edge in the edge fashion so that my precious personal data is processed on premises and not sent to the cloud and if I need to interact with the cloud it's with the least amount possible of data about myself okay and that's really important and you will tell me okay okay that's your smart home I don't care about that well you care about that in regulated industries defense health care health care particularly you must make sure that patient data doesn't leave that specific hospital or that specific state or even the US let's say from a national perspective and you even need to block access if the doctor is on vacation in Japan for example okay so things like that of course is easier are easier to implement when you do edge computing right because you can guarantee that the data is in a specific physical location and that you will control access to it from remote parties okay so what are edge native applications so this is really the main part of the talk first they have some things in common with cloud native applications of course they rely typically on microservices and expose restful APIs as well okay not only restful because of course when you are at the edge there are protocols that are more appropriate for low bandwidth environments for battery operated devices and things like that okay so you will favor those protocols over rest and over TCP IP because in some cases that's you know a drain on your battery just doing that but anyway when you are on the edge node itself the point where you know the battery operated devices were send their data then of course that particular node will interact with the rest of the world through restful APIs and things like that but of course you want the microservices that you deploy there to be loosely coupled one that makes them more reusable but secondly they don't have dependencies on external to parties and they can continue functioning if part of the infrastructure is down so maybe you will accumulate records for you know a number of hours or some things like some things like that and then push that once the network is back or once the other services back okay but in some cases you will you will selectively drop data because it you know if I'm if I'm doing a smart convention center it's not a problem if I don't measure temperature every second or every 10 seconds maybe every every minute every five minutes is good enough to give a good experience to you in the room right anyway all of that to say that you implement those services in a way where you assume that the rest of the infrastructure can vanish at any time okay and then cloud native applications are built by teams typically today leveraging a DevOps approach with continuous integration continuous development and all of that is very good however however if you do that at the edge you're in for a word of pain so and we'll see how you should be things differently from that perspective but before we get to that it's important to consider the specific characteristics of edge native apps first an edge native application you deployed that in the real world in a factory or over a whole city with cameras and sensors and things like that okay so the last thing you want is to refresh all of that infrastructure every six months or every two years you know if I build a smart convention center you know I don't want to have to open those walls two years from now to change everything that's in there right so I'm building this for the long haul and especially in the industrial world how often do you think that Ford let's say or Tesla or whatever no Tesla they would they would change on a whim doesn't count you know a traditional industrial company will invest in their equipment for a very long time every 20 years every 30 years or more in some cases you know you see from time to time on the internet pictures hey my favorite machine I met my factory has made in West Germany okay so that's the word we live in there okay and so this means you need to design for the long haul and think about maintainability over an extended period of time okay then of course adnative is about it aerogeneity I told you you pick the right adware specifically for the job okay you don't deploy just you know random white box pieces and you know hope that things will work you need you know industrial enclosures with protection against dust water ingress or vibrations for example and things like that so you are putting you know specialized hardware in special places all over the place and once again some of those places can be very remote so if you have an edge node on every winter being in the US of course you don't want you know to have to climb there just to deploy a patch or anything edge is about constraints of course we have constraints in the cloud as well in the sense well if you if you do everything without constraints in the cloud then you are getting very very large invoices so of course you need to keep an eye on things but at the edge you have constraints because of the physical environment once again electromagnetic interference vibration dust but also heat and not only external heat but the heat that will equipment will generate okay often you will favor passive cooling rather than active because fans and things like that have a tendency to fail often and once again you want this to be as reliable as possible but in any case if you don't manage it correctly at the edge this means you are shortening their life expectancy of the equipment okay once again you want this to be over the long haul so you need to make sure that you know you will take all of those physical constraints and other constraints into account and this means even if you have a fairly powerful CPU and some memory and all of that you need to optimize early for size and efficiency once again you don't want you know your edge node to over read because you are deploying badly optimized code and you want everything to be as economical on resources as possible because if the node is there for 10 years oh yes I have extra memory I have extra cores and things like that yeah until you don't and then you are stuck and you need to optimize at that time so optimize early is better from that perspective and finally there's the whole networking thing this is about in the case of FGNATIVE distributed applications so this means the network is at the core of everything but you have to assume that the network will degrade at any time that the network would disappear at any time if you don't do that well you are building applications that will fail in the field and this has consequences that's a thing you know if you know we we lose our AI driving the HVAC system in this room it's not that bad if we lose the AI driving your car that's quite bad and of course there are the cases of factories and power plants and things like that or pipelines whatever I mean you can kill people if you don't design your application in the right way so you have to assume the most extreme possible consequences or the most extreme possible outages and design accordingly and this means that edge computing is at the border between information technology and operational technology when we say operational technology that's industrial equipment literally and typically that equipment is especially built for the use case and controls critical infrastructure and is not updated frequently as I said okay of course you will want to deploy software patches and things like that but even there you need to proceed carefully and one particular dimension is that if you stop the infrastructure there are probably financial losses at key if I stop my factory for half an afternoon I'm losing 10 million dollar in production value or something like that so that's why OT people are very obsessed about reliability stability and previsibility from the software updates okay so they they do changes incrementally and very carefully because if they bork something you know the phones will start raining fast okay on the other hand it is pretty much about of the shelf thing right you change your laptop every three years you don't care about it specifically everything is in the clouds you can replace those things very easily and of course cloud native is about this little icon in the top of my browser reload the page there are new features right cool that's really cool but that's very different in an industrial fashion right so you see how different those environments are and edge is sitting between the two of them in the sense that it's close fairly close to operational technology but at the same time you want to have the kind of agility and DevOps inspired approach to software development there to keep things moving fast so typically the way that most edge native applications are built are around three specific planes the data plane where your software components will be deployed control plane where essentially you control the infrastructure and do your real-time monitoring and your management plane that will really manage everything else and perform device configuration and this is really important in the case of device configuration because you will have approaches like zero touch and zero trust okay zero trust mean that you don't trust any device that you put into the field everything is suspect so to speak and zero touch means that you need certificates and other credentials to be deployed on the devices but you shouldn't deploy those through manual intervention you should deploy those to proper automated systems that will do that for you so that's the zero touch approach so literally you plug the device on the wall it will contact some kind of provisioning server get its credentials from there get validated and one and this also includes the dimension of what we call device attestation so you want to be able at any time to be completely certain of the trust worthiness of the device and every bit of software on it so that includes the OS its kernels the drivers the applications that you put on them etc and that involves well interesting things let's say from a deployment perspective once again all right so what are edge native applications first they are optimized for field use since you are running on constraint hardware okay you need to optimize as I said earlier for size and power okay and that's really important of course if you use a proper edge computing platform the platform can help you out with some of that but even then you need to think carefully about your own code because essentially sometimes the fastest way to produce a result will consume too much power so you will settle on the second best alternative which is more efficient so efficiency is a word that you need to take into account a lot in the context of edge native apps then of course they are resilient because you assume that anything and everything can fail at any time okay and you need to be obsessed by that once again because if not you can kill people or lose production value or you know so that's really important and once again you want your platform to take that in charge or part of that in charge for you rather than coding everything on your own one this is tricky code to write and second of course you want a uniform behavior for the whole system and that's why you need a proper edge computing platform the third is that edge native apps are adapted to mobility and that doesn't mean just that you connect to LTE or 5G or things like that yes okay you will probably use that kind of network or some other kind of wireless technology in order to achieve your goals but many edge native applications are deployed on moving vehicles you know locomotives cars buses planes whatever and there you need to be not only location aware because sometimes if you change from one country to another or one state to another maybe your data data gathering requirements or data you know the regulatory environment changes and the behavior of your app will need to change but not only that but you want to optimize your usage and the cost of your bandwidth so that's why you will rely on location based routing so trying to find in the local environment the most cost-effective solution to send your data out and receive comments for example of course edge native applications are orchestrated but this is not just about containers it's mostly about containers but I've seen VMs serverless functions and even straight binaries on microcontrollers and this means your orchestration is a different thing when it's not just about containers you'll so you need platforms that can manage this whole set of options rather than just Kubernetes okay and that's that doesn't mean that Kubernetes is not part of the solution but you need to think more broadly and think about platforms that can address a wider set of use cases okay and of course everything that you deploy need to scale up or down according to you know the situation which means if nodes in the vicinity of a specific node are missing maybe you will start additional service instances to take over part of the workload that was processed by the missing nodes at any point in time or maybe you will scale down to the very minimum because there is no activity at night and this is really important to optimize your usage of resources of course then there's the zero trust model that I told you about so this is really systematic authentication authorization and with time bound limitations on the scope and the time frame of the access that you grant okay and the other dimension there is that of course your data is encrypted not only on the wire so that's SSL or whatever encryption you are using but that trust as well so you need to encrypt every drive and have a TPM on your device and things like that and that's painful but yes you need to do it because don't forget those are devices in the real world that anyone could steal and then steal your precious data or your customers data so you need to be obsessed about security for doing edge the right way and finally zero touch on boarding as I told you you want to eliminate manual steps in the provisioning process as much as possible okay so this means you need beefy platforms that will deploy credentials for new devices automatically from a central location as soon as a new device is on the network now a few things to consider a few questions to ask yourself if you start designing an edge native application or solution okay first how predictable the latency should be in your system in the case of the smart convention center okay you know it's it's good even if you know it takes me five more seconds to change the temperature it's in this room it's not a big deal if it takes me five more seconds before turning a corner in my car that's a pretty big deal okay so you have to consider of course the use case there and don't forget that mission critical systems so mission mission critical I mean they are human lives at stake you have real-time requirements there which means you need very predictable latency okay in order to make decision and execute on them then can you afford to lose data once again in the case of the smart convention center yes you know it doesn't give me anything in any case to report about temperature every micro second or something right every minute is fine and even if I skip one reading okay you know but then in other industries especially in regulated industries healthcare defense things like that you need to keep you know history of every bit that was recorded in historian databases and you need even to you know manage from from a corporate perspective the lifecycle of those databases to ensure that you have full trust ability on everything well that's a different ballgame right and and there about about the data you should ask yourself how stateful my application is really or not and whether my instances for every edge node are they unique or not in the sense if one goes away is is that a big deal do I absolutely need to retrieve the data that this is stored on it or not once again smart data science convention center not so much but if we are talking about defense for example where you need to to to understand in real time was the drone pilot you know really the one that fired that shot that killed that person on the battlefield well you need to record everything and you need absolutely then the edge node that's coordinating those drones in the field to be you know extremely reliable how constrained are your nodes in infrastructure typically there is little to no elasticity but sometimes you have the luxury of an extra core of an extra 8 gigabytes of RAM or something and of course this will up to a point influence the design of what you're doing but I would argue that it's better to be really really really careful especially if you think about a very long time span you have no idea what your boss will ask you to do five years from now and then how far is your control plane from the edge and by far I mean maybe you are orchestrating everything from a field data center so to speak so a little thing that you are leasing from a network provider or maybe something that you have a deal with with your wireless network operator and they are offering service deployment at the tower or something like that so that's fairly close to where the data is but of course if you are putting stuff in the cloud okay then of your control plane is a bit farther than from your edge nodes and once again this has an influence on the overall design of the solution now let's have a look at potential runtimes at the edge so places where our runtimes that you would use to deploy your applications and so you would think that maybe all I need is Kubernetes Kubernetes is you know the answer to everything those days but you know if you ever have been trying to do stateful applications that abases load balance on Kubernetes and things like that you know this is not a good idea Kubernetes is the ultimate way to do massive massively to massively scale state less applications right so if you are to do trying to do stateful things well Kubernetes may be part of the solution but if it is the core you will hurt at least that's my personal experience now don't forget that you know there's a wide assortment of workloads that you will do at the edge you know AI comes first in most conversations because it's very trendy and all of that those numbers are from our last developer survey we did in 2022 we do one every year on IoT and Edge topics at the Eclipse Foundation and by the way all of those surveys are published under Creative Commons so please feel free to go there download the reports put the slides in your own decks and things like that we do that on the behalf of the open source community and this is as is my deck in any case you know shareable with attribution so please feel free to do so so in the last edition of our survey of course AI was the first the top edge computing workload that people told us they are working on but there is also pretty much control logic data exchange data analytics and many others and you see that all of those categories in our survey grew between 2021 and 2022 so yes AI is pretty hot but people are increasingly using edge computing for many other things than that okay and that of course influences the kind of runtime you want to deploy at the edge as well then we asked and this is the proof point are you doing containers or something else and once again you see that every category grew okay so people are not of course they are you know mostly using or mainly using containers and that's fine right there's nothing wrong about containers at the edge but especially if you have real time requirements already the Linux kernel is not you know if you if you talk to a hardcore real-time engineer and you tell them Linux is real time I installed the pre-empty patches and all of that no that won't pass I mean it's getting close to hardcore real time but it's not there yet okay and this means of course if the Linux kernel is not enough for real time and then you add Docker or Kubernetes on the top of that you're not real time at all okay so serious real time means that you cannot typically use containers at least in the current state of the art in the market maybe ten years from now we'll find a solution to that ever predictable litancies for real-time applications inside containers for the time being such a thing doesn't exist okay so that's why you will find virtual images native binaries crypt files whatever at the edge as well okay you need to use the right tool for the job okay and of course you can run Kubernetes at the edge full fat Kubernetes is a solution if you can spend less money on a ton of servers but they are specialized versions or cut down versions that you can deploy at the edge and this is not a comprehensive overview okay there are many many others you know I put K3S and Q badge because they are both representative of two different approaches of the way of doing that so K3S you know slims down Kubernetes by putting everything in a single binary you know and it replaces at CD I wonder why anyway no so they replace at CD with an embedded SQLite database because that one is much more reliable especially for atomic transactions in a high throughput kind of environment in embedded devices okay so K3S as the single binary and then it will work it supports the plane the plane Kubernetes API and I think the change set between K3S and mainstream Kubernetes is maybe a few thousand lines of code so it's pretty close to mainstream or mainline Kubernetes from that perspective but there are other solutions like microkates from canonical and mini-shift from from redact so you have alternatives there and they most well you know they are they are differences but I mean the philosophical approach is to slim down mainstream Kubernetes in that way and then there's the Q badge approach Q badge is an LF edge project at the Linux Foundation and in their case they developed custom components the cloud core and edge up and things like that that you see on the diagram I'm not doing a deep dive on that but essentially the approach is to be API compatible okay with Kubernetes and then they have edge optimized components behind the scenes that implement the actual container orchestration okay so that's a different approach whether it's good for you or not well download it and test it it's open source but the field is much wider than that right and this is not even once again a comprehensive list just some of the most popular options and you will see four eclipse projects there and you see that each of those projects some of them integrate with Kubernetes some of them don't some of them can run in standalone mode at the edge some of those require Kubernetes control plane somewhere so you have to figure out what your architecture is and really pick the right solution for your project but please have a look at those eclipse projects especially eclipse IO fog since we have our friend Kilton from Edgeworks here and that's a pretty pretty comprehensive and compelling container orchestration platform at the edge but we have other things at the eclipse foundation as well in that domain which brings us to Edgeworks which is our little philosophy for the edge at the eclipse foundation and when I say we it's really about the community you know as a staff member I don't have authority over our open source projects only thing like that I'm just a staff member serving them and being the vendor neutral voice of reason maybe you know in various settings so when I say we this is literally our community and I participated to that I was involved in writing some of the stuff but ultimately you know I'm going where the community brings me and not the other way so if you want the full version of the full the full history about Edgeworks we have a white paper for that so you can scan the QR code there or access the link and you will get a copy but essentially Edgeworks is about making DevOps suitable for the edge and what that means is essentially we have core principles the short life cycle the collaboration between developers and ops whether they are one single team or distinct teams that work better together CICD microservices infrastructure as code all of that is fairly good except maybe for CD continuous deployment in most use cases you don't want to do continuous deployment at the edge can you imagine everyone is stuck on the interstate 4 30 p.m. on a Friday afternoon let's patch all of those cars no you don't patch the cars then you don't patch the connected roadway infrastructure you wait at night right so continuous integration up to a point depending on mission critical your application is and even there you need in some regulated industry to qualify your code and things like that and pass through regulatory approval for patches and so you know it's a different ballgame as I said but anyway you know the core principle of DevOps are still there but you need to tweak them adapt them to the challenges that edge native helps you resolve to the characteristics of edge native solutions that I covered earlier in the stock and of course the fact that you have a bit more diversity than in the typical cloud native environment in the sense that yes containers out there but they are VMs scripts serverless functions and many other things and straight binaries to microcontrollers that you need to manage and all of that requires a bit of a different approach and in the end my message to you there is that there is no single community or no single place where you will find everything that you need to build that native applications you need to shop so to speak and this is what is great about our open source ecosystem if we all work together with our strengths and weaknesses we can in the end cover the whole edge to cloud continuum and the whole development to operations continuum now on this particular diagram which is extracted from the white paper I told you about okay the logos in color are eclipse project the rest are leaving elsewhere not that they are less good on eating like that we just wanted to contrast what lives at eclipse versus the rest of the ecosystem but you know we work with a different edge is a member with us we are members of a different and things like that so you know and our strengths at eclipse are really into developer oriented platforms and protocol implementations we are not so much into infrastructure I though I owe fog for example a closer to infrastructure or two operations okay but pretty much we integrate with Kubernetes we care about what's happening at LF edge in elsewhere in the LF ecosystem but we focus on our strengths which is code first developer first types of platforms okay and so if you are happy about this talk and want to have a written version I just publish a few days ago on open source dot com an article about that okay so you have the link there so please feel free to refer to that so you will have longer form versions of everything or if you hate my voice and want to read me that's fine and so at this point thank you very much for attending my talk and we have plenty of time I guess for questions so of course I'm happy to answer any question that you may have so thank you if anybody's got a question raise your hand I'll bring you the microphone so it gets on the recording thank you for speaking today I haven't been following the Edge IoT I remember five years ago or so you know it was a emerging field and a number of joint companies were trying to you know ensure that secure communication from the cloud out to whatever edge point was being done my question is in hearing that 5g networks we're gonna have a lot of edge computing resources has a standard been developed for mobile and mobile edge networks that's different from here or are they using some of the open source projects that you've described today well from from a wireless networking perspective there are many many many options in the market okay and of course yes 5g is part of that but you have things like LTM and NBIOT and then low bandwidth solutions like Laura one for example that don't use wireless network well that's a wireless network but I mean they don't use carrier networks or telephone networks they are you know with different properties and much lower bandwidth but much higher power efficiency so there is a wide wide wide array of potential solutions there and there are players that play in the infrastructure market for every of those technologies so the key there is really maybe to start small and evaluate according to your particular use case okay what could be the good fit we were having a discussion Steve me and other people yesterday and Steve was mentioning you know a drink manufacturer was deploying refrigerators all over the place in locations and they implemented ways to determine the inventory level so that people would come when it's nearly empty but not come when it's you know still still quite full and they ended up setting up on Laura because reporting once a day with a tiny packet about the state of that particular refrigerator was good enough for them and in their case it was not using fixed infrastructure they were deploying let's say the end points in moving vans so when a van was coming you know close to the place then the packet was sent and then at some point I suppose the van was getting close to a cell tower and transmitting there to a central location and etc you know so you have to figure out where the requirements are and what's the best technology fit because of course you know the telcos have a wide array of options for you but they are not in the business yeah exactly they're not in the business of saving your money so sometimes it's more economical to look at alternatives that will have lower bandwidth but you know sufficient throughput for what you do but then you will deploy yourself or maybe they are an emerging in most large cities you will have city networks for lower bandwidth options that you can economically use for those kind of applications so you know and that's one thing that we've seen in our last because I told you in my presentation about our developer survey but we also publish every year a commercial adoption survey for IoT and Edge and in that we discuss use cases and things like that and people told us in the latest edition of our commercial adoption survey that connectivity was their number one problem not that they lack options there are too many options and it's confusing in my book I have a chapter where I discuss some of those alternatives but there are free resources on the net where you can at least get a list of them and get you know some primers on them so feel free to explore because there's more than just going to AT&T let's say yeah I'll throw a little more color in there about that conversation he's talking about so the my attitude is there's not never going to be one size fits all because there's different use cases so this scenario of having a point of sale refrigerator trying to count the number of drinks that are in there or have run out so it needs to be restocked is a different scenario than I don't know security where you need to set out send out a notification almost synchronously that this door opened or you know the vault the vault door was opened by somebody actually to somewhere if you're just trying to guess the drink in inventory by how many times the refrigerator door opened sending that once a day is fine there's probably also not a whole lot of security concerns about that or you know sending a weather temperature sensor nobody really cares if that's encrypted more than likely but some other things if there's financial financial repercussions involved like driving an ATM out at an edge location that's entirely different a thing that commonly enters in here is even legacy protocols and there are a number of very rich open source projects that label themselves as gateways that will translate or fully support these legacy protocols some of them like Modbus has been out there for decades and getting it on to something more amenable to modern transport maybe more secureable there's a I don't know that there must be dozens of open source projects just in that category of protocol translations and gateways alone yeah you shouldn't reinvent the wheel even if you're inventing a new app that needs data transport you need to look at concerns like is it one way only is it information that you really don't worry if it's open or not if there's no backflow back to the edge location where you know it's just reporting the temperature of the meat in your barbecue grill is one way only but if you control the burner in that barbecue grill that's a different situation where maybe a hacker could cause a fire maliciously or something and suddenly that bidirectional nature is going to put you in a situation where there is or maybe should be security concerns and you're just going to have to evaluate it and categorize it but once you do I'm going to tell you that I think there's plenty of open source things out there to support whatever you determined your needs are with regard to LTT L 5g LT whatever the telcos do have service if you don't want to run your own mesh networks and things but there are alternatives for low bandwidth there on unlicensed spectrum I don't know how prevalent it is but I've seen some of the telco equipment vendors coming to conferences even touting the ability to run your own 5g networks you have to be willing essentially to buy the same hardware the telcos use to run towers and running yourself but I think there's open source software to enable you to do that yeah private 5g in factories or where asses is something that we start seeing but of course that's an expensive proposal and there's also non-cell 5g which is ITU certified that is maybe appropriate and also you might want to check out Adafruit.com they have lots of projects and they have a way of discovering free open source connectivity solutions. One other thing I want to throw out there while it's your talk excuse me for budding in but there's a lot of crowdsourced information out there that I found interesting almost as a hobbyist you know like I put my own weather station up but I find that other people have done the same thing and they share it on networks there's sharing of air traffic control readings and there's a lot of that then there's shared mesh networks you know like Laura Wann and things like that that it's up to you to put a gateway up there on Laura but then assuming enough people in your urban area have often have also done that you can enable the construction of effectively a community-driven shared network and those things are out there and very interesting to what extent they'll permanently be there in a solution or whether they're appropriate for some commercial solution is questionable but they are out there and it the world is changing. Any other questions or comments or even people who want to respond to that question one thing I'd like to hear from you Fredrick just a suggestion but I know that at the Eclipse Foundation you run your community meetings on a regular cadence and then you and I often get together the CMCF has an IoT Edge working group that also runs that we intentionally I think are in an opposite cadence so we're I think both of us are every two weeks but if you just want to go in there sometimes there's presenters talking about these projects but often we don't have presenters and it's just birds of a feather where you're welcome like your question of what's out there or I'm trying to accomplish this and you just pop on a zoom and say does anybody know a solution for this situation I'm in and trying to come up with an open it is confined to open source so these meetings both Eclipse Foundation and CMCF we don't want vendors showing up there giving price lists and order numbers but there are no other people that can advise you for sure but anyway maybe since your laptop's there if you can open up the meeting schedule or just tell them what it is or where to find it yeah yeah absolutely I think yeah I don't I don't need that but yeah that's a good way to follow up because this event is you know one time only maybe once a year but these are our regular cadence and then yeah it's actually there you don't you don't want to turn it into this technology thing you know at the cusp of new things whether it's the internet cloud native somebody shows you this hype cycle hockey stick growth of this is going to change the world and actually there's these analysts companies now that are saying cloud maybe yeah is leveling off but that edge is going to be I don't know you've seen Frederick I think what are they saying six times the size of the compute that's going to be deployed at edge because of technology changes some people are claiming six times the size of what's in public cloud so far community meeting I just remembered the information on the website is out of date but you have the handy dandy button to join our slack workspace it's open to anyone so feel free to join and then ping me and I will give you the details and I will ensure to correct that once I'm back from my trips sorry yeah yeah yeah once the mornings at 11 11 Eastern 8 a.m. Pacific one thing people says that the cloud is other just other people's computers IOT might be thought of as the rain from those clouds also you might find it interesting that Toyota will upgrade the onboard computers in their cars but not if it's on a hill interesting yeah so that's the type of location awareness and physical constraints that that we need to take into account even there I would be a bit uneasy knowing that you know it's flat and then it will stop deploying if I'm not a 64 yeah yeah okay okay yeah okay interesting yeah I mean SF is quite quite steep I'm not authoritative on this but I've heard that some of these are only situations where they're like drive by the dealership to so if your thing got bricked at least you'd be there and they don't broadcast it you know in open spectrum there are local broadcasts coming from the dealership or the parking lot so you would go there and not have to book an appointment you could just like drive your car in the vicinity of the dealership linger around and get your drop and I think yeah okay yeah and one thing one thing that I don't want to see but it seems car makers are contemplating at least I've seen a new tidbit about Ford contemplating a model where if you miss a payment the car drives back to the dealership and you don't you are out of a car so that's probably a bit too far to my taste but anyway there are great possibilities for let's say improving our experience as a car owner or driver finding the right balance I think will be key because I don't want to especially in Canada to have a pay for a subscription for heated seats exactly why I didn't want to name them but yeah fortunately I don't have that kind of money so for now I'm safe the BMW began thinking about transportation as a service at the corporate level as a transformational understanding that began in the mid to late 80s and so having that kind of 20 to 30 year rollout into what is currently their current product line is understand is a good scale to understand okay I think we've reached the time limit but I don't know if you've got a few minutes before you leave but we can take this in the hallway and continue on if people just want to chat pause good afternoon folks we're gonna start just minute early but before Josh introduces me I want to do Josh he is one of the volunteers these folks who work this show is for the love of the community and their love of you so make sure you tell them thank you all these folks walking around the scale shirts especially the folks in the hockey shirts be sure to thank them they spend a lot of sleepless time don't get any thanks for it but now everyone in here is gonna go say thanks to them they're gonna feel great for the rest of the day thank you Dave and this is Dave Stokes who will not be new to anybody who's been attending scale for a while but this is a new talk from Dave on how to move your lap your lamp stack to cloud native and here's use somebody's been doing lamp stacks for as long as there's been lamp sacks so he will have a lot to say about that as always with other talks if you have a question and I hope you have questions at the end of the talk wait for the microphone so we can get it on the recording and so that everyone else can hear it and with that welcome Dave okay for for those who weren't awake during that introduction my name is Dave Stokes I'm at Stoker or David Stokes at stokercona.com if you need to get a hold of me slides are available I'll have a link at the end so why this presentation many many years ago I put the American Heart Association on the internet and at the time the patchy server was still fairly new I was running their website on a challenge S from SGI I see a couple folks flinch from me saying that and this crazy guy had this thing called personal home page that later became PHP and from there I needed a database but couldn't afford Oracle so I ended up stumbling into open source databases so got that technology pretty good and then a couple years ago I started running this thing called Kubernetes and as things started going I thought it was interesting but I didn't pay a lot of attention to it till I realized a lot of people were really heading down that though let me stay up front this is from the best movie ever made in San Diego and I do love lamp if you haven't seen Anchorman or Anchorman 2 they are probably near documentary status for the the news industry and very funny so lamp Linux Apache MySQL PHP maybe Pearl if you're really old was the basis of the web for a long long long time had a lot of good technology technology that still works very well and unfortunately once you learn a piece of technology that means it is automatically outdated you have to learn something else whatever you have is now no longer valued you have to tackle the next thing and the next thing of course is Kubernetes so ignoring the the Linux and PHP aspects right now let's take a look at just the the web server which is illustrated here and the database which will use the file server for that so this came up it was wonderful for those you had ran go for sites this was amazing this is a lot better you can actually have graphics and it worked really well and of course anything that works well someone pushes the limits and they figure well we need to do a little bit better so maybe we'll do something on the database side where we'll split the reads and the writes get better performance and then of course someone decides that they need multiple front-end servers and whenever you have multiple front-end servers you end up needing a load balancer by the way I love these graphics the idea of someone juggling being the load balance just strikes me is great and once you get this worked out the database ends up being the part you're really worried about so you end up clustering your database doing things redundant replicate things for multiple sites or backups and it goes from being lamp stack to being a a horizontally scaled mess to manage but it's still fairly much the same technology so the two obvious problems that people started pointing out and by obvious I do mean expensive not all applications utilized all the resources you had a lot of excess capacity and anyone who's worked with big business for a long time is excess capacity is something to be avoided for some reason it drives a lot of people up the wall so they started looking at things and said well we're only using a fraction of the available resources let's go out and reduce this to its kind of its finest point and this ends up being container and the question here is it more bang for more bucks about 12 years ago you started hearing things like Ansible and then Docker and these other containerized things that were great for developers because you could have your own little environment and one little nugget of computing power and then people started saying well why can't we take this and put it up there where we make advantage of it so this began the move to container containers I live in a part of Texas it's by a multimodal transport center this means trains planes trucks and all that converge all at the same time and at any time on the road directly in front of me going 10 miles less than the speed limit are thousands of container trucks so containers are great they isolate a lot they give you kind of get what I think of as a nugget you kind of go in there it has everything there it's like a studio apartment when you're in college you know you have everything you need a fridge a sink up toilet a place to sleep and they're great in that they can work with each other and they're fairly well-defined channels and they take up a lot less resources so let's take a look at this you end up using your infrastructure with the host operating system and here we're using Docker there's other engines we can use if we want but Docker seems to be the prevalent one and we can run a whole bunch of apps that can inter communicate or be locked away from each other as much as possible and the visual representation is kind of like this those who are locals who've been down the Port of San Pedro or Long Beach can tell you that they have these huge cranes moving these things in and off ships in an amazing fashion it is the ultimate game of Tetris so containers emerge is the way to make software portable you had everything kind of self-contained and what was great about them is you can ship them off to somebody else and would work and you got rid of this old problem that Dilbert showed where you had the old problem is hey it works on my machine I don't know why it doesn't work on your machine your server or in production so containers let's take a look at a database example so here on a Ubuntu box install curl install Docker and then I tell Docker to run a database and in this case I'm going to use Percona server plug for Percona we're an open source database company we take things like my SQL give enterprise features for free so if you need connection pooling data masking encryption at rest we have that stuff that Oracle or Mario charge your money for and notice here that on the second line I'm passing a route password and I'm specifying a top layer version of the software to run so get up and running if I do Docker image LS it will tell me that it's actually running Percona server whoops the version that I asked for the image ID one was created and a rough size and if I do a container PS it comes back and tells me the image that I'm running how long it's been running and down here Percona server and this funny BEBF 3 stuff it used to be when you had a server you named it and you had Star Wars names you had Star Trek names you had all these great names for these servers and then things flipped and people started treating servers instead of being like pets they started treating like cattle so now they have these these names like this BEB now once you have this out there and running it's fairly easy to tell Docker that Docker that you want to go out there and talk to it and you you tell it where you want to talk to in the command you want to run so you type in this you're actually in that BEBF 36 and we're running bash so I tell it to run my SQL and there I'm running Percona server 8031 so that's the basics of containers now the next step up from there well first you have to learn how to stop it and by the way here I'm doing a PS to connect the PS command you do on the command line to find out what's going on there and then you get to stop it using that lovely handy name okay next step is into the cloud and this is an interesting quote before we had airplanes and astronauts we really thought that there was an actual place beyond the clouds somewhere over the rainbow there was an actual place we can go above the clouds and find it there probably the best thing I've ever heard Barbara Walter say but other people tell you the cloud is simply someone else's computer that you're renting okay so several years ago people started rushing to the cloud I was at a my SQL users conference and I heard that these folks from Amazon were selling elastic blocks and I was thinking something physical in the cube shape that was made of rubber and I couldn't figure out why people thought it was so interesting but later found out why so going to the cloud it has to be cheaper right anyone here ever open up their Amazon bill and gone oh my god I'm so fired you know in our lead a computer room you don't need those funny guys running around at all hours doing the funny computer stuff you don't need that electricity bill you don't need the extra air conditioning you don't have to worry about an ongoing capital budget expenditures to keep everything running you don't have to worry about the hardware service contract and all that yucky stuff go to the cloud that all disappears by the way if you need it upgrade just pull out the credit card it's faster to provision a server the first time I ever had to ask for a new server I had to write a capital budget request submitted to my boss who submitted to finance who submitted to the provost of the school who sent it back to the budget committee who sent it to the capital finance committee who approved it and six months later I got a purchase order and only took three months after that to actually get the hardware up and running so with all the stuff in the cloud you have a better integration with a lot of modern practices the CI CD cycle a lot of the agile practices worked very well with everything in the cloud and plus since everything was based on containers you had these little nuggets you could pass around and work with and you get almost infinite scaling as long as your credit limit holds out and of course nothing ever goes wrong with containers yeah this is a whoops so along came kubernetes and the great thing about kubernetes is I don't know anyone who will really tell you I know everything about kubernetes I need to know and it all works great so some people say kubernetes is the operating system of the cloud other folks will say no it's not but there's really no definitive answer there so for those who have not heard of kubernetes is the first time you heard about it was really designed by Google and it's now maintained by the cloud native computing foundation and it's suitable for managing large cloud native workloads spread for wide adoption and everything is kind of based on containers and rather than calling nuggets like I do they call them pods the great thing about a pod is you have one or more containers they're guaranteed to all be located together that makes interoperability nice and easy and you can make sure they all get unique IDs and all the containers once again can reference each other so you have its own little biosphere of your applications out there those of you who are old enough to remember this movie anyone remember this actor's name here delay he made about four other movies after this and kind of gave up big scene in the movie is the howl 9000 computer ran everything on his spaceship and he would ask it to open the pod bay doors and of course the computer would come back growing up with the name Dave this was one of the few two quotes I heard all the time the other one is from Cheech and I will not repeat it so you have these nodes and the nodes can have multiple containers they can be just about anything you want they can be applications they can be load balancers they can be servers they can be disk drives everything is treated as one unit now the pods can interact so you can actually set them up so that if you do have these little containerized areas you can have them talk either from the same hardware or across the world to each other I'm greatly simplifying a lot of this because I want to kind of you know the ten thousand foot view but the idea is that you have all these little compute units wherever you want whenever you want and kubernetes manages all this it oversees the operation something disappears or restarts it it takes care of all that for you now here's a Moscow example with persistent storage I like mini cube if you're just starting off with kubernetes the thing I like about mini cube is it works on a laptop you can do all the little playing that you'd like and then get rid of it and of course you cannot fail to not love any piece of software that has so many emojis emojis equals quality right so the the crazy thing is that to describe a simple my skill database you saw when I set up the container I basically said okay I want you to go out to Perconas website and pull down from their repo their version of the server simple right well now we have this where we come in and we say okay we're gonna have a service we're gonna call up my sql it's gonna listen on port 3306 if something happens we're gonna recreate it and we're gonna pass it a password we're gonna pass it the port that we're gonna use where we're gonna mount volumes and some other basic information so once you get your your kubernetes machine the way I think is machine running which is I'm using mini cube for you command called cube CTL to get the pod running and in the first case I'm running a command that goes out and says okay this is for my persistent volume of data I'll get in the persistent volumes in just a little bit you're gonna go out there and get all the information there and set that up and from there you're gonna look at the file I just sent you and you're gonna apply that and then you you wait a couple couple moments and you go out there look at the service you find out that you have your kubernetes running and there lo and behold is your my sql it's listening on port 3306 and things are great and if you actually type in get pods gives you different slightly different information this horrible my sql dash number it's ready it's running as it restarted and how long it's been running and then from there to talk to it you exec to this lovely name here and in this case once again we're gonna run bash on that that container and I'll show you that running now something I like about minicube is that you can type minicube dashboard and it will give you visual feedback of what's going on so you can see how it's actually operating and it will give you status so you find out what's up what isn't it's fascinating the more complex your containers to get to watch these little things turn from red to yellow to green and then as you're working along something stops working and you look up and something's turning from yellow to red again and you get to figure that out so we're now actually ready to talk to the database so we've run bash in the last slide I get out there and I say okay run my sql and there is in this case I'm running Oracle my sql it's all there and running very similar to the container idea that I showed you earlier we're just running containers now we're doing it under kubernetes so the big thing is instead of running the container directly on your laptop you're running the container through kubernetes that's the big takeaway from all this so this is a little more advanced example actually running the lamp now I don't have any example files here because I've been playing with this for the past four and a half months and haven't found any good examples that I'd love to show off but in case you want to run a lamp stack the idea roughly is you're gonna need to tell it what you need to run WordPress or whatever lamp application you want to use persistent volume claim for the data and then you're gonna have to do your mysql deployment and once you get everything up running it will tell you that you have WordPress running and you have mysql that's set up to run for the WordPress and now you're ready to have your lamp stack run now there are thousands if not millions of pre defined YAML files that have all this information out there like I said I've been playing for four and a half months trying to find a set that I really like not yet writing your own is a little rough we'll talk about that a little bit later but the idea is that your lamp stack is not that far away it's fairly easy to move things over so right after launch by the way I need to get a picture of you all for my boss got a new boss he's wondering why do these people come to these computer talks okay here we go thank you so right after launch you have Kubernetes running you have the PHP service you have WordPress you have mysql now you notice with all that I didn't have to define with the PHP service look like that was defined on the Kubernetes service what's interesting sometimes if you're grabbing something from column A and something from column B and something from column C on the web trying to get your own stuff running other folks may have already defined something that you're that you want to do before you've gotten there and you end up having two PHP services or two database services or whatever so be careful as you pick these things also notice the ports over here WordPress is listening at port 80 load balance is front end for 80 and mysql is the Singapore 3306 standard mysql port so whatever ports you're used to they can still be used in your Kubernetes deployment and I'd like to warn you that these are not the droids you're searching for scaling a lot of people use Kubernetes to easily scale it does do that you need more resources throw more pods at it need less resources take away a couple of the pods and you can actually scale across data centers scaling across data centers is very expensive and with anything cloud if you screw up you can have something out there running for for weeks on and before the bill hits so be very very careful the ammo configuration files most of the kubernetes stuff you see out there is configured with the ammo files now I thought that we had all agreed about two years ago that everything we do would be in JSON but someone came up with the idea of the ammo files and they're easy to read they're easy to parse they're they seem awfully verbose to me but here's one where we're doing it we're setting up a pod we're gonna call it static web in this case we're gonna run engine next and what's gonna be on port 80 now the great thing about yaml files and the only great thing about yaml files is it has somewhat gotten rid of the argument tabs versus spaces okay persistent volumes everything in a container is decided designed to be ephemeral it's designed to be used and thrown away like a Kleenex after after use that's great for everything but your database if you bring up a new database right your data to it and then throw it away you have not saved anything you basically wasted time and cycles now for databases what you have to do is you have to say I'm gonna have a persistent volume this is a separate depth declaration than anything else and you have to go out there and say where it comes from and how it runs now I've not I'm not a kubernetes expert I still have the training wheels on I'm a database guy and I've been looking at some of the ways that they define persistent volumes and some of them make me a little queasy now the another alternative to that is using something like debas or some other service and point to that let someone else run that fiddly bits any English people here the English have a term for fiddly bits that's when you have a bunch of little things that you know some figure sub finger size that tend to get messy when you try to assemble them and here's my two cents about kubernetes this is my opinion alone not my bosses not Josh's not anyone from scale the two things the things I kind of worry about is it is too complicated a lamp stack is fairly easy to configure even when you're doing crazy rewrite rules with the web server it's still fairly simple kubernetes is far from simple too many varieties there's still no coalescing around one or two kubernetes approaches or monitors if you go out there and look at operators which is something I haven't touched on everyone seems to have their own operator for whatever they're doing and they all work differently and some give you things you don't want and a lot of them don't give you what you do want as I sit as I'm making here we need two things to modularize my good gen eyes too late or in the day for me to say that hopefully over the next couple years this will kind of strain out and another problem is one size does not fit all or most if you grab someone else's configuration it may be exactly what you want but odds are it isn't you're gonna have to be able to tailor it and as my dad used to say when the only tool you have a hammer as you tend to whack the crap out of everything so not everything needs to go into kubernetes if you have a simple application that works with the lamp stack and you want to port it over that's okay do you need to port it over no and with that I'd like to invite you all to Percona live May 22nd to 24th at Denver Marriott it is the open source database conference to attend if you're in the databases and with that I'd like to point out the slides are available speaker deck comm slash stoker and with that what questions are pleased do we have yes sir we got we got a mic over here by the way you have the best beard of anyone here heard that there's there might be are there still concerns like of long running processes on those kubernetes nodes that it will time you out if for example you have a sequel query that's gonna take a long time what's it are there timeouts concerns that we should be there are timeout concerns just not in kubernetes but there are timeout concerns with any database the other concern I have is everyone's using the root account for everything that's here is the G will occurs out of me it's one of those things where it's a bad practice but it's the way people are way of things are doing stuff now if you go to the cloud conferences there are thousands of vendors out there who promise to have good security measures for kubernetes I haven't got further enough into the pile to be able to tell if they're bsing me or not yes sir yeah the question the question I have is is there a easy way to set up additional accounts with grants and roles using Docker and the rest or is that frowned upon it's not frowned upon what's interesting is you have ways with the various operators and some some of the Docker containers to basically say okay bring up my SQL wait 20 seconds for everything spinning run this back up to restore all your data and pull in the grants and all that that's why people are using persistent store you have that all predefined by the way josh if you have anything to kick it so kubernetes is a very robust RBAC system which ties in with setcomp and linux but the downside of it being extremely robust is it's also extremely complex and by the way it when you're setting up stuff it's not exactly quick so you spin up something and you're waiting for the little thing to turn from yellow to green it might take a lot longer you think it's humanly possible just as you go off for coffee it turns to green well questions or what right behind you in teaching yourself kubernetes which what resources did you use in terms of books or YouTube videos you might recommend I mean there's I see a lot out there and I'm not sure where to start from the neo fights point out you I haven't found a good book yet what I did do is a lot of web searching you know running my SQL on kubernetes and there's a couple hundred answers I started plotting through some of those there's a lot of shaft to the wheat so you'll have to fight through it and then once I did I said okay how do I get lampstack technology running on kubernetes and roughly the same amount of number and somewhere helpful somewhere or not and fortunately with the web as the technology is perceived something that's three or four years old is outdated and doesn't work anymore just wanted to point out that if you're interested in learning a lot of stuff like this J. LaCroix has a YouTube site called learn Linux TV and if he hasn't covered already he probably will and he's got a great amount of stuff on there and he's very well-spoken can you move the mic just a little bit closer and repeat that that website it's on YouTube is called learn Linux TV and the guy that runs it is J. LaCroix okay learn Linux TV Jamie LaCroix okay cool okay more questions well if there's no more questions are you too shy I'll be over in the Burkina booth which is number 220 and if this is your first scale you've come to an amazing event hopefully we'll see you next year and remember you two should be up here on the stage teaching others so thank you for coming up well thank you sir not yet oh there you go yes this okay it is and welcome everyone I started a little early hello and welcome everyone to Kubernetes community day LA day two that was words and letters so I hope you're excited for this next talk where Mofi Raman is going to be presenting running batch workloads on Kubernetes at scale he will be taking questions toward the end of the presentation if there's time and if so then I will be running the mic around so when it comes time for questions just raise your hand and I will come by with the mic and take it away Mofi hello everyone my name is Mofi and today thank you for having me to give a talk about running batch workload on Kubernetes at scale at scale I should have added that that would be funny this is working probably my name is Mofi I work at Google as a developer relations engineer I actually work at Kall's team so and you can find me on the internet at Mofi codes if you if you think I'm saying things that are wrong please yell at me I enjoy those things so that I can get better and next time I say the right things so let's jump right in batch workloads what are they why do we need them why are they here the first thing to is this what to think about batch workload is that anything that you run and then the thing finishes can be considered batch that's a very broad definition a lot of people might disagree shake your heads that is fine but for the purpose of this talk anything that starts does something finishes is we're gonna consider them batch we're gonna see more like a niche definition of them in a second but start thinking in those terms so one of the things you're gonna hear quite a lot is what is the difference between batch versus real-time streaming why when should I do one or the other and the easiest way to think about it real-time streaming is you have your data pipeline and you are doing something where you're processing the data as the data gets produced most of those use cases actually used to be batch processing but as our computing power got increased we can now do them real-time so one of the prediction I can make is that in the future when we have a lot of computational power we can actually do a lot of the batch things we do today as real-time things but because real-time things requires a lot more resource to be able to do things real-time but some of these use cases we actually do them in batch not because we can't do them real-time but because we need to do them in batch because the data set is just too big so who needs batch the first group is data intensive task you are trying to create some sort of ML model most of us know things about chat GPT that's a large language model that has billions of parameters that needs to get trained ETL pipeline the people are still doing that I guess you can take some data do some transformation load that data these are data intensive tasks you want to do them in with large machines together so that it can do them quicker or cheaper scientific research so a lot of the cool things that happens around us are product of people doing some math and science and figuring things out they need to do a lot of experiments things like gene sequencing figuring out what diseases how did this happen DNA sequencing figuring out how this is propagate or maybe new create new medicine fluid simulation to make better vehicles for planes cars trucks you can create like mathematical models of how fluid move through space and then create the physical object to make it better or things anything that you can parallelize again things like data processing you can process the same data as a single object or break the data into like 10 parts and process all of them separately and they can do the things 10 times faster roughly so different types of batch workload that exists obviously you have your classic ETL pipeline it could be anything like analytical data from your transactions from your company you take those data do some transformation clean the data up a little bit then finally load that in some future some space so that you can do some more things with it could be machine learning model training you again things like large language model image processing all of those things are machine learning models that we train with a lot of data HPC workload HPC stands for high performance computing these are the things scientific a lot of scientific computation are part of HPC you also have things like data analytics data science all this other fun stuff another fun fact HPC initially actually was the term that we used for both machine learning and data analytics but as these different things grew we realized they're not calling all of them HPC was doing a disservice so we decided to break them apart into their own kind of practices so machine learning model data analytics all of them kind of do similar things that HPC does but more specialized on their own things so why do we use batch then number one costs often times doing trying to do things real time is more expensive so we do them in batch we run these on machines that are not being used too much right now so we can do them together in non busy times sometimes we use batch for speed if you're doing one thing one at a time versus one thing 10 at a time you can do them faster even though it costs more money for HPC workload we want to figure out again how do you make the next vaccine or how do you make the next medicine we want to make that calculation we can use we can throw as much resource at it like we can throw very high performance computers at it to calculate that results faster we do this for reliability so if a scientist came up to you and told you we know I think this medicine is going to cure that disease and you're like how do you know like I ran it 10 times and one time it said it's gonna work are you gonna take that medicine raise your hand who's gonna do it no one right we need to be able to run our mathematical simulations our our tests reliably over and over again so we do them in batch so that we can create almost like an isolation on the work we do and we can do that work over and over again and finally the over and over again part is repeatability in the scientific community there is a big crisis of repeatability of the work we do oftentimes they will create some cool thing and then we're like oh it worked once on my machine and then I think it does work but I don't know how to prove it actually does work so let's talk about designing batch workload we still haven't talked about anything Kubernetes we're gonna get to that in the future some point but let's talk about just designing batch workload it all starts with a job this is not Kubernetes job this is just a job the thing that we're gonna do that gonna run to completion and then when you have the job you want to run that job in some worker nodes and that worker nodes we're gonna put a scheduler in front of it so that we can schedule our work onto that worker node but we just don't want to give the job direct access to the scheduler because that way if we have multiple people using the same space our scheduler is gonna get overwhelmed so what do you do we throw in a job queue in front of it we put our job in a queue and as resources become available we can run that job onto our scheduler and we want to add some monitoring on top we want to see what's happening in both our scheduler as well as our worker nodes so that we know how much resource we're using if we need it need to add more resources and again when you're doing a job oftentimes you're dealing with data so you need data for the input as well as once you have done the work you need data some storage for the output so you train the machine learning model and you want to be able to write that model out to some space and if you want to do this in scale you need to be able to auto scale your auto worker node to some degree like you have a lot of work coming in right now let me scale my worker nodes to certain number when I'm not using it I'm gonna scale them down back so now the big part why use Kubernetes because people have been using batch workload even before web servers were a thing like we have used batch things since the 1980s even before to be able to calculate different things so why use Kubernetes number one resource management Kubernetes is is basically resource management with some fancy API on top jiffy's gonna probably disagree with me on that one but it is what it is very nice scalability Kubernetes has really nice APIs to scale up and down when you need to and if you're in a cloud provider system ecosystem you can scale your nodes up as well and scale them down when you're no longer using it fault tolerance Kubernetes has built-in systems for if some job fails automatically start the job up to some back-off period you have monitoring and logging if this is not part of Kubernetes itself but there's a lot of tooling in the world of monitoring and logging for Kubernetes that you can use to get information about your cluster and if you want to learn more about the observability state of Kubernetes there's a talk I'm giving tomorrow in room 107 at 1130 that you should probably check out like I'm gonna give you more of a Cliff node version of all the things that you need to know about state of Kubernetes observability finally portability and this one is hit or miss because most for the most part if you build an application on Kubernetes you should be able to take it and move cloud to cloud on prem to cloud or any other hybrid situation I say most part because there are certain things like networking storage those doesn't necessarily translate all that well but with a little bit of work you can kind of do it where the same workload you have built out will be able to move between clouds even on prem and finally and this I think is the big one your company is already using Kubernetes for a bunch of other things and now you're asking for like a different platform to do batch workload and your boss went to you and said why do you want a different platform just these Kubernetes because everybody's using it and that is probably a big one right anybody in the room right now as their day job only do batch workload or machine learning what workload just that okay anybody do any machine learning workload or batch workload couple of you so all of you has a mixed type of workload you're using some for web servers some other things and you also have some batch needs so in that kind of a system Kubernetes is really cool because Kubernetes can do both pretty good all right so let's talk about running batch workload on Kubernetes not at scale yet but just on Kubernetes so first thing you need to know job this is the job we're talking about in Kubernetes so not the other job is the Kubernetes job we're gonna get to it so what is a job a job is a computation that runs the completion a group of pod run independently or collaboratively to process a task and task is something you yourself will define and finally often flexible on time location and or types of resources and this is the part where Kubernetes being a system that has APIs to give you resources really comes in handy because you can choose for a particular task that needs SSD or GPU or TPU or need to be in a particular location all of those things are not that it can configure and two types of jobs in Kubernetes someone yelled out cron jobs so this is this slide is for you so you have jobs and cron jobs and we already talked about jobs but this is what a job would look like the font is pretty small but I will say what's in there so you have a definition of a job and this particular job is using a pearl image and running BPI which is printing out the first 2000 digits of pie not interesting but it does something it's a job that does something and cron job is extremely similar to a job except for one field I don't know if I can show you but this right here this schedule if you ever seen a cron schedule on Linux is the same exact thing you have this five values that you set and this is what they mean the first one is saying what minute it should run second one is talking about what hour day of month month and day of the week I don't know why the day of the week is the last field but it is I think what happened initially they had four then they're like you know what we need one more so they added one more in the very end but the idea is you can set all those values and the cron scheduler will start your job at every interval of this particular semantic if you have all of them a strict is gonna run in every minute of every hour of every day of every month of every week there okay there are some other things you need to know about running job on Kubernetes there is a field in there called spec completions if you set it greater than zero Kubernetes will spin up your job that many times to completion if you set up also parallelism to a number greater than zero Kubernetes will spin up that many parts at a time until the number of completion is met the next one is you could set just parallelism and no completion so Kubernetes will spin up that many pods of job at the same time and you might want to get the work that needs to do from some sort of work queue maybe message queue or pub sub or anything that is giving it tasks and those pods will spin up keep processing those tasks from the queue and when the queue is empty all the jobs will exit and your job will be complete so now all of you know the basic building blocks of Kubernetes batch so you can just all go off and build amazing batch platforms thank you well this might be you and it was me I have no idea what I'm doing this is just like two things I have no idea how to actually build a batch platform few of the things that are I think well a lot of people think are missing from the batch platform batch API that exists on Kubernetes right now is that there is no way to set quota and budgeting to control who can use what and up to what limit you could set resource quotas on the individual pods but if you are sharing the same cluster for multiple people doing their batch workload there is no way to fair share that resources right if you have multiple tenants which we want to be able to do we don't want every application to take up the whole cluster with current batch view on features we can't necessarily do that between multiple tenants flexible placement of jobs across different resource types based on availability so not all jobs requires SSD storage not all jobs require something like a GPU or a TPU so right now with the batch V1 job API we don't really have a way to say okay my this job requires GPU so I would like to be able to schedule this pod onto a node that has GPU attached you could do that a node selector but there is no way to flexibly place and there is also no clear way to say this job does not require an SSD do not schedule it in any node that has SSD so like being able to hold on to things that other jobs might require that doesn't exist in batch view on right now also support for auto-scaled environment so if you're in a cloud that can scale your environment as needed our batch view on API does not understand that environment all that well so this is why we start in the work group batch started working on a project called Q and Q is a Kubernetes native job queuing API this is not a third party project this is not a company that we're asking for money or funding this is part of Kubernetes 6 this is started by the work group batch but this is a queuing mechanism that runs on top of Kubernetes and uses the native job API so you don't have to learn a whole different system but this adds some knobs for you to control the things I said was missing from the batch V1 at this point so what the idea is the queue adds a new API and some controllers on top of your Kubernetes infrastructure already and this can use so the job API has a flag called suspend which starts a pod then suspense it that just turns it off but it still knows that this part will have to run at some point and the queue can turn that suspend from on to off to on to start the pod this is all Q is doing but it has some additional smarts to know how to schedule and make sure the pod goes to the right place so it has quotas and policies for fair sharing among tenants and resource fungibility if a resource flavor is fully utilized Q has the understanding of admitting the job on to a different flavor that will be able to satisfy the needs for example let's say you are trying to run a workload that requires 10 gigs of CPU or 10 CPU and 10 gigs of RAM and you your cluster right now doesn't have that flavor available Q will be able to use a different queue well Q with a K will be able to use a different queue with the queue to schedule your workload so that even if it doesn't match fully what it actually asked for so one of the things we have is resource flavor resource flavor you can define things like what kind of CPU what kind of GPU if you're trying to use spot VMs spot VMs are VMs that are much cheaper but doesn't have the guarantee of certain like amount of 9 SLA so you can say I am going to get this VM for 10% the original price but my cloud provider can take that away anytime there is more demand on other things this way you can get resources for a lot cheaper but because it's not a continuous service like a web server you don't care if the node goes down for like two minutes during the day so there are a couple of new concept that queue adds to your Kubernetes cluster number one is called cluster queue in a cluster queue will define things like how much CPU you're giving as a limit how much memory what kind of GPU storage of what kind you can define all of those things and then you have another thing called queuing strategy right now there is only two implemented these are called best effort and strict strict so best effort does the first in first out mechanism where let's say you have a workload that is really big asking for hundreds of CPU which is not currently available on your cluster as your cluster is figuring out how to get the added resources maybe you have another workload that is only asking for five CPU and five bits of RAM that can kind of bypass imagine you are in a grocery store waiting and you asked store clerk to get you some item that is that is not right now available to you and there is someone behind you just holding a loaf of bread want to buy so would you want that person to wait until that item that you're looking for gets to you or you want to let them pass and they can go through and do their thing that's what best effort does strict isn't the other one where you say you know what no I'm going to get my item before anybody else can pass me so strict would be the process where the work that came in first will be done first you can do both of those things right now in job queuing we are also looking at adding some other additional ways of queuing but these are the most asked for so far so that's the one we have implemented right now next one next concept that queue ads is called local queue so this is a namespace bound queue onto your cluster so what this does for each namespace so you can think of a multi tenant system where you have multiple people using the same cluster and each of them isolated on different namespaces and each of those workload will be then targeting a specific local queue which then in turn targets a specific cluster queue so within the same cluster you can think of having multiple cluster queue that is scheduling the workload and each local queue then will compete for the same resources and queue will understand how to schedule them fairly and giving each of them resources one of the problem that can happen with Kubernetes cluster let's say you are you have a cluster that has 20 gigs 20 gigs of RAM and 20 CPU and I have a workload that requires 15 CPU and Kathleen has a workload that requires 15 CPU as well and we both run the workload at the same time right so what Kubernetes will try to do is like start any part that is possible to start but if our job doesn't get to start all the pods Kubernetes will just halt and say okay I can't start all the pods so I'm not going to start any of the workload so you will end up in a deadlock like situation where both jobs spun up few of the pods but none of them can actually complete the entire workload with queue because the cluster queue will let only one of the job that can fully complete in first that way your workload will not actually get into a deadlock situation next so in terms of this side this is very small but if you read it what it does it is actually your basic job definition only thing you have to add extra to use queue is this annotation here this annotation just points to local queue of your namespace everything else is exactly the same as your regular job object that you are creating on Kubernetes and queue controller will then pick that up and the moment you start that job it will put that job into suspend mode until the local queue can talk to the cluster queue to schedule your workload then the suspend flag will be turned into false and your pod can start and do the thing it needs to do all right so that's queue we're going to see an example of that in a second but let's talk about the other thing in our basic structure of a job batch infrastructure so batch monitoring we need to understand what's happening in our cluster how much resource we're using and the queue and the job status like what's happening in our jobs what's happening with the queue if you're getting a lot of load what can you do about it so enter Prometheus Prometheus is no surprise to anyone that this is the way many of the Kubernetes things we monitor and this is the de facto standard for most cloud native applications that we have got going but the challenges for batch workload and Prometheus is that Prometheus is a pool based system so what Prometheus does the collector will query the application to get metrics out of it bad jobs can die or and at any time so if you have a Prometheus metric that's checking every 30 seconds and your batch workloads basically starts and ends before the Prometheus metric pulls anything all of a sudden your Prometheus metric didn't get any information about your job because it just fell under that weird period where Prometheus was not querying your application so you have a chance of missing important metrics so Prometheus again very smart people so they figured out how to solve this problem and you use something called the push gateway so for most application that are long running Prometheus collector can pull that information out of the application and for jobs that are short lived in our case bad jobs what we do here is we push our metrics automatically or we write the code to push the metrics or something called the push gateway which Prometheus can query at times to get the information they need to so even though it's a challenge it's already solved so you don't have to worry about it batch storage so this is where it probably starts getting fairly complicated anybody who is coming from the world of HVC you have your favorite storage option Luster Gluster and we have people from Seth here we have people from like so we have a lot of ways to store things but I'm going to just talk about them in principle and like each of the category has different storage options and both internally to cloud providers and externally as other service providers so batch storage mainly because we cannot do all the computation in memory it would be nice if I could I don't know have a RAM of 100 terabyte and store everything that I needed to on my RAM and on Google Cloud we do have an option to get up to 11.5 terabytes of RAM on a VM very expensive might not be worth it so we probably don't want to do that we want to be able to use commercial like a cheap hardware to do a lot of work by scaling it horizontally not just getting like a big and bigger machine so another reason you need storage you sometimes want to write state as a checkpoint for fault tolerance we're going to talk about fault tolerance in a second but the basic idea is you are doing some work and your work can fail at any time so at certain interval you write down the current state of your work as some file externally and if your pod dies for whatever reason when the application comes back online you can just start off from the checkpoint it's like a video game you just save auto save like a did a quick save before you fought the boss write intermediate state as a way to pass data so when you're doing ETL stuff one of the things you might want to do is you extract transform the transform state gets written down somewhere so that the load state can take that data and do something with it so that's why you need storage and finally write output for further use if you're building an ML model that ML model is the output you want to write some storage so that someone can take that model and do something cool with it so different types of storage that you can think of local storage the Kubernetes has a way to attach hard drive or SSDs to your node itself so in Kubernetes API you can access that directly only challenges it is not shared across your entire clusters each node has access to a certain amount of storage the next one is file storage so file storage is anything like your NFS volume type things so on Google cloud we have Google file store Amazon has the EFS, Azure has a file system so you have different options and on top of that any like luster all of them are also file system that are working over the internet we have object storage so this is like your classic S3 bucket or your GCS bucket where you can just write a lot of information into as objects so files as objects you can write them and the benefit of object storage is you can open up I don't know a thousand different buckets and each of them has a like upper limit of the network bandwidth so when you have a file store all the different file pointers to that file store are sharing the same network bandwidth of your NFS volume but on object storage let's say your limit to write is one gig per second and then you have a thousand of them all of a sudden you have a terabit network you can write terabytes of data at same time finally block storage so this is your takes a file apart into singular blocks of data and then stores the block across multiple storage from a user point of view block storage just looks like file storage but internally they're just hashes and there's some interesting things that people are doing you should talk to people at Seth and other folks didn't know a lot better about how this whole fun things work but ideally products like Google Cloud PD person the disk or Amazon elastic block storage these are products that you can use to get block storage for your batch workload alright so this was the picture we saw before as like a standard batch platforms you have job job queue scheduler shared storage for input and output worker nodes and monitoring system and this is what it looks like in the world of Kubernetes right you have job which is the same name then you have queue that's the missing part that we thought of we're trying to add to the Kubernetes ecosystem and we have Kubernetes scheduler that does the scheduling you have the work node pools that does the scaling of worker nodes Prometheus to check the monitor monitor everything we have the Kubernetes auto scaling to scale up the node pools and we have storage options like GCS file store any type of the other storage you can think of all right so bad job types what are the different batch job types you can do I don't know if these are official terms I kind of tried to group them into some logical thing that it made sense I didn't find it anywhere that said this is the different bad job types so I might be missing things but ideally these are could be the like category you can think of ad hoc runs are job types that you are just individually just starting yourself manually saying okay I want to train this ML model I'm going to press this button that's going to start the training that's like a job run that you're doing manually yourself same idea of drug discovery genomics data analytics all of these are things that are you're manually starting yourself you also have like for example a user use the kubectl API or the the Kubernetes API using some programming language talks to Kubernetes Kubernetes spin sub job straightforward storage and other things exist we're not talking about that right now react to events so things like someone uploaded a file to GCS I want to take that file change the file get it generate a thumbnail or take a video do some transcoding to it all of these are reacting to certain events so this one or some new user signed up we need to send them email we need to send them a lot of marketing stuff because people seem to like that no one so you do that as a batch thing or your CI CD pipeline CI CD is a batch workload no one likes to think about this in these terms but CI CD in reality is just a batch workload right it's a pipeline that does things and after it's done it shuts down which is which is the definition of batch next so what it looks like you upload a file it creates an event Kubernetes picks up that event with some mechanism when that Kubernetes doesn't have built an event mechanism right now but you can use some third party tool to get events caught and start a job to do something process the file write it back to storage done next one is run on schedule your classic cron job report generation analyze periodic data every quarter take all the business data generate a nice report for your manager or something so Kubernetes cron job does the thing things happen pretty cool fan out fan in it is where you can take a workload and speed it up to like bunch of different workers do the thing and then finally join the work into some single output so kind of looks like this Kubernetes spins up a job that job spins a bunch of other job and those job combine their result into a final job and you have some sort of output message passing distributed work so this is where my HPC folks will cheer and rejoice because this is the bread and butter of a lot of HPC workload you have MPI message passing interface where you can consider multiple distributed nodes as if they were a single node because they're talking to each other over network all the time so if you have a workload that depends on previous steps or all other steps at the same time you need to basically do MPI type things for example if you're doing a structural simulation where you want to figure out if this building is going to fall over if something heavy hits on this corner or weather modeling you want to figure out the entire space what's going to happen to the weather simulation and things like that you need all the moving piece communicating with each other to get the full picture so this is what it kind of looks like Kubernetes spins a bunch of jobs and it spins a bunch of other jobs and each of those jobs are talking to each other just sending network information either TCP UDP what I have you but they are just communicating to each other and combine they produce an output that is the final result that they're looking for all right so let's design a batch workload together so we're going to do the classic sorting a file of trillion numbers right anyone have a guess how big of a file of a trillion number would be no guess alright if you first of all why do we need batch for this the reason we need batch for this sorting is a simple problem we all understand sorting to some degree if you have a number from one to ten in random order you put them in the right order right we got that part great a trillion number if you did hexadecimal encoding like the 0 to f kind of encoding each number is about 16 bytes so a trillion of those numbers would be something of the north of 15 terabyte right I don't know about you but my Mac only has 16 gigs of RAM so I won't be able to unfortunately load 15 terabyte of data to be able to sort that so that doesn't work if you have a machine that can do it talk to me I can maybe Justin does his his homelab he probably has 15 terabytes of RAM but I don't so I have to use a multiple machines to be able to handle that big of a data that's why we need to use batch so possible solution number one is I can pretty much push all the data into some database and then do a query select star from database order by ascending and my database will give them all the data back that'll work technically the other one I could try doing is push the data through some message queue and create like subscription on the other end on buckets so message queue gets number zero it goes to goes to a subscription that is looking for that number and I can divide the number up in buckets sort individual buckets and then I have the whole numbers that sorted right that'll work that's fine I use some sort of external sort so you could have multiple sorted files and then do some so external sort together to join the files into a single file and the last one and that is probably the most complex and the best solution would be doing some sort of map reduce work so map reduce as an idea has is not new it's about 20 years old at this point Google published a paper in 2006 that white paper talked about the initial idea and through research and through trying to figure out this problem I kind of stumbled upon the same solution again as I'm like trying to figure this out so database solution the problem if you have a database that can handle that much storage so if you go to your cloud provider and go like I need storage for 15 terabyte of data they'll be like yes I just need your entire network to get that much of big of a database for you right so you could do it the actual task of sorting is fairly trivial we just need to write a query that's a select star from database order by ascending and it will just be sorted but the task of putting the data in the database and getting the data out of the database will be fairly slow because they're limited by network how much query your database can take in so this is what it looks like you put the data database and get the output very simple architecture fantastic the challenge here is expensive insertion operation expensive to run and operate the database and expensive output operation these are all kind of like all the red things that are happening so message queue to bucket combining sorted file will be fairly trivial once all the buckets individually are sorted you can just say join this end of this file to that file to that file to that file then you have a full sorted file or you don't even have to join the files it could just be like okay the first one billion numbers happens to be on the first file the second one billion number in the second file you can pretty easily sorting just means I need to be able to find the next number easily which you can be pretty easily do pushing 15 terabyte of data through a message queue that is a pretty challenging task and it will probably take down our network by trying to push 15 terabyte of information through a message queue quickly and it will be slow because you're talking to network to send the data of the message queue message queue has to talk to network to send the data to your individual workers fairly slow operation so message queue will look something like this your input goes to a message queue which then gets sent to individual workers they all sort that data right to some output so that architecture will work pretty well alright this is the bottleneck obviously because you're trying to push all your data for a single point external sort so external sort you have room for a lot of parallelization and need to synchronize work at some point because if you're doing multiple chunks of file sorted you have to join them at some point and that's going to end up being your bottleneck and I solve this probably using external sort we're going to see the example of that in a second and the code if we have the time to so what it looks like is you have an input you use some sort of split operation to take the input and break it into smaller files pass it into different workers and individual worker will sort the data and then merge the data into a single file so your bottleneck is mostly here because you're trying to sort let's say a thousand different files into a single file finally MapReduce it works really well for parallelizable workload and if you're trying to write it yourself it's going to be fairly complex but there are a lot of tools that does it things like Hadoop or Array they have algorithms built in to do MapReduce type of things so what we do is we map a bunch of data onto bunch of reducers each reducer sort them and then write to disk it sounds more complex than it actually is it looks a lot more complex so what it happens is you have bunch of mapper the data gets sent to each of those mapper and the mapper sends the data to the right reducer each mapper knows I have this number that should go to reducer 7 this number should go to reducer 5 this is the same idea of a message queue but instead of having a single message queue you have n number of mappers so you are just scaling the message queue up instead of having just single mapper each reducer will sort the data and write to output and the first reducer has the first bucket of number and the last reducer has the last bucket of number so all the outputs are sorted in order very much similar to the message queue solution alright in this case our bottleneck is the middle part right like where you are talking over network and all of this job has to talk to all other jobs that kind of blows up your network a little bit if you're not careful alright before we go to that let's do some demo because we're running out of time and the code that we have got going here is this is a GitHub repo it's public it's open source you can go play it around with yourself but I have a bunch of different go applications that for me to get a terabyte of data I need to generate it first so I have a go application that just generates bunch of data I have a go application that joins those files into a single file I have one go application that takes a big file and splits it into smaller files and then have another application that sorts all of those files and then have a final application that joins bunch of sorted files into a single sorted file right these are the five different applications we have got going and I have internet is probably still working hopefully maybe okay I have a VM here I have a file store instance where all the data will go and I'm just I have a VM so that I can show you the files actually exist somewhere on the internet I'll connect there and I have this to be able to I have right now no pods on the default namespace and I in this cluster I have queue installed already so I have like local queue and cluster queue so I can do what get queued so I'm looking at my local queue I'm gonna do cluster queue I'm watching my cluster queue right now there is no pending workload because I didn't start anything and I don't have any pods here because again there is no pod here so far so let's look at this cd volume one the font big enough should I make it bigger a little bit bigger you can always go bigger I think okay so if I do LS you see I generated some files already so I'm not gonna have to do that again but what I did I ran this job job generate so what this one does is close this one so first thing is a job I'm gonna make it a little bit bigger so that you can see it I have a job that has a name generate is an annotation that is targeting the local queue so I'm using queue here I'm gonna spin up 10 jobs and I've set parallelism to 10 so I'm gonna spin up all 10 jobs at the same time I have some security stuff that you have to worry about but I have a volume that is a VVC that I'm mounting so that I can write my data to that space I have a container image it's a public image you can get it to again the code is public you can I'm gonna share the link in a second and I pass an argument that says generate 10 million number but each of my job is gonna generate 10 million number and I'm using index job in Kubernetes that passes a variable called job completion index it goes from zero to up to the one minus the completion number so my first job is gonna get this value as zero one two three up to nine this way so once I ran this job you can see here I have where did it go of course here you can see I have generate zero zero zero zero zero zero zero one up to nine and then I ran this other job which is job join which is the similar job but what it does it's a single job that spins up and joins the files all the file that starts with the word generate join them into a file called join.txt so you can see that right here I have a join.txt file if I did ls-lh my final file is about 1.6 gigs of data again for times constraint I'm doing this for much smaller the example I ran at home was about 300 gigs of data I took couple of hours to do okay so let's run the example on now splitting the data I want to take this big file join.txt and split it into 10 small parts so I have job split which is similar thing but it just takes one file splits into 10 parts okay so I want to do it here oops it here apply dash f job split dot m so the moment I do it it's gonna be very quick so pay attention to that one top at the bottom right now there's zero admitted workloads the moment I run this workload you would see the job batch split created a new pod has started here container creating and you can see the workload is immediately admitted because queue right now has nothing in the queue so it could immediately take that job and put it in the queue so my job started it's running right now it's gonna take a few seconds it's doing the thing and we can actually once it completes here it's finished completing so we can go back here I can do an LS you can see now I have bunch of new files called split zero to split nine it's the same file that we created before we generate but I just wanted to show you if you started with a big file I want to split them into smaller parts the next part I want to do is I want to sort this individual files using the job sort and close this one close this one and this what it does is takes all the file that has the name data split that's in the folder and sort that data and then write out to data slash sort with the underscore value so let's run that one here we go queue ctl apply dash f job sort yeah so this will spin up 10 new parts at the same time because I'm doing them parallel I have 10 individual files that are not sorted I'm gonna run them and you can see I have another admitted workload immediately so basically my queue was empty so I could immediately admit the job a job with n number of parts all counts as one workload because they are part of the same job so they are now running so each individual application is now sorting that about 160 megabytes of data and some of them completed all of them completed so I could go back to my application here and do an ls you will see that I have a bunch of this sort file thing here the final step here is I want to then join all my sorted files into a single file so I have a final job that says job merge so what I want to do here this one what we'll do is look for a pattern called sort so any file that starts with the word sort take them join them do a like a external sort I'm gonna show you the code it's kind of interesting so let's run that one k apply dash f job merge k is just short for queue ctl so I run this a new pod starts called external sort and right after it starts it actually goes to terminating because queue puts it on suspend and then when the queue becomes available it finally actually is running this pod takes about a minute to run because it's sorting joining 10 files at a time so the way this works I'm gonna show you the code real quick so this is the function merge sorted files and it takes the name of all the different files is the is it too dark it's fine dark mode is okay you have 10 files here so I'll create 10 file pointers and then for each of those file pointers I'll open the file and then get the first value and put it on a heap and in my heap I can get the smallest value from the heap every time I pull out the smallest value if that file still has any data left I will pull the next value and put it back into the heap so it's doing like an external March sort where I never have to load the entire file into memory so this algorithm will work the same if whether or not I had two gigs of data or 20 gigs of data or 200 gigs of data because it's never loading more than n number of file number of value at the same time into memory so it's just always having 10 value in the memory at any given time so while I show you that oh yeah this just finished completing while I was talking about it so let's see if the file actually is sorted so if I go back here I can do ls dash lh you'll see that I have two files one of them is called join.txt one of them is called march.txt March should be sorted so let's make sure it is sorted so I can do so hold on let me grab the command I forget it okay so that's the command so this is sort is a Linux command that exists so I can just run this and I want to check if the file march is sorted so I'm gonna delete the d here I'm gonna get to the test data here and this is gonna take a second because about two gigs of data and sort is gonna go through individual column to see if the file is actually sorted or not so if it prints sorted that means my sort algorithm actually did the thing which is sure we're gonna come back to it and see it in a second so one of the things you might be wondering like running this job was kind of annoying right like I had to do this one thing then I had to wait for it to finish then do this other thing Kubernetes by default doesn't have a concept of a workflow you can't really do this thing where do this thing when this thing finishes do this other thing for those kind of behavior you have one of two options where one of many options but the one option is writing a simple bash script so Kubernetes kubectl has a API called wait so you can wait for certain condition before it moves on so you could do something like write a bash script that does where is it somewhere job generate or job sort so what you can do you can say okay apply job split and then wait for condition completed then run job sort wait for condition completed or then job run job merge and wait for condition completed so you can just do this over and over again until your entire task is done the other option other of many option is running something like Argo or Airflow or Tecton that has a concept of you can define pipelines so Argo and Tecton usually are thought of as CI-CD tools but Argo has a project also called Argo workflows where you can define things like do this thing then once this is done do this other thing and once that is done do this other thing so to define Argo workflows you can use something like something like this where you have an Argo workflow where I define multiple steps so I have step of split step of sort and finally step of merge and I'm gonna quickly run that but before I do that okay this was sorted so the file was sorted that's great so I'm gonna delete some files here RM-RF anything that starts with split RM-RF anything that start with merge delete that RM-RF anything that started with sort delete that LS-L so you could see I only have the data that I generated so I want to run this as Argo workflow and see how that goes down hopefully well so now that I'm using Argo Argo when it spins up jobs it actually doesn't use Kubernetes jobs it uses bear pods and Argo controller takes care of things so you have still the problem of not being able to control queues and such because Argo handles it itself if you have multiple Argo instances on the same machine you're gonna get into some trouble but it is still a good mechanism and we're in the talks with Argo to use things like queue as their queueing mechanism instead of the Argo built-in one so in the future you might be in a position where you can use Argo but queue is used underneath that would be a fun times but I'm gonna do Argo submit workflow sort and this is gonna immediately this is not gonna even touch our workflow because I don't have Argo doesn't understand what workflows are I'm talking about the queue and the cluster queue but it does know how to spin up pods and each pod will have two of two because one of them is my job and one of them is the Argo side card that is doing some administrative task so I did the split task in the meantime I can actually show you the Argo dashboard and where is the Argo dashboard? So if I refresh this page you'll see I have a distillium sort job that started did the split step now generating some numbers from zero to nine and the next step would be starting 10 pods so the same image I showed you that I was doing from job to job in Argo you can see like a nice little dashboard that is doing the same exact thing so it's running 10 different sort operation after they're all done I'm gonna do a merge that is gonna join all the 10 files back into a single file again same object, same exact thing but instead of doing multiple run jobs wait for it to finish, run job wait for it to finish I can just let Argo handle a workflow that can handle the entire thing as a single task so all of my sort is going on I'm gonna let this finish I have about 10 minutes so I'm gonna move on to the last bit of the talk and we can take a couple of questions while this finish I'm gonna show you that finish in a second so we did all that running so running batch workload on Kubernetes at scale we just did it at scale so this is done technically but let's talk about more things design so number one things about running batch workload on Kubernetes at scale is if you can manage to design the workload as a loosely coupled workload that works best anything that requires a lot of network communication between each other you will never reach the performance of something like HPC on a bare metal with infinity band that is talking over terabyte of network no matter how good your network gets it will always be slower than physical hardware talking directly over like wire, right? That next point is network is not unlimited so there is a lot of workload right now uses MPI that doesn't need to just because they can use MPI they usually just use MPI parallelization is your friends if you can create parallelized workload that doesn't depend on each other which is oftentimes possible some use cases not possible but most 90% plus use cases it is possible if you can do it Kubernetes is great because you can get parallelized and scale up your cluster to absurd amounts and limited communication between workers scales best so if you don't have to talk to each other you can just do their work that's fantastic optimize so this is one of the things we usually talk about you know what don't worry too much about like byte level optimization don't worry about shaving milliseconds of your work batch is one of the only places I will say do that because if you can say I save 5% of time by writing my algorithm better that is immediately 5% cost savings because you're no longer doing that 5% extra work in Kubernetes you can do the work and once the work is done you can scale down the cluster to nothing that means you're no longer paying for that resource and finally try to run jobs in co-location if you have a choice to run the workload in the same data center that is better because even if you had to talk over network now you are just talking on the same data center instead of going across scale Kubernetes scale down when not needed that's pretty obvious in Kubernetes especially in GKE we have an option to set up that I only one minimum of one node maximum of 100 when the work there is no work happening your cluster can scale down to one like just like barely nothing there is a limit on how much you can scale up to in GKE we can go up to 15k I think open source Kubernetes generated as about 5k which is a lot most people never need more than like 100 nodes but if you are one of those very few people that needs to spin up thousands of node for example a company called Fijiast use GKE to build their supercomputer and then needs about 50,000 nodes so one Kubernetes cluster wasn't enough they use 12 different GKE cluster to parallelize their workload to get 50,000 plus nodes at their peak and use spot VMs because you want to save money you don't want to and batch workload doesn't really care about running on VMs that has like four nines, five nines SLA node goes away it goes away you can restart the job that's fine monitor of course monitor cluster research usage but the next point is important because oftentimes people are like I'm gonna trace every call that happens in a batch workload every time you do any extra task that you're not supposed to you're using more resources unless you absolutely need to trace your every call I would recommend for batch workload don't do that because you build that application optimize it to the best capabilities and then you look at the cluster level monitoring for the information individual traces is not gonna give you much more limits so understanding individual limits for your Kubernetes cluster so Kubernetes has soft limit for both pod service node secrets and there's an upper limit that's defined but that doesn't mean you can do upper limit of every single one of them if you do the most part in a node that means you probably won't be able to do the most secrets in the same node you kind of have like an envelopes type system where the truth is if you're running all the things at the same cluster your upper limit is actually limited by other things for example HCD is a big bottleneck for Kubernetes it has a hard limit of about six gigs of total memory and then for individual type of objects if you have thousands of pods I think HCD has 800 megabyte limit on individual object type so if you have 10,000 pods each with a lot of metadata you might overrun your HCD limit on how many new pods you can create fault tolerance assume everything will fail that's just basically the fault tolerance guidance it will fail so you have to build your application in a way when it does fail you have enough checkpointing so they can restart back without having to do the entire work over again so as you're building an application batch workload you have to rethink rather than saying oh I am gonna write the best application I'm gonna write the best application that can fail but still do the job so last only one slide about GCP and GKE some of the things we are doing here I think that is gonna make running batch workload on Kubernetes a lot better is we have a node limit of up to 15,000 if you're one of those few people that needs it we have it manage from it yes with zero hassle we can just have manage from it yes collect all the information without you having to handle from it yes and scaling from it yes all that fun stuff and managed from it yes and Google Cloud uses the Monarch database that the same database we use for monitoring things like Gmail, YouTube, Google Maps which are fairly big systems compact placement so what this one talks about is in the same region you can ask your notes to be scheduled on the same zone if possible and if the zone goes down your cluster will be rescheduled within the same region on a different zone so usually when you do the same region your nodes get scheduled like round robin so one, two, three, one, two, three with compact placement you are asking Kubernetes GKE to put your nodes close to each other that way you can get the co-location benefit and the regional benefit at the same time time sharing GPU GPUs are expensive but if you're doing workload that requires GPU you can time share them and save a lot of money image streaming so if you're doing batch workload sometimes the images are really big I have seen up to 22 gigabytes of image sizes so with image streaming you can start an application without having to have the entire image downloaded so Docker images are usually layered so smart people at the team somehow figured out how we can start this container image without having to download the full image you can just start as it comes finally cost allocation if you are running a GKE cluster with multiple tenants you could have cost allocation that kind of you can do charge back and figure out which team use how much resources if you're having a shared instance you can do some cool things with that, thank you so much I'm Mofi Kursar the internet you can ask me questions and you can ask me questions for three minutes oh yeah, that also finished so while we're talking about it any questions? and it's for questions so I've got the mic and I'll bring it around I saw him first and then I'll come back to you and yeah, if we can finish it within the time we don't have any talk right after this I'm gonna be outside or hanging out here we can talk more about your batch workload if you have any I'm interested in learning more how people do their things question yeah, for the queues is there a way to configure priority and preemption and that kind of thing between the queues so it's being worked on so right now the idea is just like the best effort first in first out queue mechanism we're working on preemption right now it's on the roadmap questions, I think someone was here it's not centered to the queuing but when you mentioned the monitoring you mentioned that Prometheus endpoint I probably thought it wrong but I always thought that that was limited in the amount of data you could send it to it was like only for like edge cases but I might be wrong um, can you repeat that question again, sorry you mentioned that since these jobs are ephemeral you cannot scrape them you cannot have guarantees for scraping them so you were sending them to that endpoint oh, you're talking about the push gateway yeah, the push gateway yeah, so yeah, I mean the idea is when you're running batch workload you don't want to send all the metrics because again anything you do that's an API call that slows the thing down for the most part one of the key things I want to see on my batch workload is how much resources it's using currently because on programming languages you can check your garbage collector and other things just send that information because based on that it can resize your job limits later so other than that knowing how long the job ran it's not like a job's responsibility to send that information back out so I wouldn't care too much about that part yeah, any other questions? Giffy, you said you had questions at the end come on we have one minute, we can do it, let's go also the next talk in here is until 4.30 yeah, so we have sometimes we can go a little bit over, I think okay, um, sorry we know each other but we don't know each other well I don't know if you know like Bob's in my background was HPC workloads at the University of Michigan so I have a opinion yeah um, within Q are you looking or is the work being done on any additional schedulers? or no, I shouldn't say schedulers Q types so like yeah, first in, first out and then you have best effort but like those aren't necessarily what most people in scientific computing are expecting yeah, so you can handle that situation in two ways one is if the workload is that different that they shouldn't share the same Q you can have multiple Qs running on the same cluster you can have a cluster Q that is only dealing with GPU stuff on top of that, for the same type of workload for the most part, the best effort and the strict covers 80% of these cases and the other algorithms that we need to handle we're working on those okay, so I'm thinking like imagine you have 100 academics there is a Q or like several Qs that just several months worth of work are these schedulers or these Qs looking at packing as efficient as possible because I know like default Kubernetes scheduler not really good at that yeah, so Q is not a replacement for a scheduler it's a Q that utilizes the Kubernetes scheduler right plus on top of that, you still have auto scaling so if you all of a sudden find out your Q pressure has gone to a certain amount you could ask underlying infrastructure to scale up more so that way if you see that your Q size has grown to 100 we have knobs, it is not implemented yet it's because it's not implemented yet it's being implemented but if your Q size crosses a threshold we could tell the underlying infrastructure to spin up more resources how much thought or again I haven't dug into the Q or batch stuff oh, if you want to learn more about Q that's the link I didn't give continue asking the question so in terms of like data partitioning and keeping like QA and QB separate are you just using the namespaces are you using namespaces to partition people submitting different types of workloads an example could I using Q put like KIPA workloads in a namespace and be like these are safe or so Q doesn't really take care of your actual isolation on the namespace itself because anybody on that namespace will be able to see any workload on that namespace anybody who has permission to see things what you can do with Q the isolation is based on on the same namespace you can have five different local Q each of them concerned about different types of resources okay so it's yeah Q I don't think we undertake the responsibility of actually data isolation on the job level okay um I did something okay there you go I don't know what this is oh I think it kicks over to that automatically so we can see who's coming in next awesome any other question I'll be hanging out here for more question JJ has a question what about free BSD was the question well it's free I don't know what he wanted me to tell you um alright so yeah thank you so much for having me and thank you to the organizers for giving me the opportunity to talk to you so appreciate it we'll chat after everything is done yeah testing oh I can hear me that's good really it's time for us to get started so welcome everybody welcome everybody um I to the cloud native track or welcome back to the cloud native track we're gonna talk you've heard if you've been in the track for a while you've heard a lot of talks about running cloud native stuff on Linux um so we thought it was about time for you to hear a little bit about running cloud native on a different OS um free BSD um and this is gonna be presented by Karen Brunner um who I've known for quite a while as fuzzy KB um and I and she'll show us that including uh some demos sure thank you so welcome Karen thank you thank you Josh um how many people here have touched free BSD done anything oh almost everybody oh this is a lot more fun when nobody no not really um how many people so gonna I usually ask how many people run a Mac have Mac OS 10 and then I tell people that you're running free BSD sort of so so actually almost everyone in Silicon Valley has touched it in some respect or another um free BSD a lot of the for Mac OS 10 they took a lot of the free BSD user land and some of the the kernel and built it into Darwin OS 10 so I'm gonna talk about cloud native free BSD um and again most of you have touched free BSD so I don't maybe for the audience I will still continue um free BSD is of course an open source Unix like operating system um it dates back to uh rewrite of original Berkeley source distribution or Berkeley system distribution um they rewrote it to get rid of all the AT&T proprietary code because of licensing reasons um but that happened in 4.4 BSD light about just over about 30 years ago um so free BSD is built on that so are a lot of other other BSDs like open BSD and net BSD and works of those uh so why are we talking about cloud native BSD free BSD well it's kind of to probably talk about what cloud native is uh and this is one of those definitions that not as much as DevOps but one of the things where everybody's gonna describe it differently but it's basically higher level it's uh applications which were built to run on clouds which were built to run on you know uh cattle instead of pets you don't have these nice little servers that you know by name and they all have their little quirks and you can all say hi everything you're agnostic cloud agnostic you could scale you can do um a lot of things you can really do on hardware um so more specifically we're going to talk about one of the one of the most or at least most popular patterns is to use something like containers you know these little bits that you run on your servers and you can run lots of them and they all act like little servers uh you need to orchestrate these so you use something very often like Kubernetes there are other iterations uh there are proprietary you have AWS ECS their elastic container service which they use internally um OpenShift is Red Hats branded Kubernetes uh you have Nomad from HashiCorp which is similar but anyways there are a lot of uh we're gonna lean we're not gonna be too specific about which orchestrator we're gonna use but um so that's what it kind of means by cloud native 3DSD instead of running on Linux nodes we're not gonna talk about Windows nodes either instead of running on Windows instead of running on Linux you're gonna have uh how would you accomplish that having these little mini servers that are easy to bring up and tear down how would you do that on 3DSD so I'm gonna do a brief comparison to Linux um this makes we are at the uh you know Linux uh um exhibition uh event so I will at least throw some Linux in there um first thing is Linux is GNU Linux 3DSD is not GNU 3DSD um 3DSD is using the original BSD user land so PS options said all these things that you use on Linux which act in a certain way those are actually GNU user land options and they're very different especially if you have a Mac your Mac is using predominantly the the BSD user land so PS takes different options said doesn't work like you expect um a lot of these so you can install these easily on 3DSD you can install GNU-SAD which I end up having to do you can install GNU-MAKE GNU-MAKE which you pretty much have to do if you want to build anything but um this also gets into it's it's sure it's confusing to people who go back and forth maybe are not as familiar but if you run if you've been running on a Mac you're you're probably pretty familiar with those differences um let's go back to licensing um GNU is using or Linux is using the the GPL the GNU public license and this is kind of interesting it caused a lot of issues as far as embedding Linux and devices for a long time um Friest uses a BSD license this is extremely a very open license as long as you maintain the copyright you can pretty much do almost anything with the source code um can't necessarily merge it back into Friest for BSD if there are proprietary options but as far as using it for your own purposes it's it's it's very open so it always confused me a little why Linux became so much more popular when it had such a problematic uh licensing in some respects but anyways that's one difference they both have cute mascots um the BSD is a little cuter but they do both have cute mascots uh Linux uses systemd to manage orchestrate processes on a server um I'm not a big fan of systemd I don't know who you are so it may or may not be a feature that Friest does not have systemd system or Friest is still using the um the old RC I don't even remember what it stands for anymore so uh configuration um it's this is a little dated it doesn't scale very well their conversations about making it more manageable similar to systemd I'm sure some people would just want to port systemd to Friest I hope that does not happen but um so systemd is kind of overkill from my point of view and Friest's RC system is at this point underkill file systems so Linux at this point I think most distributions default to ext4 which has been around about 15 years now um it works okay uh Friest this always wasn't I haven't asked or scared but ZFS which came out of Solaris is first class citizen you can run it on Linux it is not a first class citizen you can even run root on ZFS it takes a little extra effort uh but it works um ZFS is it is very reliable scalable uh high performance and you have full features like snapshotting you can do you know copy on right so you grow your you can make copies of your little file systems and grow them which is really useful when you're running say a containerized option so you have a file system and the file system doesn't take up much space as long as you don't you know as long as you don't diverge too much from the original base image um I won't go into all the ZFS features I've I've forgotten them all but it's it makes stuff just kind of work in a much easier way instead of using something like LVM I don't even know what people still use different are people still using LVM and Linux anyways um so Linux kind of cool previous use super cool virtualization so hardware level virtualization you can use KVM on Linux the kind of analog on previous use Behive they both are using virtualization hooks in your CPU but it's just a different kernel implementation Behive is really nice I've used it it's great KVM I've used it pretty good but they look this you have a fully virtualized machine so we're getting closer to the cloud native bit um so obviously you have these servers and you have lots of cores now and you want to instead of having servers with names you want to run a lot of little discrete processes you probably want to keep them kind of in their own space so they don't eat up all your system resources so they don't you know write to parts of the file system you don't want them to write to so they don't hijack your network whatever so Linux does this with mostly with C groups control groups which runs in the kernel it you're able to assign processes to control groups so you can say this this control group you're limited process of limited to one CPU and maybe two megs around before the out of memory killer comes in namespaces are they're different but they're grouped in what we think of containers also uses namespaces so you can namespace your network and keep it partitioned from the actual system network previous day has jails and jails predate Linux C groups by a number of years I don't I think I want to say jails appeared in 1999 ish yeah C groups I think were merged into the Linux kernel in 2007 so jails work they were containers before there were Linux containers where there were Docker containers jails well I'll talk a little more about jails has its own section so I can't draw this is a AI rendering of the previous the Damon and jail anyway so previous jails are very very special they work originally they work instead of on a process level control group they work on a shrewt change route so your root system is not the system slash you move it to something else and then limit the process to access to that or to that part of the file system so I'm going to do a very quick demo to show you what that looks like maybe and I'm going to do this horrible thing and you're all going to judge me which is I understand okay I'm going to make a new directory called so let's say let's say I just want to maybe run you name in here so I will make this and I will copy into my oops I did that wrong I should not talk in type at this time okay so shrewt is really easy I run the process and give it my path and if I just give it a command it won't try to enter it'll just well you'd still give it a command but let's say I just want to have it run you name okay it gives me an error because the name is dynamically linked it wants it wants this yaw library so I'll just do this thing and then I'll copy again okay so we'll try this again share a library okay so I could go through this forever so I will instead do this um this is all part of the demo I promise um I can't do this because there is no bin of sage in my shrewted file system okay well jails are a little more complicated than that this is the basics but um you know I can do I can't really network out of them um not easily and they're not very isolated they're not very secure because I can do um if I say if I copy the ps very or uh binary into yeah so I have to mount the um mount the files which I did I write this I have this cheat sheet of commands to um cut and paste um and right now then if I try this again I can see every process on the server uh which is probably it works as long as I don't really care about security or as long as I'm not trying to do anything weird to the process table because it's probably not what we want so concept of jails came about uh in one sense to make this a little make jails much more much less leaky um so now let me and okay so they still they're still built on on shrewd um oh and I am demoing okay well I haven't demoed in jail so you saw the problem was statically linked you know these are dynamically linked binaries this would all be much easier if they were statically linked I wouldn't have to copy all these libraries over there are a lot of things like that it's uh you can do that um but if you want to just be really quick and well not quick but completely you can just kind of install all free BS or yeah all free BSD in in your files in your jail file system um I will not run this well I don't know what the network is going to do but uh in theory I could just do and I actually can't just do it um you know and I'd say this nice jail file system I would say my new jail well I actually already have something this would by default by default it goes on to the network you give it a mirror to go to and it will look and do the standard BSD install from it it will download your your base um as a tar file and I've done this so I will go to and I'll treat it like it's a an actual full server um and it's like the root file system on the server I have other junk in on the actual laptop root server but it's just a standard it's your standard file system um so I've also done some other stuff here uh chchchch I have I will show you so the jail stuff is built into the previous decolonial you just have to load the enable it um I have done this and I've actually created a jail this enables my jails it's going to clone use a clone interface which I won't get into but there's weird networking stuff um and I'm assigning it in the in the the loopback interface loopback one interface is going to be in the 192.168.254.24 subnet so now I have to actually configure I have my files I also have to tell the jails what it's going to look like as far as I give it a hostname I give it an address uh I give it the path to you know that I created my file system um I tell it to mount devfs we saw that earlier if you try to look at the process table or something you need the you need the file system and then persist which just tells it I want it to keep running um you can also do something where instead of keeping it running you can give it a single process to it and this is really great if you're using a web server or something where you're not going to go in and run a bunch of commands you just want the service to be running it's it's more efficient and it's also more secure so I will now start my jail yes um yes I had I had a demo with network collisions before but yeah I have the bridged I have it bridged I haven't shown the bridge um I have this nice bridge which I have other stuff on but it's yeah it routes through the um and I should have picked a shorter name but there we go oh yes I have to how did I do that I don't remember how I did that last time we also go back here uh da da da yeah I just need to create it and then give it the another difference I don't know if you notice this but most most Linux distributions moved from ifconfig to the IP suite um previously still uses ifconfig which confuses me less you have not been doing this for a long time okay there one okay now I'm trying to start my jail again yay I have a jail so uh I can look at the list of jails there's my jail there's the address I can now run something in my jail um do something like this hey hey that looks just like well except for the name the host name here that looks just like my uh my current system and this is also something very similar to my containers you're not running a separate kernel it's going to inherit the kernel that the system is on um same thing in a Linux container you're not running you can have a your container can come from an alpine base it can come from an Ubuntu base it can roll your own the kernel in there is not going to get run so I can also because this is a full install and ooh that's money so but it's still a file system and this is useful like you seen services um services kind of the analog to the system d control if you use Linux um it's it's jails are first class citizens and and 3dsd both they've been around that long but they're also really really useful um great great uh so it's yeah this tooling it's it's made to we saw bsd install you don't necessarily need to run it for a jail it's what you run when you install um it's directly to hardware or directly to a virtual machine but it's it's got I just pass the jail argument and it says hey I'm installing a jail um service service usually operates at the level of your laptop VM whatever whatever you're running but it takes the jail option and it knows that um well this is a jail start it knows that you're running you want it to run a process in a jail instead of so it's it makes it because it's still a file system though it makes it you just stop the jail and put some stuff on there unlike a docker container well you could still do that it's kind of ugly but where you you don't want to run app get inside your container you want to run it in your container when you build your docker container um but this makes upgrades easy this this system is set up to be a first class citizen um but I can also do stuff like stop the jail and if I wanted to I could just uh instead of you know pulling in a file from externally um I could run it but oh I didn't um just don't remember if I did this right oh yeah I got the uh results um it's actually you can limit you can limit the network um this is sharing the host network directly uh you can use something like vNet or some other virtual create a virtual interface which is going to complain and secure your network a lot more it's going to uh limit it to its own network and it's invisible the server will not be visible the server network um okay so that's a jail basic uh very simple um I can stop the jail and drop the file in there just directly from my host cpe file into jail file restart it it's it's there okay so what do we mean when we say making previously cloud data well it already has these containers these semi virtualized so processes so why do we want to do this so we'll talk about what the standard for a container is um these are all standard there's this cri container runtime interface one for network one for storage and this is a specification about the plugin if a plugin is going to call itself a run you know a cri compatible runtime it needs to run you know these functions we hate this way this needs to these methods need to exist you can run the you can create these in any language as long as the specification as long as they take certain arguments and behave in a certain way give you certain outputs so you have the runtime you have the network and the storage so the runtime network and storage are the names are fairly obvious but what do we mean when we talk about runtimes um so runtimes basically are run um when you tell it to run my container it creates the container uh the running processes um assigns whatever you know if I'm going to run a network uh if I want a network on my container it calls the container network interface plugin um storage is the same way okay so what's next well there's the open container initiative and these are standards about what container images should look like they hate um this is you know uh docker containers docker containers are kind of where docker is not really a actual implementation of the container runtime but anyways uh your images should behave in a certain way and it doesn't matter if they're linux images or they're windows images or they're previous images they just need to have certain files in certain ways and be packaged in certain ways so they can be packaged in certain ways so basically what happens is you have your runtime which implements the container runtime interface you have an OCI tool and then you have the actual container so runtime we're talking about things like again docker engine is not quite right uh it uses it's no longer standard in Kubernetes but they actually use containerd as a shim uh containerd which is a nice run you can run containerd um I think uh red hats openshift uses CRIO there's something like podman which is also out of redhat um there are a lot of them but containerd is the new standard for kubernetes so and kubernetes or containerd has been ported to previousd which is nice so you can actually just run containerd directly on previousd works so well you need your OCI tool so you can actually which knows how to manipulate your actual container objects um on linux you usually use something called runc run container uh so if you were using previousd jails you'd call it runj and runj is an existing project it was created by Samuel Karp um it's got other contributors now but it it implements the OCI specification so you pass it a certain um certain options and it behaves in a certain way and passes ends up creating your your object um so it's obviously instead of creating containers we're creating jails uh of course if you were running this on linux you could still have containerd this would go to runc and then you'd spew out some containers let me show you something first um so I was talking about runj and I'm going to do a lot of copy and pasting because you don't want to see me type um so I'm going to show you how runj actually works this is and I have container I hope I have containerd okay so we're going to use ctr which is the standard containerd command line interface um I've already done this step but I'll show you anyways so uh Samuel Karp has built an OCI compatible image for previousd and he kindly hosted it on a public AWS ECR last container registry ECR doesn't care if it's doesn't care what the artifact is so it doesn't care if it's a previousd image um it's a little cut and paste so I've already pulled this image oh and my network isn't working anyways so that's fine oh no I see what I made a typo yeah okay copy twice yay okay I did already pull so that's nothing new to pull and then yeah I did not make another okay so now if I were running this as a linux command if I were on a linux system and I wanted to run a container very much I could use CPR if I were running containerd and it would look very much like this um the biggest I already have this running um what call this okay so so the command line here run I'm going to run you know my thing run time I'm telling it specifically to use uh run day this is the path specification I'm just going to handle on the whole system the storage um CFS is again talked about lovely CFS it's going to manage our snapshots for us it's um similar to I think on linux it's using butterfs alright but we can use CFS we can do anything with it so including use it as our container snapshot um and then this is my image my previousd 13.1 release and I give it my name and my command so I'm in my container I do not have a hostname because I did not give my container a hostname but um and I can only see my processes because it was a nice lead on that for me um so I here I run it as as a uh with CPR on containerd I ran a container it was a jail but as far as containerd cares it was a container so okay so this brings us to another interesting thing the uh previousd is really one of the nice things about previousd is that it's it has gone out of its way and by it I mean the maintainers gone out of its way to uh play nice play really nice like be a nice host to technologies from other operating systems and so it had um pretty early support I'm pretty sure that I did an NTFS file system on it at some point read write and put files there and didn't corrupt it um I think it had a great EXT 3 driver I think I've tried this recently but I think the EXT 4 driver is read only because of journaling I don't know if you know the answer to that I keep looking keep looking at that for you but um a lot of this is nice it's it's one of the useful really nice useful things it did was make a kernel level linux emulator as long as you are running a compatible an architecture compatible binary it will linux binary it will run it for you um there are some caveats a lot of these are much nicer than they were and the last time I did this five eight years ten years about ten years ago um it's those basically linked libraries so you um there are now packages you can install that will install a couple different linux the distributions but will install the libraries for you the dynamic libraries you don't have to let those down yourself but it just you know binaries just run you load the you start the you load a kernel module you start it it's okay it gets you pass it a you know you try to exact a command that's a linux binary you can't find the library so this makes it possible to do some very interesting things um I could run a linux container on FreeBSD so how would I do that uh well go back to Runj um and I don't know why this works it just works uh and I've already pulled the image but I I will show you this but this is really small I forgot this is really small so um again using my CTR fan line and I've already pulled this image so it won't be an issue but I'm telling it same thing I did when I was pulling um the FreeBSD mostly image telling it to pull this image uh I'm doing something very similar I'm telling it to pull but I'm giving it I'm telling it specifically to pull a linux from for a linux platform image I didn't have to tell you that when I pulled the FreeBSD image because it's FreeBSD and so that matched but I'm on a FreeBSD system and I want to pull a linux image so I have to specifically tell it that and then I give it the um the actual image I'm pulling just the standard alpine linux container um so now I can do this really nice thing I can run a linux container directly on FreeBSD now to FreeBSD purist just run to else but I don't know this is really this is uh this is kind of fun yes wait so how is that running because it doesn't have a linux kernel to access it's the linux emulator the linux you later yep so I'm still using the runj runtime but because when it you know goes to execute the actual process it just the kernel says okay this is a linux finder but I know what to do with those so boom um I actually have linux uh so that's one I'm talking a little bit about the future that's one possible path but mostly um I'm gonna bump ahead a little bit talk about the future so Doug Robson has done also a lot of work and and Doug Robson Samuel they I mean they're doing this on their own time they're not paid to do this they're excited they want things to work Doug Robson I specifically asked him and he said as part of an answer my long-term goal is to make it possible to add FreeBSD nodes to a Kubernetes cluster as it turns out the control plane works just fine on FreeBSD already and he has a working FreeBSD only cluster in his lab so um you can you know we have Windows, Windows nodes in Kubernetes which are you know full-fledged first-class mostly uh nodes that you can spawn you can spawn containers on um why not FreeBSD which is much closer relative to Linux and that's the thing so we've seen you know Run Day gets a lot of the gets us most of the way there we just need some of this needs to be ported some of the the node processes whether it's the um the Keyblit or you know there are other projects to replace Keyblit. Keyblit is the little process that runs on your Kubernetes nodes which are you know your work courses where all your service containers run and um so it's what talks to the Kubernetes API which kind of sits between the Kubernetes API the control plane and your container run time uh so it's it makes it seems totally makes total sense why FreeBSD shouldn't be a it should be a first-class citizen in the cloud-native world um and that would be great and could then use I don't know what Windows it runs its little containers I don't know what they look like but on FreeBSD you could make jails and you know you'd understand that they don't behave exactly like Linux containers but hey you're running on FreeBSD you have these little discrete services servers that can run and you can they run processes and if one dies you know the orchestrator restart another one and it's a lot of fun and it's very useful um and as we've seen if you have Linux simulation mode turned on oh hey you could also schedule Linux containers on your FreeBSD system obviously there are some caveats if you're using some really low-layer Linux stuff but you know basically you can do it um so that's that's what excited me when I thought about it two years ago because I missed I've worked with FreeBSD a long time and I miss it um I was working a lot more in the Kubernetes world and why aren't we doing this why aren't these two things you know the Reese's Cup why isn't my peanut butter my chocolate and peanut butter why do I get the peanut butter on the chocolate because it's the we've seen the revolution that containerization and orchestration have made mostly on top of Linux just makes development it's changed the game hugely nobody almost nobody knew what Kubernetes was 5-10 years ago now everyone is running um and I say it just makes sense a lot of people are excited we made a huge amount of progress I haven't done anything except talk about it so I think that'll happen um and there's one I could unless people want to stop for questions there's one really oh yeah no I'll show you it anyways it's called the jail it's the full previous defile system so it's kind of big so you know Docker containers can be really small they can be really big but they can be really small so it's the same thing with you know we saw with this root example it's the same thing for jails you can have a very tiny image because if I go back to my you know all I had was the binaries but I also I copied my dynamic libraries but I can now do and I've already cheated and created this file um it actually may give me an error okay we'll do with that later um no and I need to edit this it may actually give me an error we'll see no it didn't like that okay um actually let me do this okay I will go back to my original this also has uh I don't actually want to do that I can use this this shrewd demo that I created earlier which we know just has um it's got the defile system on it but it just has a bunch of libraries and the psme binaries that's that can be my image um this worked run a jail off of that again the jail does not need based on shrewd I just need what I'm going to run it's this very similar uh looks the same on a Docker container um there is I think it's Doug Rapson has worked on an OCI jail spec I don't he has actually worked on um so he and Samuel Karp kind of came out of added different directions Doug really wants uh he wants podman to be a full pledged uh you know podman is basically docker without docker uh but not by docker uh and he wants it to be able a full fledged uh runtime he's gotten builda which is the docker build podman build he's gotten it working um so it would be nice to have it more general uh again people came at it from the container D direction um but it's yeah you can build these images he has some tooling to build the images um or if nothing else you can copy files into your shrewd file system and tar it up um yeah that's that's definitely part of you know that's standard tooling so people are definitely looking at that um oh I think I have it open do I have it open do I have it in the okay no I don't um I need to generate a list of my references but uh I will definitely point to I don't know if you're unmasked at all and you can okay um I will post those probably in the next hours um okay so anyways we'll make really small images um you can distribute them just like we saw you know with with run day with with Samuel carp's image it's just it was full-sized previous d but it was you know it was an artifact that was on uh awc's elastic container registry and I could pull it just like I pulled the docker image for linux so that's a that's a very this is just a snapshot of where we are um there may be people oh somebody else I read somebody else has a different kind of unrelated run time for containers that hooks into hashi corp hashi corp's no map uh container orchestrator people are I I think there are a lot of these corner cases which are not out in the public but these are two of the more kind of kubernetes approaching uh responses which is the interest that I came to it with um so I think I am a terrible prognosticator but I think by the end of the year we will have a pretty much especially Doug Grabson he's done a lot of work uh you know a full-fledged FreeBSD node um kubernetes um and it may go past that and whether I have no doubt that it will end up in the main kubernetes repo eventually uh and I think a lot of people are going to get really excited by that I'm excited by that um people who like I am I've worked with Linux I work with Linux now you miss FreeBSD that's great news um it's I think it's going to it will help bring FreeBSD back into uh a more mature uh part of the server of the server ecosystem uh which is you know where this is okay that's it um I'm on I have a Twitter handle I never go there so on mastodon I'm at buzzykb at hackyderm.io um and I do have a website where I post I posted some um tutorials on this I post a lot of random stuff so that's production with scissors dot run okay and do we have more questions um I know this might be a little bit beyond the scope of the talk but um like at work we were a really heavy BSD adopter even on desktop like we had uh like it was like a windows terminal server but it was with XRDP FreeBSD and it worked pretty well actually like we started early when things were just not there yet but it got way better and then we we moved to a point where I know you briefly mentioned it the beehive virtualization and it's awesome like we used it for like a Linux printer driver that I couldn't figure out how to get it working and like the Linux simulator and it worked great but like for graphics I was wondering is is there is there new is there news on GPU acceleration like particularly being able to share a GPU with a VM because I think that would be so amazing to have like for for like these enterprise providers that say like oh this is only supported on Linux even though it works on everything like our accounting system is like that and yeah so there are a couple pieces to that and I kind of know bits so it has to be oh should I repeat that or is it the mic pick it up okay so and I don't remember I may be mishmashing I think Michael Dexter is I think I'm not sure I don't want to miss title him but he's one of the leads on the beehive project and I remember asking about several months ago and I think it was about GPU support and I think he says they're working on GPU support I think if you go to beehive.org you can get transcripts of their meetings too and watch video if you want to okay yeah and I follow mutuals anyways so I don't know so the Linux emulator is I don't know its current status there are parts where it falls down so I don't know if you can just run the driver and it would work I have no idea what the status is I don't know I don't know it works really well but for some commercial applications from specialized domains I don't know but I think Microsoft Office would be awesome to have if there is a Linux version yeah you can and again beehive is hard work level virtualization so you can run Bihive is the same as KVM where it's managing at the current level, but I could actually show you I have. So there are a lot of jails that I use the command line, it's still a little cranky. There are lots of wrappers for jails that make it a lot easier to manage. Same thing for Bihive, and I'm using CVSD to manage my Bihive VMs. So I actually already have one, oh, an Ubuntu server running on my local, so I can, yeah, it's fully, it's a VM, it's like, if you go to AWS, EC2, you say I want a Linux VM, boom, it's what it is. Yeah, obviously it's heavier weight running a full VM, but I need to restart because CVSD does not like to go to sleep. Yeah, it's, Bihive is really, I actually created my first Kubernetes cluster on CVSD using Linux on Bihive. Oops, okay, okay, next, I will show you this once I restart. Any other questions? More questions? I have a question, since you can run like the Linux jail, could you run the, Kubernetes control pane in the Linux jail? So you're not technically running a Linux jail. You can run most Linux binaries within the jail. I, maybe, so there's still some network, there's still some virtual network stuff that needs to be worked out to make that stuff, make the specification look, get really full-fledged networking. Again, a lot of people are actively talking about it. I don't think we're there yet. I don't know how Doug grabs it, but this, okay, so I can log, this is a VM running locally on my laptop, it's a Bintu, you know, and it's completely separate from, you know, it's a truly, truly virtual machine as opposed to the semi-virtualization like containers in jail. I actually have a question. Yes. So, jails and containers are not the same thing? No. Has anybody explored trying to build a jail and ZFS native orchestration system? Probably, and I'm actually sure they have. I don't know what it is. No, there are definitely bits, people have done this, it's been, they've existed so long. Yeah, I think pot and no matter kind of similar as far as orchestrating jails, and CBSD will do jails too. Yeah, it will. What we think of as orchestration is a little different from the Kubernetes standpoint, but it will definitely help you automate managing it in large. So I just wanted to throw in there for you guys to think about is, you can do configuration management, there are things like Ansible and Salt, and there's no reason why you can't take what you know from Ansible and Salt and use something like HAProxy and Reactors and Salt and build your own version of all this stuff. I mean, you can tell it, build me a jail, configure it, install these packages, do all these things. So, I mean, if you want to, you can go down the route and hand build your own thing. But I understand the need to put into the standard tooling and get into the community, but I'm just not involved in any of those communities, so for me, that's the route I'm taking with it. It all works fine. Yeah, I mean, there are so many ways to approach it, and organizations, people should do what works for them. There is no one right way. As I said, I'm coming at this from the more Kubernetes cloud native way, but these are all, that's not the only way. And some of these bits, you can take, pick and choose. You don't like, you can use the RunJ, you can use the Runtime without using Kubernetes, without using Orchestrator. These pieces, I am very pragmatic. I'm a site reliability engineer. Use what works for you, for your org. Any other questions? Thank you so much for coming, thanks for the questions. Sound okay? Oh, yeah. Testing. Thanks everybody for coming. We'll get started with this last session of the, not the last day, of the cloud native track at scale. Our speaker, he told me to introduce him as just Cotay, because that's what everybody calls him, but he is, he does have the name Michael, and he has asked that we hold questions to the end. When we get to that phase, I'll run a microphone out to you so that we can get that on the recording that we preserve after this. So thanks everybody, and Michael, or Cotay? Hello. Well, usually, I was realizing, usually I say there's only two people who call me Michael, and that's my mother and my wife, but I've realized also John there calls me Michael. So I'm going to start name checking him every time I get introduced, which would be great. Now thanks for coming during a dinner slot. I stopped by the Sheraton lounge. If anyone's hungry, I brought these for you. They are oats, so that may not work depending on what you do, but I'll leave them here if you get hungry. Well, yeah, so I want to go over, well, you're going to find out what I'm going to go over, but it's pretty much says right there. There's this notion of a platform or the application thing, whatever this stack of software is that your application developers use as their kind of primary interface with your infrastructure and your stuff. And well, not so much, I'm not really technical. I stopped programming in 2005, and now I make slides. So I'm much better at that. But I don't really want to give much of a technical overview, because I'm not capable of it. But what I can talk about is how people have been kind of the practices that they follow, the way they think about it, what are the lessons learned and the best practices for being successful with your platform, if you'll pardon the word strategy, your initiative. And before we get to that, I thought it would be good to talk about mayonnaise. But really, what I'm going to go over here is I think whenever you're talking about something like infrastructure or platform-y or whatever, I, having been an application developer, think it's always good to remember why you care about infrastructure. Now, I just sort of told you I'm biased. I used to be an application person. But typically, in the case of a platform, the reason you have a platform or thing that runs applications is because you have software, you have application developers, and typically the reason you have applications is because people are using those applications to accomplish some task. I suppose you could have an application that no one uses, which would be academically fun, but not very useful. So I think as you think about how you get a platform in place, what it is, how it's successful, it's always good to have the use case, the goal, the reason, remember that guy who talked about the why's and stuff like that? Like the why that it exists. And that is when our mayonnaise friends come into play. Now, this is a story from a while ago, and it's a company that, back when I was at Pivotal, now VMware bought Pivotal, so I'll go between VMware and Pivotal. Although I'm pretty well trained, I'll probably say VMware. But we worked with this company a while back, which is why I know the story. And they're a food services company, which means they will ship food to you or deliver food or they will also run your big commercial kitchens, whether it's a campus or whatever. I don't think they would come into your home and run the kitchen, but they'll run a kitchen for you. And so, you know, let's consider, they're a business, right? There's three core things businesses want to do, among other stuff. They want to make more money, make more profit, and retain and get more customers, right? And generally the way that they do that is they focus on, you know, part of what they focus on is like optimizing costs, doing things as affordably or cheaply if you prefer as possible, not wasting, and then having good customer experience. So let's go all the way down to the kitchen here. So those are the big drivers from way on high, you know, people who have big dry cleaning bills have these priorities. And they thought, we've got these kitchens and we want to have better customer experience and we want to optimize them. And currently what we're doing is we have these three ring binders of the recipes that these chefs use. And obviously we're doing some digital transformation so we need to put them on an iPad. So let's go get, and you know, that'll make things more efficient because paper recipes, you know, you lose them, you gotta ship the new recipes each week or whatever the period is. It's important that you sync up on the recipes because you want consistency of the food and also presumably down there at downtown or headquarters they figured out the most optimal affordable way to make recipes. You don't have this guy being like, well what if we add truffles to it? Like, you know, you're kind of limiting the expenses of the meals that you have. So it's very analogous actually to the way that a lot of IT is managed. You have a centralized governance that kind of gets sent out and enforced and you follow what's in the run book or the recipe book, so to speak. And so they set off the application developers and they gave them the mission. What you need to do is you've got to digitize these three ring binders. And now, normally what would happen is the developers would say like, sounds great and then they would do that and then they would give it to them like three to 18 months later depending on how good they were getting software out. But instead, and again, focus on like, this is the goal that we're shooting for, right? Instead these developers, because they were good and very focused on how they can make their software better rather than just doing what they were told, which now that I think of it is a great way to pitch a developer. Would you like to not do what you're told? And instead what they did is they, if you can imagine they got up at like, three, four a.m. in the morning for a couple of weeks and they would go to the kitchen and they would observe what people were doing, right? So right away you're like, well, that's crazy. But they go and the developer teams go there and they look at what's happening in the kitchen. And again, they have the goal, I mean, go all the way back up to the dry clean people. The goal is we want to make more money, we want to save, eliminate waste, save as much money as possible and we want customer experience, right? We want consistency in the food. And so they have this in mind, the developers and they say, that's great. You know, they're using paper recipes but we've noticed that there's something, all these people that spends in the kitchen, they spend a lot of time running around seeing what temperature the mayonnaise is and seeing what temperature the raw chicken is and they have the little thing on their sleeve and they like stick the thermometer, that's what it's called. And they stick the thermometer in there and then they have a piece of paper and they write that down in a book and I don't know, it's up on a wall somewhere and maybe randomly every few months some health inspector comes in and sees that they've tracked this stuff on a piece of paper. But they're spending a ton of time doing this, the kitchen staff, so the developers thought, well maybe computers could help with that. So what we'll do is instead of right now working on putting a recipe book on an iPad or a Samsung Galaxy Tab or whatever, what we'll do instead is we'll digitize that process because our theory, and this is the first thing is you're doing this kind of iterative, trying to discover the problem, come up with a theory of how you would solve it, you wanna run the experiment as a developer, all you can do is write code. I suppose as a developer you could not write code, but that's a little cheeky to be like, that's the best option to do. If you're not writing code, ultimately you're gonna be out of a job without this kind of a bummer, so you probably should come up with a reason to write code. So they went in and they, to stop rambling about this exciting story and get on to the rest of it, they digitized that experience of measuring the mandates, and sure enough they found by going through a few cycles, a few week cycles of tuning this and observing that it saved the kitchen staff a tremendous amount of time. So you already, if you've been listening to the whole DevOps lean and whatever stuff for the past 15 years or whatever, they're like, ooh, removing waste, success. So you've sped up the time that it takes to get that food made or to measure things and you free up this time to do other stuff to focus on better quality, following the recipes, not being so harried and running around, all that sort of stuff. I'm not really sure there's too much concern about kitchen staff having burnout and things like that, but it's very similar, like all the stuff you would think from the DevOps world about how you need to have a more humane approach, the developers are trying to give that to the kitchen staff. Now eventually they go on to digitize the recipe book and all of that stuff, but you can see that the point is the developers could go in there, they had the time to actually hang out and study people, right? They weren't like, well, we gotta spend, you know, sprint zero figuring out like this Kubernetes thing, but no, instead they went and figured out what their customers were doing and like how that was working and then they could go through every week and come up with a new theory and actually test it and see if it's helped solve the problem that they're, you could say customers, but that these happy people had. And that is the goal of everything that I'm about to talk about is like getting to that point where you can have a team of developers go in there and have that kind of experience, right? Purely focus on the problem that they're solving. And again, it links all the way back if people here care, all the way back up to the business side who wants to be like, what are all you computer people doing, right? Like are you helping us achieve the goals and sure enough, yes it does help them achieve the goal. And you know, I always like to note that the ultimate poor customer experience is just that mayonnaise and chicken aren't at the right temperature. And then I don't know if you've ever had the food poisoning, but it's not a fun customer experience to have at all really. Like I guess it's a good excuse not to go to work, which is always nice. You know, I moved to a place recently where, you know, when it snows, they don't give you time off from school, which is like a big bummer. Like it seems like back when I was in Austin, every time it snowed, it was a kid's favorite time. No school to go to and ice as well, but they're not really into that where I live. So here's, let's get down. If we want to enable these teams here, let's identify one simple problem. Now, I neglected to get permission from a forester to use their actual chart, so you have to forgive this because I'm trying to be a good boy about things like that. But basically, if I round up a lot of studies and surveys over the years that go over how frequently are development teams releasing their software to production? So can they actually release their software every week? Can they come over the theory and get this loop in place? And depending on which survey you look at and how you interpret it and all that stuff, right? You can see that maybe, again, depending on which one you look at, once I get permission, I'll use the actual chart and then I won't have to disclaim stuff. But like, I would estimate somewhere between like 30 to 20% of teams out there can actually release their software a week or less, right? And I think, you know, I'm being very liberal in here, but well, or I don't know what the phrase is for that, but like I think probably more like 50 to 70% of people, it's a month or more, and I think probably 50% of the people is like a quarter if you're lucky, right? And so essentially think about that mandate's case. Like that's not gonna happen if it takes you like a month or a quarter to release your software. It's just like not gonna happen. So like this is essentially, again, so going down another layer all the way from the dry clean people to the developers and then right underneath that is like, so this is what we gotta give the developers. Like they can have all this other stuff, but if they can't release their software, it's not gonna work. So that is a large part as you'll see of what I think is kind of like the most important outcome of why you do a platform and what things will rotate around. And then just as an introduction, this is me. I actually, I've worked in many places. I worked with Matt Ray long ago. I found this, maybe you remember this. In 2005, this is like I said, this is the last time I did something innovative because you can see that I won the research development team innovation. Now that's FY05, so that was probably calendar year 2006. But I found this in my back attic, my back shed recently. And boy, I don't really even, that's a Galileo something or another. I was young enough at the time that I thought that was cool instead of being like, could you just give me the cash? But like, you know, it was nice. So I was a programmer a long time ago. I worked in M&A at Dell for a while on cloud and software. And I was an analyst at Red Monk a long time ago and another place. And now for about eight years or so, I think eight years, I worked at Pivotal at first. And then of course that's a part of VMware. And I've worked on the developer advocate team, even though I don't really talk to developers nor advocate for them. I more talk with managers and executive types and things like that about that kind of stuff. But yeah, if you wanna find out more, I don't know, I have a podcast that I do and stuff like that, you can go to cote.io, which I would appreciate you doing that because I have to pay like $35 a year for that domain name instead of like four or five. I don't know why, but you know, there it is. So first let's start with what is a platform? I'll try to keep this short because you can go on and on with the definitional thing and then you're just like, okay, but now what, right? But we'll jump into what a platform is. And if you've ever done application development, you kind of know what a platform is. We've had lots of names for it and lots of things that have been used for it. But I think in defining the platform, the other thing that I can do is kind of tell you why I think this discussion is relevant nowadays or relevant is coming up why you hear phrases like platform engineering and things like that. And so a platform, you know, again, is we'll get to some more details, but it's basically like the thing you run your applications on. And the reason it's pertinent nowadays is because, you know, we kind of go in cycles with the platform, right? Like we get really good at making like the best platform possible that everyone wants to use. And then we just like decide to ignore it and forget it and rebuild it from scratch. And the cycle goes over and over and over again, right? And, you know, you can go back further from Heroku, but Heroku like still kind of like best in class platform, everyone still loves it. Of course, the fatal flaw of Heroku is that it's too expensive. And, you know, especially I guess at this conference that's a big problem if something is too expensive. But that's kind of ultimately like what kind of pulled that back. In fact, that's a good theory now that I mentioned it, that maybe the problem why we keep remitting stuff is people stop, they don't want to pay for it, which is a problem if you're like money or need it. So, you know, around 2015, and this is back on as a pivotal, there was a bunch of platform as a services like Cloud Foundry and I don't know other ones that I've forgotten. And so that was another great platform. You see people even nowadays, of course, I'm biased because I talk with them, but you talk with people who still use Cloud Foundry and they're like, oh man, it's great. I deploy my applications, everything works really cool. Love that platform. But we're installing Kubernetes, so like memories. And indeed what seemed to happen sometime in here is like our friend Kubernetes came along and decided that like yep, it's time to restart everything. So right now, I think we're finally at the point with Kubernetes after four or so years where we have arrived at this, congratulations. So now you can stand up Kubernetes, you can go to your developers and you can be like, here's your blinking cursor, ready to go, right? Now, that doesn't really work for developers as numerous people go over. Like some developers enjoy not going to talk to the chefs and instead spending a lot of time figuring out what to do with the blinking cursor, but most of them would like to say, here's my application, let's get going, right? Like and just actually start using it. So this is why platform conversations I think are happening a lot now is because we're right here. Like we've kind of gotten rid of the past few cycles of platforms, those are no longer cool, no one wants those, too expensive. And now we're like at the very beginning of building out what this platform is gonna be on top of our new friend here, good old Kubernetes and other stuff. And so finally, what are the components of a platform? Well, this is from, there's a newer version of this diagram but I like this one better, so I'm gonna use it. But this is from a CNCF like working group of a working group to define what's in a platform. And I think this is pretty good, I like this because it means I can show you a vendor agnostic diagram of what a platform is. But I guarantee you if you go look at including our own at VMware, if you go look at any vendor's platform it pretty much looks like this, right? You've got your infrastructure under here, you've got this middle layer, not supposed to call a platform as a service, a platform. And then on top of here, this is kind of what's new and interesting nowadays is you actually have like very developer-centric tools and collaboration stuff, right? That hasn't necessarily been part of previous platform sort of revs that we've been going through. But you know, as you look through this stuff it's like yeah, I mean that's if you're running software you need all these things. Hence the thrill that we get to reinvent it all the time instead of just kind of using what was already there. But that's great if you're into like getting work because what you get to do is you can build this entire platform on top of this blinking cursor, right? Now, as you're out there building the platforms, right? You know, again, you wanna keep a focus on like who it's for and what the goals of it are, right? So this is back from 2018 from the ThoughtWorks people. You know, just as a side note, these ThoughtWorks people they're great at marketing. Every time they come up with something in my VMware Pivotal position I was like, ah, why didn't we write that like two years ago? Like we would probably do a pretty good job just like sending ThoughtWorks stuff to chat GPT and having them rewrite it and post it on ourselves. But they're fantastic at documenting things. And this one back from 2018 really, I think captures the essence of what a platform is nowadays. And remember, it's like very focused on developers, right? Like it's very, very focused on how do we make the application developers daily life and lives easier, right? That's what the focus of a platform is. Now I mentioned that there's this kind of new thing, like so you hear like platform engineering, right? And so this rev of platforms is, it has that much more closer focus on not only being for developers but thinking about what are the tools on top of the platform that developers use, right? Like what is, I mean basically like what's all that Atlassian stuff, right? That we can pile on top of there. And so that's where you encounter this phrase, let's see if my laser works. This phrase internal developer portal, right? Or IDP. Now I think this is one of the most boring names I've encountered in my entire career, like portal. I don't know about you but every time I hear the word portal I don't want to be involved. But whatever, that's what it's called. And it's actually like, again, if you, not again, but if you look at this diagram and you go back to this diagram, kind of the same thing, right? But you can sort of pick out how there's more developer centric stuff like your build tools and stuff like that. So in this rev of doing platforms, we're further trying to like pull in as many different tools as possible that are used to get the software out the door because it has that bigger focus on helping developers out. So that's as brief as I can possibly make a definition of platform stuff. Again, hopefully the notion that you get, I mean, especially if you need some references, I didn't mention this, but these are also platforms. So unless there's a lot of, well, judging by the look of some people, you probably recognize many of these things. But again, it's the stack of stuff above your infrastructure that you run your applications in, right? So that's what we're shooting for with a platform. Now, let's say you have a platform in place, right? Now, luckily, because we do have this like eternal reoccurrence of platforms, right? We actually have lots of knowledge and lessons about how to run them. So while we might delight in rewriting all the platforms and recoding them, we don't actually have to rewrite the culture or the process or the ways of using them. We've just got a better tool that we're swapping out and we can kind of reflect on like, at least I just kind of chose seven years arbitrarily. Well, actually, was it arbitrary? I forget, I guess that goes back to 20, I can't do math in public, that's very embarrassing. But back when I was making this up, it was basically about the height of the previous, of the platform as a service era, more or less before Kubernetes started coming in and just sort of bringing us back down to the infrastructure layer. So that's what I'm gonna do with the rest here is like, so you got your platform in place, like how are you gonna have it be successful, achieve the ends of having your developers be able to wake up at four a.m. in the morning and go watch chefs. I mean, if you're like operations people, like any excuse you get to wake developers up before 10 a.m. is probably delightful for you. And so that's what's gonna be motivating you to build these platforms here. So the first thing is, again, take on that perspective of like, a platform is to support application developers, right? So the first thing that you're gonna wanna do, straight from the sort of dev ops and lean sort of area of thinking is like, you wanna think, all right, so if the goal is to get application developer stuff out the door, like I don't wanna be involved in waking up that early in the morning and go watching people measure mayonnaise, they can do that. But what I can do is I can think about how can I speed up their release cycle, deploying software out to production. Now, what we learned from the dev ops world is that like, organizations are made up of silos and every single silo is delegated one part of the overall process and each of them, if you're lucky, locally optimized, but they don't really care about what happens in the rest of the cycle. Or to put it another way, like what really matters are all these arrows between all these boxes, right? Cause you can get the box to be really efficient, each of the different steps. But unless you can actually pay attention to not only the system as a whole, but those arrows and realize that the arrows are boxes. There's like a mind blower for you. Like, you know, you've gotta start doing that to think about the end to end process. And that's usually the first thing that I've seen over the years platform teams focus on is like the first thing we gotta do is figure out what it takes to get software out the door, right? And I forget, you know, I was just talking with someone a little while ago, right? And the issue with this, it's almost definitional. It's like, we've gone in and we've said the problem with getting software out the door and managing it is you have all these silos that don't talk with each other. So if I wanted to go talk with someone about that, no one exists that I would talk with, right? Like that's why you've always gotta go way up to an executive level who's just like, ah, I don't know, whatever, these people do that. Usually they're a lot more enlightened than I portray them. But that's what you end up focusing on. Somehow, as the platform team has to figure out getting this and figuring out like what it takes, what all the, you know, one of my more astute lean scholars out here can probably give me all the terminology that you would use for this stuff. I never really learned the difference between like lead time and cycle time, which is why I keep trying to say, release the software, deploy it. So, you know, you gotta map this stuff out. So what are, there's two easy ways that I see, I've seen people doing this. Like the first one is just like, you know, go through the exercise of what it would take to deploy one line of code. Maybe like, you know, we decided we shouldn't have drop shadows anymore on the buttons, so we're gonna remove that or put that into the style sheet, if that's what the kids call it nowadays. And then we need to deploy that so we don't have drop shadows on the buttons. Like just one line of code and you go and you chart out everything that would have to happen, right? Like all the meetings you gotta have, like all the stuff you need to do, how long it would take in waiting and actually doing and you construct this, people call this all sorts of stuff, just the end to end process together. Now, and especially like clever way of doing this, I think it was James Waters I heard saying this, is like don't even deploy one line of code, just assume you're gonna redeploy your current version, right? And that means you've gotta start all over again as if that current version is like basically done and what's it gonna take to just get it into production? Like all the meetings you need to have, all the testing that you need to go through, all the checking of things, like how long does it even take? Which is actually kind of nice because it takes out your development time, which again we're targeting in a week, so you just load that week back in there, I guess, in your optimistic scenario. But what's fun about that, right, about this exercise, if you do it realistically, like the best, I would recommend if you're doing this, do this assuming that you're going to be doing this between the third week of November and like you're really, basically the third week of November is when you're gonna be like in the states at least to be different in Europe. But that's when you're gonna be code complete and then say like all right, you're done the Friday before everyone takes a week off. And then they're gonna be off for a week and then you're basically gonna have maybe two weeks before everyone disappears again. And so like schedule some meetings, see how that pans out, right? And essentially what you're gonna find, what you'll find by going through the SOT exercise are all the activities you need to do and that becomes once you draw that out, that's sort of like your map for things that you're gonna start improving, right, as the platform team. What can we do along not only like, people-wise and with process, but like how can we go in with tools that we have by changing the process around and optimize all those different things, right? So that's kind of your first go there. And then once you have that set in place, you've got kind of like a baseline that you can go back to and kind of keep tracking if you're improving things, right? So in the same way that the developers where they're focused on like experimenting on a weekly basis, trying some new theory of doing something and measuring if it worked or not, this is kind of like a thing you can use, among other stuff, to measure if you've actually progressed in all of your platform work. So then the second thing is, and this is again, as with all sort of like re-occurrences of stuff, this stuff was like always there, like I was there long ago when all this DevOps thing was happening and boy, all the ITEL people would get so upset and they would be like clearly, you didn't pay the 400 pound sterling to buy the nine editions of service delivery in ITEL or the nine parts of it because we wrote about that in volume six of service delivery. We've been doing this all along, right? Now, so obviously there's been this idea of product managing or thinking of developers as your customers, but that's an especially big emphasis in this current like gyration of doing platform stuff in platform engineering. And here's some of my friends at Mercedes-Benz here. Now, this is one of the people who, they're not running this group at least, I mean it's Mercedes, so they run everything, but they're actually running a platform as a service, a cloud foundry instance, but they're a great wealth of like what it takes to run a platform and anyone who's run a platform for a couple of years will say the same thing is that what you do, a huge shift you do is you're no longer doing service delivery, right? You're no longer delivering a set of infrastructure services, you're thinking about developers as your customers and your product managing the experience from them. Now, if you're in operations or infrastructure and you have a product manager on your team, then you can go get dinner now because I'm not gonna tell you anything that you don't know. But if you don't have a product manager, that's like kind of like the next step that you go through is like, all right, if developers are my customers, what does it mean to have a customer? And you can go through all sorts of stuff, but in the software world, this is where having really only lived my life at vendors, it comes in very handy. I know what product managers do. I do a podcast with one of them. And like, product managing is a very, very well known practice that you go about. It's very well described. It's more or less easy to learn about. It's got its own weird little fun language and conferences and stuff like that. But the essence of product management, I mean really to boil it down, is you have a product manager and they know who the customers are and they also know like, let's say the capabilities of the technology you have, like what's possible. And there's lots of things product managers do, but the core thing they do is they say this week, we're gonna do this. And I know that sounds ridiculous, but I like to boil things down to simplicity or to be simple. And so the product manager is always thinking about, what's the next best thing that we can do to help developers out? Now that's in contrast, that's not like, what capacity do we need to achieve these, the kind of performance characteristics that we have and the cost metrics that we have, right? Like the product manager obviously cares about cost and performance and things like that, but their primary concern is like, is this gonna help the customer out? Not so much, is this the right way of going about stuff? So having a product manager, very, very important, right? That's a key role that's on a lot of these platform teams. Now, if you're kind of curious, like how to start doing product management for developers, for a platform, right? Like, again, product managers would be like, well of course this is what you do, but if you haven't really done product management, what you do, big shocker here, is you ask your customers what their problems are. And there's all sorts of ways you can go about doing that. I think this is a pretty good one. I help write up this kind of survey that some of our people have used with organizations and really, this is one, I know you can't click on it now, but you can get the slides later, but you can basically take this survey and just like stick it in a Google form or Office 365 form, whatever. And it's basically just a series of baseline questions to go out. And if you start sending this out, kind of pretty frequently at start, at the beginning so you can kind of measure your success, but you'll very quickly find out what developers need and what problems are, right? And so again, this is another way of like if you're measuring what you're doing and you're asking your customers what they need, you'll get out and you ask the right questions, you'll get a lot of input about, again is what a product manager does, says here's what we're gonna do this week, right? Like here's the thing that we're gonna pick. And because the way the survey works, it's like on, I don't know what I'm about to say here because I'm a philosophy major, so I assume it makes sense, but you're doing it on a Likert scale, so you can go in and like sort from the top of the mountain to the bottom, ascending. And you can figure out like the most important things on going that you would want to do. That's the other thing, if you stop doing software development, you get marginally good at Excel, which is fun, ascending. I only know the word ascending because of Excel, because you have to know that. So you've got your, whatever you want to call it, value stream your end in process, the arrows are boxes, remember? And then the developers are customers, we got a product manager, Likert scale, whatever that means. So let's say you've built your platform. This is like the first thing that people encounter, and that is, and this especially happens if I talk with people all the time who have this. They're like, we built out the Kubernetes clusters. I don't know why I'm talking to you, VMware who wants to sell me managing Kubernetes, we have Kubernetes, we got five Kubernetes. And we built them for the developers because it's inevitable, it's the industry standard. Even though five to 10% of the industry works that it becomes a standard at that point, which is fine. You can tell them a little sore about Kubernetes, but I went through the full on grieving cycle back in the fall and I'm cool, it's fine. I just use humor to deal with things. But they stand this up, right? So you've got the blinking cursor and then they're like, Kote, although they might call me Michael because they don't really know me. And they're not my mother, my wife or John. But so they stand it up and they're like, why are the developers not using it? In fact, they go off and they use their own thing. We built this great Kubernetes for them. And that is often because they haven't really thought about platform marketing, advocacy, and to a certain extent, consulting. So if the idea of product management is a crazy idea for infrastructure and operations people, try marketing. They're probably, I passed by someone in the expo hall. Who are they talking with? I forget, it was the Grafana people or whatever. And I just overheard them saying like, oh, no marketing pitch. And I was like, oh, come on, right? Like, I mean, you walk up to a booth here. What do you, like, whenever I hear someone saying no marketing pitch, what I hear them saying is don't be boring, right? Which hopefully I'm not being, only a few people have left, so I guess it's exceeding. This is not a marketing pitch except for a little part. But you gotta market your platform, right? And it's not only like you're gonna send out like your weekly email that's like, think of the trees before you print this if people still do that, like that kind of thing. But it's really, really like product marketing. Again, like us vendors would do, or really anyone, right? So what you're thinking about is how do I establish like, well, first you gotta have a name. That's incredibly important, right? And one of the things that's important about a name is that you've gotta come up with your own name, right? And this is, there's a colonel in the Air Force who told me once you're going down an elevator, he was like, oh yeah, yeah, you can't call it Agile, right? Because if you call it Agile, what's gonna happen is all these people who don't wanna do Agile are gonna come and tell you why you're not doing Agile. And therefore, you know, it's like my hair's a bird, your argument's a valid kind of thing, right? And so you're gonna get in all these types of arguments because you're not following Agile to the T, like whether it's Scrum or XP or Crystal or I don't even know what people do nowadays. And so instead, what you wanna do is you wanna come up with your own name for something, right? So that you define what it is. So you've got a brand, you've got a name, and you can see you're starting to set up a bunch of principles, right? And actually, these organizations, they end up coming with like physical, like three-dimensional, like books that document not only the tools, but document their approach the way they think, you know, what a platform is, what the interactions are with them, right? And this is, two of these are for Mercedes here, right? But they spend a lot of time on marketing and brand and kind of building up like identity around it, right? In the same way that like, you know, again, back to the application developer world, like you can encounter someone who will say, you can encounter two different people who from the outside, they appear to be the same type of person. And one person would say, I'm a Java developer. And the other person would say, I'm a Spring developer. And these two people might as well be trying to kill each other, right? Like, these are very different like identities and sort of brands that they have with each other. I mean, I joke, they probably like each other a lot and they'll argue about like JVM optimization and other thrilling stuff. And I don't know, they're great. But, so you're gonna spend a lot of time marketing, right? And part of that, I mean, literally is this, is like, you should probably plan on a quarterly basis to have an internal conference. And like, literally that, literally, literally, so to speak, right? Like every three months, that's generally what a quarter is. You wanna plan out, and maybe it's like in various different regions that you're gonna travel around to, it's probably also online. But I've seen this over the years that they have a conference that people come to, right? And what you do with that conference, and this is where the marketing comes in, right? Because this marketing, it goes into not only bringing people developers to your platform, but it also is key that we'll get to eventually to scaling it out to your hundreds, thousands of developers. Because what marketing does, and I don't know, I'm not telling the thing to you, but by word of mouth, by one of your peers, someone who had your situation. So at these conferences, sure, you give a presentation about like, well, here's our platform, and it does this, and it's pretty cool, and whatever. But what you start building up is you have other application developers who have used your platform, and they go over their experience doing it. And they're the ones who become your, if you'll forgive the term, if you don't like it. I mean, I like it because they help pay my bills. You're salespeople, right? They're the ones who are going out and establishing this is a good idea. Here's how to do it. And it's also like a great source of feedback for your product management, which is maybe a little too much deep into that. But, and then there's also obviously consulting that you'll end up doing. I mean, I won't go over that too much, but what I do see in the first year as though of platform teams is they actually allocate platform engineers or whatever you want to call them, people to go out to these initial development teams and work with them to understand the platform and kind of train people up for it. So then the next thing that I recommend, and here, I mean, I'm obviously biased because I work at a vendor, but whatever, here I am talking. You know, what I've noticed is that people who build a platform from the ground up, they often come and talk to all of us vendors like a year later and kind of like want to start with a pre-integrated, pre-assembled thing rather than building it on their own. And there are many, many options out there. There's only one good option. Obviously, I'll let you guess what that option is. But there are many options of a pre-integrated, pre-built platform rather than defining it all on your own. And based on what I've seen over the past like seven or eight years, right? Like it's a really good idea, whether it's for pay or a reference architecture you have to really just start with a pre-integrated platform stack. And instead of saying we will only buy is the wrong word, but we'll only get off of the shelf. We'll only get some pre-built thing if we need it. Instead, you should justify when you need to customize something, right? When you need to do your own customization and put your own components in. And the good news is the way platforms are nowadays, it's like, I would never use the word easy where computers are involved, but it's more possible than it used to be to swap components out and kind of put in what you want instead of what's in the pre-integrated thing. And so, and the reason that is is because, I mean, here's a list of issues that you can have, but like it's gonna be a lot of stuff to deal with, to get a platform in place, learn about it. And then the part that is the most annoying is that third week of November part where you're gonna uncover all these groups you never really wanted to know about that now you have to go work with, right? Like not even the networking people or the security people or like the audit or the compliance people, but just like all these various groups that you're gonna have to coordinate with. Again, think of this kind of paradox of the, I don't know, fixing the silo problem. Our problem was we've got all these different groups that don't talk with each other across silos. And by, as I just said, you don't talk with them. And so now you're gonna have to go find them and deal with them, right? So, and also you're gonna have to learn the technology. Anyways, there's a lot of work that you're gonna need to go through rather than just figuring out, building a platform on your own. And like I said, there are many options, but there's a super fantastic one if you're really interested in it. But, you know, really, I also wanna prove to the point all the different options that you have, right? They all kind of sort of implement the same sort of set of boxes that you have, right? So you can start with that reference diagram up there and kind of run through the, like, I don't know how many there are, but there's at least five, if not maybe 10, different pre-built things out there that you could use if you wanted to. So then next, I think there's about, there's a couple more things before I wrap up here, or by we, I mean me. And that is, and this one, this is something that I think is really important to go over with people, and hence, I go over it with people. And that is setting your expectations about the timeline for things, right? Now, we in, well, I'll speak for us application developers. Us application developers know that things are always, well, you could say that they're late, but I prefer to think of it as, you know, you just had the wrong date picked for delivery, right? Like my way of thinking is I'm never late for a meeting, someone just started it too early. Is really how things worked out. But similarly, if you're putting a platform in place, you're fixing up how you're doing software, what you're gonna find is it's gonna take a lot longer by years than you think it should, right? You're gonna think like, ah, in order to transform our 100 teams, or if you're like a J.P. Morgan Chase, you know, we're gonna move all these 25,000 developers by the end of the year, they're gonna be using the platform. Or like what I hear very frequently nowadays, there's this big, big re-interest in moving everything to public cloud despite, you know, the kind of like little bit of kickback about repatriation and stuff. But I talk with executives all the time, and they're like, oh yes, we have the target that by, the last one that I heard was great, is they said by March of 2024, we're gonna have moved 60% of our applications to the public cloud. And like, you know, I did a little bit of math in my head and I just, you know, changed the topic, because we were there to kind of learn about new customers we were gonna sell to, so I didn't want to insult them. But like the timeline is gonna take a very long time to achieve these big goals that you have. So you gotta really pull that back. And just as a representation of that, right? So I was lucky to be kind of in some meetings back in 2015 at the Home Depot, which normally I present to people in Europe, everyone here knows what that is. You get this, they call it a DIY store in Europe. Isn't that adorable, instead of a hardware store? But they started way back in 2015 of this goal of like, we gotta do software better, right? And they went through the process that I'll go through after this, but it's not, I mean, they were successful all throughout. But finally back in, it's actually 2022, not 2021. But if you go look at their, what I think it's like Q3 or whatever, some quarterly report they had, their CEO was directly talking about something. I was, talked with some people, not just me, I was just in the room about, and that is this pro business where the people who are contractors, again, it's great, you all know what that means, I don't have to explain it. People who are contractors, those are pros, and they finally have gotten their software in place where they've like, grown the revenue for some of the pro-contactors like by hundreds of thousands of dollars, right? And that's taken a long time, and that's just, it didn't take just that one application, but if you think about the scale that the Home Depot works at, they're putting a platform in place, it's gonna take years and years before you get like these gargantuan results that you're transforming all of the things over, right? So you've really gotta like set expectations that it's gonna be a lot longer than you think it would be, which salespeople I work with probably love that, right? Like it's gonna take a long time, so start small. So here's like the kind of timeline that I've noticed happening. So there's kind of two parts to it. One is let's just call it the first year or so, right? And the kind of initial phase. And this is the reason it takes a long time is because in true product management fashion, you kind of don't know what you're doing or what needs to be done. Now, all these diagrams I've shown, you kind of know what you're gonna end up with, but how it works in your organization, how you build it and fit it to what your needs are, you have to discover and figure out. So what you wanna do is you're gonna pick like one, maybe two application teams. You go out and you find an application and right around the same time when you're building out the platform, you find this application, these developers, and you're like, all right, we're gonna work for like three months and our goal is to get a release of your software out on our platform. And that's it. So whatever wacky business case you made when you close the deal or you got permission for this, it was like, oh, we're gonna have this return just to be like, that's great, let's put that over there. Just get the one application and focus on that. And it needs to be an application that's real, but not one that's gonna bring down the business if you mess it up. So like in the example of Home Depot, one of the initial ones they did is like the software that runs the custom paint desk. It's an important piece of software, but it's not gonna crater the company if it doesn't work. But you use that as a way, again, just like our mayonnaise people, of exploring and discovering and experimenting with what works. And that's also part of what's different kind of with a platform team and why you're product managing it is you're kind of learning and experimenting with that. Now, after three months, you get to six months and nine months, that's how math works. And then eventually you get up to a full calendar. You're adding one more team, then three more teams, and you're kind of like always going up more and more with teams. And eventually you get to this point where you use this practice. It was, I think it was, I forget which DevOps report, but in the last couple of years, they cited this idea of pairing and seeding, which is something that I've seen over the years platform teams and kind of use a lot is you have your initial team of that application developer team, right? And one of the things you strategically do is you figure out who's someone on that team who could talk with other people that they don't know. And we could even kind of help them and teach them doing things. And so that one of those people, you'll hopefully have one of those types of people on that initial team. When that initial team is successful, you're gonna go out and find the second and the third team application. And you take that person, they're a seed, you take that person and you put them in the new team, right? And that way, again, think about what vendors love to do. We don't actually like to sell to you directly. That's, we actually do like that. But we also like y'all to sell to each other, right? And so you have the developer who is previously successful with using this new platform. So you kind of move them off into this new team, not only to like train and teach people how to do things, but also to like say like, it's great. It works to be a trusted person to endorse it. Now, what we advocate in the way that we do this stuff at VMware and in Tanzu and Pivotal, whatever you wanna call it, is we actually like advocate pairing up. So not only pair programming, but pairing product managers and pairing all sorts of stuff, which is, you know, some people like that and some people don't, but I found it's quite effective. There's all sorts of research papers that says it's good, but also you should probably floss every day and I don't. So nevermind that. Although I was told that in the Netherlands, they have these like flat toothpick things that have a ridge on them. And you're supposed to use those, you go in this way, straight, and in that way, and it helps your gums not bleed, which is a very Dutch sort of thing. It's like, yeah, you just need to poke your gums that's part of your daily life, twice a day. So I think it's actually once, but I could see them wanting to do it twice. And then your gums don't bleed. It's fantastic. What was I talking about? You pair people up, right? And then that's the magic of the pairing is that because you seeded these people over, you start pairing them and they transfer that knowledge. And again, I have a liberal arts degree, so I don't understand how math works, but there's some kind of effect of pairs. I think that's like compound interest or something. But you know, once you get enough pairs going and they go out and pair and they go out and pair, it's like that shampoo thing where I told a friend and they told two friends and two friends. And next thing you know, and that's why it's like the first year you're not gonna be doing much, but then I see this over and over again at large organizations. You reach this point after about 12 months where it's pretty easy to scale. I mean, not instantly, but you can see a path towards having lots and lots of teams out of your 25,000 developers or so using it. But you've gotta know going in that you're doing this and really engineer that you're gonna be following this pairing and seeding thing, which I'm not really talking about this here, but that's also, you know, a lot of the stuff that I'm talking about, if you think about it, is like the job of management, right? Like it's their responsibility to do that sort of thing. And so that's why getting the management people on board and kind of thinking about all these things is really important because they're often the ones who can look across silos and make decisions and make changes rather than just like, I don't know, I type into a keyboard. Like I can't change stuff unless I can type. Which isn't really how you change organizations. I mean, I guess you eventually click on a button, but you gotta have a lot of meetings. So the next thing, you know, just to kind of wrap that up, right? Like to give you an idea, you know, I mentioned the Home Depot and the anonymous restaurant, the managed people. But there's, I mean, if you wanna see like a very brief, it's only seven minutes long, but this is one of the people I know who runs, JPMC has been, JPMorgan Chase has been doing this long enough that not only do they have a platform team, they have like at least three different, if not more platforms, but then also they have the platform engineering team, but then they have a whole team that this guy runs, that's the platform advocate team, which, you know, makes sense for, again, I'll say it again, like 25,000 developers. But he's got a great talk going over lots more information and these are just two slides about the practices they've learned doing it, right? And I wanna show this one because like, it's great. I know this person and we work with him and you know, you like to see people that you know and work with be successful rather than not. That would be a terrible slide. This person tried everything I said and it didn't work. Let's go to dinner. And, but it's because like it actually is possible at this scale, at this sort of like normal of a company. Now you could argue JPMorgan Chase is not normal at all, they are getting custom servers built and things like that, but like at that scale, there's a lot of normalness, right, and I see this at all sorts of organizations, you know, particularly government organizations who do this who very quickly become successful at doing it. So it's definitely possible to build a platform and put these things in place. So with that, thanks for sticking around here. No one wanted a snack, but they're still here. If you go here, I have a page that lists kind of all my relevant platform stuff, including the slides that you have here. And like I do most everything through my newsletter nowadays because no one seems to read blogs. And like, you know, the Twitter thing is like not really panning out. And I mean the issue with Mastodon is like, now I gotta do this all over again, right? Like I had, you know, if my Twitter accounts from like 2006, so I've got lots of years and stuff and like, also there's no algorithm to promote. I guess what I'm saying is like, I just want people to click on my stuff. I don't really want to have a conversation, but if you like that, that's there as well. I'm available as my friend Andrew Schaefer likes to say, engage with my brand. And I'll be around if there's any questions. And with that, thanks. Chris, if you've got a question, please raise your hand and I'll run the mic out to you. I thought I was gonna get away, but. This was the best talk I didn't realize I was going to. So thank you. Oh great, I'll have to parse that as they say. But I appreciate that, it's very nice of you. I love how many like awesome reports and studies you have in your slides. I'm just curious, like, how do you find all of these? Oh. Like, you're mining gold here. I want to see where your gold mine is. Oh sure. Well, you should subscribe to my newsletter, first of all. But now, well, I was gonna say more seriously, but equally seriously. Yeah, well, I've been an analyst over the years, so you get access to stuff. And I think also there's two other things. Well, one, I spend a couple of hours each day reading stuff. I'm lucky. Well, yeah, probably at least an hour. I'm lucky in that respect. So, you know, you just remember that guy in heat where they found out where the thing was and they go up to this trailer around here somewhere and he's got all these radio beacons and he's just like, I don't know, I just hear stuff. Like, you gotta set yourself up there on the mountain to hear things. But the other thing, working at a vendor, right? So we get access to all sorts of things because we pay the analysts and we get the reports, we commission surveys and things like that. So that gets you on the radar for this stuff. But the other thing, speaking of the person who doesn't want a vendor pitch over in the Expo Hall, us vendors also love it if you go give us your email address and you can download all these reports, right? So we actually license all this type of stuff across all the different vendors. I mean, obviously you should start with VMware and maybe not look anywhere else, but if you wanted more, you could look elsewhere and go to all the various vendors, all the public clouds and you can find many, many, many of these reports, right? And, you know, I shouldn't tell you this, but you can just not, I use and really checks. But you should definitely, if you're clicking on the links I have, use your real email address. Cool, so yeah, that stuff is out there. You just have to kind of like know. Anybody else got a question, comment, or even a story? Michael, who are these people that have high cost dry cleaning bills? Oh, the executives, yes. You know, I used to work with them when I was at Dell for a couple of years. And they, at first they seemed crazy. People who wear the start shirts. And there was a guy to use the word again who literally drove a red Carrera. Did I say that right, a red Porsche? And I was like, boom, I've arrived. Like, I've met the ER dry cleaning executive person. He's a very, very nice person, very effective too. Like, so yeah, I mean, you could derive them as like the MBAs, the managers, the executives. But the way that I look at them, and I kind of alluded to this a little bit, like if I'm talking to that kind of crowd, I would say something more like the following is like this platform, this organization, that is think of yourself an executive as a programmer, and this is your application, right? You're responsible for architecting it, for developing it, for testing it, for deploying it, and caring for it ongoing, right? And so really like they're the ones who, if you think about your whole, if you're into systems thinking, which I barely know what I'm talking about there. But as I've demonstrated, never has stopped me before. But like, you know, if you think about the system of your organization, they're the architects and the managers of it. They're also the people who like make sure that you get paid once or twice a month and that the business is functioning. So very random, on the last slide, one of the bullets was cloud parties, and I was just curious. Which one is it? Number six. Number six. What exactly is a cloud party? You know, I don't know what that is, but it sounds like fun. Yeah. I think, you know, the presentation's only like eight minutes long, so maybe he defines what that is. But yeah, I'm not sure. I mean, my guess would be, is it else there? He might also mean like dealing with the various parties involved. Not fiestas, but the people involved who are doing cloud stuff. But I'm not really sure. Maybe he literally means like, yeah, we're gonna go have an event and make people think we're cool. There you go. That would be great. Depending on your geographic region, that may or may not be possible. One minute before seven, anybody got a last question, comment, observation? If not, why don't you? So you mentioned a number of times, you kept coming back to this idea that there's a goal and there are goals and lots of goals, and it's like, you said the word goal more times than like John would have said it. Was that actually a subtle call out to Goldrat, or is that? Oh, oh, no. It just happened that you used that term. No, no, but yeah, yeah. I mean, that is a, sure, yes. I mean, it wasn't a call out, but your sentiment is correct, right? Like, I mean, as I started with, right? Like, it's always good to like know why you're doing something. Which, you know, when I was a young developer as, I think Matt Ray is the only person in here who knew me when I was a young developer. Like, I was happy to have no goals and stay at work programming all day. But now, like, I don't have five kids, but I have three kids, and like, I like to have goals so that I know when I can stop. And like, I think that's equally good in this kind of situation, because otherwise, like, you know, you just, it's never gonna end. Which could be cool, but the dry cleaning people, they don't like that. They like goals. They gotta go pick up the dry cleaning. Okay, well thank you everybody for attending. If you haven't been to scale before, this Saturday evening does have a social event that we'll start next to the expo hall at 8 p.m. It's actually open now, but reserved for under 18 until then. And there'll be food, drinks, board games, and just chat with each other there. If you like me, enjoyed Michael's presentation, could you join me in giving him a round of applause? And then the Cloud Native Track will resume in here tomorrow, and the conference will close with the legendary keynote speaker Ken Thompson. So stick around for the whole thing. Thank you.