 the session. Let me give a brief introduction about me. So, again, my name is Chamod and I work as a software engineer in an Australian company called Industries Software Group. And other than my technical background, I run a few communities in Sri Lanka, which are KCD Sri Lanka and Golden Sri Lanka GitHub, and I'm also a GitHub campus expert. So, and I'm leading a digital Sri Lanka as well. So these are the things today I'm going to discuss about. So the first one is what are the containers and then what is GitHub Container Registry? Then why GitHub Container Registry? Then I'm going to move into the, what is Kubernetes and give a brief introduction about Kubernetes. Then what are the benefits of using Kubernetes? And yeah, we move into the demo and the Q&A. So, what are containers? So containers are packages, packages of software that contain all the necessary elements to run any environment. So it doesn't matter what's the environment. It can be run if it's on Docker. So, as you can see in this meme, so previously, before coming into the container technology, people have to set up everything from this package to run a certain application. For an example, let's say we are running a Java application on production. We have to set up it for, we have to install JDK and everything to run a Java application. But if we continue the application, we can run it anywhere because it's easier. So let's move to the Container Registries. So GitHub is not the only Container Registry offering platform. Actually, GitHub is a source, source repository management system previously. Now they are coming into the GitOps. And there are other Container Registry offering platforms like Docker Registry and from Google, we have DCR. It's Google Container Registry. In Asia, we have Azure Container Registry. And AWS got another one too. So, what are the benefits of using GitHub Container Registry? So the key benefit is it reduces the maintenance by keeping all the, keep everything together. So, which means we can keep the source code and the build as well as the distribution in the same place. So before we talk about the need of Kubernetes or Container Orchestrator, do you know the meaning of Monolithic and microservice? Can someone tell me? Yeah, that's correct. So, as he said, so Monolithic run as a single application, but in microservice, we can split the application into a different, different part and run it as a same fund. And it's easier to maintain and easy to do everything and deployments as it. So when it comes to managing containers, we cannot do that manually. Let's say our application got a thousand containers. How can we manage? So that's the main reason Kubernetes is coming into the scene. Because of the Kubernetes, we can manage, actually orchestrate so many containers at the same time. So as you can see in the, from the definition, Kubernetes is originally developed by Google, but now it's maintained by CNC, a cloud native computer foundation. Yeah, let's move into why we, why we should use Kubernetes. It's actually highly available, which means, simply means application has no downtime and they are available for 24 seven and accessible to every user and it's very scalable. So with the need of a resource, it can be easily scalable to, we can maximize the resources. So, so when it's come to the Kubernetes, there are some few things you need to know. The first one is nodes. Nodes means worker machine that runs container. Actually, we don't need to have a cloud environment to run Kubernetes. We can do it in physical machine or even hybrid or cloud. So the second thing is ports. Ports are the smallest, smallest deployable unit in Kubernetes and the, the next one is deployments. It's actually a, yeah, it's probably a declarative updates for ports. So when we do a Kubernetes deployment, it's actually creating port. So the next one is service. We can use services for expose our application to external or for internal. And the, if our application has some secrets or configuration stuff, we can use config maps or secrets. And the another one is namespaces because namespaces provide a way to divide our cluster into different, different parts. Yeah. So this is actually a Kubernetes object YAML definition. So as you can see, first, we have to specify the kind. So this is a port YAML definition. If it is a deployment, the kind name, kind value will be a deployment. And the, most of the other parts will be same. And the, as you can see, in the image, we can put the image name where, where the image is located there. And as well as the, we can specify the ports, port of the image. So actually we can run more than one container in a port, in a single port. So there are two way to create a Kubernetes object. The first one is imperative. That's my personal primary. Actually, we need to install the Kubernetes CLI into our machine or whatever the environment we are in. Then we can run some command and create Kubernetes object very easily. And the other one is the declarative way. So maybe we will have to remind this definition or we can go to the Kubernetes docs and copy and edit. Then we can do the, yeah, create Kubernetes object. So I will show a quick demo about building an image and doing a deployment in Kubernetes as well as exposing it to the outside. So here I have prepared a simple go application which, which, yeah, I have prepared for Cessia 23 when we hit the API call. So, and I have actually localized it in here. I'm actually using a GitHub action. I have pushed this image to GitHub container registry. So let me show the GitHub action for that. Actually, we just don't need to add anything when you are doing, when you are doing this, you can go into actions and new workflow. There are tons of templates available. So you can easily do them. So as you can see. So now I'm going to do a tag release. So the action, GitHub action will be trigger and it will create an image and push it to the GitHub container registry. So I'm going to name it as version 3.0. All right. So now as you can see, the GitHub action is running. Since we have only a few minutes left, I'm not waiting till this end and let's move into the, the Kubernetes cluster side. So create a Kubernetes cluster. I have, I'm using a Google Cloud Fact flow and you can use any cloud provider you want. You can use either Azure or AWS. So using this setup, you can easily create a Kubernetes cluster. It's very easy to setup. So I have actually created the Autopilot one because I don't need to specify the nodes. Depending on my workload, the, the Google Autopilot will take care of, of the phrasing, creating and deleting the nodes. So here's my created cluster. When I, when I want to access to Kata, I need to go to the connect and copy this command. So I'm going to run this command in the Google cloud shell because it, it is very easy to use and it's running on cloud environment. So it's very fast. So I'm simply pasting the command. Okay. I have already prepared all the decorative commands to run my deployment. So to view the, all the names faces, I need to hit Coup CTL, get NS. Okay. I have only default and the other Coups names are created by the Coup system. So let me create new names face to view the NS. So as you can see, the names face is here. So let's create the deployment. So when I hit this deployment creation command, it will create deployment under this name, post ACR dash 2283. So here I'm specifying the port and the names face as well. So I can actually view the deployment in here by running Coup CTL, get deploy and I have to specify the names face as well. So yeah, as you can see, it's not ready yet, which means it's still the deployment is creating. So when it's come to exposing this deployment to outside, we can use this command Coup CTL export deployment and I have to specify that I pass a load balancer. Okay. So everything I have in this git repository, I have shared it later. So let's move forward to the slide in my presentation. So where to next? So if you are new to Docker, you can try Docker and GitHub content registry and Kubernetes. So if you want to do some labs about Docker, you can go to the play with Docker website and try out. Actually this is the place where I've learned about Docker most of the time. And if you want to upskill your GitHub skills such as GitHub actions and GitHub content registry, those tutorials are available in skills.github.com. So if you don't have cloud kitties, you don't need to worry about that because you can always use minicubes. It's a local cluster. You can practice Kubernetes in your local computer. And so other than that, you can stay in touch with my communities, Kubernetes Sri Lanka and KCD Sri Lanka to learn more. So do you have any questions? Feel free to ask. Actually for the free users, they offer 500 MB. And if you are for a user, they will offer 2 GB. Okay. So thank you. You can follow me on my social media. Thanks, Shamot. It was really a wonderful session. I do remember earlier we used to say no life without water. If you ask a developer, he will say no life without git. So, okay, we'll be back with the engine eggs or like any other API gateway. Okay. Let's start with authentication and security. So with the API gateway, when a user sends a request to your service, that is your API gateway, you can set up authentication. So this can be simple stuff like a key value pair, JWT token. Or you can use other platforms or the third party platforms, something like Oath or any other third party tool. So it not only lets you control who access your APIs, but you can also get the user info once the user is authenticated with your API gateway. And you can make your API a bit tailor made so that you can give each user a tailor made experience. Another one is rate limiting. So rate limiting can be intentional, something like denial of service attacks. Or it could also be unintentional. So maybe like your site got really popular, your application got really popular, and there are a lot of users trying to access your application at the same time. So this could happen in a multitude of ways, but you have to make your services reliable enough to work against this, right? So an API gateway can help you do that. So if the number of requests exceeds a certain threshold, you can choose to reject or delay the request so that your back end application or your actual APIs can handle those requests. Now let's look into monitoring and observability. So you are only as good as your monitoring solution because if something happens, if there are some issues, you need to be able to identify it. You need to be able to have data on it to actually protect against it. So an API gateway like Apache API 6 integrates with a lot of observability tools, so like tracers, so things like open telemetry. So API 6 integrates with open telemetry and logger. So you need logs coming in from your API gateways, from your back end applications, and you also need metrics of stuff like Prometheus. It needs to be integrated with stuff like Prometheus. So all of these stuff can be handled centrally within an API gateway. So when you think of a system without an API gateway, you need to actually add all of these into each of your applications, each of your microservice application. So an API gateway makes it much easier to do that. Another aspect of this, another aspect of reliability is version control and zero downtime. So you have to make sure that you are able to bump new versions, release new version of your application without your users experiencing any downtime. And you can do this with something called a Canary release. So this is a release strategy that is being followed in like the software engineering world. So you have your V1.0 API and you are trying to switch it with a V2.0 API. And initially you have all of your traffic being routed to V1.0. So API gateway is routing all traffic to V1.0. And you have V2.0 deployed in your server. And initially you can route some traffic. So 5% is a nice number. So you can route 5% of the real traffic to your new version of the application. And the API gateway lets you do that. So in Apache API 6, this is handled dynamically. So you don't need to restart the gateway to make this change. But some percent of the traffic is now being routed to the V2.0 application. So you can test your application with real traffic. But you are only testing it with a small percentage of the traffic. So if any issue occurs or if your application is not working as you intended or if there are any bugs, you can easily identify it without affecting most of the users. And finally when you have tested the new version of your application, you can configure the API gateway to route all traffic to your new version of the application. And there is always an option to roll back to the older version of the application if you have an API gateway. Next up we have circuit breaking. So this is similar to how circuit breakers in electrical circuits work. So the main goal of a circuit breaker in an electrical circuit is to protect your applications from faulty components. So if you have faulty service or if you have a service that is experiencing some downtime, you need to cut it away from your system. If you don't do that, what happens is like it can cascade into other services. So if upstream even is not working and you are still trying to send request to this upstream service, this request will fail and this will cause a bottleneck in your system. So you have to make sure that you cut it off from the system. So an API gateway can check the, can run health checks with these backend applications. And if they are faulty, they will cut it out from the system. And the API gateway can keep pulling the system. And once this service is back online, it can restart sending request to this service. And finally, redirects. So as I mentioned in the beginning of the talk, reliability not only means reliability of your services but also means reliability for your users, for your client applications. So redirects are a way to do that. So if you are duplicating or like if you are changing your API endpoint, a client application might need to change their endpoint as well. So this is an additional overhead to the clients. So if you are doing this frequently, you don't want your clients constantly changing their code to like work with your API. So an API gateway acts as a middleman. And even if the user sends the old API endpoint, an API gateway can redirect the users to the new API. And it can also send a deprecation notice with a 300 status code. And this will help users to like transition to the new API in a more smooth way. And before I go, like I want to talk about the project and the community. So Apache API 6 is a project under the Apache Software Foundation. And it is entirely built by the community. So the Apache Foundation has this motor which can be summed up as community over code. And this project is entirely community driven. And we have community members from around the world. So yeah. And if you are interested in contributing to API 6, feel free to do so. So contributions of any kind are welcome. You can contribute code. You can help us by testing out API 6. You can share how you are using API 6. You can help us write blog posts or give talks like this. So you can check that out as well if you are interested. And if you have any questions, feel free to ask me now. And you can check out API 6. And there is also an article version of this talk. So if you want to look at all these diagrams, you can check that article out as well. Yeah. Thank you. I guess I have a lot of time for questions. So we do have Christian. Yeah. So my question is basically, I'm trying to basically find out what exactly this is, why this is different from load balancing. It seems to be a lot of features are very close to load balancing except for things like the monitoring and the logging. But does it do things like integrating multiple APIs? So is that like one of the features that it has or is it more like, you know, it's quite a more advanced version of load balancing essentially? Yeah. It does load balancing. So load balancing is a, I guess it's a super set of what a load balancing service. So an API gateway can do load balancing. An API gateway can also batch requests as you mentioned. So one request to the API gateway and the API gateway will send multiple requests back to your services, collect all the information and send it back as one package. So it does all of that stuff, but it also gives you more features on top of that, better control of your traffic. Yeah. So. Yeah. So I have a related question. So you mentioned it has tracer support. Right. And I wonder when this API calls the downstream API and if you also use the same API gateway for those downstream APIs, does the tracer integrate all the traits together? And yeah, can you get into the detail like how the tracer works? So the API gateway acts as the entry point to your whole backend services. So when we talk about open telemetry, like this is where an API gateway is the place where you get the trace ID generated. So everything that starts at the API gateway, it can track. So if you have instrumentation in your backend applications, the API gateway can trace the whole request. So from your API gateway to your service and like if you have databases, maybe you can also instrument those databases to send out traces. So an API gateway can be the central part and it collects all these and you can get it as a single data. So there are many API gateway service on all the platform. If you talk about AWS, they have AWS API gateway in the Google also, Google API is and all. So have you done any kind of evaluation? Like this API 6 from Apache, how it is different from them? What is the US speed it is having, which is basically a differentiator as per the other API gateways? What are those points, those brownie points actually that would be really helpful if you can help us. Yeah, but that's a good question. So first of all API 6 is completely open source and it is hosted by ASF so that's a big point. And then like API 6 is built on top of Nginx but it is a different build of Nginx that makes it much faster, much more dynamic and much more much less and cost much less overhead to your infrastructure. So being fast is being fast and making makes it really scalable. So as your applications grow, I'm talking in the range of millions of traffic, millions of requests per second. So in such scenarios, API 6 really shines compared to other alternatives and when you talk about big cloud providers you can always end up in scenarios like where you are tied to the vendor but Apache API 6 lets you get away from that because you can easily switch to a different API gateway if you want because API 6 supports a lot of common specifications. So like if you are using Kubernetes there is a Kubernetes Ingress API and there is also a Kubernetes Gateway API. So API 6 supports all of these and even if you decide to switch API gateways you don't need to change your configuration and this will work seamlessly and there is also the case for multi-cloud or hybrid cloud scenarios where you don't necessarily want to be tied to one cloud provider and a gateway like API 6 can help you avoid that. Are there questions? So API 6 has this plugin architecture. So there is the core API 6 and all of these features, all of these things like rate limiting and observability are implemented through plugins and let me just show you the docs. We have multiple plugins for rate limiting. So this plugin limit work limits the number of rate limits based on number of requests and we also have things like the number of simultaneous connections a user can have with the API gateway and you can also do things like limit based on count but you can also make it more complicated by like setting up rules for specific users so you can set up authentication and once the user is authenticated you can give them certain limits so this particular user has can make 60 API calls in a minute or something like that. So API 6 really lets you do that and if you have a more complex scenario that is not supported within API 6's plugins you can also create your own plugins so you can use Lua or like you can use all of these other programming languages so API 6 also supports multiple plugin runners so if you can write Java plugins you can write Go plugins we are also working on a Wasm integration so basically you can write a plugin in any language and put it to Wasm and run it with API 6 Yeah, it should be possible but not out of the box I guess you can write your custom code to pull metrics from Prometheus or set up your monitoring system and then evaluate and run limit as you mentioned so the question is how you can upgrade the API gateway I guess it is similar to how you update all the other software because you can do something like canary release as well because if you have multiple instances of API 6 deployed or like if you are talking at that scale you need to have some sort of load bar in front of that and then probably do a canary release or something similar so that like the traffic is not brought in between but the process is quite seamless like we do have an LTS version and we also have edge releases so most people like this stick to one particular version and they like they take time to like update to any version of API 6 but yeah yeah, so one question was what kind of communication protocols we support so all the different types of API so like REST, GRPC, GraphQL and work we also support multiple protocols so we recently had a user who have been using API 6 for IoT applications so that uses NQTT protocol so it is it is pretty vast and like another use case of an API gateway is like to convert between multiple protocols so you might have a GRPC application or you might have a GraphQL application but your API is only client in a REST model so you would want to translate between those different kinds of API types right so when API gateway can be a facilitator for that and the other question was about service discovery so API 6 integrates with service discovery platforms as well so like let's see like so API 6 integrates with all of these platforms so if you talk about Kubernetes like API 6 integrates with Kubernetes default service discovery system let's see like there should be some documentation somewhere here oh yeah so you can see all these all these registries all these different things that API 6 supports so yeah you can definitely integrate with that it will do service discovery you can also do it manually as well like so if you have a YAML file where you are automatically purpleating your services or like your back end services API 6 can watch that and like do automatic service discovery thank you thank you so thank you Navenju for this wonderful talk so we are going out for a break and open source program to supply the cloud infrastructure he's going to talk about kubernetes cluster and other API related stuff so yeah here you go welcome to our presentation and our Nguyen Tan Huy I come from the Vitegrub group and I am I am a cloud solution engineer at Vitegrub group my colleague is for some needspan prison he couldn't be here and now today I will talk about the topic implant open source program to support the cloud infrastructure needed by a cloud provider community cluster the presentation has been the first one year overview security problem the next solution for implementation and the last one is value as you know kubernetes complex piece of software provisioning and managing community cluster is a challenging job to installing all the components of kubernetes cluster kubernetes is a costuming time when you do manually it means the situation when you have to manage hundreds of community clusters you provisioning in professional environment and today to operate a cluster can be so terrious although there are some tunes such as kubernetes but it can manage full life cycle of the cluster and here is cluster API is a kubernetes project started by cluster life cycle 6 it's a really deep API that can be used to manage the life cycle of one or more community cluster next let me see the program of the component of cluster API the way cluster API works is separated into provider and what provider do is that it have to make very modular so that means you can plug in different infrastructure and different workshop machine so we have call cluster API controller is a call cluster API many it includes API life cycle logic also life cycle management is operating update a machine all that then we we have the workshop how to join a node join node in the cluster kubernetes cluster cluster API come with built-in workshop provider workshop provider based on the kubernetes project then we have the infrastructure provider where you have your different infrastructure like environment cloud anything for example cloud you have as well provider AWS provider parameter provider as last we have controller provider we are responsible for managing control plane machine managing can means operating control plane machine make sure the kubernetes component are ready cluster call API includes custom definition custom definition built-in that let you you can attend the security API cluster API provider can provide on some custom definition service machine deployment so I want to describe the overall architect of the previous virtual cloud system before get into the middle of the matter so architect consists of three main modern cluster management platform called CMP the security provider open stack CMP as a management modern community directly with user to UI and receives request and send this request to security provider and the next security provider is central management modern that's impact the QoT cluster. When receiving a request from the CMP, QoT provider will send this down to the object stack to create QoT cluster. Based on the cluster API provider object stack solution to create QoT cluster you need to create other resources such as global server but the information of the component is stored in the object stack. And object stack is your private cloud. So the CMP can get the information about the resource created below the object stack. So when you send a request to create QoT cluster, the request will be sent to QoT provider and it is to send to the object stack. And here is the network model of our architect. HVM is a cluster will be plugged into two VPCs. One VPC is a user VPC and the second is provider VPC for the request that requires to go out to the internet. They will go to a router and go out the internet. And for the request that requires administrative and interaction with the object stack, it will go to the provider VPC and to plus balancer the API server must not use a load balancer or attack to a VPC and expose it to the internet through a frapping IP. So our architecture is resolved in some problem later. The first server like transparency of our system. When customer has a destructive behavior, it's harmful to the system such as they release a resource apply a pie jammer. We can know and customer can completely blame the supplier. Additionally, it will take a long time to determine which result is having a problem. When they are in issue with the client infrastructure which impact the customer experience and the customer will not be satisfied. Another consider is likes of information. When a client send a request to create a community cluster and the system dies successfully. All they receive is a notification that the cluster has been created successfully. The system can tell which result are being created accordingly. And the customer will be able to update, revise, estimate, develop in a timely manner since they don't have the information about them. And from that the third problem is the calculation of the cost of using customer. You cannot know what they have paid for. And another issue is a complex network model. The issue affect both the customer as well as the provider. And now we and I will talk about the solution. And instead of using cluster API provider open stack we custom and design plugin named cluster API provider with help. So plugin can call to the CMP to create a community cluster. Our CMP can create results similar to open stack as is our new architect. User send a request to create a community cluster to UI. The CMP receives the request and send it down to security provider. And then it will call back to CMP to create a related result to the cluster API provider video. And the the information of the result used to create a cluster such as localizer, server, volume will be stored in the CMP and it can put to the user to UI. With that it's possible to show the customer the information of the result created. And this now is the new architect model network. It's the VM of the user cluster in user VPC and for other request to in the internet or connect to CMP API it to a writer and we still will the API server master node. And to custom cluster API provider video we follow the cluster infrastructure provider contract of the cluster API and the machine infrastructure provider contract of the cluster API. We have defined the data cluster template. In the API we add the field and the the and the the the cluster we have the and and the cluster control plan and other as the stated we have required field ready. It's to indicate the provider specific infrastructure has been has been provided. And in the detail machine time we have the the the the the the the the the the the and we have the So provider must watch for new update and there is a result and respond accordingly. So we have a role-based asset control, it's a more detail, we define for main customer definition is the virtual cluster, virtual machine, virtual cluster template, and virtual machine templates. At the virtual cluster, we have defined the component of the virtual cluster structure. So VPC, subnet, region, browser, security group, listener, API server port, control plan, and file. And in the virtual cluster, we have defined the virtual cluster spec and the virtual cluster standard. And at the virtual machine, we have defined a component of the virtual machine, such as server, region, image, volume, size. And the virtual machine templates, it have the cluster BI provider, it can integrate with other cloud provider. So it's the state transition when you create a machine. When machine creates a virtual provider, what's the machine in pending status? And machine controller set the data script name from the virtual config. As I like to say, that is pro-visual name. The infrastructure provider starts or creates infrastructure for machine. And when machine in file is ready, it set the status is ready. As I like, status is pro-visual name. And when the status of machine infrastructure ready and the machine controller, you set the machine status is pro-visual name. And now when is the controller of our cluster API provider video. When cluster API provider video run, it will run two concurrent request size stream. It's a cluster and machine. So cluster request size stream will create an Azure cluster infrastructure ready. And the stream of request size machine will wait for the cluster infrastructure ready. At the request size of cluster, we will create VPC, create subnet, create Robuster and another object. It will Azure cluster infrastructure ready. It will set infrastructure status equal to and cluster status equal to. Then the request size machine will start to create the object. It creates volume, creates server and load balancer member. And another object. When the other object is created, it will set the machine status in file, charger equal to. And the status pro-visual name. So use of the new solution has been brought many benefit to our platform. This can provide sufficient information to user when user send request to create cluster. And all the information of the relative result is can put to the UI. So the customer can see all the information. The second is it will solve the problem. And customer will know what is the next. So new solution will Azure transparency for customer system. And the last it will increase the experience when they use our cloud. Yes, thanks for your attention. Anyone has a question? Thank you for this wonderful talk about Kubernetes. Hello, yeah, welcome back. So now in the next session, that is going to be lead by Ankit. He's going to talk about DevOps is dead. Yeah, that's right. So over to you. Thank you, Preet. You'll have to excuse my voice today. My throat decided the worst time to not work. But I'm Ankit. I'm a lead platform engineer and consultant for ThoughtWorks. I've been in the DevOps space for most of my career. And most recently I've been leading teams to build cell service cloud infrastructure platforms for our clients here, mostly in Singapore. So today's talk is about, well, the state of DevOps and where we are, what's next, and how did we get here. So this title is a bit click-pity and slightly controversial intentionally. But it's also extremely popular title that a lot of people are throwing around these days. Have you heard of is DevOps dead? Have you seen any of these conversations before? So, okay, so nobody's heard of somebody proclaiming that DevOps is dead. All right, okay. So let me introduce you to this new debate happening in the platform engineering side of things. So what's happening is probably a marketing department gone crazy. What the actual question is not is DevOps dead, but is DevOps no longer relevant and is platform engineering a replacement for DevOps? That's what people are actually positioning. So you'll see a lot of content that's titled DevOps is Dead, Long Live Platform Engineering, et cetera, et cetera. Some people are doing it to evangelize platform engineering, which is good. Some people are doing it. Some people are explicitly refuting it, which is also probably good. So have you heard of platform engineering? Okay, so one, two people. Okay, so I'll just poll the two people. Do you guys believe DevOps is dead? So there's a no from over there, right? And a non-committal response from here. So, okay. So let me start off with answering the question before I go into the details. Yes, DevOps is dead, but probably just as a job description and a team name. And I say this as an ex-DevOps guy. Do you guys have DevOps in your job titles or your colleagues who have DevOps in their job titles? Right, okay. And do you have experience organizations where they have large, long-lived DevOps teams? Have you interacted with a DevOps team before? The name of the team is a DevOps team. Yeah, I see a few nods, so I'll take that as a yes. So most probably, and most hopefully, these teams will no longer be called DevOps team and everybody will change their job description from DevOps engineer to platform engineer. And that's probably a good thing because if you are in my camp of what is DevOps and it was never intended to be a team or a job title anyway. So also DevOps is not dead. DevOps remains the North Star culture of enabling teams to own features end-to-end in the entirety of their design code, test, build, deploy, lifecycle. And platform engineering builds upon DevOps and enables DevOps. So like a good politician, I've given you both the answers. A good consultant, since that's my job. Before we get into what's platform engineering, I'd like to show you visualization of data I pulled from LinkedIn about the transition of people from the DevOps job title to platform engineer. So this is some data. As you can see, the evidence is quite compelling of the number of DevOps engineers who are transitioning to platform engineers. So it is happening. But what is platform engineering? And why do we need it? So let's look at it backwards. Like what are the current dysfunctions of pain points in the engineering process that we are trying to solve with this relatively, relatively new thing which is platform engineering and why are people claiming that it's a replacement for DevOps? So if you look at some of the common pain points that people are trying to address, rightfully so, and you may or may not relate to some of them, but I relate to all of them, is on a day to day, on a day to day, as a software engineer trying to deliver software features, one of the biggest dysfunctions and pain point and wastage is just waiting. You've probably all had an experience of having to write up a Gira ticket, send off an email, DM a friend and another team, because you need some dependency service by some other teams to be able to do your job. So you sat there just waiting for somebody else to do their job because you're coupled with them and dependent on them so you can do your job. This means you may have to sort of switch a lot, bring in some other work, jump between tasks while you wait for your dependencies to be fulfilled. The other thing is cognitive load. I learned this term from the book Team Topologies. It's a really good way of describing load. It's not about the volume of work. It's about the human brain's capacity to be able to simultaneously in parallel hold only X amount of context and be able to do only X amount of work. So cognitive load happens when, for example, say you need some configuration change in the service mesh layer, your product engineer, your app developer and the service mesh team says, hey buddy, we are too busy servicing some other requests at a higher priority. Why don't you just make a PR to your repo and so to make a three lines of config change, you begin a three week journey from the servers to orchestration to networking to service mesh to be able to make that change. That's completely externist to your actual day job and that adds to your cognitive load and you're spending time away from what you should actually be doing which is writing features. The very interesting one is also some heroic unicorn figures and teams. We often have teammates who have really high potential, really talented, who have somehow everything related to the app we are building and also all the surrounding infrastructure layers and how CI and CD works and all the tools we depend on that nobody really knows. So if anybody is stuck on anything, you always end up going to these really talented few. This also is a way where cognitive load is masked and burdened on a few select few people. This is also a massive waste even though it can look good because these are probably your most talented folks in the team. Most of the energies are directed at value at work and the other is of course everybody's favorite whole bunch of meetings. Meetings are symptoms of team interactions not set up correctly. If you're having to go into a meeting to explain to other people what you need and there's a whole bunch of back and forth it simply means the interface between two teams there's something wrong there. So you can end up feeling like okay I was hired to do a feature development on a product. I'm really excited. I want to learn master my craft but all I've done for the last three months is learn about all the peripheral rears and the organizational structures and everything else other than actually mastering my craft, building my career and driving business value. So you're left feeling like I just want to do my job. If you feel like that you're it's not uncommon. This is fairly old but still relevant survey. The Stripes developer coefficient from last year is also quite relevant or 21 I forget. But essentially it says that waiting, struggling with back tools which is cognitive load waiting on requests etc. the highest reported highest reported area of waste and pain or dysfunction in the process. So how did we get here? I'll try to speed along because I don't know if I'll be able to squeeze everything in the 15 minutes that remain. So how did we get here? Well in the beginning there was a big bang and then nothing interesting happened for a long time except around the mid to late 2006 something really interesting, two really interesting things happened in our industry. The first was okay, Dev and Ops were doing their thing. They were separate siloed units and Dev would build at least code and hand things over to Ops. This is a story all of you are quite familiar with so I won't harp on about it. But then what happened in this 2005 to 2006 timeline is also microservices all started happening and what that meant was that now the Ops team who had that one batch script or systemd unit file to manage one monolithic application across one server or just rinse and repeat that across multiple servers had to deal with a massive distributed 1,000 application things running on complex clusters needing orchestration etc. So the Ops team had to scale up to respond to this extra operational overhead for managing the same systems and they had to build out sophisticated operational tooling and just to give you how massive this landscape of operational tooling is this is probably 30% of a screenshot from the CNCF landscape page. All of these tools are what classically belong outside of a product engineering team, right? So who owns these tools? More often than not it's assumed that all of these things are bucketed in that infra Ops platform kind of team layer. So there's increased overhead and also increased sophistication of complexity and operational concerns. So that's one. The other thing that happened was also somewhere around the same timeline people realized that we need dev and Ops to work together. Specifically, we need developers to own a little bit more of operational concerns. How do we do it? Nobody knows. A few people did it, right? But quite a few people did it in very interesting anti-patterns. These are two of my favorites and these are ones that I have encountered most. The guys who wrote the book, Team Topology actually documented these things in 2013 at that link over there. So the first one is you need Ops, so you say, okay, you build it, you run it, but run what? Like run everything down to the hardware layer. That's almost impossible because of the sheer amount of complexity you end up owning. So over here we have Ops that are embedded in the team or part of the team and there's no real Ops support. And the other even more popular one is over here where, okay, let's just call the Ops team a DevOps team and be done with it. Now we're DevOps. This is the one I've seen more often and this is why DevOps teams exist and hopefully DevOps teams will die soon. Not to say that we won't make the same mistake with platform engineering. So what happens in both these kind of anti-patterns is on the left you get extremely high cognitive load and you need those highly talented unicorn engineers to be able to deal with anything because you just have so much complexity you want. You need to understand all the layers beneath your application to be able to you build it, you run it. And what happens over there is you're just stuck waiting and dependent on these massive Ops or DevOps teams that are servicing the entire organization and almost always a DevOps team will be a bottleneck to an organization. It's a given. At a scale, if you have one team doing ticket based work raised by others, it's going to be a bottleneck for your organization. I don't know how many of you have struggled to get your work prioritized for an Ops or Infra team over other teams. Hey, I need this database, can you I'll take you out for a beer, can you get my work done first? Okay, so what is platform engineering and how does it help? So this is Evan Botcher's definition of what is a platform. He wrote this back in 2017. There's an article on Martinfowler.com. You can check it out. We don't need to read it, we need to just look at the results of today. This is all interactions as self-service and it's a compelling internal product. The self-service in product, as long as we keep those things in mind, we can solve for the previous anti-patterns and actually try to make DevOps work. So instead of something like this, if we could have a team that sort of exposes all of these operational concerns for product teams, hey, I need a service mesh configuration, I need a cache, I need a database, in a way that these product teams can self-service it while also at the right layer of abstraction, they're not trying to figure out how caches and RDSs work and what subnet it should be, what security group, blah, blah, blah. The product teams don't need to worry about and expose it as a self-service API. If you can move to this kind of team topology, it would tell both these types of squads be DevOps. What this means is I'm now able to own my feature end-to-end because all the layers underneath that I need to manage are exposed to me at the right level of abstraction. I don't need to worry about all the details and all that. And the below team is the DevOps team because just like any other software engineering team, this team also owns their APIs that they offer to other internal customers at the right level of abstraction. So I think we need an example to really understand what's going on. So my common developer journey is, hey, I need data persistence. So I need a database for my application. At this point, what happens? Usually what happens is we reach out to another team, raise a ticket or phone a friend or whatever and say, hey, can you please give me a database. And they say, hey, I'm inundated by a database. And also, VAPT is happening and audit is going on, blah, blah, blah. You've got to wait a week or two before I can provision the database for you. So you end up waiting a while until the database is provisioned. It really depends on when the team gets your job done. The alternative is dev doesn't need ops, the other anti-pattern where you try to, the team says, okay, I can't unblock you now. You really need to just do it yourself. Go to AWS console and click off your way to a database. So you do that except say, hey buddy, where's the encryption configuration on the database? And you say, I don't know how that works, man, can you help me out? And then you're back to square one. So the alternative is a model like this where a platform team is internally exposing a database as a self-serviceable infrastructure. And the product team is consuming it. The platform team takes ownership of all the layers that make it self-service. So it's productizing this self-service API saying, okay, you don't need to worry about the encryption configuration. You don't need to worry about the backup configuration. You don't need to worry about security and compliance. All of that is baked in. Just read these documents and consume this thing. And the product team says, okay, as a product team, I'll probably just need to figure out what version of the database I want and how big do I want it? That depends on my user traffic. So that's the only way I will own. Now this team can end-to-end own this configuration because say they need to upside the database, they can self-service that themselves. And say there's any change in any other layer, you could have many ways of product teams consuming updates pushed by providing teams a platform team. Semantic versioning could be one of them. So you end up with this model where you have DevOps teams exposing self-service APIs to other DevOps teams that are called DevOps. I mean teams that practice DevOps because they can own their stuff end-to-end now because this doesn't cost the 50 years of learning to be able to own it. Okay. And that's about it. So in conclusion, DevOps is not there. Platform engineering should enable. It's based on DevOps and enables DevOps. That's my LinkedIn. If you guys would like to connect, I think we have maybe a minute for questions or something like that. But thank you. That's all for me. Yeah. Yes. But it doesn't have to be a REST API. It can be here's a read me. Here's some config you can push to your GitHub actions workflow and this workflow will provision a templated every security compliance practice baked in database. So it doesn't have to be like a HTTP API or something unnecessarily complex. The most prominent way I've seen folks do it, which also feels like we'll probably win out is defining operator-based models and defining things as Kubernetes CRDs. So then that lets you sort of offer APIs really, really cheaply. So define a database type in your cluster and then the cluster makes the database real and AWS land. So you can say Qubectl get cluster Qubectl get database, Qubectl delete database, etc. Because it's a custom resource then. So does that answer your question? Yeah. It's a good question. I think it could be either. This is a really compressed and simplified version of all of this stuff. What you probably want to do is talk to who your customers are and see where they want to interface with you. Do they want to interface with you at a human API level where you copy, paste, subconfig into action? Do they want a CLI tool, like a Heroku kind of thing to provision your platform infrastructure or something like that? So part of running a platform team is also having a product mindset and then doing that research of what will work for my organization. You need to find a product owner, then you need to talk to your users, then you need to prioritize the most common pain points and then you need to create a minimum viable super lean platform and test it out in the real world with your real users and you need to find one team that will evangelize your platform for you. So it's quite a good product engineering in that way. Anything else folks? Yes? Well, we do it for some of our clients which I can't probably mention and I think a really good one for you not to talk about my company though you can find some resources and thought works that are quite good is a lot of stuff that comes out of Spotify. They have been on their platform engineering journey and they have sort of a framework for building internal developer portals out there as well or tool at least. So you can check out things they publish as well. I think any organization that's like more than 300, 500 engineers will have some platform engineering effort going on or should. Do we have time? Sorry, do we have time? So it really depends because with the platform you could be solving a lot of customer needs, customer being the internal product team. So which need you're trying to solve will mean there is probably a tool that exists to solve that need. So specifically because this is a provisioning self-service example you could check out cross-plane and you could check out open application model which is something coming out of Alibaba and Microsoft. So they have a cross-plane specifically designed for infrastructure as code but it makes exposing an API to your consumers first class citizen of the tool. For example, I would argue Terraform does not do that yet. And then open application model is a very holistic but very opinionated model for how product teams or developers should interact with the platform. What are the components of the platform and how it should sort of work. I think we are out of time. We can talk later or you can just connect with me on LinkedIn or something. Thank you, Ankith, for this insight. I'm pretty sure still people are having the dilemma that they are also dead or not yet. So maybe they can get in touch with you offline. Welcome back. So we have Paul as our next speaker. Paul is having vast experience in the session development and big data as well. So today he is going to talk about key rupt, the new thing for Apache Kafka. So over to you. Thanks for coming along everybody. This was a talk that I volunteered at the last minute because we had a spare slot. It was based on a 40-minute talk. I gave at Apache Kafka last year in the US. So I hope it fits in the line that we've got. I'm the open source technology evangelist at Instacluster. So I talk about quite a lot of open source technologies but Kafka is certainly my favourite and it's the first one I actually learned from scratch when I started my job at Instacluster about six years ago. So it's the one that I've built up a lot of experience of over the years. And the change from using ZipKeeper to the new KRAF protocol is one of the more significant changes under the bonnet for Kafka in my experience. And I think it probably fits in quite well with the previous talk as well, talking about platform engineering. Kafka is certainly one of the more pervasive PubSub middleware technologies and the best known open source one that's been used by just about everybody now around the world. So Kafka abandons the ZipKeeper on a KRAF that's more or less the story. So Apache ZipKeeper was quite a well-known Apache open source project for distributed systems coordination and it's used by a lot of Apache projects including Kafka and it's actually really good at Apache Con last year. I actually gave a talk about why Apache ZipKeeper was so good still and then the next talk I gave was about well actually you don't need it for Kafka anymore so it was a bit confusing potentially. So this story is going to be assisted by a few trained pictures and this is one I took in New Orleans last year at Apache Con. So Instacluster provides a managed platform for big data open source technologies. We've got technologies for storage, streaming, analysis, search and orchestration. Yesterday I talked about cadence, today I'm talking about Kafka so that's our main streaming technology and up until now it was three components Kafka itself, Kafka Connect and ZooKeeper which we all ran as managed services. So Kafka just briefly is a distributed streams processing system that allows distributed producers to send messages to distributed consumers via Kafka which is a distributed cluster with multiple nodes and multiple things going on in the nodes. Primarily topics and petitions are Kafka topic petitions enable massive consumer concurrency. Essentially you've got producers sending messages to one or more topics. You've got consumers consuming from one or more topics and the way the workload is balanced over consumers is you can have more than one consumer in a consumer group and each consumer can consume from one or more petitions. So that enables high consumer concurrency. This graph shows throughput on the x-axis versus petitions on the y-axis which is in millions. So the catch with Kafka consumers is they're single threaded, well at least the default Kafka consumer is single threaded which means you need to increase the number of consumers as the consumer response time or latency increases. So for example if you want to achieve a throughput of 10 million messages a second and your consumer has a latency of 100 million seconds you're going to need 1 million petitions in that topic which is a lot of petitions. Petitions are expensive. They have replication and meta management or meta data management overheads associated with them. Without going too much into the detail the real problem is that the Kafka cluster itself has to manage the topic partition meta data and it also has to cope with the replication and a replication factor of 3 is quite common in production Kafka systems so that means each message sent to a topic has to be replicated on three other brokers as well which introduces quite a significant overhead to the whole system. So how does Kafka work? Basically it has a controller the Kafka controller does the control plane stuff all the meta data management. There's only one active controller at a time. Each broker can potentially have one controller but there's only one active over the whole cluster at a particular point in time. The Kafka controller manages the broker topic and partition meta data. It's basically the brain of Kafka. Most people don't even know it's there. If you're using Kafka as a managed service you don't even need to think about the controller side that's all handled by Kafka for you and the managed service provider normally but that's actually how it works. Which controller is active and where is the meta data stored? Essentially the answer to that up to a few years ago was an Apache Zookeeper. Zookeeper is used for the controller election and storing the meta data. Zookeeper has the concept of an ensemble which is really a cluster of Zookeepers. So you have typically three you can have more but probably not more than seven. Again there's only one of those which is active at a time, the leader Zookeeper. The active controller from the Kafka cluster communicates with the Zookeeper leader and keeps track of the election data and the meta data as well. So it's pretty slow. That's quite a big difference. Zookeeper isn't great in terms of right scalability. Meta data changes and recoveries and fell over are quite slow. Reads are pretty fast due to caching in the Kafka cluster itself though. So it worked okay. I mean you could get some pretty big clusters up and running and reasonable numbers of petitions. So the new kraft mode is something that has come along in the last couple of years. Kafka plus the raft consensus algorithm abbreviated to kraft. So the Kafka cluster metadata is now only stored in Kafka so it's fast and scalable because Kafka is fast and scalable. The Kafka cluster metadata is replicated to all the brokers so failover is a lot faster as well. The active controller is just the quorum leader and it's using the raft protocol to elect the leader internally as well. So it doesn't have no dependency anymore on the external zookeeper. So we had a couple of hypotheses that we wanted to test when we started out looking at the new kraft mode in Kafka last year. We were interested in whether there'd be any impact on the actual data workload performance and we assumed that there probably wouldn't be. We guessed that we'd probably see similar results between zookeeper and kraft just for the actual data load. But for the metadata changes and recovery from failover so metadata performance we knew that zookeeper was pretty slow for some of these operations and we assumed that kraft would be faster. That was the promise anyway. We knew that using zookeeper there were quite a little limits on how many petitions per cluster you could get out of Kafka and we assumed that using kraft there'd be a lot more hopefully. And with the robustness aspect we knew zookeeper was pretty robust. There weren't many situations where zookeeper caused any problems and Kafka itself was pretty reliable using it. But this is an unknown feature of KRR because it's a new feature. It's only just recently become production ready and the fact it's been available for a while now sort of as a proof of concept people can try it out in a developer context and test it out. I'm not sure whether it was going to work particularly well in a production environment or not. So basically I did a couple of experiments using a fairly early version that we had available in our managed service. The first experiment was whether the data workload performance would be different a lot. So we assumed that there'd only be minimal or no difference between zookeeper and kraft message throughput. And this is purely looking at the producer workload at this point actually as well. The theory that zookeeper and kraft are only concerned with metadata management not the actual data workloads, Kafka producers only need read-only access to petition metadata so they should just be as fast basically. So how did we do it? We set up Kafka 3.1.1 on some identical AWS nodes with an RF factor of 3 and what did we find out? Well, okay, it confirmed our suspicions that there was actually no which you can see there because you can't actually see the blue line it's hidden behind the orange line. So the bottom axis here is the number of petitions so we managed to get up to about 10,000 petitions in the cluster. The throughput is the y-axis sorry no petitions is on the x-axis throughput on the y-axis and yeah there's an identical throughput basically. There is a cliff this is something we've seen before with Kafka when you hit a certain number of petitions you actually get a reduction in the throughput and this isn't too bad compared to some tests we ran about two years ago where the throughput started dropping off at about 100 petitions you only start seeing this drop off in throughput now around about a thousand petitions so that is still something to watch out for though. Again petitions versus latency in milliseconds not surprisingly as the throughput drops the latency also starts skyrocketing to the point where it would become unusable which for a low latency system like Kafka is a real issue so you want to avoid that if you can. After the second experiment we had a look at petition creation performance how long does it actually take to create a larger number of petitions so how many petitions can we create can we actually create more on a K-Rap cluster compared to the older zookeeper cluster or not how long does it take to do this? So a few simplifications for this experiment we used an RF of one otherwise because of the replication that goes on with the topics and petitions the background CPU can be quite high when you have a high number of petitions and a high replication factor with an RF of one though there's actually no replication and there's very little background CPU going on we could have used a bigger cluster but for the sake of the time the experiment we just set RF to one for this one and then there was still about 50% background CPU load on the clusters with a hundred petitions and no workload going on as well we tried a few different things to create a large number of petitions the first approach was just using some of their Kafka tools to create a topic with lots of petitions straight off then we tried using the alt command to increase the number of petitions on an existing topic incrementally we tried curl with our inbuilt provisioning API because we were having problems with timeouts with the first two options and finally we tried fourth approach which was just a simple script to create multiple topics with fixed number of petitions however there was a problem with all of these all of them actually failed eventually and the only difference is really how soon the failure occurred and some of the failures disturbingly the Kafka cluster was basically unusable even after restarting the Kafka process on each node so something was a bit strange this graph shows the number of petitions on the axis versus the creation time in seconds on the y-axis ZooKeeper basically has a linear time for creating petitions and the more petitions you want to create you're going to take up to a point where there's an eventual timeout which stops you creating anymore with using caraft it's a constant time so it's a lot faster which is the orange line on the bottom there so it's actually very easy using caraft to create lots of petitions but still with the eventual failure occurring mostly due to timeouts as well which was interesting so the other approach we tried was the incremental approach so this is the time per thousand petitions increments which does increase with the total number of petitions in the topic ZooKeeper is the blue line so it is a lot slower than caraft, caraft is actually taking some time though so it's basically a slow process still to create a larger number of petitions and still we were getting eventual failure so the initial conclusions were certainly faster to create more petitions on caraft compared to ZooKeeper we were hitting a limit of around 80,000 petitions on both ZooKeeper and caraft clusters and Kafka ended up failing so it's actually very easy and quick to kill Kafka on caraft just to try and create a 100,000 k petition or more topic and something inevitably goes wrong so another experiment we did was the performance of the metadata workload for example reassigning petitions so common Kafka operation for example if a server fails you can move all of the leader petitions on it to other brokers and there's a simple command to do that and our tech ops people are doing this all the time for customers you run it once to get a plan and then again to actually move the petitions so moving petitions from one broker to another two brokers was the experiment we performed with 10,000 petitions and a replication factor of two this time and we got the answer to life of the universe and everything which yeah it's 42 but what's the question? Well ok so the question is how many seconds does it take to reassign 10,000 petitions using caraft 42 as the answer so it does take quite a significant amount of time using ZooKeeper to do the same operation at 600 seconds that's quite a significant overhead at that point and one of your nodes available on a cluster 42 seconds is certainly a lot better although just noting that there was actually no petition data for this experiment so in real life it's quite likely that the time to move the data is going to be dominant as well experiment 4 this is the one that I was really excited about everyone had been saying I can get millions of petitions with caraft I mean millions of petitions is probably a silly thing to be trying to achieve but for the point of view of science and doing an experiment we all had that goal in mind that if we can hit a million petitions we will have proved that caraft was actually doing what it was advertised to do so this is my final attempt to reach a million or more petitions on a cluster again with RF equals 1 I didn't want to have an enormous cluster to do this at the time to achieve it we had to sort of cheat a bit we used a manual installation of Kafka 3.2.1 on a large EC2 instance so it's not actually a cluster at this point either and we were still hitting the limits at around 30,000 petitions so this is the sort of error that we were getting it was a bit of an odd one it's not one I'd seen before it was a map failed error something to do with the Java runtime and the available memory a slightly odd error because we knew we had lots of spare RAM and we hadn't encountered this sort of error before on very large clusters either so we tried increasing the Amanda RAM we tried things like increasing the number of file descriptors which we knew was actually quite important and again because I was just doing this myself as an experiment we weren't relying on our managed service version of Kafka which has all these configurations sort of carefully configured for customers I had to sort of recreate some of our settings so typically we did have lots of file descriptors 65K is the default on Linux and for Kafka you need a lot more than that basically but it still didn't work so yeah we had plenty of spare RAM but we were getting this out of memory error googling this type of error actually did discover what the problem was on Linux there's a thing which determines the maximum number of memory maps that a JVM can or a process can have and again the default was only 65,000 and because Kafka uses every partition uses two map areas it actually limits you to about 32,000 petitions which was actually roughly the number we were hitting the limit at so we set this to a very large number and tried again did we reach a million well sort of yep so we managed to hit around 600,000 petitions just on the one Kafka broker that we were experimenting with and by inference for a three node broker we were actually hitting over the one million partition mark about 1.9 million petitions we estimate we have actually redone this experiment since this is a bit out of date with our new managed service with KRAFT in fact that's exactly the sort of number that you can get with the KRAFT mode so that's a lot of petitions on Kafka we've typically our production customers only have hundreds of petitions on a Kafka cluster theoretically but I don't think there's much practical application at this point in time okay what about the batch error it's actually painfully slow to create this many petitions due to the batch error when you're trying to create too many petitions at once this actually turned out to be a real bug the quorum controller has a limit to how many things it can handle basically in the maximum batch size and the promise is that it will be fixed in 3.3 but we haven't tested that yet so that's some takeaways about KRAFT it's fast for data workloads there's no difference in fact to the zookeeper mode it's certainly faster for some metadata operations that we've tried you can have clusters of more petitions potentially more than 1 million you still have to watch out for some of the operating system JVM configurations to support clusters with more petitions and trust your manager service provider because they've probably done the experiments as we have but we're constantly learning so should you use Kafka KRAFT mode yet? Well yes 3.3.1 is production ready and by 4 which will be next year there will actually probably be no zookeeper mode anyway so it's probably time to start exploring the Kafka KRAFT mode we provide Kafka 3.3.1 in public preview I think at the moment and you can test out a free trial for a couple weeks using some quite reasonable sized Kafka clusters on various cloud providers we support AWS, Google Azure and a few other other ones as well so we're providing that as a public preview at the moment rather than sort of the final general availability because we're still learning how KRAFT will be working mainly for our tech ops people so it's giving them some experience and it's also giving our customers experience in using Kafka just in a development context initially and that's it so that was a 40 minute talk it shrunk down to 15 minutes so there's a lot there but hopefully that gives you a bit of an overview of the new KRAFT mode and it will be the future of Kafka I don't think there's much choice about that so thank you very much Any questions I guess there's lunch next so ask me over lunch if there's anything you'd like to know about Kafka in general or KRAFT in particular or anything Okay thank you for so now I think we are taking break and in the second session, second half we have a few other important sessions lined up so meet you again then, bye