 Yeah, so Back in October 2012 I was working in the telecom industry and I had gotten involved in one of the first CICD efforts at my workplace And I had my project lead come by and and explained to me like what the current situation was and he said So Eric we make releases roughly every six months, right and Development of those releases typically start at around a year before the actual release But the verification department doesn't really get started until about three months before the release date So I was thinking then well sure this surely this means that sometimes the developers have to wait a long long time Between developing the feature and getting feedback on it worst case it can be several months So if they get a bug reported to them they have to make the mother of all context which is to try to get back into what they were doing and If you are anything like me and having a problem Like remembering what you did this morning imagine trying to remember what you did seven or eight months ago it's tough and My project lead then said so this Delay between development and verification. We need to get that down from a couple of months to a couple of hours And I'm thinking this is a crazy big step But it turns out and I've been shown over and over again since then that the step is actually not that crazy But you do need good inputs. You do need to know where your bottlenecks are and for that we need metrics So I am Eric Stanis on and I'm Andrea Frittoli and Yeah, so today we are going to welcome to Building DevOps metrics for your choice of CD tools through CD events So today we are going to talk about CD events and going to introduce them We are going to talk about DevOps metrics and Also, we're going to discuss how the two technology actually can fit together Thank you, Andrea. So I'm gonna start by talking a little bit about the CD events project So what we're trying to achieve on a conceptual level is to define a common language for communication in CICD solutions and surrounding areas What we are concretely doing to try to achieve this is two things. We are developing a specification for events in CD and We are building a number of software development kits or SDKs for Producing sending and receiving these events And let's dig a little bit into the specification first So in the specification we define a number of events. They can be related to source code They can be related to pipelines. They can be related to deployments or other areas that we want to cover For each event we define like what are the mandatory and optional parameters that Both the the sender might want to communicate and the receivers might want to know about this message and Also, since we build on top of cloud events, which is a CNCF project We also provide some guidelines for how to use the fields that are provided by cloud events when you're sending these events So based on the specification we can then produce a number of SDKs to help people build Or yeah, send and receive events in in their own languages and platforms If anyone attended the excellence talk by Edison eski yesterday talking about generating clients from schemas That is also something that we want to do So Andrea has actually been working on the schema quite a lot lately and we have a goal to be able to do that But we're not right there yet But with these SDKs things like producing a new pipeline run queued events Can be done without having to like manually put together a JSON object So you can use the SDK to create your message and send it So then with these SDKs and application code to produce events We can then work on Integrations and proof of concepts to try out the things that we are defining them and see What are the Attributes that are missing what are the the information that is needed by an observer or a receiver of this message And it sort of feeds back into a cycle a development cycle actually. I probably shouldn't call it a cycle It's more like a reverse tornado. Just trying to pick up thread some information or nuggets of ideas and useful stuff and Create some order out of it. That's what we're doing And when I say we it's it's not just the the seed events team Which is like responsible for and maintaining the specification We actually get a lot of help from people involved in other CICD projects So at least one of the SDKs that we have and quite a lot of the proof of concepts and integrations Have been built by members of other communities that we have collaborated with But let's talk a little bit about what we want to achieve with cd events. Why are we doing this? We have covered this in more detail in previous talks that we've had So for the purpose of this session There are two goals that I want to focus on one is interoperability So making things work together by providing a common language So switching out a tool should not be hell. It should be quite easy Because the tools are all speaking the same language The second one which is more relevant for this talk is observability So providing directives on what to send and when to send To tell the surrounding world what you are doing and what you are achieving And through this observability comes a great opportunity for building metrics And we have some metrics that we're going to talk about today Andrea, so do you want to talk about those? Yes Thanks Eric for the great introduction to cd events So today we are going to focus on a special kind of metrics which we call DevOps metrics And yeah, while DevOps practices are becoming ubiquitous different organization There are different points in their journey to DevOps And going continuing in this journey requires a continuous investment of resources And so this organization needs to to understand Is they're getting The investment paid being paid off basically if they're getting a return for if they're investing resources in the right area If they're focusing their team on the right tasks Apart from these I mean the recent surge in Software supply chain attacks is not making things easier because it puts extra pressure on the team to like starting bringing security left And implementing pipeline in a secure way and making sure that the the build system are Approved and secure and so forth So I wanted to mention the state of the DevOps report by published by puppet since 2014 they have been looking at how Organization has been implementing DevOps practices Yeah in their organizations and What they do they divide or the group organization in three groups like low and medium and high Depending on this level of success. They have in their in this implementation and something that they have seen over the years almost 10 years now is that there is a strong correlation between this level of success and The numbers they get on specific on a specific set of metrics So how well they are doing on a specific set of key metrics that they have identified And to calculate these metrics this DevOps metrics, you need data and as you can imagine Getting this data doesn't scale very well with Increasing number of repositories and teams and applications So if you have one you may go and collect this data manually But you have a larger organization. It doesn't work that well So what kind of data we are talking about So the four metrics that the state of the DevOps report refers to are the Dora metrics You probably or may have heard about them They've been developed by the Dora group, which is DevOps research and assessment group I will just forget the name And what these metrics are Develop the deployment frequency. So how often do we deploy into production? Lead time for changes. So how long it takes for a change in the code to actually get in production? Change failure rate How often does a change in production causes an issue or a failure in in our application or service? And finally the time to restore the service. So How long does it take after a failure to actually go back to service? Working Nicely and something that is apparent Looking at this list already Is that no single tool is actually going to be able to provide the data for all the metrics? I mean the if you look at the cd landscape, it's relatively fragmented We have tools which focus vertically on areas of your entire pipeline So you will need to use a combination of tools to collect all this data And here is where cd events we think can help Because on the producing side. So when you're creating the events it provides It provides Kind of a a minimum model for the tools For a minimum data model for the tools that they can produce to be interoperable with the rest of the world And on the receiving side it makes Injusting the data consuming it storing it and processing it easier And what we believe is that For cd events we can foster an ecosystem of tools that can process this data And as long as more tools become compatible with cd events Then it will also lower the barrier for Introducing new tools or switching out tools because let's say you have your setup with Your tool chain and your Workflows running and you're collecting metrics and you decide to introduce a new tool or switch out one of the tools As long as the new tool produces cd events Then it should seamlessly integrate with existing metric collection that you have in place So let's look a bit then at how we Produce the metrics that andrea talked about The first one we're going to look at is probably the most straightforward one and its deployment frequency Essentially just counting how often and how many times we deploy So let's pretend that we have a setup like this We have our yeah, we have an environment a production environment running some form of service We're going to use potato for this example because it's it's fun and um Yeah, we have a new version of this image coming in And we want to deploy it And we will say that we both want to deploy Yeah, we want to upgrade the existing environment and we want to deploy a new environment and and install the service there Through some orchestrator it could be tecton controlling cube ctl It could be argocd spinnaker captain something else that helps us Do this orchestration What we provide in cd events then are two related events one called service upgraded for when use upgraded an existing environment and one called service deployed for when you deploy a new environment And just essentially counting those messages An observer has enough information to be able to count deployment frequency So that was not very difficult. Let's move on to something slightly more involved So the next one is lead time for changes. How long does it take us? To go from an approved changed to this change actually being built packaged uploaded and deployed So let's go through that journey. We start with the change. We have the same observer as we had last time So first thing we're going to do is we're going to need to make this change available in our repository So we involve scm We Create a pull request we get it approved and when it's merged A change merged event is sent telling the surrounding world Some information about like what is the repository where this change has happened What is the Shah of the commit things like that that can help uniquely identify this change down the line Next we're Probably not going to just run the source code. We need to build it And as part of the build or as the last step of the build We have some sort of artifact Created it could be a jar file. It could be a container or an image rather But when that is available, we send an artifact packaged event and this connects back to the source So we know what source we built or we know the change that we built this from And it also identifies in this case the image But just having an image On the build server doesn't really help anyone So next step is to connect it to a registry So we upload the image and then the registry can send an artifact published event So this announces to the rest of the world that this artifact is now ready to be used Which means that it can be picked up by an orchestrator Who can deploy as we saw in the previous picture maybe upgrade something and then send a service upgraded event And with these we again have enough information To declare lead time for changes But Andrea, I think we've had just about enough slides for some time Are we brave enough to demo this live? Yeah, let's give it a try Okay, so first I'll quickly introduce the demo setup that we're using For these two metrics So I can use the pointer No Maybe Okay, so here in the top corner. We have the sources of events. So we're using gt as an scm So where it's we do our code changes pull requests and so forth We have a container registry where we are storing our container images And we have a kubernetes cluster running on ibm cloud where actually the entire thing is running And we have a nice streak through k native to take events from the api The api server in kubernetes and convert them in Cloud events So we get all these sources of data And we feed them into a project called tecton triggers which allows to react to basically htp requests with adjacent payload and through the go sdk plus some Other tooling that we created here we call cd eventer. We basically generate our cd event So here in the middle there, this is a cloud events broker. This is powered by k native eventing And so basically every message every cd event that is produced Is deposited here on the broker. It's like a channel And the nice feature of these is that then you can define triggers and triggers are like subscriptions So you can say you can decide how many clients and which client will receive which messages and you can filter logic and so forth so And for the for this Demo specifically We have a trigger here that takes all the cd events and send them to another instance of tecton Which basically allows us to to visualize them so that for the purpose of the demo We can see those events and they are actually being stored in htp But you could have another Database where you store all your events a document database or something like that Or these are other tools that you can use to visualize events So let's see If We can Show the demo so I need to mirror display first Hopefully this is readable Okay, so we uh, eric Discuss presented two metrics. The first one is the deployment frequency So the first thing that we are going to do Is to Deploy an application. So we are using the pod data application. This is the container image That we're using and I'm just going to create a deployment From that image right So we can have a look Isn't he fantastically cute? And here it is in version 0.1.0 And the other thing that we can do we can open the tecton dashboard And here in the dashboard See a bit bigger You can see A service upgraded or actually a service deployed event here That was sent Just a few seconds ago And if we look into the event Okay, so we have the context with the type In the custom data, we have all the Data we put all the data that was in the original event So this is coming from the Kubernetes API Um and finally if I scroll to the bottom, this is the The key the core data model that we are providing. So this is a subject And the subject is a service the source. This is the IP address of our Kubernetes API service Um, the ID is the idea of the subject. The subject is the application that we deployed So in this case, it's a deployment. And so this is the whole path to the deployment And the key is that the combination of ID and service must be unique So it must be uniquely identifying the resource that we are talking about Additionally, we get extra context like the artifact ID. So this is the image that we just deployed with the SHA At the environment where it was Deployed and for the demo we are using the namespace. So this is the Kubernetes namespace where we deployed the application all right, um so and you can see I tested this demo a few times and I get the nice thing is get I get a number of Deployed and upgrade service deployed and service upgraded events They've just stored in at cd and I can see them here And I could run some queries and get some data about how often am I deploying and how often am I upgrading the service? um It's a bit uh more verbose That you might expect because in this case I deployed the application and I got some service upgraded events And this is because we are Basically tapping directly from the Kubernetes API service, which every time there is a small change in the deployment definition The status updates then it sends an event. So we get that event as well Uh, but let's continue. So from here, um the other The other metric that we wanted to talk about Is the lead time for changes. So how much time it takes For a change to get into production. So first thing we have to actually create a change Um, and so that I'm creating a branch and Opening the doka file And we're going to run a different version of the potato application. So changing to version zero one one Okay, I'm pushing the branch Right, we get a handy link here And open up the git t interface We can see the change that was made here looks good. So let's create a pull request Right, um, so in git t is configured with webbooks that are pointing to our environment or tecton triggers So now that if I merge this Change and let's switch back to the tecton dashboard Yeah, you can see it's running now. Um, we just got a change merged event. Okay, and similarly to before We have the type in the context In the custom data, we have the original event from git t Which all the details from the pull request and the owner and the repo, etc And here we get the slice of the data model that we define a standard on city event side So we have the id which is the git shot of the change that was just merged The source which is the url to our git t instance Slash the owner and repo so that again the combination of these two uniquely identifies this subject Um, and we get some extra content like the repository id So next step then, um, we have a change Um that we have made so we want to run a build With the version so this is going to take a second, but not too much. Hopefully Uh, the build script is going to Create a container image a new version of the container image And it's going to send an artifact packaged event about it And then it's also going to push this new container image to the container registry The container registry as a web book configured It's a nice feature found in the azure container registry. Actually you can configure webbooks So it's coming down to triggers again, and we are converting into cd events Okay, so we have the new container image here. I can set up To the new image. All right, um Let's go back to the dash to the tecton dashboard And we have here two new events artifact packaged and artifact published As we expected in the context we have the event type Um, right and our subject is an artifact The source is the script that I just executed The id here is the artifact id So this is not yet in the spec the format But we are considering using pearl package url as a format Because in this case it's a docker image But it could be another type of artifact could be a jar file or anything else And we want to make sure that we have like a consistent format for addressing artifacts Um, what is interesting here is that for the artifact package we get a link to the last change So this is a gitsha and the repo Of the last change that it is included in this artifact So it's a bit simplifying the model here We are considering an artifact that comes from a single repository in future We plan to extend to multiple repositories And so this allows us to link the change event With the artifact package The other event is the artifact published So this comes from the container registry The container registry does not necessarily know about the repository this comes from So that's why we have two events to create the link So we have the event type This is the in the custom data we get again What was coming from the container registry originally And here we have the pearl Which is consistent with what we've seen before And the source which is the This instance of the azure container registry Right, so we have got the build done So the next thing that we want to do is to actually Update our deployment so we can use kubectl Set image on the deployment we created before And set the image to the new images we just built All right, and if we do that and go back to our friend here There you go, we got the one one version with arm raised and shoes We have service upgraded events This is similar to the service deployed so we got the type We got the Kubernetes API server data And we got here the artifact ID and environment production So these events basically put together They give us enough information So we can calculate how long it took for the original change We've made to get deployed into production Right, so and we can do this for every change that we do In our system or in our multiple repositories And calculate this metric consistently Right, that's all for the demo That means which back Okay, so with this we've seen the first two Dora metrics and we see them in practice There are other two metrics you wanted to discuss And that we are working on The first one is the change failure rate So this is how often a change that we are implementing Is actually causing an issue in production So this is interesting from event point of view So let's dig into it So for the number So for this we could consider this metric As a ratio between the number of deployments That we are doing and the number of issues Or incidents that we are seeing in production And for the number of deployments We already have this information For the number of incidents we don't I mean we could look into the versions That we are deploying and make some assumption And say, okay, actually we are rolling back To a previous version So that means that there was an issue But this does not give us a complete picture Because counting incident Actually requires more information As incidents may come from different kind of causes So it could be that there is an issue In the application that I'm deploying Let's say I'm putting out a new port data server With the wrong color of the shoes No one likes the application That's the file a bug, that's an issue It could be that there is a configuration error So we are working for the demo In Kubernetes type of environment Maybe there is an issue in the manifest They put the wrong shot Something doesn't work Or it could be an environment error So something may be external to the application But the tendency we depend on That is not working anymore Or it could be something else And also who can discover the incident So who can produce the information About incident happening Again, it can be an orchestrator Like Kubernetes or whatever system we use It can be a monitoring system That is looking, watching the application So something is wrong Maybe their performance is degraded It can be the application itself Or it could be an end user Or a DevOps team Which is looking at the application And say, okay, well, something is wrong here So how can we deal with all this kind of possible Sources and type of incidents Again, we want to solve this And try and solve this with an incident event So this is not part of the CD event spec yet But we want to extend the spec To include a new bucket of events We group our events in different areas And this would be a new kind of bucket of events That would allow us to model this kind of incident And so you would have, like we described Incidents from the operator From a monitor From the application itself Or from an external users All going to an observer So we would collect these incidents And we would need to make We will need to make sure That we have enough context in this incident To be able to kind of correlate them with each other So that we can know how many incidents we actually have And produce the change failure rate out of that And the next metric, similarly, time to restore service We need to know about restoring the service So when things come back to be working fine And again, what can restore the service We could change the version Roll back the image to the previous version Or install a new version You could have a configuration change Scaling up the service Maybe it's under pressure Or adding new hardware even Or maybe it's something completely external That is solving the incident Oops, sorry, too quick And so the key thing is that We need to have enough information To correlate this back to the sources of the incident So we'll need to make sure that in the data model We have things like Well, the environment ID The artifact that is deployed with the version So we have this enough information To be able to connect these events To the previous event And that will make it possible to calculate the metrics And yeah, we think that A lot of the value from CD events Is that we will have a common way To address this data model So these fields to indicate To indicate what the environment ID is What an artifact ID is So that you can reason across Events that may come from many different sources And again with events in place If we look this on the similar diagram as before We can see resolution type of events Coming from the different sources And we collect those as an observer And finally we can calculate the time to restore Something we want to mention as well There is a project called Captain within the CNCF And they use cloud events today And they have some modeling around this type of events So we are discussing with the project To see if there is a surface of the events that Or a fraction of the events that they defined today That we could use as a standard for CD events as well So we look forward next time to be able to demo these as well Thank you Andrea It's been a little bit sweaty this afternoon I'm not sure if you noticed But I think network stability took an early launch today But it came back so it's really nice that it worked But we're nearing the end of our talk So we would like to leave you with A few messages to take away First is that metrics are tricky There are some really important metrics That we want to be able to capture But there are also a whole lot of different sources Of data for these metrics And it's our firm belief That having a common language really helps here So rather than having to understand 10 or 50 or 100 different types of events Or hooks or anything like that If we can get it into one language In one set of events Then observing this is or can be a lot easier And that is what we would like to achieve with CD events But we are not going to be able to do that alone So in the demo Andrea showed translators Between existing web hooks or event type things to CD events We have had support from some communities To get various solutions to natively produce CD events But those are the things that are going to be needed For CD events to have any level of success And for people to be able to benefit From the observability and interoperability that it provides So what me and Andrea would really like Is for anyone who is interested in the work that we're doing Want to read more or want to get involved To go to cdevents.dev There you can find the spec You can find the SDKs You can find our communication channels You can find GitHub and all that stuff So yeah, we are looking forward to getting in touch with you With that, we would like to thank you all for attending the talk I've been Erik Stanensson This is Andrea Frittoli And we would like to open up the floor for questions If anyone has a question, I have a mic here And thanks everyone for being with us on a Friday afternoon After a long conference week We're all kind of exhausted then, so I appreciate it Yes Yes