 Hello and welcome everyone. Thank you so much for joining us today for our webinar. We are going to be exploring a very interesting and pertinent topic in today's time. So we'll be looking at boosting engineering efficiency through open telemetry, captain and Taik. We are specifically looking at how insights can drive efficiency in organizations. And we'll be looking at different sides and aspects of that. I am Buddha, your host, product evangelist and developer advocate here at Taik. And I'll be your host, co-presenter and facilitator for today's session. I'll take you through the entire journey of this conversation, picking up your questions as we go along as well. So that's going to be my role. Hopefully we'll have a really good time together. Joining me on this journey today are two amazing panelists starting off with Andreas or Andy. He is the DevOps activist at Dynatrace as well as a developer advocate at Kepton. Super cool titles, all of them. So hello and welcome Andy. It's really good to have you with us. I thank you for having me and yeah, please keep calling me Andy. That's always what I say. Friends call me Andy unless I offend somebody. Feel free to call me Andreas. Otherwise Andy is cute. Andy it is then I think we'll keep it that way. So thank you so much Andy for joining us. Also joining us is our very own Sonia. She is the group product manager here at Taik. But more importantly she is also the subject matter specialist, subject matter expert as well as the driving force behind Taik's open telemetry here with us. So thank you so much for joining once again and really, really looking forward to sharing more about all things open telemetry. Welcome Sonia. Thank you for having me and to help us advocate on open telemetry and getting insights that's a topic that you know I'm really passionate about. I am looking forward to it as much as myself as much as everybody else here so hopefully just a few sort of housekeeping things here if you've got any questions at any point of time do feel free to post that in the Q&A section below. We will be taking questions as we towards the end of the session but if there is something that specifically pops up during the discussion as well I'll keep an eye out for that we are live streaming right now on YouTube so I'm keeping an eye out on the comments there too. So if you've got anything any feedback anything coming up anything that comes to your mind feel free to add those as well. So what do we have in store for you today. We are going to be looking at beyond the introductions we can be looking at GitOps again it's a very pertinent topic when we talk about efficiency specifically from DevOps DevOps perspective. GitOps comes up pretty frequently and I think we want to look a little bit more into the what how and why of GitOps will then look at the Dora metrics and what the importance of that is how you measure DevOps efficiency through that. We'll introduce you to the concepts of open telemetry and the benefits around it, followed by observability of deployments using captain this is where Andy will come in and he'll do a bit of a demo showcasing how you can have better insights when you are doing deployments and making that more efficient. We'll be following that up with Sonya who's going to be telling us all about API observability with type and demonstrating how you can troubleshoot your API and get better insights for your API is with type and dinotray so all very very exciting things to come. When for last but not least of course we've got the Q&A segment where we'll be having a bit of a discussion and taking questions from the audience as well as perhaps question from me as well because I'm a curious mind here so I'm learning as I'm going along. So, with that being said, I'm going to go over to our topic for today. So, where do we stand in today's world. We have pandemics, we have a war, we have economic uncertainty to a point where there is a panicked market, and we have regulatory failures in certain cases, which has all led to an environment where things are not really as predictable anymore. It is quite uncertain, it's led to organizations needing to take drastic steps, it's led to them letting go of a few people. And it's all been a little bit messy to say the least and you know, we don't exactly know how things are going to progress in the near future so things may stay along for a little bit longer with this case or get better hopefully sooner rather than later. But all of that is to say that the conversation has now shifted with okay what can we do with what we've got today, how do we do more with what we have. And the conversation immediately shifts towards the ideology or idea of efficiency, how do we get more efficient, how do we do things better, how do we do things more effectively and different teams within an organization have different perspectives to this and I think what efficiency would mean to an engineering team or a DevOps team would be quite different from what it might mean for a marketing team or a sales, sales or commercial teams as well and every one of those is going to be important that we're contributing to the success of success survival of a business in these uncertain times and I think today what we're going to be looking at is specifically we'll be looking at the engineering efficiency side of things specifically from the perspective of DevOps. So with all of this being said, obviously you start thinking about how do you get more efficient you start putting in plans together you start strategizing and executing that of course as quickly as possible. And all of that is fantastic and all that is great but while you're strategizing and executing a plan. How do you figure out whether it's working I think ultimately the whole conversation around efficiency and effectiveness comes down to getting better metrics getting understanding what is happening. And that is where today's discussion becomes very very important we want to know what is going right we want to know what is going wrong. How do you make better decisions for your product how do you make better decisions for the tools that are using how do you make better decisions for your business as a whole and all of that is encapsulated in the idea of observability telemetry and hopefully as an extension to open telemetry that will explore later. But before we go there let's look at like I said we're going to be looking at it from the DevOps perspective so let's look at a little bit towards the DevOps lifecycle so to speak. So it's a bit of a it's an infinite loop so to speak or a feedback loop as I would call it an iterative loop would be a better way of putting it, it starts off with the planning, coding and building and testing and then releasing that the developer side of things development side of things and we move on to the operational side with deployments and operations and monitoring and then it feeds all the way back into planning and the cycle continues and it's like I said it's a very iterative process. And what we want to explore with the idea of efficiency is to make this cycle go faster make these iterations be smoother be more efficient for the lack of better word. And whatever percentage gains that you can get through this is what is going to contribute towards that whole efficiency strategy that you might have better maximizing your return on investments, doing more with less I think the entire conversation is hinge towards that. So when we talk about this obviously the idea of automation comes in and the idea of automation is driven in the DevOps world by get ops. So you would have probably heard this in the context of CICD or infrastructure as code but really get ops is a set of practices for managing infrastructure and application configuration in a declarative and version control way. And ultimately it's an enabler for infrastructure automation and specifically looking at API infrastructure automation because that's where the context of today's conversation is going to go but get ops can be applied to your entire application stack so you know it it can be applied to different aspects. It is a set of principles, not a set of tools so you can use tools to enable you to be get ops ready or enable your get ops journey but get ops by itself is more of a framework and guiding principle, then anything else it tells you how you can approach things for it to be more efficient be automated So I think I like this quote from someone who mentioned get ops as just infrastructure to score done right and we'll look at what that being done right means so with get ops. There is a few key principles that you need to be aware of and for primary principles here. And when you think about those principles you look at how the system behaves what your system needs to be system is, or system needs to be defined declaratively. What that means is that you're describing what the desired state of your system is going to be, instead of describing or defining a set of instructions you'd look at what that end end game is or the end result is going to be what the outcome is going to be or the end state is going to be as opposed to how you get there step by step so that's what the first principle. The second one is versioned the system needs to be versioned and immutable so this is sort of handled through versioning of your entire code base sort of in in in get this is where your infrastructure at call as code conversation comes in, so all version get which means that it is a lot easier for you to roll back things because your state is maintained as different versions, it's immutable which means that again if you're looking at audits, you can go back in history and look at what's how things have evolved and gotten to the point that you are at right now so rolling back and recovery of systems becomes really easy and simple with this. The system again it needs to be pulled automatically which means that your approved changes are automatically applied to the system, which again comes from a host of automated testing and checks that you would have in place. It would also enable you to put policies in places driven by policies that would enable all of these to take effect once again you're trying to minimize the amount of manual inputs or human effort that is required for you to actually make these deployments happen. And then finally you look at continuous reconciliation which is where got software agents, which ensure the correctness and correctness of what you're doing as well as the, they alert you if there is any divergence in this. This would be sort of your principles of get ops, which have gone through quite quickly, but ultimately this is the gist of what get ops is built on. So then, why is this important what is the benefit of actually using this principles are great but what's the actual benefit of it. Moving on to the benefits here so we've got five key things that I've distilled on and I think different people might again have different ways of looking at it. So my objective here is looking at it from the spectrum of productivity so faster more efficient more frequent deployments made easier with you've got a pipeline that is defined that makes all of this easy without a whole lot of human intervention with the right checks and balances in this case, this becomes a lot lot easier for you to manage cost efficient of course because you know you're reducing the amount of down time possibility here, you're better managing your resources, a little bit more again human intervention is reduced that means things are moving a lot quicker lot faster using the amount of time and effort required here reliability because again we spoke about things being version and gate it's immutable. You should roll back things and it makes life a lot easier from an error perspective as well you're less prone to errors because you've got the right checks and balances in place already compliance and security makes for simplified auditing and access control you've got all that managed already you've got credentials management taking care of a lot easier as well, especially in a in a cluster cluster environment or orchestration environment like Kubernetes. And finally we've got developer experience where you know instead of having to deal with a whole lot of tools within these pipelines you really work with get something that most developers are familiar with have worked with and therefore it gives or makes for consistent and familiar practices and tools that would again make life a lot more easier and productive. So with that being said of course now while the benefits are great but how do you measure success of that so which is what brings us to how do you measure the value of the business value of GitOps. And this is where we have enter Dora. Well, when I talk about Dora, this is not this Dora. Not Dora the explorer per se but perhaps Dora the metrics personally in this case so Dora stands for DevOps research and assessment this is what this is essentially a way for you to measure DevOps efficiency. And there are four key things that you need to look at or is looked at under Dora metrics here where we look at deployment frequency, how often organization successfully releases to production. We look at lead time for changes where the amount of time it takes to commit to get to production is important change failure rates where you look at the percentage of deployments causing a failure in production and then time to restore a service I think this is again, quite important because how long it takes an organization to recover from failure is an important metric for you to understand again from a GitOps perspective, all of these really really important for you to know so if you look at the first two of these these are really depend talking about the velocity of your progress, whereas the second ones are more towards the stability of your system and I think that's what you're trying to measure here so with that hopefully that gives you a little bit of a little bit of an understanding of what we are talking about we're starting off with the DevOps lifecycle, GitOps as sort of enabler of the automation journey that we're looking at Dora metrics as the way for you to measure success, and then we'll then bring it all back into the world of observability and GitOps needs observability. This is because when we talk spoke about GitOps we were talking about a desired state, a state where we are going to be looking at again things are automated things where we want to have less and less of human intervention. But the challenge there is you also need to know when to course correct or what is going right or what is going wrong in this case so that different difference between desired state versus actual state is what you need to know and that is the question that observability in this case needs to answer or speak to. So you need to understand what is going on within your system and whether there are deviations if it's all working well that's fantastic double down on that, but if there are deviations and how do you make sure that you're not going too far apart. Before you make or you know find things that needs to be changed. So that brings us to open telemetry, which is what we're going to be discussing today, open telemetry is an open source observability framework for collecting processing and exporting telemetry data essentially to help you gain better visibility and into the performance of your distributed systems. As you can see it is actually supported by a whole lot of different tools in the industry already. There's a second most popular open source project in CNCF right now after Kubernetes which as you know most of you might be familiar here already. It has a really really big active community today. It's only growing in popularity with more and more tools adopting this open standard. So why is this beneficial for you to consider it as a no brainer gives you better monitoring capabilities it gives you better service health response times errors, better insights is what we are looking at. It also provides you with a common language it's vendor neutral it's open source it gives you an open standard common language for you to integrate with with tools externally as well as its support for multiple tools it gives you to again add all of those things to your stack, have specific tools that look at the same data differently and provide you with better insights and and monitoring. And then finally that'll help you make better product decisions through better product insights, better usage metrics product issues are highlighted earlier, and then you have a more data driven approach to decision making your organization. So with that, I bring it full circle back into our DevOps lifecycle and once again this is important because this is what we can be talking about Andy will be looking at it specifically from the deployment side of things and can get better with that side, whereas Sonya is going to be looking at it more from the API is like cycle side of things and how you can troubleshoot your API is a little bit better and get more insights from it so I bring it all back the idea the objective of today is going to be to make this cycle more efficient and help you get started with that journey or be better at the journey that you already are on today. So with that, I think, hopefully that gave you a bit of an introduction into what we're going to be discussing today, some background on the different concepts that we are touching upon. And with that, I'm going to head over to Andy. I have mentioned Andrea's here but I'm going to say over to you, Andy. Thank you so much, Buddha. Thank you so much for having me and let me share my screen. I hope I can just take away your sharing rights. Here we go. Just quickly in case you want to follow up later with me I know you were probably sending out any way information but who am I was introduced in the beginning but in case you joined late. During a day I work for Diner Trace and the rest of the time I work for captain well obviously I do work also during my day for captain I'm a death rail and a maintainer for the open source project. Today, I also want to quickly mention that besides open telemetry and captain I want to highlight open feature as well as another open source project in the CNCF space, because open feature also will make it easier for you to deploy new features in a more risk aware way so in case you're into feature flagging you've never heard about it. Maybe open features something you want to look into as well but Buddha you started with this slide here and you also ended before you passed it over. The DevOps infinity loop and as you correctly said the Dora metrics are really here to give us insights in how well we are doing how well we are pushing out changes from code all the way into like building it testing it releasing it on the right side I have to change failure rate and time to restore services which are more metrics that show you how mature you are in terms of operating your software how fast you can recover from a problem. So, I want to focus specifically now in my presentation on observing the door metrics from a build to deploy perspective. And then Sonya right you will be then covering what the operational aspects using the monitoring data to do the troubleshooting and all that. So I want to first focus on how do we get stuff into production. Now, Buddha you also mentioned github and I love github and I would say github is a really it's not just in infrastructure is code done right but I think it's full stack is code done right because it's obviously spanning from infrastructure to the application. I want to highlight though first that github as great as it is. It also requires a new approach to observing and measuring DevOps efficiency and here's why the way I see the world and again, you know, please please feel free to also correct me or challenge me. But in the classical monolithic world that we used to live in and some of us still live in the complexity of building monolithic applications was really all on the deaf side because this is where you had a joint repository of code where multiple teams had to figure out how all of the code works together. You had to invest a lot of time in continuous integration, we had to figure out how all the different components actually make up an app. We did your validation and security checks and then we had tools like Jenkins that were then able to come up with, I would say a more simpler kind of construct like a monolith that then allow it allowed a little bit more of simple operations to deploy observe and operate a well defined app. And I think with the emergence of github's we also saw obviously the move towards breaking the monolith into smaller pieces. I think in the cloud native world, we definitely ended up in a way where we have engineering teams develop individual services rather than building big monolithic applications. Obviously there's a debate on what what is when does it make sense to build a monolith versus when does it make sense to build services. However, I think the move was great because we made development simpler. We were boosting I think efficiency on the output of individual development teams. However, what this really meant, we were shifting right the complexity of actually running applications that are still applications that are built now of multiple capabilities multiple services toward operations. The whole app composition problem like what is really an app, what type of services in which version running on which infrastructure on which cloud services really make up version one or version two of your app. So, the reason why I bring up this slide is because I think we see a shift here and the shift is also important when you talk about measuring your efficiency from a daughter perspective. Because if you look at the top if you measure the monolithic way you had to, it took a long, long time to build everything and then push it over. Now if you measure Dora in the new world in the cloud native world only on individual services you all of a sudden see a lot of new services being deployed all the time but it doesn't really mean that you really deploy new features new apps to your end users. Right so this is the challenge with the app composition what does now make up an app. And this is the problem that we try to address with the open source project captain bringing visibility a into the new cloud native world as you're building microservice based applications but also bringing application awareness into the observability aspect that we bring in so what does this really mean captain. What is captain captain is an automated app aware observability toolkit that gives you visibility into all of your github's and Kubernetes deployment so what you need, or what you hopefully already have is your github's tools whatever you like your Kubernetes clusters. You pick your observability tool of choice I Sonya I tried to do a good job and put in all of the logos that you had. If I miss any of the logos don't be offended any any observability tool that can deal with open telemetry Prometheus or any of the other open standards. I have a you know should be up should be placed up here but it was limited space I put the stuff on there that I see on a day to day basis. Now, what we do from a captain perspective, the only thing you need to do is you need to install captain, we call it the, the latest version or the latest iteration of captain is the so called captain lifecycle toolkit. You just install it on your Kubernetes cluster. I have a couple of things that I will show in the demo later on where you can instruct captain on what to actually observe and what not to observe and what to do. But once you do when your developers are pushing code changes into get. And then your github's tool of choice I will use our goal later on for my github's tool of choice will then do the deployment. And on captain will automatically create and give you insights, actually using both Prometheus and also open telemetry so that you automatically get the dashboards that show you the key door metrics how often you deploy, how long it takes to deploy, how many variants fail. And the beauty of it is because we are basing this all on an open standard or on open standards you can look at this data in in this case it's a Grafana screenshot, but you can also look at the same metrics for instance in dinotrace and we not only creating metrics, we're also creating traces so we're using open telemetry in case you are not yet familiar with it. We're actually using open telemetry to not trace. In the traditional sense a transaction of your deployed application from end user to database. We are actually tracing the whole deployment process from the first step that your github's tool does to apply changes to Kubernetes until the whole application is deployed. So we're expanding actually open telemetry from the initial use case to a new use case, monitoring and tracing a deployment end to end and how this looks like. In a second, those in the demo. How does this technically work in case you're interested in it. Technically, captain is a Kubernetes operator that you install on your Kubernetes cluster. And to give you a quick example if you have an app that is made out of free services front and back and storage in different versions, then what we actually do what our operator does if you're instructed to do so through some applications on your Kubernetes deployment manifests, we automatically measure the time each individual service takes. What do we do with this, we are measuring it that means we're creating metrics and also traces for every single deployment front and back and then storage, including the time span obviously right how long does it take now, we are introducing and also an application concept because I told you earlier. We have to figure out when you're not that you deploy individual services but how long does it really take to deploy an update to an application because an update could be updating one service or five services. And so we are allowing you to define the application context, and then we also measure automatically when you're making an update to one or multiple components of the application. The pre time so when does the whole update start when at the individual updates been done off the services or the workloads and then when is it when is it finally done. And we're also again creating metrics and traces. And you see here pre and post what we also do and this is a really a nice side benefit of what we also built into captain it's not just giving you observability. We also allow you to execute tasks so we can also use captain to orchestrate actions pre deployment and post deployment now why is this important. One of the metrics in Dora is failed deployments or successful deployments. So we can use captain after a deployment is done to execute a task within your Kubernetes namespace and have it for instance figure out is my app really up and running. Right, you can execute some tests you can validate it or you can validate your SLOs by going back to possibility platform. We can also use pre deployment tasks to pre deployment validate. If it actually it's a good time to deploy the application right now because maybe your external dependencies are not there or maybe your environment is currently under quarantine, and we can actually stop the Kubernetes scheduler from deploying your workloads. Alright, so this is what we technically do. We are an operator that really traces and monitors and observes the complete end to end deployment of your application that consists that can consist of one or many workloads and we can even execute pre and post deployment tasks. Now how does this look like. When we look at the trace that we generate. So when you are applying the changes you manually or your github's tool. This is now visualization how it looks in Yeager I'll show this in the demo as well, but just to make it easier to visualize when you're deploying an app update of version 301 as you can see here. Then we automatically start creating a trace but not a single note trace we're creating a full application trace application deployment trace that means this actually consists of the pre app deployment tasks that we can execute any checks before the app should be deployed. Then for every workload so your front end service your back end service your database service whatever service captain will then create spans on that trace that show you. How long did it take to do the pre checks for the workloads how long did the actual deployment take of that workload and how long did the post deployment checks. Remember post deployment checks could be checking if the deployment is really successful by executing some tests. And we do this for every single workload for the front end service the back end service the database service whatever it is. Once all of them are done. In the end we also execute the post app deployment tasks and we measure this as well. So this is a trace and we look into this in the demo and obviously we also get not just traces we also get metrics as I showed earlier. So I can put it into any type of tool that you like but I think I talked enough and I did enough slides. Let me go over to my environment. First of all what I have is I have a Kubernetes cluster that is instrumented in my case and the data for me ends up in in dynamic ways in the end I said it doesn't matter if you send it to data dog new relic or any other tool honeycomb. It's in a dynamic ways because that's what I do during my day to day life. What I have is I have my, my app, my app git repository here. And as Buddha said what's really nice. If I want to change something right if I have a sample app it's not very fancy. But if I want to change something if I want to update my version of my app the only thing I need to do is right if I have a new container image. What I need to do and let me just commit this I'm deploying a version 402 oops not not sharp as that's it. If I committed, then what I have here, I have my git update. This could be now obviously application configuration is called this could also be infrastructure but what I'm basically instructing now my tool to do my tool, my GitHub tool of choice is Argo. You can use flux or anything else. I don't really care but Argo is now updating and pushing my changes out. And what is happening, if you can see this here, we have some annotations where I actually instruct the captain operator that this is really a workload it should look into should it should observe and trace and here we're also the pre and post deployment actions that we can execute. Now what this does, I'll explain in a second. Argo is pushing out these things captain is now working in the background and what captain actually does it automatically starts tracing and it starts monitoring and observing the deployments that happen in my system now I can look at this data now because it's all based on open standards, either I have Grafana installed, and I can see what's happening and what what's what's currently happening in my environment. I can also look at it in right because this is a an open standard so I'm using down a trace here to also visualize the workflow so earlier today, I deployed version one, two and three, as you can see, and if I refresh, I should probably also see version number four as a trace coming in hopefully any second. I'd see version number four is coming in already perfect so every time now a deployment happens, I get a trace. I can actually open up quickly this trace here, version three and four I think probably version number four is still ongoing. But if I look into this trace, I can see this is the deployment that I did earlier. So, I, or by making my kids commit, I basically said I want to change the desired state. Our goal was applying the desired state to my Kubernetes cluster. And what I see here is an end to end trace right if I go here ends to end trace on how my deployment actually made it to Kubernetes, including my pre deployment tasks. The actual deployment with timings, also with additional information about like if I click on these individual nodes, we also have so called attributes. So when you are creating these open telemetry spans like we are doing with captain you can add as much metadata that is needed so I know exactly what type of app what type of version is actually being deployed. This is all here. And if I now go over to you see version number four, it's already almost done. The pre the pre deployment tests are coming in. Full end to end trace now why is this important from an efficiency perspective, because I want to know how long does it take to make to get a deployment done how long do my pre and post deployment checks take. You're failing hey I'm executing a test and the test takes all of a sudden five minutes instead of one minute. It is slowing me down from an efficiency perspective maybe so this is the stuff where we can use open telemetry to get those insights. And as I said, right, you have the individual traces. That's great. But if you want to look at it from more from a higher level perspective. All right, you can also look at this in dashboards to see when did you deploy what particular version how long did it take deployments over time, and just to really show that you can do this in any type of tool. Right, you can do this in in your observability tool of choice as long as this observability tool understands and supports these standards. But at least in the in the demo that I want to show. If I look at my deployment YAML. You saw earlier that I have pre and post deployment tasks. It says notify here for a pre task and the post task. What not if I really does it is actually executing a where my tasks here my captain tasks. So captain actually allows you to one of the options is that you can write a JavaScript function that we then execute. It's kind of like a serverless function. And this function here is actually sending out a select notification. So through and through a declarative way. Right, declarative on the deployment. I basically was my slick. Here's my slick. I now see that my version four got deployed and I got two notifications because I said I want to call this job this function pre and post deployment. So there's more that I could demo but in considering the time I want to bring it home and then pass it over to. So Sonya I just want to end up with with one thing here. I showed you the whole thing on a single cluster. But really the end goal with this is because you probably have multiple different environments. So that means you're not constrained with what we're doing here and what you saw to a single environment. So what we really want with captain and with getting the observability and I'm just drawing this out here. We want to give you end to end traceability from first git commit all the way from development all of into production. This is what we aiming for and I think this is what open telemetry enables and open standards enable so that in the end we can really get to a world where we can enable developers to easier develop their individual but also reduce the complexity and give you the insights for operations so that they know what they're getting is actually there to stay and is good. And for this was insightful but now it's time I think to pass it over to Sonya I will focus more on the operator monitor piece of the whole thing. Thank you and it was really interesting. Every time I see this demo I think we need to contribute the demo with an API gateway and API definition that we can also push through the stages and have some some using open API so one more standard to that I think that would be really great so something for the future for us to work on. If you can stop sharing your screen. Thank you for doing that Sonya just a quick question. Andy I think there's a couple of questions that come through from Gabriel thank you so much for the questions. He asks, is there a demo with flux CD as the first question so perhaps is there a demo I'm sure I don't think we're doing that today but is there a demo that he might be able to reference. I don't have a demo with flux CD but it's the same concert right I just use. I mean you can just install captain on your Kubernetes class if you use flux you get the same thing. I'm just using Argo and as for tracing. Again there's open telemetry traces and yes I showed China trace but you can see here. This is the same trace in graph or in Yeager. So, so any any any tool that essentially supports open telemetry. Any tool and I really tried that's why I put all of the logos on there. Of course my tool of choice is China trace because that's what I use every day. But as you can see here. Awesome. Thank you so much. Thank you for the questions Gabriel has been over to you Sonya. Let me share my screen. Now we are here so we shifting things to production and yeah that's not to pick the way it should be but it's still most of on developers do something committed and it lands to production and what happens then what what's what's the next step. And if we go back to our style circle we are now going to the once we have deployed and we are operating we start to monitor. Because you want to know what's happening in production. Everybody knows you could test as much as you want your application your services but when it's in production customer we start using it in some different combination that you haven't tested and and all the services in their version so you will have new learning. And this is what we are going to talk about today in my in the part of my presentation. So I call that learning from production. So once it's in production what can you learn from how your systems are running how your API is running. And that's kind of two different learning. There is one that's the really really technical part where we want to know are there any errors that I need to act on now is anything going really really really wrong. Do I need do I have many more traffic is something happening. Do I need to have auto scale. Do I need can I scale down and save some money. Is there any error configuration resource. Do I need to act if possible automatically without having somebody that needs to be working up in the middle of the night. And then there is more than improvement continuous improvement on the product side as a product management manager I want to learn from how the users are using my application my services what can I prove how can I provide a better services. Maybe what are they not using so we go so they pick up some things to really reduce and be more efficient on the services that we are offering. And all this you can learn from production using observability data and open telemetry because it's vendor neutral as we were discussing is one standard one protocol format that you can send to many many different observability vendor and open source platform. And but you need only to do the instrumentation once. And this is why we're working on it in to have it in the take it where because it's super valuable. Also, when you have a PS you don't have typically a front end so you just expose your PS to your external users. And you don't have session monitoring you don't have a UI to monitor to see what user are doing where they're clicking. So you really need those inside the gateway is the first things that hit the traffic from your customers. And this is where you can observe what is going wrong what is going well and what you can learn. The other is if not only the gateway sense data, but also all the other services because then you really get this nice end to end trace and to invisibility to be able to understand how much time is spent in the API gateway. If there's an error is it an error configuration on the gateway is it something that is happening much later in the upstream services, and at which point which team needs to act and to solve that that issue. So let's look. I have one more trace that I wanted to show so before we look into so for this demo I'm using data trace because that's Andy to love chairs and we are happy to have him. So we wanted to take this opportunity. And this is an example of her end to end trace looks like so you've seen it also for Andy presentation that was more about the deployment and here's really a trace of HTTP request and you can see it's hitting. In that example for the front end so the front end is hosted on the side then the gateway and in the tag gate where there's different middleware operation checking the version so you could have different version to your API is and redirected to the right version. You could have a cache with limiting and then sending to a services that could be your PC GraphQL rest, whatever, and then seeing everything that's going on in that services and then the call being successful. And if you send some using for this demo the open telemetry demo, the open telemetry community has been working on a demo it's a shop and you can, you can run it in Kubernetes or in Docker and there's different products, you can click on it, add things to cart by things. And all of these in instrumented with different services and what we have done attack is we have changed it a little bit to add our API gateway in the in the mix. So the API all calls are going to the API gateway that then does the redirection the forwarding to the upstream services that are using job pieces. And when we send the traces the data and data trace we can get a really, really nice overview of how the tag gateways is doing so the typical service matrix the response time. So it's all running on my computer so no no network that's pretty fast that's not that much of traffic. You can see the failure rats that looks pretty decent you can see the throughput and then you can go over to the traces. This is one view that already gives you information about the infrastructure so this is the place where that who you would use for auto scale is your there's a lot of requests coming up we need more gateways we need to scale. There are too many errors do we need to work somebody I do we need to to to act to do something automatically. But that's not all with API observability because that's really the model infrastructure is the gateway running fine. But when you're doing with API is you're more interested to learn also about the usage of the different API so more granular level. And you can do all that so if you have the data you can create some nice dashboard visualization kind of like a be a tool to look at the data and you can really have. A dashboard with the metrics with the traces the metrics based on the traces that are that I based on the telemetry data that tech is exporting. And here I can see okay which are not most popular API requests I see the product catalog service is the most used one where that's got most of the request. And then the card service I can see all my services. I can see the response code some more often it's like successful 200 response code but here I see yeah the errors that's that's quite some errors that are coming from a checkout service. And I've already some some some information that it's the errors are coming for the middle where part intact that is responsible for weight limiting so maybe let's take a look. First in the. Let's check if there's even really a problem so I'm going to try here. Check out place order yeah. That doesn't work. Okay so we need to check that. And I can look at the distributed traces oh yeah I see there are some traces some transaction that are having errors with that API with the check up. And this is the beauty of the distributed traces the end to end traces at just one look I can see okay there's an error that's coming from the weight limit. A part in the checkout service for the checkout service API and oh API weight limit exceeded oh okay so either somebody is making too many requests or I haven't I have misconfigured something. I can go to tag to my API definition. I can check it. Let me check. Let me check. Yeah I have my weight limit. Oh yeah I have only one request per minute so that's that's really that somebody made an error so let me let me increase that. And obviously here I'm doing it in our in our API manager but that's something that we would do as code so part of the API definition something that you would put like to the whole GitHub process and just automate but here it's more easier to show it to you like this. I have updated my API definition. And now if we look if we refresh a little bit and wait at some point we'll see that the errors are not in common. Let's check if I can. If I can class my order. Yeah, I could place my order. And that's really the beauty of open telemetry and when tracing you see the error directly is it something in the gateway misconfiguration, or is it something your option services. Sometimes it's not an error per se so the maybe technical engineers would know that was not an error you know that was how it's supposed to be this cannot but as a product manager I can go there and see what other errors that my that's my user having. So many users are having error with weight limiting with authentication with one path then I can look at the documentation that I have for my API's I can improve my sample code. So it's not only always about the code per se or the application but it also the enablements of those API's. So that's all for my demo. I want to I want to add first of all awesome demo and I want to add again, especially for people like Gabriel. I know we showed the visualization of these traces in diner trace or Sonya showed it but you can view this in any type of tool that understands open telemetry. But what I like about your demo is right you you showed a very common use case in distributed systems where you have API gateways or service measures or whatever it is between services and if somebody makes a configuration mistake and therefore stops API calls from going from A to B and stopping there for something critical like an order transaction. Then you want to be notified as fast as possible and you want to have the data that shows you the problem is there it's a rate limit right or whatever else it is yeah but this was really really nice. Absolutely it's really really good demonstration both of you actually and I think just that on that last point. There is true business implication of something like this where it's not just about having the best possible product experience and I mean obviously that is very important that is what people probably pay for if they are using a paid product. They want the best experience possible they want their users to be able to use your product better, but at the same time, it's also about, you know, uptime conversations and SLAs and I think there is the entire sort of commercial aspect of support of of a particular product and the faster you can respond and identify an issue and you know resolve it hopefully. You know there is there is true commercial implications of that the support packages that go along with it all of that comes down to true business implications of what we have looked at so it may. At the face of it it is it is a fairly technical thing that we looked at we are looking at you know, get ops we are looking at open telemetry, but there are true business implications of everything that you are seeing here, the end to end ability to look at things from an end to end perspective, what is going on internally what's right what's wrong, how do you improve how do you get better how do you make better decisions. And all of that again has implications both from an engineering perspective as well as from a business perspective so it's a pretty overall approach to efficiency that we looked at today. And one thing to add here is, I know that many organizations in the past have developed their own tools and own standards and how to collect data from different parts of the infrastructure and maybe written it to their custom databases or wherever. I don't want to do this anymore because we have open telemetry with open standards, if you miss observability in a particular piece of your infrastructure in particular processes, then don't implement something that is custom proprietary use open telemetry in this case for pretty much every language out there, you can create open telemetry traces logs and metrics and then you can use any observability tool that supports the standards to look at the data to analyze to alert on it. 100% I think that's the that's the beauty of it again open standards in general open telemetry in this case specifically. So the idea is that you don't have to think about hooking into specific systems that have their own language, their own requirements, you don't need to do that anymore. You can have this common language that is spoken across different systems and solutions. And you've like Sonya mentioned, once you've set that up inside your inside your system and inside your solution. It can be applicable to any other integration in the future, even if it's not today but you know if you want insights later on, you would still have that ability to to integrate with those solutions a lot easier than having to write either a middleware or a plugin or some kind of a hook to be able to connect to that system in its own language and not having to maintain like five or six or maybe 10 different systems all together so that's the beauty of it that's the productive hopefully the efficiency part of things comes out a little bit better. Again from Gabriel I think will you share the slides in fact I'll do you one better we will be sharing the entire presentation, this entire video recording will be made available to everyone here so you shouldn't have any issues with that once to follow along if you wanted to. From a questions perspective, I think there is one thing that Andy you touched upon I think just I wanted to clarify and sort of reemphasize that a little bit where a lot of times when we think about open telemetry sometimes it is looked at it from looked at from a very transactional perspective, but as what you mentioned today was very much from an end to an application side of things so if you could maybe just you know clarify that a little bit or maybe just reinforce that in the minds of everyone who's listening. Yeah, thank you so much for the question so obviously open telemetry I think at least it was born out of the necessity and the need to trace into transactions and business critical apps, but we can use open telemetry for any use case. And whether it is the deployment use case that I've shown some from kids commit all the way into production, you can also think about business process monitoring right if you have business processes that are spanning multiple systems where you even have, you know, wait in between. I think about an order process. There's also ways in open telemetry to create traces and spans that are then linked together so you can really think about tracing end to end from the first time you reach a customer and get their interest until we ship the product I mean even that's possible. And so, that's why open telemetry is not constrained to the classical tracing use case for tracing business transactions in business critical apps. Absolutely. Thank you so much for that response. The next question that we have this is probably over to Sonya in this case, where typically when we think about open telemetry I think we obviously us working in the API management side of things there is different API styles that we have to deal with at the API gateway level. A lot of the conversations typically tends to go towards the rest API is today because this is obviously the most popular standard. But equally, could something like open telemetry help us get better insights with say a graph QL API or perhaps a GRPC API all of which have. They work a little bit differently they provide different insights they have different operational models internally. So, could open telemetry be extended to probably give us better insights into those. Yes. So, as you mentioned with rest is just easy straight forward we know all the fields so there is a semantic convention with open telemetry kind of defense standards fields that everybody use you know HTTP response code and, and then it's really helpful because in the tool when you want to generate from different sources the data you can use those semantic convention to create filter. Some things that are still missing for example for GraphQL, because for example what is a GraphQL error when you have GraphQL you could have you know calling two different services and just getting data from one, and then the response of the GraphQL would be HTTP 200. So, open telemetry would interpret it as something that was successful, even though you are missing some part of the data. That's something that we're also looking into and I will have a talk at cube com with one of our colleague. So we are going to explore that topic and I think that's something that we are going to push also for a new set of specification in the next month. Thank you so much. This has been fantastic. I think we are exactly one minute to the hour today so thank you everyone for joining us. Thank you so much panelists it has been an incredible conversation I have thoroughly enjoyed learning all about open telemetry captain type dynatrace and the entire workflow that goes into making application lifecycle more efficient in this case so thank you everyone it's been a pleasure. All right guys until next time we've got another webinar coming up next week above where we will be looking at declarative approach to API management and Kubernetes. Next week on the 23rd of March as it stands so do make sure that you join us for that will go a little bit deeper into some of the concepts that we discussed today and stay tuned for more as we go along so thank you so much everyone it's been a pleasure. Until next time take care and have a lovely day ahead. Bye bye.