 Hello everyone, thank you for having me here, I'm John Luca and I work as a developer in the past I work as an SRE as a reliability engineer and I'm a lot involved with cloud computing, Kubernetes and stuff like that. Today I'm here to speak with you about observability and with a project in mind that is called Open Telemetry. So this is what we're gonna look at today. So let's start from something more general about pros and cons about cloud computing, Kubernetes and microservices. So what those have in common and what's good and what's bad about those. There are both of them, they have a lot of good stuff, they bring a lot of good stuff to us and that's why they are so popular. So cloud computing and Kubernetes gives you an API that you can use programmatically to build automation and to deploy your application. Not only that but you can control the life cycle of your application as you prefer. So this is very important because applications are critical, we have to look at them, we have to take care of what they are doing and they have to be distributed across clusters and cloud provider for reliability. The problem accounts is the distribution itself because distribution is crucial for a success but it's also very complicated. If you think about how easy it is to monitor a single process on a single laptop or in a single server, that's way different when you have to replicate that across like cloud providers or across like regions or continents. So this distribution makes stuff way harder and more distributed you are, more complicated it is. And cloud computing and Kubernetes makes the distribution way simpler, way more affordable and that increase complexity. Micro Services is a way you write applications to make them to scale. Not only the application itself but also the teams that work on your product because you can have smaller team that works on a segmented piece of your application or your business and they can work by themselves and you can scale them up and down as you wish them. So it gives you way more control and scalability but it also means that you have a lot more to manage at what you usually have to take care when you have one gigantic application. So I think we can summarize pros and cons here has you know you get better scalability, you get more like distribution across the network but that has a price and the price is all about how do you manage and how do you understand what's going on in your system. So has consequences for example of you know deploying an application across cloud provider is that you get a resiliency budget because your application runs better and if one region goes down you have the other one but this means that you don't get like a single way to say oh this is broken or this application doesn't work because it will may work in a region but it will may be down in another one. So you have to figure out where it's broken, how it's broken and if it's broken for how many people. So those questions are not you don't have this question when you have a single like replicas or when you have a single data center and so on when the distribution is lower because yeah if it goes down go down that's a problem because it's easy to maintain like it's a con because usually you want to keep everything up and running all the time. So this is kind of the friction that you have when you think about onboarding Kubernetes cloud computing. It's all good but it's complicated. I also never heard like a customer complaining about my application using too many CPUs like when I get ticket supports for a commerce that says that they can't pay or that the page doesn't load fast enough but nobody ever complained about CPUs but from my from my experience everybody cares about CPU I mean when I go in a company I see graphs and dashboards all about CPU packet lost and all those stuff that like customer really don't care about because they come from a different perspective and this is part of the story that I want to tell like those metrics CPU memory are important but the business requires other metrics and as a developer we are there to meet their expectation so we have to figure out a way to make this to work. So customers doesn't complain about the quality of my code either I can wrote the worst code that works like very good and they are happy or I can write the perfect code that doesn't meet the criteria and they are not able to use the outcome of my workflow. So this is important they don't care about CPU they don't care about code quality they care about the product to be up and running reliable and useful for them but I care about those stuff as a developer because for me the consequence about writing good code that is reliable that is easy to maintain that is that doesn't use all my like computer resources and is important because by consequence usually means that the customers are happier but they don't care I have to find a way to write those stuff in a way that has a direct impact to them and one of the best way is to use the metrics that they think that we think are important for the customer so if you think that the customer doesn't care about CPU in your e-commerce what does it care about maybe they care about like the number of products that you can suggest to them or the number or the latency to opening a detail of the product so all those metrics has to be aggregated with the CPU with the memory with the one that we know about to get something useful out of them if you look only at one only at one signal it usually like complicated to to make it to work usually doesn't describe enough the world that we are in so let's take a break because there is a lot already here and I'm gonna give you a brief presentation about myself so I'm John Luca Bezzano and I come from Italy I even touring very close to the house in the north and I work at packet as a senior staff sort of software engineer and you can find me around has a John Arb I'm on Twitter and I blog I use Twitter a lot so if you have any question or if you're looking to chat about those stuff like cloud computing observability monitoring just see you there and when I'm not developing I grow vegetables in my garden so this is the best season to see some pictures about like tomatoes and potatoes and so on so see you there so my question for you or let's say the question I usually try to do but at myself when I design a system or when I code is how do I tell a story for our cast so let's say that I let's say that I have a no website that serves you know let's let's stay strictly the e-commerce because it's easy so how how I tell a story for a specific payment for a customer like as we know the microservices like if we do microservices we have a lot of different pieces and the payment request goes through all of them trying to you know fulfill their cast but how do we describe all the story for a specific payment or let's say that you are designing a new feature like you are heading like I don't know you are having like a comment box away to leave feedback to a product these we may involve like writing a few microservices deploy them interact with some other microservices there are already there like the metadata database for example and how do you tell the story of this new feature from the outside so they say did you wake up tomorrow and nothing works because something like is broken how do you look and trying to figure out that that solution so this question is very important and I do it during like code review for example because that's the good place where I can interact with myself or with a developer that I'm reviewing the code for and we can try to figure out how to make that feature observable and that's where observability comes from or I mean what about like third-party services maybe for the code you write that's simple because you can write logs or you can send events but like how do you figure out what's going on in my sequel if your pain your payments at the end goes to my sequel for example it's also very hard to make developer to agree on something we all know that like they like to complain we are complaining driven people and that makes everything hard so how do you this how do you tell the story of your request if you've write like microservices that come from different languages in some way you have to agree at some standard at the end and this is hard so open telemetry helps you to do that so open telemetry is a specification that describes how you instrument your code to be able to figure out what's going on in your application from the outside because this is what we do with the monitoring I mean we open a dashboard we look at the logs but we are outside looking for the inside state of the application and this is the definition of observability in control theory so trying to figure out what's going on inside the system from the outside so you try to dig in the internal state of your application from your point of view from your desktop and open time telemetry helps you gives you a specification and client libraries that you can use in your code to expose metrics in a way that are the same across many languages because that's what we need at the end of the story we need the full story no matter which languages you're using which cloud provider you're using or whatever we just have the first we just need the full story so your application may have like a language or at least it has a language that it speaks and the language that the application speaks is the one that you are able to teach to the application itself so logs for example are a language that the application use to tell you something and we are the one writing logs so if we like after a couple of weeks get back to the logs and we can figure out what they are it's our fault as a developer because we didn't write code that is understandable from the outside at least so a bunch of like stuff that you can look at to make your application to speak in a clear language is to use structural logs so think about structural logs not has a message that has a timestamp but has a timestamp that has a set of key value pairs maybe one is the message that is the human readable one but there are other stuff that you can maybe print as a JSON and parse after that so you can think about those other stuff has a way to you know print the state the context of your application so in the e-commerce it will be like the product ID it will be the customer ID it will be the payment service that you may use so all those stuff comes together in your log line so you can have the message but you can also build the context around that message at that specific point in time and then this is how you do structural logs because they are not like something that you have to string match or that you have to index like in a search engine but it's something that you can parse as a JSON and from there it's way easier to make aggregation and so on so context propagation is disability this ability to you know and reach your log with information that comes from the application in the at the point where the logs get printed so the correlation is an ID that usually floats gets generated from every recast every cast has its own one and it goes through all the recast in this way you can say okay give me all the logs that comes from the recast ID or the correlation ID 12345 and you get only the one that are related to to that specific recast this correlation ID of this recast ID is something that we see a lot when we do traces I will tell you what like what tracing is later but think about it when you open the dev tools on crumb and you go in the network you see all the bars that represent the time span from the browser downloading all the assets that your page requests and you see that all the bars have time and you can figure out the entire picture that is called a trace split by asset this is a trace and every line is a span other than that your application can also expose events and events are like the number of logins or the number of product that a customer so split group by product itself all the stuff are events that usually you can see them as a counter so numbers that goes up or gauge that are numbered that you know goes back and forth and you can aggregate them and group by a specific key so when I think about like monitoring or infrastructure monitoring I think about something like this I have the telemetry generator that are like our applications it will be it will maybe your application or it will maybe a third-party application like my SQL rabbit and Q or whatever you send all those metrics or to a collector or the collector goes and takes the metrics pull versus push is it big that in monitoring will stay forever so that's not important but there is a collector that usually applies back pressure back pressure and push them to the storage the storage is the place where your metrics leads so I put together a list of for all of those categories as I said telemetry generator is the easiest one because it's everything like everything that expose a metrics your application in go JavaScript no JS whatever the collector are agents that usually runs inside a server and collects all the metrics coming from the your applications and push them to a storage there are collectors open source like telegraph or Prometheus exporter or open telemetry has a collector itself Jaeger new relic has an agent log stash is an agent and so on the storage are places where you can store those information Cassandra is one influx DB Prometheus there are also has a service one like honeycomb or new relic or open source like elastic elastic search those are usually called time series databases obviously you can store those kind of metrics everywhere but if you use a time series database it is designed for time series let's say so it's a little bit more efficient some of the technology that I touched or that we will touch during this talk has those logos I just push I just put them there because I think it will give you an idea or maybe you will rely on them like googling around or you know surfing the net for those information so let's back to what it is a trace a trace looks like this so it's like the traces of is the full picture and you see at the left that there is a column with a bunch of IPs and those are the services that you have in your microservices environment and every span that is the bar is a request so you can see that a specific request goes across your entire system and you can see how much time it is it gets pens to each of them or you can even see like how many times a service is reached during to fulfill our cast like for example during one of my debugging session with traces I and I realized that I were calling like the authentication service for four times five times for every cast because I was all the time like requiring to check the authentication token and that was too much so I was able to make some optimization and to save some request and make my you know response way faster when you click on the span you have the ability to see like the meta data attached to every span so for example I'm saving the the name of the components in this case I was I was tracing the AWS request so from the AWS client SDK I was tracing all the request that I was doing to the AWS this visualization those are screenshot that comes from Zipkin Zipkin is a popular tracer written in Java if you are more familiar with go you can use Yeager as well there are links about those stuff at the end of the slides as well so the obviously it doesn't care I don't care about like the language that you write to is like a you can you can look at them as a database so Yeager and Zipkin are like a database that you can push or your metrics in so it doesn't matter where they come from and later I will show you a few example with JavaScript obviously so when I think about like code instrumentation is critical for your for your application because that's how a developer teaches the application the language the proper language that you will look at when trying to solve an issue and usually your application has logging metrics and tracing it is a big part of your application it's not just few lines of code there are proper libraries and in some way you have to organize them so in the idea is that you can you should try to encapsulate those stuff as far away as you can from the business logic this is hard for logging but for tracing and metrics is a bit a bit is a little bit easier and you can use like event listener and stuff like that another important part is the propagation because as you realized like there are there is a bag of information that has to go from a service to another and how do you move the correlation ID between all those stuff it depends on the length on the protocol that you are using for example in HTTP or TCP you can use the headers so one of the propagation formats is called B3 and it works in this way so this is a TCP and HTTP recast as as you can see the headers contains a bunch of X-B3 stuff and those are information from the trace itself so the the the service that make their cast pass those headers and the server that received their cast is able to bundle them and you know create its span that has the same trace ID as a parent so you can create the hierarchy that we saw before so this is a bit of an overview about how tracing works in practice but let's see how OpenTelemetry.js works so OpenTelemetry.js is a library that you know follow the OpenTelemetry specification but and it is written in JavaScript so you can check it out I learned a lot about looking just looking at the foldering so I decided to share the foldering with you because I think it's critical to figure to you know have a good understanding about how the application works so as you can see there is a folder is called examples and there are examples about how to trace like HTTP recasts, HTTPS recasts or DNS recasts or SQL recasts and those kind of stuff so you get a sample application that helps you to figure out what's going on there is another directory it's called packages and packages gives you a concrete implementation about like how to trace specific components like for example you can trace GRPC recasts or HTTP recasts HTTPS recasts and so on those stuff you just have to import those packages and your application will be traced automatically so as you can see there are like prefix OpenTelemetry plugins for the plugin that helps you to instrument your application but there are also OpenTelemetry exporter that are that those packages contains the code that teaches OpenTelemetry where to push your metrics so as you can see there is Jager and Zitkin that are the popular tracer that I told you about there is also an exporter for Prometheus because that is a time series database that is used to store events and so because the OpenTelemetry supports traces and events so this is one of the example I took from the examples directory and as you can see this is a server written with the HTTP package and for what concerns traces it's easy it is just easy has importing like a single file in this case the example HTTP server and that file provision and provide you a trace configured as you wish with the right plugins and with the right exporter so let's have a look at how the example HTTP server look like this is it as you can see we import we require like a bunch of core libraries from OpenTelemetry the API the node and the tracing and you also have to decide where to push your metrics and in this example based on an environment variable we can switch between Jager and Zitkin so we import both exporter and as you can see we inject into the tracer the concrete exporter so here you can see that there are no plugins and this is because the tracer itself comes comes from comes with a specific set of plugins already provisioned by default so you don't have to do anything more than that to get a trace from an HTTP request and also to get it propagated through the next one so how does it work I mean it looks too easy and it is true and I think has a JavaScript developer we are in the unique you know we are lucky because we can wrap our like functions every function from the outside and this is cool because it helps you to write automations to write you know tracing code that doesn't go into the business logic because you can do it by by from the outside and this is important and Schimmer is the library that is used by OpenTelemetry to to instrument all the code from the outside and this is why you don't you don't have to go in every line of your code and do the instrumentation as you will may do in other languages like Golang because Golang doesn't have this ability Java has it because you can instrument the JVM but not for JavaScript this is very cool and yeah if you have to write your specific instrumentation for your business code you can use Schimmer as well because it's it's very simple and you don't have to go where your application is so this is it I I leave I will leave you a bunch of links because the topic is very big and I hope these slides and presentation will help you know to figure out that this is a topic and it's not too hard compared with other languages for JavaScript developer to pick it up with tracing and I think it's super important when you do when you have a distributed system or also with a monolithic that is very asynchronous so I use it a lot when I have cues or when I do even sourcing because I can tell the story for all the events or for all the process and I can follow the message in the cues and I can tell for how long it stays and which worker possess it and so on so the first link is my github it's my Twitter account because you can reach me out there my dms are open the second one is the open telemetry when you can learn more about the specification and you can see all the other languages supported like Python Golang PHP and so on and the third one is my blog where I wrote about tracing and open telemetry as well. Yeager is the pop one of the popular tracer that I spoke about so you can use it in open source and it's also sponsored by them the CNCF so Honeicomp is a company that provides another service solution for observability and they have a cool blog where they write a lot about these topics so check them out because they are like on top of the of this topic that the open telemetry has a github community so you can go there and chat that they are super reliable I learned a lot from them and you can also contribute because it's open source so check out the repository itself and the last link is an application in Node.js that I wrote and that I instrumented with open telemetry it is a sample application so it contains also other application in other languages so it's an e-commerce like it's a dummy e-commerce that it's in it's written in five different languages and it is instrumented with tracing and logs so you can have a look about how it works in practice over there so thank you for your time and let me know if you have any questions