 So, when I got a talk accepted to KubeCon about the browser, I was honestly pretty surprised and seeing a room full of folks excited to chat about the browser at KubeCon and Cloud NativeCon is not something I expected, so I'm curious, who here identifies as primarily a front-end developer working on browsers? Okay, I see a few hands. Who is more like, I'm a full stack dev, I work on everything. Okay, a lot more hands. SRE Ops, okay, gotcha, amazing. Well, I'm excited to talk to the browser, talk about the browser to y'all today. My name is Purvi Kanal. I've worked with browsers for a large part of my career and I'm really interested in getting front-end observability to a higher standard, especially for ops folks and back-end folks. We're used to really good tooling and I really want to see the browser front-end world catch up to that. I work at Honeycomb and we're trying to solve this problem there. I'm also an approver on the OpenTelemetry.js project with a special interest in web APIs. So today we're just going to be going through the current state of real user monitoring and general front-end monitoring tools. We'll talk about a quick introduction to OpenTelemetry and how it can also be a browser monitoring tool. We'll talk about the automatic instrumentation that comes out of the box with OpenTelemetry that requires not a lot of configuration to get started. And then we'll jump into how manual instrumentation can really supercharge your journey of front-end observability. So let's jump into it. There's a lot of tools out there today for around real user monitoring and general front-end monitoring tools. But how did we get here? There was a time where the web wasn't that complicated. It was a time to just truly have fun, express yourself on your homepage and talk about your favorite bands or sign up for Neopets, which is a lot of where I spent my time on the earlier Internet. And it really wasn't that complicated. It was a server, served up some HTML. We barely used JavaScript unless you wanted to play an MP3 or something similar. Even blinking and marquee was all built into HTML, so we didn't really have a lot of complexity going on. Over time, though, using the web has become no longer optional. Especially since the pandemic, more and more of our critical services are accessed online. Getting vaccine appointments, government services, and lots of other really, really basic needs are accessed through online portals. It is our job as developers to create reliable and performant experiences for our users. When our system looks a lot more complicated today with multiple databases, multiple microservices that serve both a browser app, a mobile app, and maybe a standalone API, and also we have increasingly complex engineering organizations as well. We have multiple teams working on multiple different services. So it's a lot more complex than the old Neopets days. And to keep track of a lot of these things on the front end, this is kind of the state of what monitoring looks like on the front end. You might have a real user monitoring tool that tells you about session information, performance information, maybe a little bit of errors, and even some analytics. And then there's other dedicated analytics tools like Google Analytics that can be used by developers but are also sometimes used by other people in the org, like product folks or marketing folks, to make decisions about your product. There's also more full-featured software out there, something like Full Story that even captures user sessions for you to watch back and see how users are using your site. There's error tracking with stuff like Sentry that aggregates errors and you can alert on them. There's APM tools. There's log searching. And all of these things are happening in different places. And it can be a lot to keep track of as somebody who wants to figure out what the state of their front end is at the moment. And the main thing about a lot of this is we have a lot of different tools in different places that tell us really well what is happening. But when it comes to bug reports or trying to debug something, I've been in the situation a lot as a front-end developer where a bug report might come in from our customer success organization saying, hey, this isn't working. The page is blank for a certain customer. And I can't reproduce it. So then I message them back and I'm like, hello. Can you get the customer to open their dev console, take a screenshot of that, and then send me the error message, and then tell me the exact state of how they got into it. It's a lot of back and forth, whereas I feel like when I've been debugging back-end things with really good distributed tracing, I can see things at a glance without having to necessarily reproduce it myself. So a lot of these tools that tell us well about what is happening are often disconnected from the why. And the why is really where observability shines. Observability is all about knowing how your system works because it's telling you how it works. And in order for something to tell you how it works, it has to be able to describe itself pretty well. So I'll go into a bit of an analogy here and bear with me. So I like to run. And I wear a running watch when I'm running, and it really helps me make decisions about my running. It'll tell me what pace I'm running at, it'll tell me how many steps per minute, maybe even the power that I'm exerting with each step, and if I'm running up a hill and how steep that hill is. That helps me make decisions, and it also helps my partner know when to come get me if I've been gone too long, if I've been gone longer than he was expecting. But also this data leaves out a lot that only I know internally. So if on Monday I was running at a 5 minute 42nd minute per kilometer pace and I was feeling pretty good, and on Wednesday I go for a run and at that same pace is feeling horrible, my watch can't really tell me very much about why that's changed. But I might know. I might know that I didn't get a very good night's sleep, or that I was up really late, or that I didn't eat well, or that I was stressed out by something else going on in my life. My watch doesn't have that information, only I have it. And it's kind of like, and so what I do is I keep a journal where I self-report that information because it's really hyper-specific to my system. And this is where sending hyper-specific data about your system is really where observability shines. We need a pretty flexible tool to make that happen. And open telemetry is one such tool. So open telemetry, which you'll also hear me refer to as OTEL, is a vendor-neutral open-source observability framework for instrumenting, generating, collecting, and exporting telemetry data such as traces, metrics, and logs. We'll mostly be talking about traces today. The really crucial part of open telemetry is that it is vendor and tool-agnostic. It can be used in a broad variety of ways, many different languages, SDKs, run times. It doesn't matter. You should be able to send open telemetry data and have a back-end like Yeager, Prometheus, or Honeycomb aggregate that data and present it back to you. So it's really intentionally built to be flexible so that you can send hyper-specific data about your system. And it's really easy to set up in the browser. I'm curious, show of hands, does anybody use open telemetry to instrument other parts of their systems today? Nice. Yeah. So a lot of us are using open telemetry. Is anyone using open telemetry in the browser or see a couple of hands? So getting started is installing a set of packages. There is some setup code. I'm not going to go line by line through this code. It's also available on the open telemetry docs site. But it's a snippet of code that you want to include at the top level of your browser application. It needs to be loaded before everything else so that it can fire off critical spans about your document load, for example. So you want it to be loaded first. So let's jump into automatic instrumentation or auto instrumentation. These are spans that form traces that you can get out of the box with, if you look at the bottom here, there is this register instrumentations function and a get web auto instrumentations meta package. This meta package instantiates some base auto instrumentations that send some basic spans about your system. It's a great place to start looking at what's going on. So let's jump into what some of these auto instrumentations are. There is the document load instrumentation. And again, all you have to do is provide this meta package and it will automatically start generating document load instrumentation spans. So what do these spans look like? So there is a overall document load span at the very, very top level. And this is telling you from the point that a user hits the browser to the point that the DOM content loaded event is fired, how long that took. As child spans of that, there is a one document fetch span. And this is when the browser received the last byte of the response of the last thing that it had to load. So if it had to load a bunch of resources, it's like, here's how long it took me to receive all of the last bytes. Document load and document fetch are slightly different because document load also includes how long it took to execute some of those scripts and things. Not including any async or deferred scripts. And then most importantly, I think the most interesting part of this instrumentation is actually the resource fetch. So for every single resource that your browser is fetching, it will create a resource fetch span. That is every font, every CSS file, every JavaScript file, every third party JavaScript file that you load, it will create a span for that and tell you how long it took to load. So a more concrete example of that is, for example, we have instrumented the Honeycomb documentation website. And this is a bunch of resource fetch spans and it tells you which resource it's talking about. So at a glance, I can see that, hey, there's this GIF that's loading on one of our pages that takes six seconds. I can probably do something. I should probably do something about that. But it's just nice to see, at a glance, how long all of your resources are taking. The other thing about these auto instrumentations that I want to stress is that they can be extended with really critical data. And I don't want to be the person to tell you exactly what that data should be because you know better than I do what your system is like. But here's an example. So I was curious about which resources loading were blocking the render. So there is a browser API that can tell us whether a resource that is loading is a whether it's render blocking or not. And what that means is if a resource is render blocking, then it's more like static files like fonts, CSS, and JavaScript that block or delay the browser from rendering the page. So this can happen sometimes. Have you ever seen a page where it kind of flashes and it's not styled, it just looks like pure HTML and then the CSS kind of flashes on top of it? So that's because it might still render but the CSS was deferred. You might not want to do that. You might want to preload the CSS. So it can give you some pretty interesting information about whether you should preload or defer your load. So I kind of stuck that in there and I said how many of these resources are render blocking? And then I saw that there's a font and there is like my main style sheet which allows us to like get a little bit more information and dig into what optimizations we can make to make our page load a smoother experience. Because page load is also not necessarily about like intrinsically or deterministically like how fast it is but also whether it seems fast to the user and preloading can do a lot for that. So that's just one example of how you can not only use the auto instrumentation but enhance it with extra attributes that are important to you and your system. The next piece of instrumentation that I want to talk about is user interaction instrumentation. So again this ships out of the box with this meta package of get web auto instrumentations. But by default it's not that interesting because all it does is track clicks if you load it like this. So here's an example where I had a website where I was interested in click events and input events and the event list that you can give it is any event, any browser events. So if you go on the MDN docs and look at all of the browser events you can pass in any of that. So that could be for mouse events, it could be for keyboard events, it could be for navigation events, it's a pretty long list. Also by itself the data that it gives you is pretty limited. So I wanted to enhance that a little bit. So looking at extra attributes that I can set, I want to know if there's an ID on that element, I want to know what that ID is because by default it gives you the target path. But in a really large application a nested target path can be almost meaningless. So it's helpful if you have data attribute IDs that you use for testing or for observability you can pass those in here too. I want to know the type, I want to know what the class names are and then for inputs I'm interested in what the value of that input is. That could give me a lot of information if someone says hey, the search is broken. I can go and dig into that with a little bit more information, if it's links I want to know which links folks are clicking on. What that looks like is on the left here we have just the out of the box instrumentation attributes but then on the right there's more enhanced information. So this is I think a radio button and it tells me that it's toggled on. It tells me exactly which ID it is and the type of element. So it's really interesting and what it leads to is not only helpful for debugging but I can also do analytics now with user interaction instrumentation. So for our docs site our docs maintainers wanted to ask the question hey, which code snippets are the most copied? And we don't actually need to go to an analytics tool to do that anymore because we are collecting that data with open telemetry and we can just run a query and say hey, which pages have the most click events on this specific type of button? Another thing that is often really important to track are core web vitals. Web vitals is an initiative by Google to provide unified guidance for quality signals that are essential to delivering good user experiences on the web. This kind of came from the idea that it's hard to describe in a quantitative way what a good user experience is. So they broke it down into three major categories, load time, which is where largest contentful paint comes from, interactivity, which is first input delay that has actually now been replaced by interaction to next paint, and visual stability, which is your cumulative layout shift score. That's referring to if you've ever been on a cooking website and you just want to get to the recipe, but then the blog loads and it pushes all the content down and then the ads load and then you're just like I'm just trying to make these cookies. Please let me look at this website. That's a really frustrating user experience, so they tried to quantify it. INP is more about, so FID was first input delay was just kind of looking at the interactivity of the first input you might have, which is either a click, a tap, or some sort of keyboard event, but that falls short because we're loading lots of JavaScript these days and that's deferred, so INP actually tracks it throughout the life cycle of the page and sends the largest delay there, so it can be a much better tool. There is no Web Vitals instrumentation available through OpenTelemetry upstream today, so we created one which will eventually be available upstream, so you can install this Honeycomb package and get that instrumentation, but I want to be clear this is going to be available eventually in OpenTelemetry proper. So it'll extend spans for all your major Web Vitals, it'll tell you what those values are, whether they're good or need improvement, but we can take it a step further as well. So for cumulative layout shift as an example, this is an example of some layout instability that's frustrating. If I know that I have a poor CLS score, there's really, I don't really know where to go next. I can try to reproduce it myself, I can ask other people to reproduce it, I can try to play with different settings of my network speed and see if I can reproduce it that way, but it can be a little bit of a guessing game as to what is causing that layout shift score. But it doesn't have to be because the Web Vitals baseline package has something called attribution. So what it can do is for every single Web Vital, it can tell you which element on the page contributed to that Web Vitals score. So taking CLS for example, the instrumentation not only collects the value, but it also collects which element contributed to that layout shift score or that largest contentful paint or that interaction to next paint. So it gives us a starting point of where to look for poor Web Vitals score so that we can start optimizing, and it starts to connect that Y a little bit better. The most important one I want to talk about today, especially relevant to so many of y'all that are already using open telemetry to instrument your systems, is the network instrumentation. So that is also available out of the box with the instrumentation meta package. Here I just have an example of there's two network instrumentations. There's instrumentation fetch and instrumentation XML HTTP request. Your application might be using both fetch and like kind of more of an Ajax style to do fetching, so in which case keep both on. If you're only using one, you can disable the other instrumentation because it won't do anything. So that's just an example of in case you need to disable one of them. And out of the box, it gives you spans for every single network request that is being made. So every get, put, post, delete, and this will include things to like third party sites. You can write filters to only capture things for your API and you can do like fancy things like that too. But it will capture every single network request made by your browser and a bunch of metadata, which is cool. But we can take it a step further. So this is really where the magic happens. If you propagate a certain header, a trace parent header, then you can connect your front end and your back end network requests together, which is really exciting. In order to do that, it will happen automatically if your app is served on the same domain as your API. So if I'm serving from local host 8080 and my API is also served out of that same domain, then it happens automatically. If your API is like api.honeycomb.io, and my UI is UI.honeycomb.io, I'll have to do a little bit of extra work to propagate that. And that's just passing a regex to match your back end URLs. This is telling the instrumentation, hey, for this set of outgoing HTTP requests, I want you to add a trace parent header. And that's because there is a concept of contact propagation in open telemetry. So signals can be correlated with each other, and the way that that happens in this case is by sending a trace parent header that has the trace ID in it. So that would be this particular trace. That was the trace ID and the parent span ID of that HTTP request. And the result looks something like this. So this is really the exciting part, is that you have your HTTP request, the GET request that originated from the browser, automatically connected to your API that is already instrumented with open telemetry. So you can trace a request all the way from the browser down through to a database call and the rest of your distributed tracing. And this to me is really, really powerful because you never have to wonder, is it my front end or is it something in my back end? You can answer that question really, really easily with open telemetry. On some front end teams, often we have to prove mean time to innocence, which is that a lot of front end teams will have bugs and stuff reported to them before everybody else. But the bug could really be anywhere in the system, and it's really, really hard to prove that it's not a front end thing. Or whether your front end is slow or is it your API that's contributing to the front end being slow as well. So you can answer a lot of those questions at a glance. And you can go from something like a click event, so someone did something on your website all the way through to the database call that it made with not too much effort. So that is really digs into, it can really start to connect that why. But to go a little bit further, we've talked a lot about enhancing this auto instrumentation, but it's different than creating your own spans. Because you know where all of the dragons are hidden. Coming back to that running analogy, only you know what you're feeling like when you're on a run. And that's really where manual instrumentation shines. You know all of the nooks and crannies on your system, so you can instrument them yourself. You can send spans about whatever functions you want. If there is a part of your app that is really important to you, for example if you run an e-commerce website and you really want the cart checkout experience to be fully instrumented, you can do that manually and track things that are really important to you. So let's jump into a concrete example about that. Let's re-examine document load. Because how useful do we really find document load? I personally have struggled with it for many years because I can see a document load that is reported as pretty fast, but I still have users that are complaining to me that the website is slow or that they can't use it in the way that they expect. And that's because for example for the document load that comes with that auto instrumentation or comes through a lot of other tools, we've decided that the document has loaded and this means that the user can use the website when this DOM content loaded has fired. But this might fire and we still have a lot of JavaScript that's executing, people might be frustrated trying to click on buttons and the page is totally frozen and we'll never know through just document load instrumentation. And that was a large part of why web vitals were created in the first place. We're like, okay, we need to actually split up this into different categories and largest contentful paint is kind of that proxy for document load. But if we dig into how largest contentful paint actually works, it was really surprising to me that it's just reporting the time to the largest thing on your web page. So in the example on the top, the LCP fires when this picture has loaded because it is literally the thing that takes up the most amount of space on the page. And that seems fine. It seems like an okay proxy for the tech crunch example. But if we look at the Instagram example underneath that, it starts to break down because the LCP fires when the Instagram logo has loaded. And for the Instagram login or sign up page, it's telling me, hey, you're good, the page is loaded. But there's no form. I can't interact with it. So it still doesn't feel like a very good measure of what users are experiencing. And it's a bit frustrating that it's telling me that the largest paint has happened and I take that as a proxy as like the page has loaded. And ultimately none of these metrics matter if your users aren't happy. And the more important thing is for you to decide what does it mean for your page to be loaded and what does it mean for users to interact with your website. So time to first X or whatever is much more useful than LCP or document load as proxies to whether users can use your site. An example of this is for Honeycomb, graphs are kind of our bread and butter. So time to first graph for this particular application is a much better proxy to understand whether my page has loaded. But that is hyper specific to this particular app and your app will have something totally different in it. But that does mean that I need to set that up a little bit myself. So I can use the element timing API and give it and give my graph an ID, the element timing and say this is my graph and then use the performance observer browser API to say hey, when this particular thing has rendered, send a span about it. And that is much more useful than an arbitrary document load or an arbitrary LCP because sometimes LCP fires around the cookie banner, not even like an actual important element on your page. So defining what that is for you is really important. And this is just one example of like instrumenting your system because you know it best. Context is absolutely the most important thing. You should be adding extra information to all your spans like user IDs, team IDs, whatever is important to you, you know your system the best. You can add that context through resource attributes or with span processors if it's changing every span because ultimately, vendors should not determine what you can measure. You should be using a flexible tool that's vendor agnostic so that you can really describe your system in as full rich detail as you possibly can. And the best time to instrument your code is while you're writing the feature because we've all been there where you come back three months later and you're like who wrote this and get blame is like it's me, I'm the problem. So the best time to instrument your code really is while you're writing it. And that's really all I have so thank you so much for listening and come say hello. I don't think the microphone's on. Try again. Oh, okay, I can. Yeah, cool. Thanks. Nice presentation. And I think in one of the earlier slides, you showed the bootstrapping code and there was like a URL to my server, like to my metric server. Do you have any comments on like how do you like should that be secured? And yes, if yes, in which way? And on top of that, like any rate limiting should be implemented there because I think our team like looked at this at some point and we kind of were worried that some of our clients may like find this and bombard us with the requests. Yeah, that's a that's a great question. So absolutely, there's there's ways to secure it. And if you're worried about that, you absolutely should. Ideally, you would be using in production an open telemetry collector. So that you're not exposing any API keys or anything. And that collector that you host can live somewhere that's authenticated so that you actually have to authenticate against it before you can send spans to it. And that kind of prevents people from just getting the URL and sending as many requests as they want. You mean like the browser with like the user, the JavaScript on the page would have to first authenticate with that, probably using like user credentials or whatever, like session or whatever. Yeah, that's kind of the idea or however you like you can host it alongside your API and however you authenticate with your API can authenticate you to using your collector as well. Cool. Thanks. Hello. Yeah, I think it's. Hi, thanks. Great talk. I'm actually interested in what you do after you've got those nice red amber green health indicators and how you aggregate those and report on them later in particular if they've got different weightings maybe because some are more important than others and how you report that like over a time period. Yeah, that's a great question. And it kind of depends on on your back end. So at Honeycomb, we make pretty big use of service level objectives. Yeah. So that's something. So there's benchmarks like taking web vitals as an example. Google has benchmarks for web vitals. But often folks find that when they first instrument, especially if they have a lot of work to do, everything is just read and it's pretty, you know, not fun experience or motivating to be able to change it if you're far away. So I always like we always recommend sending those benchmarks for yourself. So it can work towards the Google set ones, but you can tweak them to to your own preferences and use things like service level objectives to to keep track of those. And then on aggregate report on them using your own benchmarks rather than some like arbitrarily set benchmarks. And that's that this is where like your back end of choice starts to come into play a little bit. So I don't I don't know that Yeager has this kind of like reporting feature. So you likely have to pay for a back end to do that. OK, thanks. Hi. Hi. Question about the clock synchronization. So you mentioned distributed tracing between front and back end. And of course, in the back end, it's easy to make sure your clocks are synchronized as far as I understand the time stamp is generated in the span. So that would be the pros in that case. How do you handle that? Yeah, it's a really it's a really complex problem. And you basically from from clients, any mobile apps like mobile or web, it's hard to rely on the time stamp because of clock drift from for many different reasons. So that's again a thing that at Honeycomb, we kind of proxy through the back end a little bit, which is that we do record the time on ingest of an event. And we all we also have the time stamp that's set by the client. And if it's egregiously difficult, like we had we had some problems with a particular customer once where they had spans coming in from a lot of like mobile clients. And a lot of those mobile clients played Candy Crush, so they would change their clock. And so we had to like do some interesting math to say, like, if the clock drift is a lot further than we think, then we use the reported like ingest time. So that's something that would be done in the collector. It can be done in the collector and it or it could be done at the time of ingest of the back end of your choosing. But yeah, a collector is a great place to solve that problem. Thank you. Yes. Hi. Thanks for the it's here. Oh, thanks. Sorry. I would like to ask the instrumentation gets done on the JavaScript level. But there are some stuff with redirects and happens inside the browser that doesn't come up to the JavaScript. And I have seen also the performance observer never used it. But are you able to catch those transactions currently? Yes. So a lot of a lot of the instrumentation is using performance observer in terms of catching certain things. There is the option to use send beacon instead of sending your telemetry over HTTP, which allows you to catch some of those, like, let's say, if you navigate away and you want to catch any like on mount events, if you're sending through the send beacon API, it'll still be able to aggregate that and send that telemetry. So there's a specific setup for using the send beacon API to make that happen. OK, thank you. Thank you very much for the nice insights. I was asking myself, did you ever reach, especially when you create your own spans, also the level where it affected negatively, the performance, because you added too much of a logic on the client side, and it would have been better to just have a more complex theory to get your insights, what you were looking for. Yeah, that's a really good question. I would say it depends, but I personally have not reached that limit with the span, like over instrument, like manually instrumenting, because, again, tend to pick and choose the most important parts rather than blanket instrumenting absolutely everything. But if like the way that I would think about handling a situation like that is understanding that maybe this telemetry I need is temporary, like we've definitely had done those tradeoffs not so much on the front end, but I've done it on some back end services of like right now I'm having this problem and I'm going to over instrument a little bit even at the cost of a bit of performance, but knowing that I'm going to get rid of that eventually once I hone in on what the problem is.