 Hello? Check check. Okay in the back still. All right. Let's get this thing going. There's like a partial minute left. We looking good Kyle? Am I muted? Thanks. You see levels? Excellent. Thank you sir. Straglers. Now it's time. Hey everybody. Good morning and thank you all for coming out. I want to thank the scale organizers for completely eliminating the Sunday morning hangover slot. I've had it before and it's not a lot of fun and I'm glad to be rid of it. My name is Jeff Gelbach and I've been doing network monitoring longer than is probably healthy. I got my start in the discipline in 2000 while working as a wide area network management engineer for the NASA integrated services network. From there I went to work doing consulting and software engineering for a company called Concord Communications, which was a proprietary network management vendor. After being made redundant in a merger, where I've got hoovered up by CA as so many have over the years, I worked for a couple of telcos where I deployed OpenNMS, got to know the system a little bit, and got involved with the fantastic user community, some of whom are still either working with us as coworkers or just as members of the broader community. I started fixing bugs in my spare time in the software and got to know the code and hired on with the OpenNMS group in 2007 and that's where I've been to this day so you do the math. My roles at OpenNMS have included end user support, consulting, solutions architecture, sales engineering, and most recently product management. I'm now the product manager for the OpenNMS horizon ecosystem, which includes the horizon and meridian distributions of OpenNMS plus some supporting components. This role is really good for me because I can see my work making a difference and that is very important to my psyche. It's also good for the products because there's no longer any danger of me writing code. Before I get underway here, I want to acknowledge a bug in my talk. I'm aware that the much of the terminology I'm using today is a little bit cited normative, so suggestions on how to make this better are appreciated. I'd like to talk to feel more inclusive. Please see me afterward if you have ideas. All right, so here's my agenda. I'm here to sell you today on the idea that the network still matters despite the fact that it's increasingly hidden from our view. I'll get into the elements of network management or network monitoring as it's more popularly referred to these days and I'll touch on a few ways that OpenNMS approaches this very fun discipline. I also aim to convince you that you should care about NetFlow, which is a protocol that's been around since the mid-1990s or at least that you should care about some of the successor protocols and work alike protocols that have come along since. As a terminology note, I'll be using NetFlow to refer to all protocols in this family as a group. When I'm talking about NetFlow proper, I'll suffix the term with either V5 or V9 to refer to a specific protocol version. Let me pause here for a second and ask for a show of hands. How many of you have used or still use NetFlow or a similar protocol in your day-to-day life? Pretty decent representation here. Okay, thanks. That's going to help me adjust the nerd level a little bit when I get into some of the more deep dive stuff on how we added flow support to OpenNMS. There is a lot to cover here and I'm not going to try to put too many of you to sleep doing it. I'll also talk a little bit about what the future may hold for flows in the OpenNMS product portfolio. So let's get on into it. All right. Now, I do mean to convince you that the network matters. However, I'm a big fan of considering all the available evidence, not just the evidence that supports my own conclusions. So I'm going to begin by acknowledging some facts about the state of the world in 2022. First, the in-house data center is in sharp decline overall. Unless you work at AWS or GCP or Azure or AliCloud, you're probably not doing new data projects in your work. Second, who knows what the hell is up with the physical office? It's so up in the air. It's anybody's guess. Just anecdotally at OpenNMS where I work, the US and Europe staff are still 100% home-based and they have been since March of 2020. In Canada, our Ottawa office has found a hybrid protocol that works so far for them, but that could get disrupted at any point. So we really just don't know what the physical office will look like. Also, the last mile is changing. There are developments like 5G and SD-WAN that are dramatically altering that landscape. Even if you set aside the shift toward work from home, it's going to be a while before the dust settles in this arena. We don't necessarily have the comfort of assuming that there's glass or copper coming into where the users are. Also, cloud native deployment is really changing the game as far as what technologies the people coming up in the industry today are familiar with. It's really threatening to make this into the new this. These facts are all acknowledged. However, magical as it feels to type our form apply and see your code become infrastructure right before our eyes in our terminal window, the evidence we have suggests that our universe is unfortunately devoid of actual magic. At some point, some device is moving ones and zeros around and that's what makes our infrastructure go, and that's what makes our users able to interact with the apps that we're deploying. This is why we do monitoring, right? Apps have an idea of where those bits are coming and going and what they look like qualitatively as well as quantitatively. It's really hard to form a clear picture of our infrastructure's health and capacity. This brings us to the why behind monitoring. I grabbed this image from the OpenNMS Twitter feed on Valentine's Day of this year. This is our mascot, Ulf. I've got him actually right here in the plush. Ulf eats bugs for breakfast and he hearts monitoring and this isn't really a tie-in to anything except that heart is in the title of this talk so it seemed appropriate. All right. Now, I contend that network visibility is at the heart of the discipline of network monitoring and everything monitoring in fact, and that it will continue to be so for the foreseeable future. Visibility of the network enables us to understand what's actually going on. Application level metrics, logging, tracing, these are all indispensable parts of the monitoring ecosystem and they're super important parts of the picture but if you add network visibility it's like turning on the lights in a darkened room where we're fumbling around trying to figure out what all of these other inputs mean just by feel alone. So let's break down network management which is the sort of old school discipline that old guys like me practice into the discipline that contains monitoring as we practice it a little more focused and down into its constituent parts. There are several bodies over the years that have put forward different models to describe network management capabilities. Here are a couple of, I don't know how many of you know either or both of these but the OSIFCAPS model is sort of the most widely known way to break down network management capabilities. It's the one that I learned decades ago back in 2000 and it's the one that we've historically used in our work at OpenNMS to guide our thinking as we evolve the code base that we are fortunate to be stewards of. The areas of FCAPS in which we try to play are chiefly fault management and performance management with just a little tiny slice of configuration management that we recently added. Now folks who work on the telecoms industry side of the house might be more familiar with the ETOM FAB model which breaks down into just three areas and it's really convenient to work with the FAB model because in that one all the disciplines of fault performance and config management roll up under the assurance part let A in the middle there but really same thing just different ways to break it down. A little bit more about the three categories of FCAPS where OpenNMS where I work plays and what are corresponding capabilities look like in those areas. First area, fault management. Everybody's seen a room like this on TV maybe you've seen it in person if you work in the industry right? Fault management concerns itself essentially with knowing when something worth noting has happened in the network and conveying that knowledge to the right people. That's it. That's what fault management is all about. Now sometimes that means a fancy room full of humans and workstations like this one but that's shifting as previously mentioned. One of the typical inputs into fault management as a discipline is what I'll call unsolicited external messages. This grouping includes SNMP traps, syslog messages, JSON format messages from some message broker topic and even quasi-standard mechanisms, things like the event interchange format or EIF. We support all these mechanisms in OpenNMS that I've just named. Another important input to fault management is synthetic transactions. These are performed by software and the point of them is to simulate a real user's interactions with the real network and the real applications that it enables. If we see something go down robotically we say hey this thing went down, somebody might want to look into it. OpenNMS dedicates a whole subsystem to these inputs. We call it the polar, other systems call it something different. Finally evaluating different kinds of time series data against thresholds is another important way that many systems increase the value of their fault management subsystem. Ours is no exception. I'll touch on this more in just a moment. I'm going to make just a very brief stopoff to talk about config management which concerns itself with wrangling device and application configurations, keeping track of how they've changed over time and often also with applying changes to those configurations according to some kind of code based mechanism. Think Ansible and you're on the right track. That's the most popular config management platform that we're likely to be dealing with today. We're not trying to be Ansible. We're dipping just a toe with device config backup capabilities that we added to the horizon 30 release earlier this year in 2022. So we can go out and grab configs from your network devices and compare them historically. So you can see here we've got a little bit of a diff. We're not trying to go out and make changes. We're not trying to get that fancy but it's really nice to be able to see the history. Alright, let's move into performance management. This deals in quantifiable measurements over time. In practical terms in today's terminology it boils down to what we call metrics. It's just time series data. It's numeric. This time series data might be evaluated against thresholds as well and that can generate events which feed back into the fault management subsystem. This can happen either internally or externally in a separate piece of software. In our case we do it internally but in a distributed way in a lot of cases. Performance management these days, the getting of these time series metrics really breaks down into two high level approaches. There's what I'll call the pull mode which uses a protocol like SNMP. Could also use a protocol like WS management or calls to a REST API or some other mechanism where the monitoring platform reaches out periodically on a time schedule asks a question of a device or an agent and then stores the result it's all bucketized on time window boundaries. There's also a, in recent years there's been an increase in the popularity of what I'll call push mode for getting these metrics in and that's one in which the monitoring system just sits back chilling and it waits for metrics to arrive via some streaming telemetry mechanism. That mechanism could be Juniper's JTI, Junus Telemetry interface, Cisco's Nexus streaming protocol there's a standard space mechanism called OpenConfig and there are also some message broker based situations where it can work similar to a streaming telemetry. Performance management subsystem of OpenNMS supports both of these models including all of the transports that I've listed here. Now most of our popular open source monitoring products stop there but performance management also encompasses something that we'll call flows. Flow management is often also called traffic management and it's a sub-discipline of performance management which answers the essential question which hosts are talking to each other what and when and at what date of volumes. Now importantly here we're not talking about open flow, that's a different thing and I'll touch on that in a couple of moments. So let's get into the family of flow management protocols that's often collectively called NetFlow. Terminology note again here NetFlow, the word NetFlow may or may not be a registered trademark of Cisco hard to tell these days, they may have abandoned it, it's become a generic in any case. So as a reminder I'll use that word to talk about all the flow export protocols that are based on NetFlow or our successors of it or work in a similar way. NetFlow v5 was the original NetFlow. It's proprietary to Cisco and it first came on the scene in 1996. It's not very extensible or sophisticated for that matter and it supports only UDP as the network transport. But it does solve some basic problems of network traffic visibility. Now a fun fact, some of you may know this already but the flow export part of NetFlow v5, which is basically all of NetFlow v5 that we use these days, was kind of an afterthought. Cisco was working on a separate project meant to optimize routing and switching in the internet operating system. So they just added on this flow export everything as a nice aside and it took off into its own whole discipline. The successor to NetFlow v5 is NetFlow v9 and that came out in 2013 sorry in 2004. It's still proprietary to Cisco but it is much more extensible than NetFlow v5. 2013 is when NetFlow v9 gave way to IP fix, which is actually a standards based protocol, not proprietary to Cisco. It's very similar to NetFlow v9 in many ways. It is a bit more extensible and has more security features which is super important these days and it also supports additional kinds of transports including TCP and SCTP. There's also Sflow, which is a multi-vendor effort spearheaded by a vendor called Inmon and I'll talk more about that in a moment. There are some also runs here too. Some Juniper equipment supports a Juniper proprietary flow export protocol called JFlow. Huawei and HPE devices support something called NetStream which seems very similar to NetFlow v9 but is not interoperable with it. And then finally let's talk a little bit about SDN and SDWan. Some of these platforms may support either IP fix or Sflow or they may have their own proprietary methods of exporting flow records. I'm not going to get into those in any depth today but we are keeping a keen eye on this space and in principle we can add support for any emerging flow export protocols that the market demands. So what's in a flow record? Flow record is the building block of flow management. At base a flow record is just a summary of some network traffic which got sent by a flow exporter. To find our terms here, that exporter is typically a piece of software running on a router or a switch. One that's equipped to inspect the packets that it's forwarding and switching and look at the source and destination IP addresses, source destination port, layer 3 protocol and some other attributes to arrive at a tuple that identifies an exchange of data between two nodes. As the flow tuples build up, the flow-enabled device fills its flow cache with summarized data over here in the cache. It's filling its cache with those summarized data and when some triggering event occurs maybe it's a timer expiring after a time window in which no new data has been seen on that flow record. The flow-enabled device sends a packet with all of that summarized information to a collector. So now we've got exporter collector. The collector is some piece of software that's capable of receiving and processing and persisting the information in those flow packets. That could be open NMS. There are also things like C-Flow D running around out there and there's of course a wealth of free and proprietary options in the market that have this kind of capability. Flow-enabled devices tend to be higher end ones. Your average consumer device probably doesn't have support for NetFlow or NetFlow-like protocols unless you're running a custom firmware like PF Sense or Open Sense which probably a lot of you are. So again in a nutshell, the flow-enabled device constantly inspects traffic that it sees on the network, summarizes it up into flow records and then sends those flow records in the form of flow packets to a flow collector. This is a super nerdy comparison of NetFlow v5 and its successors, NetFlow v9 and IPfix in case you're curious about how they stack up. Grab a photo now if you want one because I'm about to click past it. There is also S-Flow as I promised. Let's get into what S-Flow does. It is an open multi-vendor effort led by a company called Inmon way back in the day. Famously the technologists at CERN have used S-Flow to gain visibility of the network that powers what has been characterized as the largest machine ever built by humans, the large Hadron Collider. So it's got some bona fides behind it. I'll have a little bit more detail on S-Flow and how it stacks up momentarily. Alright so let's rewind for a second and I promise I'm not going to do this for every protocol. I'm not going to put you all to sleep. But let's look at what's inside of a NetFlow v5 export packet since that's sort of our baseline. We can see here that it is a v5 packet. I've exploded the part that says protocol version 5 so you don't have to squint as much. And here are the various fields as dissected by Wireshark. I'm going to gloss over much of this and I'll just zoom in on the most important fields. So the first two here are the conversation partners. That's the source and destination IP addresses of the two hosts in the conversation. In NetFlow 5 it's IPv4 only. I think it's the same in v9 but IPfix does support IPv6. There's also the SNMP if index which is just a Layer 2 network interface identifier of the input interface. And that appears alongside these two addresses. Further down are the source and destination ports. If applicable they'll be zeroed out if the IP protocol in question doesn't have the notion of ports. The IP protocol here is identified by its number. In this case it's 6 which is the protocol number for TCP. So we know that this is a TCP flow record we're looking at. We also get a measure of the data volume that was exchanged by those conversation partners. That's in the octets field. And then from the duration field we can see how long the exchange was ongoing. It measured in seconds. Together with the five items listed above we now have the essential seven part tuple of any flow record. This is going to be basically the same for any flow protocol. Those seven things are what you need to know. Now at the very, very bottom you see this tiny little thing called padding. Which in NetFlow 5 was marked as reserve for future use. Some implementations use these bytes to differentiate between ingress traffic so you can get a notion of flow direction. Maybe. It's kind of dicey sometimes. Later flow protocols have explicitly accounted for flow direction in NetFlow v9 and later. Alright. So here it is again zoomed back out with all of the important fields listed. This is the time to take a photo if you want one. Alright. Cool. So as previously illustrated NetFlow v5 is pretty simple. I'm going to talk about the differences between v5 and v9 here. Again not trying to put anybody to sleep. But v9 essentially just comes with more richness to it. v9 of NetFlow introduces such things as templates and flow sets as well as option templates. And all of these things really just go into making the protocol more efficient over the wire. It makes it significantly less chatty because instead of repeating information you can just use a template to say that thing I told you about in the last one same stuff again. But the savings does come at the cost of some complexity. And as you might imagine we have found and fixed a fair number of bugs just in our handling of templates in NetFlow v9 and later. The takeaway from this slide is really just that NetFlow v9 introduces some important significant improvements and I'm going to leave it at that. Alright. Now we've also talked about a couple of flow protocols here. So let's get into how these protocols compare to other traffic measurement options including SNMP and Sflow as well. Everybody has probably got SNMP going already. That's kind of table stakes if you're doing network monitoring. When it comes to traffic volume SNMP interface counter data is necessary. But I argue it's not sufficient for great visibility if you really care about the character of your traffic. We get to know the total amount of traffic through an interface during a given period of time. But we get no visibility of the characteristics of that traffic with SNMP alone. The same is true of interface counter data that we might gather via any other protocol like a REST API or streaming telemetry or any other similar mechanism. Now this is a problem that the Sflow family of flow export protocols can address for us. We already touched on NetFlow v5 and its successors but there is also Sflow which I promised I would get to in some depth. Sflow takes a different approach to NetFlow. Instead of summarizing the traffic and maintaining counters in a cache Sflow just statistically samples packets. It might look at one in a thousand packets. Just look at the header info on that and it sends on that summary information to the collector. So the collector can take a statistical approach to interpolating what the actual character of that traffic looks like. And you can get the same general type of visibility with the same general degree of reliability. But it is important when you're dealing with Sflow to remember that the underlying mechanisms are quite different and there's some statistics happening behind a curtain to arrive at the conclusions that you see. Sflow also has some facilities for streaming other kinds of system level and interface level data. And for this reason Sflow qualifies as a streaming telemetry protocol even for non-flow data. We won't frequently see it used that way but once in a while we do bump into that. Finally one last thing that bears mentioning is that NetFlow is not at all the same thing as OpenFlow. I said I would get to this and here we are. OpenFlow probably has been talked about by a few other speakers here this weekend. It is a software defined network switching protocol. It's the protocol that SDN switches use to talk to their controllers to make decisions about what a given flow which means something different in the SDN world should be disposed of. So just take care about that. So moving on from what a flow is let's talk about what the OpenNMS platform provides now in the realm of flows. Plenty of products offer flow monitoring but it's a somewhat rare capability among open source products at least ones that try to do many things. Since we are committed to making all the capabilities of OpenNMS available under an open source license the flow subsystem had to fit inside those lines as well. Alright so we have added support over the past few major releases of OpenNMS Horizon for collecting, persisting, visualizing and now also thresholding performance data. We support quite a few flow export protocols including NetFlow v5, NetFlow v9, IPfix and Sflow all the ones that I've just gone over and I think some other ones besides. We enrich the flow data with inventory data about the nodes that are under monitoring in OpenNMS so that you can see the flows associated with nodes that actually mean something to you, not an IP address. So you can actually anchor the flows onto the nodes that you're looking at in your daily life. We have a flow classifier which enables users to write rules to identify flows based on their attributes. So you have custom protocols in your environment that are going to be seen on the wire. You can write your own classifier rules for those. Sort those rules into buckets. We ship a ton of canned ones and users can write their own using a straightforward syntax if they need to do it. When you visualize your custom app traffic in the dashboards you can see the flow labeled in a way that's meaningful to your organization. So instead of just seeing UDP 6789 you can see awesome custom app that we make. We also provide impressive horizontal scale for flow data. We have seen action in some very large environments with sustained flow rates exceeding 300,000 flows per second. Now in the interest of full disclosure it is not easy to achieve this level of scale. Frankly that's true of any platform. But with proper system sizing and tuning it is repeatable and it is being used at these scales in anger at real sites in the real world. It helps to have a team who feed and care for the more unruly stack components. A data team for like care and feeding of your elastic search and Kafka which will come to momentarily. We also provide enterprise reporting in OpenNMS and that includes PDF delivery via email of any flow report. This capability is something that we built specifically for the flows project but it works with all other kinds of reports too. Including non-flow dashboards. Finally we offer visibility in the form of top K or some people call it top N statistics. These are broken down by interface, application, host, and conversation. And we offer filtering based on quality of service switches. So you can ask questions like what are the five applications that are using most with on this particular link that's been saturated for the past 15 minutes. Get a very quick and detailed answer on that. So this is what our flow visualization tool actually looks like. As you probably guessed we're using Grafana as the framework for this visualization of flows. And with help of a data source plugin that we maintain, Grafana is able to retrieve flow records from the OpenNMS core to get and present that flow data. We'll see a bit about how this data looks under the hood in a moment. We provide controls in this dashboard to choose among multiple data sources. So if you happen to have multiple instances of OpenNMS in your enterprise you can switch among them. We also offer ways to choose which node and which interface the flows that you're looking at are associated to. There is a picker for the differentiated services code point value which becomes import work where quality of service is enforced and measured. This panel is showing the throughput by application at the top left and we can see that in graph form in the top left panel there. Now we also have it in a tabular representation at the top right there. I know it's probably pretty difficult to read from out there. But we can zoom around on the application traffic in that top left panel. We can see it in tabular form. It's exactly the same data just summarized. And then we can zoom in on a particular time period in that SNMP panel in the second row. That's just our traffic volume data. That's what we get via SNMP. So if you see an interesting shaped spike in that SNMP traffic level data, you can zoom in using Grafana's zoom capabilities and it locks in the same time frame for the application classification traffic. So you can see exactly down to the sub seconds level even what the traffic is that accounts for that spike. This lets you really get into isolating the applications and the conversations that are responsible for that traffic volume. Because the flow data is not bucketized on like a one minute or a five minute boundary the way the SNMP collected data is, you can actually go way deeper in the flow data than you can on the SNMP data. Really, really powerful way to visualize this stuff. Alright, now I did say I'm not going to try to put you to sleep but here's a block diagram. Alright, so let's just take this as quick as we can. On the left is the router or the switch that is exporting the flows. That's happening via whatever protocols you're using. That's going into a minion host. You probably don't know what a minion is so I'll tell you. Minion is an OpenNMS component which provides edge visibility for an OpenNMS instance. That includes flow ingest but also all of its other inbound and outbound capabilities. You can run a whole fleet of minions either on your own using Ansible or whatever you prefer or you can also get an appliance product offering from us which helps reduce the straight of overhead involved in running that many minion systems. Minion receives the flow packets. It does parsing and enrichment and after forwarding the resulting document to Kafka, it's floating around out there in the Kafka environment. Kafka if you don't know is a message broker but so much more than just a message broker. You can also use ActiveMQ but if you're doing this at big scales, Kafka is probably going to be a better experience. Now the enrichment currently done by Minion takes the form of reverse DNS lookups. This is just looking up the conversation partner's host names based on the IP addresses. Now why do we do this on Minion at the edge? Well the answer to that is where you do the DNS reverse resolution matters, right? Out at the edge the DNS resolver that's configured may give you a different reverse and pointer record versus one if you do the same lookup in the core. So we just roll with the assumption that you probably want those IP addresses resolved to host names as close to where the traffic happened as you can get it. Okay so we got this flow traffic onto Kafka. Also connected to Kafka is one or more Sentinel hosts. You don't know what a Sentinel is probably. Sentinel is another open NMS component. It's a workload scaling node that can host several different functions that would otherwise be shouldered by the core. In this case Sentinel is doing further enrichment including application classification and inventory matching as well as tagging the nodes and interfaces back in the core of open NMS to indicate that there's flow data available for those interfaces at that point in time. Sentinel works to forward the further enriched documents back onto Kafka and then it goes back for another spend. Now there's an optional piece shown here. It's a component that we're provisionally calling nefron. I'll get into what its job is later but you just basically need to know that it runs inside an Apache Flink cluster. It does a bunch of streaming analytics then persists the results into an elastic search cluster. This is where the flow record documents actually end up coming to rest is an elastic search. You may have noticed that we also have an open metrics TSDB down here at the bottom right in the form of Cortex. This component currently complements elastic search rather than replacing it and I'll get into that in a few moments. Streaming flows in the real world and scaling them it gets big. This is what flow data looks like at scale in a real world customer environment. I think these are from the end of 2019. It's a little hard to read probably but these are just indices in an elastic search cluster. We're looking at them in Kibana. This is how we store the flow records. In this environment we had 800 different routers exploiting flows across the enterprise at one of our customers. Most of them with flow processing enabled only on the internet facing interface though a few did have it on multiple interfaces. About 6 million flows per interface per hour. This data is going to add up fast. Here's a list of the various elastic search indices that hold these flow record documents for a short period of time. You can see that they're bucketized by day and by hour and they're named accordingly. This strategy is configurable so you can make the buckets smaller or bigger as your needs dictate. All flows for a given hour are in a single index here. That shakes out to around 140 million documents in the elastic search cluster filling about 80 gigabytes for each hour. You can imagine how this adds up over the course of 3 or 6 or 12 months. There are challenges. The volume of flow records also in this environment has grown significantly since we took this snapshot and that has exposed some new additional challenges too. With such large volumes of flow record documents we began to run into performance challenges especially when we would render the dashboards. Performance for these dashboards was fine when the time range covered only 5 or 15 minutes but with larger time ranges such as the last day or the past week or 30 days it really could slow down pretty badly even though those time ranges are super helpful in getting a handle on the high level trends of application traffic in your network. But there's just so much computation going on to crunch and aggregate those flow records into such a long time frame that the elastic search queries become prone to timing out at those longer time scales. To render a flow report covering a 30 day period we end up having to do these top end calculations over about 4 billion flow record documents. In some cases we saw that these reports covered flows involving 120,000 unique posts and 6,000 unique applications. For example we might ask for the top 5 conversations among the 4 billion documents and that's quite a lot of data for us to sift through and get the answers that we're looking for in real time or just in time. There are a number of different approaches to address this kind of challenge that includes pulling all that data into memory and using something like Druid to chew it for us. Expensive in terms of memory we ended up discarding that approach and instead we chose to build a streaming analytics pipeline and precompute some of those longer time period aggregations. That's what the nephron project does. So as that data comes through the streaming analytics pipeline we get a big speed up when we're presented with these very daunting long time scale queries. This was the genesis of the nephron project which you saw boiled down to one tiny little box in that block diagram earlier. Its main purpose nephrons is to make these dashboards quicker to load when they cover longer time periods. That's really all its job is. It does come at the cost of considerable added complexity though. If anybody's ever used Flink it is its own kind of special beast. Alright, so beyond the added complexity streaming analytics introduces some challenges of its own at runtime. Anybody who's dealt with flows is already aware that things get weird if you don't have everything NTP'd. If all your clocks are not synchronized the data looks weird and it's not reliable. But even when you get everything NTP'd properly streaming processing introduces its own time lag concerns because it's grouping every element by its time window. And while network sizes tend to vary as far as we know the speed of light is more or less fixed. Currently in nephron we use the Apache Beam streaming framework which provides some nice help with these problems but it's still very difficult to engineer around them. Now I am way past the limit of my questions talking about these concepts so I'm just going to move along at this point. Alright, the future talking about the future is nice because nobody knows what it will look like and nobody actually expects us to get it right. We've accomplished quite a lot in OpenNMS with respect to flows but we're not content and we're continuing to push these capabilities forward. We already support the most popular standardized and quasi-standardized flow export protocols but our flow architecture is designed to accommodate additional ones without the need to reinvent the wheel or do major architectural changes to the platform itself. Naturally we would like to reduce the complexity of the solution. Finding a way to eliminate the need for an Apache Flink cluster would be a huge win. Don't know how feasible that's going to be but we'll have taught people working on it. Another thing that's really nice is to make it possible to store flow record documents in Cortex or Mimir without the need for elastic search. The reason we still need elastic search today is that flow data has extraordinarily high cardinality. You have to multiply together the number of possible values for every dimension so source and IP address source and destination IP address, source and destination port, IP protocol all of those things. You multiply each of those things together by how many values of that thing you could possibly have to arrive at what level of cardinality you would need. Cortex is just not currently able to cope with the high cardinalities that result from storing flow data in this way. One approach we're considering is to work upstream in the Cortex project or potentially in Mimir depending on what direction that project takes. Finally, while this kind of solution will never be trivially easy to deploy we're taking sort of the same approach that everybody takes these days which is make a Kubernetes operator. Push a button, get flows. That work is actually happening right now. Speaking of the future, if you would like to be part of the future of OpenNMS, if you're a monitoring nerd like me with an interest in hacking on a platform that is committed to the open source way of doing things, I encourage you to check out what we're up to. We have a discourse board for long running discussions. You're going to find me and Dino and some of our colleagues on there. It's a great way to access help for sort of long running asynchronous conversations. We've also got a Mattermost that you can join it's public. Anybody can make an account so if you want to do real-time chat that's where you can find all of us too. Our JIRA is public if you find a bug or if you have a great idea for an enhancement, we'd love to hear about it. The bottom line is we would love to help turn your ideas about monitoring into open source code or docs or clever integrations. So I'll wrap up here. In years past when somebody has stood up here from OpenNMS, flow support did not exist in our products at all. So I feel really good about what we've accomplished so far. We've gone from no flow support at all to quite scalable solution that solves real network traffic visibility problems for some of the world's largest organizations. We still have a lot left to do and I hope some of you will take an interest and join in the fun because this stuff, let's be honest, if you're the right kind of person is just fun. We are eager to have contributions in any form from the user community. So come see me after if you think you might want to hack on this stuff with us. I would like to thank the scale team once again for the excellent job that they've done in organizing this event amid so much uncertainty. I want to give special thanks to all of the volunteers but extra special thanks to the AV team. Y'all make this thing go. Thank you very much for putting on this thing and I appreciate you. So to all of you in the audience and watching the playback, thank you for your time and attention today. I think we have a few minutes left for questions. So please lay them on me. I think Dino has an audience mic. Have you been evaluating OpenSearch as an alternative to Elasticsearch? I personally have been. I've gotten it to work. There's a plugin called in the Elasticsearch or OpenSearch cluster to enable storing records in the format that we use for, we call it the Drift plugin. Drift was just the engineering code name for the flows and streaming telemetry project. The plugin builds and installs in an OpenSearch cluster as long as you are able to line up like the OpenSearch version against the Elasticsearch version. If you've ever built these plugins, you know that they're like even pickier than Postgres about the version that they're built against. The good news there is we have just finished some work in the engineering team to automate the builds of that plugin. So OpenSearch should be just about as easy to work with as Elasticsearch. It's just not as well implemented yet. But yeah, we want this to be fully open source and understand Elasticsearch isn't really open source anymore. So yeah, great question. Thanks for asking it. Did I address it? Great. Thanks. So I kind of have a bit of a part too to that one because you suggested it was an OpenSearch. What about looked into alternatives like ClickHouse or something like that because I've seen a number of Flow Analyzer products that use ClickHouse because it's you don't need to have all the extensive hardware to run an ES cluster. What's a ClickHouse? The database system that was done by Yandex, the Russian search. Oh, Yandex? Yeah. I'm not familiar with it. So it's kind of like a bit of a MySQL-ish system and whatnot. It's just they've got just a fast database engine and that can handle those kinds of queries where you can scan millions and millions of rows within like 250 milliseconds, something like that. That sounds impressive. I should look into that. Curved trees, partitions, the whole bit. Okay, just like Quick House? Yeah, just ClickHouse. Oh, Click. Click. Yeah. ClickHouse. Okay, English spelling of house. Awesome. Thanks. I'll take a look into that. All right. Any other questions I can address? Okay. If not, thank you all very, very much for your time today. Oh, level one. Okay. Oh, yeah. In the back, can you hear me? Oh, hey, it's on. Hello. I am present. I am presently speaking. I am presently presently speaking. Speaking. Hello, hello. To the grandmothers in the back. Hello. I'm just, I don't know. Hi. Hi, welcome. Come on in. What's that? Oh, sure. Yeah. Sweet. You want me to do something fun like this? Oh, nice. Okay. Yeah, okay. Yeah. Sweet. Thank you. Well, hello, everyone. Hello to the people on the front and the people in the back. Welcome and thank you for joining me on this adventure into latency specifically in the front end and what we can control. My false friends, you've nearly made it. The conference is almost over and there's one more amazing talk after this. So I want to congratulate you on making it this far and being a part of something that matters. This is a unique conference. It's a way to go open source friends. Before we get going here, it's a rare opportunity to get to meet other like-minded people and that's what conferences are all about. So I'm going to give you 30 seconds to say hello real quick to someone that you haven't met yet. Go for it. Awesome. I opened a can of worms but we did it. We made some introductions. Originally, I was going to play Power of Love by Huey Lewis in the news but I realized that would be copyright infringement since these talks were recorded. Sorry about that. Not going to happen today. I do want to just take a moment and say what a beautiful city this is. If you live here, it's awesome. I'm being here as an adult. It's kind of a new experience. Most of my understanding of LA has been shaped by movies. I didn't know if I was going to show up and it was going to look like this or if there were going to be strange things afoot at the Circle K or if my diet would mostly consist of tacos and donuts while I was here but all of these assumptions have been blown away. I'm just grateful to be in this incredible city with you so thanks for having me. So hi my name is Ben and I'm known as Obensource. Pretty much everywhere on the web. My heartbeat is in community projects and this is me helping lead an internationalization summit for Node.js a couple of years ago. I've been involved in these projects over the years and I work at DataDog to help people know how to observe the front end and get in the flow of utilizing tools to help make sense of their web apps user experience using practices and tools like real user monitoring and globally distributed synthetic testing for front end performance. So that's a little bit about me. Let's jump in. Why is understanding latency web apps important? Because you've got to get back to your family right? Addressing this latency means that there's more time with your family and friends and less time waiting for your site to load. And most people feel this way about web apps too. So in a recent mobile performance survey used by Google we can see that the speed it takes to load a page significantly outranks any other major concern that people have when using a web app primarily on their phone. It's also clear that the probability of someone bouncing from a web app and likely never looking back is over 100% if it took up to six seconds to load. And another important factor is that latency matters to businesses a lot. In a brilliant progressive web app performance case study from Pinterest they showed that their signups increased by 15% when their users perceived wait time was reduced by 40%. So if measuring and mitigating web app latencies is this crucial to your app's performance and experience where do you start? Well considering what to measure on the front end your initial assumptions might look something like this. The user sends a request and there's some initial server boot up time and then some HTTP handshakes that are made and then that sets up a secure data pipeline and then there's the time it takes to deliver a bundle and session data and assets and more, which may or may not take more or less time depending on how many files and assets that you need to load at a given time, like lots of pictures of cats. And then there's the time spent loading scripts and rendering the DOM, like, you know, 653,000 potato, whatever. So we're loading and rendering things. And then finally everything gets painted on the screen and therefore all the latencies that really matter for your web apps front end per have been accounted for, right? And now you can be, you know, they can be measured using browser dev tools like Lighthouse and Chromium or Elivated that way, right? Well, no. It turns out that it's not quite that simple. If you're going to meaningfully measure performance, the performance of your web app, a good place to start is to consider the thresholds where you can accurately measure latencies that directly affect your users and those thresholds are humans, number one. Hardware, number two. And then, of course, web apps, which is what we usually think of. First, we need to take a step back from the front end and understand how latency in our own biological system works. And it seems that human behavior in general is currently the most untapped variable in the web app latency equation. We do know, though, that through a human benchmark study, we've learned that there's an average human response time to new stimuli of about 273 milliseconds. So when you're presented with new stimuli and you react, it usually takes about that long for the average person. And this mental chronometry helps reduce the user's perceived wait time of a single-page web app because the user gets to start making decisions about what to do after the first few paints, even if all the content hasn't been loaded yet. For this example here in twitter.com, you can see that as soon as there are contentful paints, the user can start to reason about what they're looking at. First with a splash screen and then, you know, a splash screen for loading and then with some navigation and content all starts to show up even before the tweets from the profiles they follow are loaded. And this all happens long before the full onload event occurs in the browser. So our goal today will be three seconds. Can we load something in three seconds in order to keep our users from bouncing? We can't just jump into software yet. First we still need to consider some of the latencies that are inherent to the hardware environment that our apps run in. And the most important thing to note is that there's a significant amount of latency that occurs for an average user by simply using their devices input and output. In light of this video from Microsoft Research, you can see that the average latency in their test range for mobile devices is about 50 milliseconds. And you can see how it breaks there and how that's not very optimal. There's less of a latency budget than you think. With that, we'll go ahead and start tracking the latency of our web apps mobile experience and then add 50 milliseconds to the total latency we have incurred just by simply using common hardware. And next, we'll add the 2021 average time to first byte metric, which is about 2,600 milliseconds. This includes measurements like cold starts for serverless solutions and more. And now you can see that before we've even done any front end measurements at all, we've nearly spent our latency budget and have almost hit the optimal speed index for our app. And that's too bad, but we'll move on anyway. So let's talk about the front end. According to Google Research, 15.3 seconds is the average time it takes to fully load a mobile page. So as we fall far beyond the ideal goal of around three seconds, we're now going to see how our app stacks up against the average global load time. And at this point, we finally get to add in the latencies that we're originally considering at front, so let's begin. We'll tack on 1,300 milliseconds to load our app runtime data into the browser like the JavaScript bundle, the assets, session data, and more. And that will bring us up to about 3.9 seconds. And next our browser is going to parse the incoming scripts and put out any rendering on hold. Put any rendering on hold since JavaScript can't do both on the main thread, being that's a single threaded runtime. And this will add about 8,300 milliseconds, bringing our speed index up to 12.2 seconds. And now that the scripts are done, the browser can render the DOM and the CSS together into something that it can paint. This is going to take about 2,700 milliseconds, bringing our running total up to about 14.9 seconds. And then finally, the browser can paint some results for our users to see. This won't take much time, about 400 milliseconds. And now the total index has reached 15.3 seconds, which is exactly the global average for an initial load time on mobile. However, there's still a lot of other tasks that phones and laptops too will be processing at this moment, since they'll have multiple apps running concurrently. The system tasks that share resources with our app are also going to add some significant time to the experience here too, say about 1,000 milliseconds. And there you go. We've just made it above the average range for the mobile load time of a single page web app. It's definitely a long shot from an ideal 3 seconds. But if we apply some front end perf best practices and monitoring across the areas that we've measured, the total speed index can get significantly closer to that goal. It's still worth noting that there are also other things that we can consider now in this performance calculation, mental chronometry, and more. I'm going to get a drink of water real quick. I feel like I've been wearing a mask this whole time. And now that I'm not, I'm like breathing in dust. So if the app's first paints include some information about loading, then the initial DOM content gives the user something to do before the app fully loads, and then we can slightly decrease the overall perceived wait time here. But this can't be resolved, or this can't resolve the fact that it's taking too long to load into render and all that, and to paint the content. So shouldn't the goal for a total acceptable latency of a mobile SPA be closer to something like 3 seconds rather than 15? Something definitely feels broken about this process. One more drink. Oh, thank you! You know, when we're doing high level stuff, I really love to just take the reins there and make it fun. So you're not like, especially at the end of a conference too, when you've had a lot of data points and things that you've been taking in. Let's have a little bit of fun. So anyway, we're done, right? For now? Well, not yet. There are other relevant latencies that are hard to track but worth mentioning. So I don't know if you've seen this kind of thing before, but this really put it into perspective for me. I was like, oh yeah, maybe there's other stuff going on and that's why the loading time is taking forever. So one of the most significant ones is distance. The math works such that there's 120 miles, that 120 miles adds around one millisecond of latency to the average packet, which is usually about 500 bytes. And it's a huge culprit in adding latency to the overall experience. Well, another one is regional network congestion. So not all regions have the same amount of bandwidth allocated to them. And when their traffic gets overly congested, your data will get stuck in it and just be too late. So yet another issue is local network congestion. So if you have say 100 Mbps internet connection at home and a few devices are all streaming 4K video while you're trying to load an app on your phone, you're going to have a really hard time. As you can see with your user, right here he's screaming in the basement. So now that we have a more accurate scope regarding all of the elements that have to perform well in order to deliver a good user experience, let's put together, let's put together some useful front-end monitoring into action so that we can accurately measure the perf issues that we can do something about. Configuration specifics aren't going to be in the scope of this talk today. There's a little bit of code, but what I hope to leave you with is an understanding of where you can get started with securing the future of your Web Apps UX performance quality now with free and open software. And regardless of which implementation you choose to pursue, you can put a solid UX monitoring solution that works for you and your front-end team into your tool belt. So with that, there are a few ways to get at this. Personally, since I work for Datadog, I typically get to utilize their UX monitoring services directly and their tool streamline what I'm about to cover for you. And of course, that all comes at a cost. For example, all you have to do is add a JS blob into your AppSign index, JS file, and then while we have the data just start streaming in and you can immediately generate averages to determine what vectors of your front-end are healthy or not. But this is scale, right? We're at scale conference. So this is scale and here we don't really care about signing up for more proprietary services. We'd rather understand how to stay in the open and utilize tools that help us do that while also supporting the free software ecosystem wherever we can. So with that in mind, let's talk about real user monitoring. Real user monitoring usually gets perceived as being a product since big companies have real user monitoring products. But in reality, it's not a product. It's a practice. And it's a term that's not owned by anyone. It's a UX health sustainability practice that's becoming more and more vital to the flow of front-end development and it always keeps your web apps useful and enjoyable from the vantage point of your end user. We're not talking about scraping user data for advertising or anything like that. We're talking about understanding every latency from every single session that comprises your actual user experience in aggregate and how we can dependably ingest that as time-based data points in math and then use them to immediately inform us about how and where our UX issues are coming from. We do this to knock them out as they arise, of course. We do this to knock them out as they arise, of course, but we can always and forever keep our web apps performant and valuable for everyone who's using or building them. And a good real user monitoring solution will help you understand what your core web vitals are across your sessions, all your sessions. So if you're unfamiliar with core web vitals, they are a universally adopted set of three UX quality signals that determine essentially how usable your web app is. And here they are. First one is the largest contentful paint, which essentially measures how quickly the app is perceived to be loaded. The second one is the first input delay, which is the time that the user had their first interaction, like a UI interaction and a response. How long was that? How long did that take to successfully occur? And the last one is the cumulative layout shift. So how much content in your front end shifts during its runtime while a user is trying to use it? You might have experienced the unfortunate situation of being accosted by ads and while you're just trying to use a UI, they shift around and then you're just trying to go to a blog but you accidentally click on an advertisement that takes you somewhere else. That is the cumulative layout shift in action. It's actually a terrible user experience and it's tracked in this way. So we'll cover how you can measure and report the core of vitals from scratch with some vanilla JS in just a little bit. But the next thing you really as a monitoring service should support is the notion of end-to-end visibility. So you should be able to see the correlation between front-end requests and the time it takes for your back-end services to run and return with whatever the user requested. So in short this addresses the need for visibility into exactly which processes are affecting your real user's experience and especially the ones that are causing them to bounce. Because if you have something that you really care about and you're serving it to a lot of people, you want to stick around. You want them to stay interested and not get to experience what you've provided for them at all just because it took too long to load. So this is how you can keep that in check. For example there may be an abnormally long delay when your note or Python HTTP request library makes calls to your discount service and there's something in the service script, there's something there that's causing the index page to render extremely late. Like who the heck left the sleep in there. Or your initial load time is just horrendous on mobile because of the size of your images and assets that are being loaded. Who thought it was a good idea to put like a 6 megabyte image in there. These are the kind of issues that real user monitoring helps pinpoint and alleviate whenever they occur. So if you're going to do this without paying for a service that automates it for you, here's some well-worn paths that you may want to consider. You need to correlate a group of services that bring together a solution that's reflective of the tick stack. You may be familiar with that, you may not be familiar with that, but that is usually the entry point into considering monitoring for the front-end as well if you're going to stand up your own solution here. And it's usually broken down into four concerns and those are to capture time series data from your web apps front-end through either continuously polling from a file or handling the incoming data stream from a real user monitoring service that's sending it out. One is to use a time series database to store your UX monitoring data. The third would be to leverage a data visualization platform where you can always view your latency averages. And the last one is to configure an alert monitoring service that lets you know. I mean, because what the heck are you going to do if you have all this nice stuff set up and then you get to the point where you have to check it manually? You don't want to do that, right? If you want something to tell you when something's happening. So the last one would be to have that alert monitoring service that lets you know when a bad UX threshold has been reached or if a front-end process has become unresponsive or you go out of a range or things like that. I honestly wish it was called something else than a tic stack so that you think about it in terms of what you're doing rather than the names of individual tools. But I digress there. What matters most is that you have a data source like your web app where your user data is coming from. A data store or handler which is the resource that retains your stats, that retains your stats so that you can have visualization of it and it can draw from that. And then a data visualization tool, obviously. Where you can calculate and monitor averages that you want to keep track of and set alert thresholds against. So beyond this there are many core concerns in your business that you can actually get ongoing numbers from through the use of real user monitoring. You might want to set up custom monitoring events that fire in your JS or whatever language you're using at specific points in the user's journey like to capture important moments like signups. How many signups did you get over time? Or how many logins were there in the last two weeks? Or how many posts were published in the last month? Or how many successfully completed transactions where there are failed transactions? And those types of things. And reporting these as they happen can be crucial to your web app's ongoing sustainability. But in UX land what I've described so far should be more or less your ticket to getting a core understanding of how useful your web app currently is from the vantage point of your real users. So on a high level let's break it down just a little bit more so that we're familiar enough just to dive in a little deeper later on. Okay so maybe you didn't design your web app to run on a Commodore 64 like these users are using. But there are a growing number of ways to establish high fidelity, a high fidelity time series oriented connection to the UX data that's coming from your web app or data source. And some of these options have been standardized which is really cool and are inherent to the current browsers and their APIs. And others come from open source efforts of popular monitoring services. Datadoc has an open browser SDK, things like that. And these can aggregate all that user session data out of more or less out of the box so you can try to tap into that. And we'll talk more on that in a few seconds or a few moments. But let's cover concern number two. So your monitoring data store or handler here you'll need an API or ingestion service could be inflex OSF API or telegraph and you'll need to configure the API or ingestion service to be authorized to connect to your data source like your app for streaming or polling. And then you can configure the API or ingestion service for example telegraph again to either continuously scrape a file that you're periodically writing to from your app or to receive a stream of event data from your web app that it can process and write to your time series database like inflex or OpenTESDB which is a great, great open free software option. So just so you're aware I'm not pulling these services like out of the ether they're popular, you know, these are both popular TFDB services that have been used and battle tested by companies like IBM and Cloudflare Cisco Vonage. So there's like people who depend on this for big infrastructure stuff. But let's move to concern number three which is the data visualization. Here you're going to need to be sending your ingested data out to a high fidelity time series visibility tool like Grafana or Prometheus or an even inflex DB which has a great advanced front end now that's called chronograph. And so using these you can set up your own dashboards and alerts there. If you go the inflex DB route you can also leverage capacitor which is their alert reporting system or service directly through their UI which makes it really easy to get up and running with alerts and you can create your own alerts directly there and all that so that's great. And it's already built to run alongside your dashboard and all that so you can get up and running pretty fast. Now let's dive a little bit deeper on what we just like flew through. I'm not going to go super deep but I'll go a little bit deeper because that was like a 30,000 foot level overview. But if this is how you're going to get your real user monitoring going I hope that you can at least feel assured that it can be done and that you can be a little bit more prepared to do it just having thought about it with me here today. So here are some of the ways that you can get at getting those real user monitoring metrics you can get them out of your web apps in your production pods or VMs or wherever they're hosted you can utilize APIs that are in your browser natively. There are native browser APIs for this like performance observer. And you can continuously calculate and report your latency measurements directly where it matters most for every runtime. And then either write them locally to the file to be scraped or send them out through a designated port that your ingestion service is listening to so it can be stored as a data point in your TSDB and then visualized in your dashboard service. But that brings up this question from before so can you measure core web vitals and other important quality signals this way? Is it really doable? And the answer is yes, you absolutely can. So briefly let's cover how you could do that with these APIs. So in the largest contentful paint what you have going on here is this is a browser API and you instantiate a new instance of it using the performance observer API. That's the one that you should really consider that is becoming more and more adopted. So there's a performance observer API. It brings back an object that has an entry list of all the performance events that occurred from basically when the process started running to navigation or things to actually be a browser session. And it can return that and you can get that through an API called GetEntries here. And then this helps us observe and get back the time series metrics that we really need. You can observe things like largest contentful paint which was the first core web vital we wanted to see, right? And then every time you write one of these into your file or you send it over the port to your ingestion service then you can be assured that those data points are getting to the right place every time so that you can visualize them. That's how we use something like the performance observer API to get the largest contentful paint. Or if you wanted to calculate the time to first byte, it's not a core web vital, there's three, but there are more and this one is very important to track in your latency equations. So you might so to do that you can use this API here from the performance observer API so if you like instantiated one and call the performance you could get the entries that are of type navigation and then do your calculation here by grabbing the time that the response started, so like the bigger number and then subtracting it, the smaller number, the time that when your request first started and then figure out when did the user actually get something back? What was the time to the first byte there? And then of course you can write that to your file or your ingestion service here and get that duration and time and propagate that. That's pretty cool. There's a bunch of supported entry types that you can use to get or calculate your web apps vital scores, some of them core with vitals, some of them other ones, but a cool API within, this is called supported entry types and so if you just grab the performance observer supported entry types API you can see everyone that's currently supported. And of course we can get events, the first input, the largest contentful paint and then you can use other ones through the navigation API or you can see what were all the long tasks while the scripting was happening and all that kind of stuff, what was taking a long time and when. You can use these durations that are returned in an object to see those. Yeah, so then great. That's one route you can take to get some UX data and yet another one could be to leverage what an open-user monitoring service can provide. There is one that Datadoc has called the browser SDK and it's open source and you can grab and go with it. It takes a little bit more configuration because you're likely going to run into some authorization issues and things like that that you'll have to sort of dance around but I've been assured that it can be done. So this is, you could try to like because both the SDK, the browser SDK and the Datadoc agent are open source so if you really want to go in there and reason about it and stand up your own you could. So there's that. But here's the link to the open browser SDK there on GitHub. So how you'd go with that is you would install the agent which is also open and there it is in GitHub. You can install the agent and run alongside your web app and then install Datadoc ROM on your front end and then you can add a connection to the ROM service in your index.js file and then specify a proxy URL which is a supported API within the ROM service that you want to send your data off to. So it could be a service like Telegraph for example or wherever for you to relay it to a time series database like inflex or whatever you want to send it. Now there's going to be other properties here that you'll probably have to finagle a little bit but that's the main one you need to know about which is the proxy URL. Also before you send it somewhere you'll want to configure your ingestion service to mirror the real user monitoring data model a bit and play nicely and you can get some really cool metrics out of here. This is the one that's built into the browser SDK and you can model what you want to get to the degree of the events that you want to capture. It reports events based on action types like errors and long tasks and resources and views and there's data that comes along with those objects that you can utilize. That's backwards. There we go. Okay. So those are a couple of ways to consider aggregating data out of your web app and kind of having some fun messing around with some large open source solutions or just getting like if you just want to get some straight up metrics quickly in every instance that's deployed you can add those JS API calls and get them out of the browser. That's a really good way to go. But what's a good route for getting a solid time series database going? Influx DB OSS is one of the most popular options here. It's probably because it's a single small binary that can be quickly deployed and it uses up minimal resources and it is worth mentioning though there's kind of a divide with the services that Influx data offers between out of the box from Influx OSS to their cloud services and that's always this dance that you're jumping around if you like. So that's worth mentioning but if you want to get around that you should probably look into a tool like OpenTSDB which is a completely open time series database and it's built on Apache HBase which is a no-SQL time series database and OpenTSDB makes it really convenient to scale if you've got a lot of scaling going on across all the nodes in your Kubernetes clusters because it implements time series daemons that can receive data from every pod and then store it in an HBase instance accordingly. So this is like an awesome scalable open way to go if you want to implement something like that. So great! Now that we're aware of what can be done to store our time series data, what are the viable options for visualizing it and easily getting alert monitoring set up. Well if you go the Influx route they have a front-end call chronograph and it's thoroughly integrated with an alert framework that's called Capacitor. Capacitor is great, it triggers alerts and runs ETL jobs which are short for extract, transform and load and can quickly bring together and detect anomalies in your data which is a really cool feature so you don't have to set that up yourself. And chronographs alert monitoring rules align with Capacitor tasks that fire alerts whenever the conditions that you've set are met. So your front-end team knows about them whenever something occurs. Like when I was first getting into front-end observability and front-end monitoring it was like cool I can set this up, I can set that up and I can see stuff. But where it all comes together again is like when you have something that can just tell you when there's a problem. You know that's like what makes this stuff so valuable. Right? Is that you don't have to sit there and be like okay I got to check the dashboard again or whatever. No you just like if you set up these alerts and your core with vital scores say the largest contentful paint for your app goes above that 2.5 or that 3.2 threshold. Then you just get an email or you get a ping or you get a whatever and you don't have to worry about it anymore. You have to worry about thinking about it anymore. It's another thing out of your cognitive load. So this is where the magic kind of all comes together is that you can keep your experience performant and usable. And people aren't going to bounce. So the roles that you set here become sort of tasks and they generate scripts that are called tick scripts and these can be edited manually if you want. That's a really great way to go. Now let's consider a couple of common useful front-end alert strategies. You might want to trigger an alert when a ceiling or floor threshold is met. So say the threshold for your global average of your like I said the largest contentful paint goes above 3 seconds and the bounce rate is going to skyrocket. You can set up an alert for that threshold or maybe a floor is met when a floor is met and your sign-up rate fell below your monthly target that's going to keep your business floating. You need to hit that certain amount of users signing up and if you go below that you're kind of in the danger zone. You could set up an alert for that as well based on the metrics that you're getting out of how long it's taking to load and if people are bouncing. So maybe you want to trigger a dead man switch when a service fails. So for example, if you have a microservice architecture set up and one of your front-end dependency services just stops running, you don't want your UI to just crash and then people to go like well this is a terrible app. You might want your service to tie into a switch that then boots up a recovery process for whatever front-end service that died, that kind of thing. Alright I've covered one well-worn path for using a decent open dashboard tool that has tie-ins easily configuring your alert monitoring. There's one that's worth mentioning, another one and that would be Grafana. Grafana alerting in Grafana is available for their OSS version and is really popular within this as well with front-end monitoring. But you're going to come up against an ingestion paywall. So like if you go over a certain amount of sessions you're going to hit their paywall and then they're going to want you to start paying for it. So you'll still likely have to assess if that is within what you're going to be doing and if you want to do something like that or build and maintain something else that's going to really work for you no matter how many sessions you're getting. So that's always sort of the dance with this setting up your own service here. So hey, thank you for making it this far. I appreciate it. I hope a lot of this, even though it was very high level, I appreciate that you've made it this far through here and that you've been gaining some insight into how you might think about setting up your own FOS front-end monitoring solution. So before we plan this plane and you get up and running with release or monitoring so that you have some granular control over these latencies in your production, let's recall what you'll gain from going through the trouble of setting it all up. So number one, you will get a deep analysis of real user session data. We're not talking about like using a web browser tool, like a developer tool like Lighthouse or something like that where you're just running an audit and then you're getting sort of a best estimate of how your app's doing based on these figures that have been generated to run it against the current averages. It's not that. This is real. This is like having something like that but like everywhere. And that's what makes it really powerful because you know what's actually happening. So you get that deep analysis of the real session data and you can assess every latency that's inherent to your services and isolate the bad actors and mitigate them before your users even experience them. Number two is the ability to monitor your UX quality and performance factors like the initial load times, like the core web vitals and always have their averages on hands so that you can be aware of how they're affecting your actual user's behavior which will always have an impact on your bottom line. Super important. And then the third one is that you get to retain the inextricable time and resource saving approach of setting alerts that fire when your UX performance averages go over the fine threshold so that you can know exactly which latency is to focus on lowering to lower on your front end and keep your users around. So before we go, let's recall what all comes together to facilitate the performance of your app and what calculating its performance budget might look like as you're setting this up. Don't start here where my initial assumptions were. Don't do that. Measure everything accurately at each impactful vector using the practice of real user monitoring so that you can get a real view into what people are really experiencing when they use it. And with this, you now have the power to help everyone get a little bit more of their most precious resource back which is their time which I'm now going to give back to you. Thank you very much if you'd like to catch up with me and talk about anything or just stay in touch you can find me in these places here I'm on Twitter on my site, my blog, and my github. So please feel free to reach out anytime. Thanks again, it's been wonderful to be with you here at scale. I'm so grateful to be at a Linux conference. This is awesome. And definitely don't miss the closing keynote which is coming up next. Like I said, you're almost there. VentSurf is next and he's going to be talking about the importance of open source to the internet success which is huge. So he'll be sharing lessons learned and what he would approach differently if he were doing it again and it's going to be rad. Thanks a lot and we'll see you next time. I'll be around for a few minutes too if you have any questions. Kind of. So I think what you're saying is is tracking latency is tracking latency the developer's responsibility? Can you free state that again real quick? Oh, gotcha. Yeah. Yeah, I think people owning everything that they possibly can for their own space is just as a general baseline approach is extremely important and this is one of them. Like as a front end developer it should be performance and latency should be a huge thing for you. So doing everything that you can to follow best practices. Things that are becoming more commonplace now, like if you are loading a million posts you don't want to load them all at once or you don't want to like do lazy loading or you want to implement those practices if you've got images that are too big you want to make them web ready before you put out your app. That kind of stuff is definitely on the front end developer to track and to maintain to make it as enjoyable as possible. There are a lot which I feel like I covered a little bit here but I could have gotten a little bit more in depth. This really also ties into the end to end visibility side of things. So like as a front end developer kind of the magic of setting up a real user monitoring solution is that you can start to dive in and be like oh it was a Python flask call, you know like HTTP call that's timing out for some reason that's making this piece of UI just do nothing for three seconds or whatever you know and that really sucks. It depends on what kind of, what your responsibilities are as a developer I feel like so if you own the Python service then obviously that's your responsibility. But then if you reach that point where you're like oh let's talk to so and so who wrote that or maybe I'll put in a PR or something. There's a lot I feel like that is owned by the front end developer to make the experience better. It's also a whole platform thing, right? You want monitoring came out of the dev ops space like you're saying and monitoring your cloud infrastructure and monitoring compute time and cold-starting and all that stuff. That's where it came out of and now it's reaching into being able to make sense of the front end too in a very powerful way. It's a problem that's owned by everybody but you get to focus on your part of it so I think that's kind of the way I see it. Was that helpful? Yeah. So like what I've seen as a baseline in the past is showing proof and so like if a front end team adopts a really user monitoring solution and then shows that they're catching their perf issues before they even make it out to user land like it's really powerful and that kind of story propagates to the other development groups within a company because monitoring can be done at every level so I feel like just being able to if you get buy-in from your team and you're able to do that just showing off how much that's changed even the bottom line is also a really powerful way to get adoption within your company because people like that Pinterest thing, like oh, we have our adoption rate went up like 15% because people think it takes 40% less time for this thing to load so that kind of stuff is extremely powerful and makes it really valuable and is a good story to be able to help other people adopt it like you're saying, you can't just say this is a good way to go adopt it and it's like if you can show it I think that's the best way. Yeah, thank you, I appreciate those questions, that was wonderful. Any more questions? No, you all tired and ready for the last talk? Yeah, I did. Well thank you, I appreciate that, yeah. I wanted to make it enjoyable and I've been messing around with animation more over time like this stuff that's like at the beginning I'm going to go back, so this stuff is like where's the one where it blows up this stuff, yeah, this stuff is all done in like P5JF which is like a WebGL like really just like a grab-and-go like WebGL library that helps you this is actually a bunch of rectangles that are spinning that create this sort of polygonal tube or whatever, so there's that and you know I just have this rasterized hand going around in a circle or whatever so you can do that kind of stuff with P5 I started a while ago like making my little talk slide framework thing for it that could help me do that more easily and then these ones being at scale I don't want to understand how I could do it more without having to pay for things I did this with Procreate which is like an iPad tool but like it's really been interesting because it makes animations really easy and it's made it really fun and then it has helped like it's gotten my kids started on making animations which is pretty cool so those are the two tools that I use was like developing you know the more WebGL stuff was P5JS and then the other one was Procreate so yeah I'm glad you enjoyed it and thanks for coming and hanging there you're almost there and have a wonderful Sunday everybody see ya, thank you