 Hi, thank you so much for joining us for Open Telemetry Community Day in 2020. This is my talk, Planning for Observability, where we're going to talk about some of the strategies, key decision points, and sort of pre-planning that can help you create a more observable application. We've only got 20 minutes for this talk, so I'm just going to go ahead and jump right in pretty quick here. So my name is Noshnika. Nika is fine, if you're having a little trouble pronouncing Noshnika. I am a developer advocate for New Relic. I specialize in cloud architectures, especially serverless architectures, and other kinds of cloud-specific stacks. I do do other observability work. I've been working with New Relic and observability for a number of years. I tweet at serverless mom, so there's my serverless pedigree to show up. I'm also on Twitch, where I mainly Twitch stream cloud engineering stuff, building applications, occasionally making other fun stuff, coding drum machines, and other cool goofiness, but yeah, that's a very dev-centric stream, if you want to come check it out. So since we only do 20 minutes, I will dive right into it, and the first one is this thing that I talk about quite often, which is what is observability exactly? So on the west coast of the US, observability is a really big buzzword. A lot of people feel like it's just a marketing term, it just means, you know, it's just like instead of calling it a task, you call it an action item, and that just sort of sounds cooler than task, you know, maybe observability just sounds cooler than logging, or maybe observability sounds cooler than metrics. And you know, really observability is a collective term that means a number of things, and it really covers not a particular technology, but sort of a design goal. Observability, I see it as the first half of MTTR, mean time to resolution. Observability is how long it takes for you to understand a problem and know what you have to do next. And that really, you know, that has some odd corollaries, like you can't have problems maybe where, you know, that you solve very readily where you don't have great observability, like if you're just restarting a server because you see you're running out of memory, you have to do that every few weeks and you don't really know why. I mean, that technically, right, that's a problem with very poor observability but relatively easy to solve. But obviously that's not great, that's not how we want to live our lives. We want to look at a dashboard, a log or a portable problem, we want to be able to very quickly go, ah, I know what I need to do next to solve that. So when that time is very low, we have high observability, when that time, when it's taking us a long time to get to understanding, we have particularly low observability. And this is a quote from a serverless developer, this is Patrick Steger, it really stuck with me, it's a bit of a wordy slide but I'm going to read it out, right? He said, I can scale serverless as big as you like, a random 500 error that I can't diagnose, that's what gives me hives. And this is sort of, this sort of speaks to a theme that I see as I talk to people doing sort of more advanced, more complex stacks, often with very, very high availability, is that before they face problems with the complexity of their stack and way before they face big cost issues, observability, call it, you know, sometimes specifically, you know, the debugging of new stacks, it comes up as a much bigger problem way earlier in the process. And, you know, that's kind of the nature of the beast, maybe because some of the scaling problems that we've dealt with previously or some of the config issues that we would deal with previously are now much more automated. The big question becomes great, we are able to stand up these big complex applications really quickly, this can be hard to get insight into what's really going on. So, one of the things about defining this problem this way is that other stuff comes up as having an effect on observability, right, the speed with which a dashboard can be read has a direct effect on observability. And that is a really key observation, this is right up in the nice Prometheus Grafana dashboard. I believe this is actually using up a telemetry instrumentation to generate this one, the one where I grab the screenshot. But, you know, take a look at it, even in this tiny version, I think that you can see, right, that a pattern is readily observable and that's really a UI issue, that's a graphical issue, it's not about what data is underlying that dashboard, obviously we need the data to be reliable, but the production of more data isn't necessarily helpful. So, that's really key because something that I, you know, here talk about a lot is people say, well, you know, I wanna add observability or I wanna add reliability and ease of understanding, however that whatever term might be used. And, you know, I ask questions like, well, if you added a bunch of logging, let's say you went in every other line in your code or every single line in your code, you added a log line that logs and information on what was going on. And, while that would increase the amount of information that we would have, it would not increase the probability of fact it would hurt observability because we go to look in the logs, we'd have all this junk and all this noise that would be hiding whatever useful information might be in there, I mean, maybe nothing useful is in there right now, but even so, we would still have worse observability than when we started because we would need to spend so much time digging through every time we wanted to get some information. So, let's think a little bit, let's, you know, we have this example from Patrick Steger of a random 500 error that we can't diagnose. Let's think a little bit on how we might chop up that 500 error and the pieces that we might need and this can be very, very helpful for deciding what we need to add instrumentation for or what we need to add measurement for to be able to figure out what's going on. So, these ones are all, we're not using any of the cool stuff, we're not using distributed tracing, we're not using logs in context or some of the other cool open telemetry concepts. This is just some very simple information. We wanna see error rates on each component, right? Cause we wanna know, hey, is the gateway throwing a bunch of errors? Is the web application layer throwing a bunch of errors? Are we seeing them in the database? Where are we seeing some errors? You can be reported. We'd like to see throughput, I say on the back end here, I mean sort of on the reermost component, right? The component that's the furthest to the right on our little graph. So, if we see, hey, we're having some problems and we see especially that the throughput on like the database has just dropped off a cliff, then we start to see, okay, well I have a pretty good theory of at least what the nature of the problem is, right? And looking into like, the configuration of the database server is probably not gonna be a waste of our time because the source of the problem is almost certainly right upstream from there. Then it'd be really nice to look at changes to our permissions and configuration and we're gonna talk about infrastructures code and some of the other steps that can really improve observability here because those changes really can, you know, often for a problem. Again, you know, my sort of comfort zone is often in these cloud architectures. So, this is a pretty frequent problem, right? The configuration to what's allowed to talk to what. Obviously we need those configs to be available but when they get messed up, right, then we have our stacks suddenly failing. The last is deployment history, right? And totally before we have any kind of deep observation tools, these are all things where if we have them, our observability is a lot higher for that random 500 error from our endpoint. So let's talk about picking the right tools. Now, of course, you know, it's open telemetry community today. We're not gonna be talking about picking the right measurement tools but the other tools around that that can help us get information or can be just defined strategically or defined tactically for our team. So infrastructure is code. I mentioned it earlier that we wanna see a history preferably of what changes have been made to our configuration. And one of the questions that we talk about the most when we start talking about infrastructure is code is which tool should we use? So, you know, I'll see questions like, well, you know, can we use Terraform? Right, can we use Stagry? Oh, it's nice to use Terraform because it's multi-cloud. Maybe that's the better solution. Right, should we use serverless framework to deploy our serverless environment? Should we use some kind of vendor specific tool like the cloud deployment kit from AWS? And, you know, the real thing that I see as a theme is not, I don't see a lot of people regretting that they picked a particular infrastructure as code tool for deploying changes to the infrastructure. I see people glad that they did it as opposed to not doing it. So if you're storing a bunch of config file in a readme somewhere that is sort of getting bounced around your team and you're having your right in that readme file, maybe you're going back and folding stuff to be like, hey, be sure that you make this setting when you set up a new instance. Right, that means we're sort of drifted away from infrastructure code and we're probably gonna face problems when, you know, you're not available or when there's team changes and you see problems with this infrastructure. So infrastructure as code is a key piece of observability because it lets us go and get clues about what's actually going on and what's changed recently and try to make those correlations in time. So what we get from infrastructure as code is we get a commit history, right? We get clear communication of the team of, you know, hey, what's changed and by whom and this isn't about letting you blame at all. This is about knowing who to talk to or knowing what the possible motivations for changes might have been. And then the last one is well-documented requirements and you'll be surprised how often this comes up where by documenting, for example, that the memory needs of certain instances has changed, we can document that something else may have changed or something else about our requirements rather than having to go in ad hoc and bump up memory requirements to get stuff to actually perform. We actually get that documented because we have all of our infrastructure configured in one place. So let's talk about structured logging. The diagram up here is my own attempt to render some Penrose tiles and I've made actually a couple of mistakes but I have it in here because it's one of the most beautiful examples of structure with variations since Penrose tiles sort of appear to be periodic but have slight changes that don't repeat. So it has that structure and I like that a lot. And also Roger Penrose just won the Nobel Prize for Physics just a couple of weeks ago so I was throwing a little plug. But so totally outside. Again, this is not a tool specific recommendation. I'm not saying you use this one logging library. Of course that's gonna vary what language you're in for starters. But just the concept of structured logging is such a key piece of getting real information about what's actually going on inside our stack. So, oh, I'm sorry about that. So, you know, structured logging is something that you can just do completely on your own where you just say, hey, instead of just having a simple line of text, I'm gonna have an object here with a few key value pairs in it. And maybe you add a key value pair as needed. Maybe you update the ones that you already have. Just because it's that much easier to parse through this history later and make your interactive search tools and maybe even make a nice visual dashboard based on what's being put in your log. So I also really wanna mention that a key key piece, an open telemetry specific is an ability to contextualize your logs is so, so great. Being able to contextualize logs with metrics with traces is a key performance feature. Something we pursue at New Relic as well because along with having some kind of structure to allow you to query in, being able to connect a trace to a log is an absolute game changer for your ability to actually get some insight. And that especially comes up, you know, obviously, these problems are usually obvious when they happen as part of an incident and an outage and everybody's sort of flipping out and we're all, you know, it's the weekend. We're trying to watch our daughter's softball game and instead we have to, you know, do this and be, you know, super mom doing four tasks at once. Sure, those show up, hey, we wanna be able to have insight better, but logs in context is a really powerful way to start just engaging a little bit of that curiosity and start fixing things before we have a big crisis or just when everything's going fine. We can start to see opportunities in the source of problems earlier. Okay, fresk it up, let's go on. So yeah, sorry, slides out of order. Okay, so let's talk about alerting. You know, alerting absolutely is a key piece of your observability structure. So, you know, it is the first warning we have that something's gone wrong and maybe, you know, we start looking at something not based on alerting, based on, you know, complaints of a customer or reports of an internally, but very often the nature of alerts is a powerful way for us to have a first clue about what we're looking for when we go and dive into our logs or dashboards or what have you. So, one key point I wanna mention is to think about low and high alert threshold. So, one of the examples from my 500 error scenario was, hey, let's look at throughput on the database. And so, that's a really good use of that low thresholding to say, hey, we have a problem because we have no traffic through here. And my little quote around this is right, 10 time growth in page views is a concern as a zero time growth in page views, right? If we suddenly see that a section of our site or a particular set of API routes has stopped being used at all, it's worth throwing an alert there. Now, you know, maybe that's predictable for various reasons or it's predictable because of low traffic. And so, it is important to check in and make sure that your alerts make sense, but especially because open flow entry opens up the route to get a ton more metrics and a ton more stuff to alert off of. Thinking about low thresholds is a really key piece. I sort of recommend that if you have a metric and you're learning off of it, you should think through whether or not you wanna do a low alert threshold on there as well because there's very often a reason to do so. So, let's talk about creating observability with the abilities of your team. So, this isn't all joke, but it's still applicable, right? You know, we needed DevOps, Ninja, you know, they need five years of service experience, 10 years of Kubernetes experience, they gotta move to Greenland to do the job. Well, okay, I would move to Greenland at the drop of a hat. I would, oh boy, that sounds cool. But beyond that, of course, this is a, you know, it's an ad you see all the time, it's looking for more experience in a framework than that framework has existed. And but even beyond that, you know, we all know that building these teams is really, really tough. And when we talk about adopting the new framework and new architecture, the question of who's really gonna do that and how we're gonna hire those people, it is tough. So, one of the takeaways I'd love you to get from this is that when you have a great team that has great observability, you really is something that you grow and it's not something that you go out and hire. And I mean, you know, why am I talking about the people side of this at all? I would say, you know, remember, observability is just talking about how fast you can understand a problem based on the information you have. So when some of your senior engineers leave for greener pastures, your observability just dropped significantly because the information that you store is the same, but the human ability to look at that information and understand it just dropped. So that, you know, growing a highly skilled team and especially a team where everybody has a good knowledge base is super duper key for observability. So the three pieces, and this is not my comparison, this is Sheen Brasov's Lego who first proposed this, I really liked it, is that you find the good seeds. You find people either who are new and very eager to, you know, fully understand their architecture and become sort of as full stack as possible. Or you find people who really, you know, they have a ton of experience, but they love applying it in new environments. And you want to give that team the soil that it needs to really develop its skills in a complete way. That can be everything from, you know, online learning resources, your A Cloud Guru memberships to making sure that people are connected to events, making sure that people are pursuing the goals they need. The last one is sunlight, is honest communication inside that team so that there's good knowledge sharing. And so if you can build the trust within your team to share lessons learned and share information, you can get such big things done by having that piece. And this really matters because not only is it gonna grow the team that you need on these new platforms, it's also gonna keep the people that you do not want to lose. This example started with, right, what happens when we lose some senior engineers. And, you know, more than anything, when I talk to people who are, you know, have real experience of why they leave, a lot of the things they talk about is they say, well, I wasn't growing there. You know, my knowledge wasn't growing, I was doing stuff I've done before. And so if you give them those chances to grow, that's really gonna build your ability to retain good people. So thank you once again so much for coming out. My name again is Noshnika. Hit me up on Twitter if you would like to ask any follow-up questions or give me feedback on this talk. I really appreciate your time and thank you so much for coming out.