 Let's get started. So I'm Isabelle Rettelmeier. I'm at LightStep, and today I'm going to talk to you all about observability in Cloud Foundry. So while this talk is using Cloud Foundry as an example, a lot of the content is actually applicable across basically any system. So essentially, to that end, basically what I'm hoping is that your takeaway is that whatever your goal, if you want to improve things for your end users, people actually using whatever you're working on, great observability will help you realize it. So I'm going to start with quick hello, and then delve into what makes a good objective. So we all have different goals. We all have goals ideally set based on who's actually interacting with what we are working on. However, my goals and your goals are going to be different because we have different end users. Then I'm going to delve into how tracing can help make this easier to solve. And then explain some of what's next, so both what you can do next, what you can take away from this talk, hopefully, as well as what my colleagues in the open source tracing community end up, Austin here is one of my open source co-workers, what we are working on next and what you can expect in the future. And then lastly, I'll be doing some Q&A time allowed. Given the Q&A, please don't ask questions during the talk unless there's something really confusing, in which case speak up ASAP. So like I said, these days I'm at LightStep. Before that, though, I was at Pivotal. So I used to work on Cloud Foundry. I also was labs for a while. So a lot of what drew me to performance and to observability was my experience on Cloud Foundry. So I mostly worked on CredHub and then Perm, which was meant to be the future authorization server. So CredHub was new. It was Greenfield. We initially had certain performance expectations. Those changed. And suddenly we realized that we had much stronger performance requirements. And so we had to backfill a lot of performance testing. Perm, we knew right from the get-go that we were going to be in the runtime path of every service. And therefore, we had to be extremely available and very fast. So we had to do a lot of performance testing. And it was not as easy as one might hope. So in case you're not familiar with tracing, this is kind of a spoiler alert of what a trace looks like. So you can see that basically at the top there, you have the request that the user is actually waiting on. But across, it's broken down by what requests are happening over the network. And you can add arbitrary what's called spans. So you can add a span per ingress and egress is kind of the common starting point. But you can also add things for extra database queries. You can add anything for heavy business logic that might be very slow caching. Again, a lot of those do have to do with ingress and egress. But there can be different things. And while the pictures are from LightStep, the trace view itself, like that is pretty common across all tracing systems, the differentiator between LightStep and Jager and Zipkin, as I'll mention later also, is not in this side of things. So who's this talk for? Who is tracing and observability actually meant to help? One of the first use cases, fairly obvious to play, is developers. So as developers, I know many of you are engineers. And you have spent a lot of time, like myself, debugging performance as well as availability. So especially if you have an existing system already, it's hard to know what your availability already is if you haven't built in the right observability from the get-go. Also, operators. So these days, developers and operators may or may not be the same person. But in any case, operators are often working in an environment where they don't have perfect context into all of the code written. And frankly, a lot of developers also don't. Sooner or later, we all inherit code bases that someone else works on, that someone else may actually be ourselves. It's pretty standard to forget the code that you wrote a year ago or three years ago. So well as for Cloud Foundry maintainers. So of course, the examples here are from Cloud Foundry, really hoping that a lot of what I talk about will be applicable to Cloud Foundry in particular. But again, for non-maintainers, a lot of the core content should still be relevant to you. It's also meant for non-engineers. So while performance and availability are first class engineering concerns, at the end of the day, there are things that most affect end users. And that's something that everyone on the engineering product and design, hopefully also on the marketing sales to some extent, sides should be involved in and invested in. And especially if you are in a position where you are figuring out how to allocate time across projects, tracing is really great for letting you know where to spend your effort. So mostly, this talk is meant for you, whoever you might be, whatever brings you to this talk. It's also meant for the people using your products. Because again, at the end of the day, we are not just improving performance and availability in a vacuum. We're doing it because people interacting with the code that we write want to be able to experience consistent, successful codes. They don't want errors. We've all seen enough error codes in our lifetimes. And they also want things to be fast. So let's start with question. What does broken actually mean in a software system these days? So basically, 100% availability or light speed performance, those aren't actually realistic. 100% availability is an anti-goal. There are other people who will talk about that more in greater depth than I'm going into you today. But basically, if you have your expectations set unrealistically, then when things fail, and it's a question of when, not if, then other dependencies aren't going to be able to handle those failures correctly. So a few weeks ago, for example, Facebook had a cascading error that caused, I think it was like 14 hours of downtime across Facebook, Instagram, and WhatsApp. Google had a big spanner issue outage a couple weeks ago. I think maybe that same week as Facebook. These things happen to all of us. It's a question of, oh, awesome. Let's hope that stays consistent. Broken? Sorry, I have the slides. So basically, it's a question of making sure that when things break, you know how to deal with them. Part of figuring out what broken means is figuring out what working means. So you need to know what are good goals for your users. Do they need to have very, very fast performance? A lot of things do. Some things also, the delta between one minute and one minute and three seconds, maybe not that significant. And something that is actually mostly being operated on CI, for example, or something where humans aren't interacting with it as closely, maybe you can get away with things being a bit slower. You want to be able to set goals that you can actually achieve and achieve consistently. Same thing with availability. Don't try and promise too many nines of availability. Or you won't have enough time to actually handle outages. So there are some kind of common industry terms for discussing these sorts of goals. So an SLA or service level agreement is basically what it's an expectation set with customers generally. It's often written actually into a contract. So this can be something that engineers shouldn't be thinking about too directly, but always need to have in the back of our minds. Product people generally are translating these into more day-to-day concerns. So you might have different SLAs per customer, for example. I know at LightStep, we have different requirements for how much trace data we need to be able to ingest per day, depending on the customer, for example. Or for Cloud Foundry, you could pretty easily imagine different requirements based on IaaS constraints or multi-region availability that a customer needs or how large a foundation they need, anything like that. Then there's service level objectives. So these are what teams generally interact with the most. And basically, you want them to be things that you can measure in a very boolean way. Are we current? Is this passing today, or is it failing? So something like 99% of requests should succeed in less than one second, for example. That's a not uncommon pattern to an SLO, essentially. Then there are service level indicators. So this is how you actually know what is going on. You can't really tell that your system, that you are meeting your 99% of requests or succeed in less than one second. If you aren't able to measure how many requests are succeeding or how quickly they are succeeding. So essentially, you want to have as many SLIs as it takes in order to accurately measure the SLOs that you actually need. You also don't want to have so many that you are hunting through the noise. It's easy enough to take them away. Basically, you want to be able to add and remove as necessary. But again, you want your SLIs to be shaped on what your actual end user requirements are. You want to know more about SLAs, SLOs, SLIs, air budgets, all sorts of fun things like that. Google has a very famous now book that is basically called the site reliability engineering book. It is free online. It's a large book. Dan Lu also has some really, really great cliff notes on it that I personally tend to refer to more frequently, shall we say, they're really good notes. So both of them very great references. So that's all fine and well. It's especially fine and well if you have everything instrumented well from the get-go. Maybe you are Google or something. And you have a perfect SRE culture. Every team is very aware of observability, all of these things. But that's not the reality for most of us. Instead, we're often backfilling SLOs into our existing systems. So how do you go about doing that? How do you know what SLOs to add? There are some problems for this. So for example, operators might not have perfect context. In fact, it's generally safe to assume that they don't. And especially that during an outage, whatever context they have at noon on a Wednesday, they're not going to have at 2 AM. During an incident the other day, as Austin can attest to, I forgot how to reboot my laptop. It's OK to forget normal things. We all forget normal things during incidents. Also, every service is limited by its weakest dependency. You can't have five nines in one service. If that has a runtime dependency in every request on something that only has two nines, it just isn't mathematically feasible. Similarly, you can't be any faster than some of your, well, not necessarily some of your requests, you can parallelize them. But if you're waiting on requests, you are at best as fast as the slowest request that you are waiting on. Also, your time and resources are going to be limited. They always are. We always have multiple constraints that we're working under. We always want to ship more features. We always want to improve other techsheds. We only have so much time and the date to address these things. So a lot of this is about figuring out how to make the performance improvement process most efficient. So again, what are the SLOs that we backfill? First thing is the humans interacting with your systems, they're not experiencing things in terms of individual services. They're thinking in terms of what's going on in the Diego rep or whatever. They're thinking in terms of Iran CF push. It doesn't matter to them how well your individual microservices are performing. What matters to them is that the thing that they are interacting with at the most surface layer is as fast and as available as their expectations and needs require. So as part of that, if you are backfilling, it's really helpful to start by focusing on the outer layer of the critical path. So for this talk, basically, what I went with was the CF seal lie. Similarly, often it's really helpful to start with a web app or whatever your end users are interacting with. But if you're backfilling things across a 200 person org, let's say, then if you start by placing a mandate to start operating with SLOs on every individual team, it's really easy for all of your teams to spend time optimizing themselves when there may be some lower hanging fruit out there. That if you started at the surface layer, you could handle and save yourself some time, ideally. So the first step is figure out what is the outer layer for your first use case. You may have multiple users. Some users are using the CF seal lie. Some are using the Cappy API directly. For example, it depends on your use case. And then, ideally, start with something that is easy enough to trace off the bat. So maybe a request that you know doesn't have many services, you just want to make sure that you can have a quick proof of concept that things are working. And then get to the meat of, what are the things that are most valuable to your users? But also, focus on what, again, is most valuable to your customers rather than on the endpoint that someone uses once a year. You can get to that if you have enough time left at the end of the day. But by the time that you have that time, you're probably going to want to be focusing on other work. So it's best to use the time that you are spending on performance and availability on the things that are most critical to your users. So what actually is observability? So probably all familiar with logs. Who here has ever spent time grepping through logs trying to stitch things across services? Yep, a lot of hands, right? So logs, they're great at showing you a ton of fine detail. They're not always great at showing you how that detail fits into the whole. They're really, you have so many logs. They get very spammy, right? It's not necessarily interesting to see every request or you have the same failure. And that's just returning the same stack trace over and over and over again. Then not only is it not interesting to see the same thing so many times, but also that can cause cascading problems, right? So there was an example yesterday in a talk, for example, of how a 404-ing service caused the go router to basically spend all of its time writing logs, and thus this basically brought down the go router, right? So if you have so many logs, you can have other problems. Also, it's hard to keep all of them. So for example, when we were on CF, sometimes we would kind of see an anomaly on a Saturday or something. Not something that required us to be paged, just something that when we came in on Monday, we were like, hey, what was that all about? But by Monday, we wouldn't still have the logs, because there were so many logs that we had to truncate some. So metrics. Metrics are kind of the next step, generally. So these can give you an indicator that something either is as healthy as you expected to be, or that something is not healthy. Am I seeing more errors than usual? Has my successful request count dropped? Anything like that. We've all seen dashboards, probably. There are many, many metrics out there, and they're very useful. But again, they're not great at telling you the whole story. They give you tools that you can use to look further, but they don't tell you at a glance that the cause of your 500 or other unhappy response is that your downstream dependencies, downstream dependencies, downstream dependency was out. So tracing. Tracing is great. For giving you this kind of telescopic view where you have some kind of view of the whole request from the outermost layer. Also, it still can capture things like logs, also some metrics, so that you can still dive in as necessary. But you start with kind of the bigger picture. It's great for making you into detectives, but effective detectives, not, I don't know if anyone watches Brooklyn Nine-Nine, but kind of not the, yeah. So again, pictures, I'm using our film light step, but the actual trace view, it's not the differentiator generally between different tracers. So Yeager and Zipkin are two big open source tracing backends. Yeager came out of Uber, it's now a CNCF project. Zipkin actually is sponsored by Pivotal. They're also open tracing and open census. So these are what's going on at the instrumentation level and what Austin and I, for example, Austin's co-workers sitting up here, focus on kind of in our day jobs. So the idea with open tracing and open census is that you want to be able to focus, you want everything in your application layer to be open source and easy to add essentially. So effectively, this is about instrumenting your code and generally speaking, the tracing happens by adding kind of middleware, for example, to your HTTP servers or clients or your GRPC, anything like that. Your database clients, generally speaking, there's middleware available for the things that you want and if there isn't, a lot of middleware, it's like 50 lines of code. So samples might be a bit hard to see at the back, but basically this is an example of CF apps. So you can see, hopefully from the back, you can see enough that basically the whole request took about 200 milliseconds. This is from the CLI's perspective, so what the end user is actually seeing when they type in CF apps enter. And the majority of that time is spent in CAPI which took about 180 milliseconds. So there's some Go Router stuff going on, but that added about 200 milliseconds in this request, for example. So this is kind of what a trace of a fairly straightforward request looks like. Now, this is another run of CF apps, but this time, CAPI took 600 milliseconds, but the Go Router took eight and a half seconds. So something was going on here. You can see some logs, or on the right there, there are some logs, so you can add arbitrary logs, you can add things like stack traces right in there, and those get attached to the individual span. You can also add tags that are, again, this is all in open tracing and open census sites. So those, for example, for the Go Router, might be useful to add things about how many active connections there are. You could also add things kind of at boot time about how many instances of the service you have, for example. It really depends on your use case, but the idea is that if you start with something that gives you a big picture, then you can spend your time, then you have more time to spend adding incrementally what you now know would be valuable, right? And again, that request is the Go Router, so it took quite a while. I also was kind of spamming my CF at this point, so it's understandable that my Bosch Lite was not the most performant. This one failed, so here there's 502 at the Go Router. I don't have stack traces, but they're easy to add. The Go Router, I'd never worked in the Go Router, and I'm not that familiar with how proxies work and stuff, but you could absolutely add a stack trace or more messages, whatever you want. It's easy to add at this point. But right off the bat, you have kind of error equals true. That is searchable. Again, that is at the open source side. All the different tracing back ends support like searching for errors. And so there's a tag for the status code, for example. So if afterwards you could go back and filter for all of your 502s or all of your 5xxs, anything like that, you can focus on them, right? Here's CF v3 apps. Now, that top thing, that's basically a flame graph of what's going on. So there are a lot of requests there. I had four or five apps in this particular instance, so that's a lot of requests for fetching the information about four or five apps. But again, sorry for the people in the back who maybe can't see the numbers, but basically each of the individual requests, they're pretty fast. Kind of 70 milliseconds or so generally round trip. So the sum added up to two seconds or so, and again, like there's some noise, but this particular one was two seconds, but each individual request was fast. So for example, at the beginning, the CLI actually logged in twice. It also already had a token, but it started by logging in twice. So it turns out that all the v3 commands do this. So that's the sort of kind of low hanging fruit that starting at the end user, where the end user is working can tell you like improving UA is one thing, but just kind of reducing your number of requests. You don't need to spend time. You don't necessarily need to spend time improving your queries if you can just eliminate them off the bat, right? So this, like removing the two logins, for example, could probably cut about 100 milliseconds from most of the v3 commands, losing my voice. Here's CFBush, just for fun. So most of that kind of like reverse mountain, the kind of valleys that you see at the top, that's because the CLI is waiting and polling happy. So logically, it makes sense that there isn't much going on, but at a glance, you can see kind of what's going on there. And I've showed some logs before. You also don't lose metrics here. Instead, like you can kind of backfill your metrics from your traces. So how do you add this sort of thing, right? This is, again, all by adding open tracing or open census. We're actually merging them. That was, there was a big announcement last week, last week about that. But again, most of that is about having cross vent, having like vendor neutral APIs. So open tracing and open census, like between the two of them, that's LightStep, Google, Microsoft, Datadog, New Relic, Elastic, like many of, many, many vendors involved. We have meetings regularly together and stuff like, these are initiatives that we are all heavily involved in, working really closely together, right? But mostly it's adding middleware. And to start with adding the middleware to your ingress and egress points. And again, if you only have so much time, start where the users are seeing things, not deepest in the stack. Because then you might be able to see that like, yes, an individual request takes maybe 100 milliseconds. How do I cut down 10 milliseconds? But it turns out maybe you could just parallelize more of your requests from the caller, or maybe you could reduce the number of requests outright. So, what's next on the open source side, and I'm running out of time, I believe. Basically, like I said, open tracing and open census are merging, which is pretty exciting. It means that that much more of the instrumentation is going to be available across both. There's also a W3C header coming soon, so that the headers themselves are going to be, to be able to stitch different vendors traces together. So you might be able to hypothetically combine a Yeager trace that goes through to your IaaS, for example, and get the Google Stackdriver data in there also. Also, trace data, that one's kind of a big concern of mine, at least, where I'm really hoping that the next little while I'm going to be focusing on this trace data, where the idea is to have a vendor-neutral API, so that the, I'm sorry, we already have the API, a wire protocol, so that you're not having to recompile per vendor, and this is the sort of thing that I think would be really useful for big open source projects in particular, things like Thought Foundry, Kubernetes, databases like Redis or Kafka, or, I mean, one day maybe Kafka, or sorry, Postgres or MySQL, but I think that's a little bit further off. But things like that, where you don't want to be recompiling again, right? Thank you all for listening. Any questions?