 Well, good afternoon, everyone. Thanks for coming. So today we'll be talking about observing failure in reactive systems. And I hope you guys are all hitting that kind of afternoon slump. The perspective I really like to take when looking at reactive systems and failures is that two in the morning, you get that phone call. We talked about that marathon of calls earlier. You get that phone call. Something happened. You're all groggy. You're kind of sleepy. And so hopefully you guys are in that state. And hopefully we can guide you into how to easily manage that and provide observability in your systems. And so I'm Jonathan. I've been at Yopworks for a little bit over a year. And I lead the managed services team for our clients that we support. I'm Clemens. I work as a solution architect at Yopworks. And I worked as team lead and later on also as architect on some of the projects that inspired this talk. And so today, we'll be going through some background of the use cases that we're going to be going through. And then we'll go through some use case-driven observability, kind of talking through some points of where you'd like to have observability, where observability lies in the architecture of the system. And then we'll talk about the real-life impacts to lack of observability. Great. And I would like to start off with a little bit of a background. So we all heard the saying that in theory there is no difference between theory and practice, but in practice there is. And coming from this, even if you do everything right, even if you do everything just like we heard today in previous talks from James and Jonas, you do your design and you implement everything right, once you hit production, things will go wrong. And this is what this talk is going to be about, the lessons that we learned from going into production, because hindsight is always 2020. What we'll be talking about as an example is a system from the logistics space, which processes around 100 million events per day. Now that's, of course, not a big, big number, but it's big enough to cause trouble when things go wrong, especially if every single one of those events is meaningful from a business perspective. The system that we developed, and that's still being developed, has been developed in an agile fashion. It was a Queenfield project. And as such, the requirements and priorities changed as the project progressed. We tried to go early into production. Things did go wrong. We learned, we adjusted. But quite often, features were prioritized over the operational what-ifs. And this is really what this talk is about. How do we prepare for this? How do we put things in place to realize when things go wrong, to detect it early, and to be able to pinpoint where the problem lies? Maybe I go a little bit over the architecture. It's just to give context for the examples that we will be presenting in a few minutes. So the system in question is the big block here in the middle. Data overall flows from left to right. The system consumes data from upstream systems. Those are mainly legacy systems that generate the data. Our system transforms processes, aggregates information, and then emits data out on the right to consumers. The producing systems on the left produce data in all variety of formats to various channels, be it MFT, be it MQ. In order to abstract from this, we introduced a layer of producer adapters. That's the second column on the left that you'll see there who are responsible for somewhat standardizing the data. They then emit it into our core where the data first hits an enrichment layer. This is going to have some relevance later because the data coming in does not necessarily carry unique ID. Those IDs need to be disambiguated, and for that we had to introduce a special service to manage unique identifiers for all our entities. And as the data progresses further to the right, it's indicated through these blue arrows. They represent message communication through Kafka topics. We had various services. We only represented a few here as examples, and then we have similar consumer adapters on the way out before we hit the final consumers of the data, which is your data warehouses and business to business integrations. Cassandra is our main data storage mechanism. It's the boxes on the bottom in yellow. We just tried to represent a little bit that it plays a big role in everything kind of relies on Cassandra. And yeah, so that's the system in question. It has a lot of pieces. It has a lot of places where things can go wrong. And so Jonathan, talk a bit about how are we going to monitor the production, please. Yeah, definitely. And so where I want to start is with this Grafana dashboard. This is a dashboard that I developed when I first started monitoring this platform and identified some pretty big gaps in the observability of the messages where the failure points were. And so what I want to highlight here is we've all seen Grafana dashboards. We've used them and they can be quite helpful. What's particular about this dashboard and why it's important is it loads very rapidly and it's very easy to understand. And so we leverage this in our support efforts. This dashboard is not like some dashboards where you'll pull up the dashboard, you'll walk off, get a cup of coffee, maybe go take a walk and then hopefully by the time you're back, the dashboard's loaded. You want to consider the amount of data points that you graph within your dashboards that they do load quickly and they're easily accessed by anyone requiring that information on these dashboards. And so I also want to talk about some of the metrics that we graph on this dashboard and why we find them helpful. So the first one at the top, or sorry, the first two is we have message lag per partition as well as message lag across the board for that entire consumer group. And the reason we use both of those is to understand where our Kafka topics lie if we have any issues in those particular systems. And we use the total system lag to compare partition lag versus total lag to find out if there's any stuck partitions on any of the services. That's a great point about the meaningful metrics for the individual system. Another example that we found very helpful is when it comes to database connectivity and performance, we use outstanding database requests as a metric. So our database connectivity is effectively throttled. You have multiple mechanisms to do that, be it connection pool, be it permits, to issue and execute database requests. And the length of the queue of these outstanding database queries, we found to be a very good indicator for the overall database performance. It gives you this one look, indication how healthy is my database. We found the metrics of the database provider. It's a managed service, sometimes a little bit less helpful because it's not from our system's perspective. We needed to have metrics kind of from the inside out as it affects our system, not from the provider, which often are looking quite green even though in our system we notice it's not really working as well. And the next metric I wanna talk about is networking graphs. These are really straightforward. They're easy to understand effectively if you see a significant drop in networking through, but there's a problem somewhere. But the idea here is that you can pinpoint it again to that specific service or that specific endpoint that you're looking for or that you're dependent on within your architecture. Moving forward into some infrastructure metrics, we leverage CPU and memory graphs to identify any spikes in utilization that may be uncommon. And also some JVM graphs, which Clemens will talk to. Yeah, just once since you mentioned memory, it's a great example for how valuable it is to look for valuable metrics. So a really good one for us was rather the garbage collection metrics, frequency and duration of garbage collection. It helped us more than memory consumption to make the decision, is memory the thing we'll be looking at right now if we have a problem in the system or might the root cause lie somewhere else. And the next graph, just kind of going down the line, is error message message through, or sorry, error message message through. And so in our systems, we leverage error topics. So if data is processed that's outside the normal bounds of which we expect, we push it to an error topic. This allows us to continue processing other valuable messages and not block our systems with erroneous messages that were unexpected or unable to be processed. This monitoring of those topics and throughput can point out significant errors occurring on a specific service or within a specific topic. And this is particularly useful after deployments. So if you make a deployment after a significant logic change or business proposition that was implemented and you notice that error topic throughput spiking, it's pretty indicative that that logic was failed or something unexpected is occurring. And now we wanted to jump into a few concrete examples where things went wrong in our system, often unanticipatedly so, and how we addressed them in our monitoring and observing toolkit. So as a first example, we wanted to talk about Cassandra, which as I mentioned is one of the, is the main storage mechanism in our system. And all persistence is handled by Cassandra. And we did have a few connectivity issues throughout the project with Cassandra. It's a managed service. So it's something that's hosted outside of our ecosystem. And as such, there is just multiple components involved for requests before they hit Cassandra. There are different networking components in between their firewalls and between. So there's just a lot of reasons where a lot of places where something can go wrong with the Cassandra connectivity. And to go through a more concrete example, as I mentioned, the incoming messages first hit our enrichment layer, which has to talk to the unique identifier service to obtain the unique ID for the entity that's being referenced in that message. And unique identifier itself now depends on Cassandra for its persistence. If there are timeouts happening in the connectivity to Cassandra, we're going into some retry loops first. Ultimately, probably we're gonna see some timeouts in enrichment, which is waiting for an answer. Again, we have more retry. Ultimately, circuit breakers pop. And what we had happening, especially in the beginning of our project, was that the error messages that we got in that case were not as descriptive. We got some error messages literally saying, circuit breaker during persistence open. And at first that did not give us enough hints right away, like we're exactly on the pathway of the error slide. And yeah, John, do you wanna continue with the observability angle? Yeah, so we put together this diagram. It was kind of a zoomed in architecture view. And what I really found insightful as we put this together is all the red magnifying glasses are actually where our observability lie. And so what's really interesting about this is that you can see all of these systems that interact, but we don't actually have observability into every single one. And so we leverage four different points on this particular, this diagram. And so we look at the error message throughput as was mentioned before, error, sorry, not error topic, but error enrichment message throughput, which is that middle magnifying glass there. Also, we look at lag of messages. So messages going between enrichment and the enriched messages topic or going into the unique identifier service. And then the last magnifying glass, which is actually really interesting and kind of important as we move through this use case, is the throughput of messages to the consumer adapter. This will become important in just a minute. And so here on this graph, you can see the lag graph that we had mentioned in the original, in that GIF earlier of that support dashboard. And you can see on the left-hand side, we left the architecture diagram with that magnifying glass. And this is where the observability lies on this specific graph. All the different colored lines represent different services for which we're monitoring the lag messages on. And so you can see we had a failure right at the beginning there on the orange line. And that failure continued for about 14 hours until that line significantly drops. And as that line significantly drops, you can see all of the other message, or sorry, all of the other message topics significantly grow in message throughput. And then the lingering lines, the yellow, green, red, and orange on the right-hand side are the consumer adapters. And the reason that those don't process as quickly is because they're going back into legacy systems and they just don't have the throughput that our services do. And so on this graph, what I wanted to highlight is error message monitoring and the dashboard that we put together for very similar to the dashboard in Grafana. But here we're actually looking at error messages or log messages that are occurring within the services. And so as we monitor these services in their error states, we're able to quickly identify where those failures are, especially on dependent systems that we rely on. And so here you can see the two magnifying glasses on enrichment as well as the unique identifier service. And you can see the timeout exceptions with right-hand drivers to Cassandra, as well as the circuit breaker errors on the bottom right-hand side. And this is super valuable because when we have a failure, we're not quite certain what's going on. And a lot of things can start alarming or throwing up red herrings where you're not quite sure where that failure might be. But if you've identified that failure before, you can go look at this dashboard and quickly identify where the pinpoint the problem and identify it and resolve it rather quickly. So I wanted to also bring up a quick tangent since we mentioned timeouts as having been an important part of the previous error scenarios. Timeouts are ultimately a necessary evil in asynchronous systems. Unlike synchronous systems where you notice when a call fails in the synchronous system, when you issue an asynchronous request for something and you wait, well, you don't wait, but you expect an answer to come back at some point. What if that answer never comes back? So you have to have some sort of timeout mechanism. The caveat with timeouts is when they get triggered, they tell you that something went wrong, but they don't tell you what went wrong necessarily. This can be multiple steps removed in some call chain. This can even just be an aggregation of multiple slower requests that cause a timeout to trigger. And just to state the obvious, one of the lessons we learned is how valuable it is to have good tracing in place, to have correlation IDs in place, but also to test for timeout situations as part of the normal test suite and test scenario. To not just test the happy path, but to have test cases, what happens if a service doesn't respond? Does my waiting system behave correctly? Just do I get meaningful lock messages and metrics out of my entire stack so that when I observe timeouts in one service, I can quickly pinpoint the actual root cause? After the tangent, one more example of an error scenario, it's a really interesting one. It's one of these expect the unexpected scenarios. As I mentioned, we consume data from multiple upstream systems. Some of these systems send us the data via files, and then we have a producer adapter that's supposed to read the file, emit the contained records one by one into our system via Kafka. And once it's done, it's supposed to move on to the next file. What happened in the scenario is that the middleware had a small misconfiguration and never deleted that file that was ingested. So that same file got ingested again and again and again at maximum speed as the PA could process it that led to a system load that was completely unanticipated. And that then meant only one topic was affected. You see this on the bolder blue arrows. The system held up really well at being a reactive system. Some of the other topics they got starved a little bit. We had a little bit of resource contention happening. I must say luckily we did not have auto scaling in place at that time because we probably would have ramped up quite a bill with this. We had a few timeouts happening. We had a few error messages, but overall things kept running smoothly more or less. And again, we put together kind of this zoomed in view of exactly what we identified in the last slide with where those observability points are. So we'll go through these in some of the graphs, but we have message throughput on the ingress topic, throughput on the error topic, as well as throughput into enriched messages. And then we have the log messages between enrichment and the unique identifier service. So the first slide I wanna present here is the error message topic throughput. And so at the top you can see the magnifying glass and you can see clearly that this service had a significant throughput in error messages which led us to quickly identify which service was encountering this failure and helped us start pinpointing where this problem lies. And then here we have a lag on each service. And so you can see on the blue line is that initial spike in lag where we had that file just bursting through our service. And then you can see the residual lag on the second service once that was handled and the services were unblocked and now able to process all of the other events that they were expected to be able to process with the two magnifying glass highlighting where that observability lies. And here you can see the GC recovery graph and I really wanted to highlight this. It wasn't a metric that we really were leveraging in the early days but we started gaining a lot of valuable insight into what the services were doing. And we had a particular instance where some VMs were taking so much time in garbage collection that they weren't actually able to process the business value that they were designed to do. And so we started monitoring these pretty closely and they really helped us to identify where those failures are. And you can see here even though we really had a limited stream of services that were impacted, look at how many different colors are on this graph and how many different services were impacted by this failure. And the last graph here, and I forgot to go through it in the slide of the dashboard that we had earlier, but it's our restarts graph, our pods restarts graph. And so what we leverage here is the pod restarts for the services on which we've designed in the core. And so you can see here there's some significant failures in some of those services. They were designed to restart on being unable to process certain events or having significant timeouts where the circuit breakers cause those restarts. And another point that I wanna make on this graph and something that we had forgotten to consider initially is Kubernetes evictions. So Kubernetes evictions are node resource contention whereas pod restarts are service failures or logic failures internal to your service. So it's really to some extent about expecting the unexpected and how can we prepare often. John, you might remember the case when one of the providers of one of our managed services had to restart one of their clusters and we were not notified in time and all of a sudden the IP addresses of all the nodes changed and we did not really have time to prepare for that. Yeah, so in the early days of supporting this service when we had relatively recently deployed it into production, as Clement stated, they rolled the nodes that we depended on didn't give us the new IP addresses and so all our services were kind of panicking. And what I wanna point out here is kind of the real life consequences to lack of observability. So I was moving out of my house at the time I was under my desk and I get that marathon phone call and I'm starting trying to take my desk apart and also trying to manage this incident and coordinate and deal with all of those impacts. And so yeah, it touched on the real life consequences to lack of observability, but there's also the business costs. And so the amount of resources that we needed to have available for troubleshooting and also the dependencies on reactive systems. Not every team has insight into every service and so you might be observing a failure in one service but it's really being caused by another and so think about like the impact of, well, it's not my problem. We investigated, we determined it's not our issue so they passed it on to the next team. Next team says the same thing, so on and so forth until somebody does identify the issue. Consider the time and the cost involved in that. Right, but it's also quite beneficial for the developers, for the development and the delivery teams. If you have that observability built in and it's, to be fair, ultimately up to some extent to the delivery teams to prepare for that observability to provide those metrics, you simply get less support calls. You have less open tickets because so many more things can be handled by the support team. And we've even seen cases where because we were able to qualify and quantify incidents that business was even able to just say, oh yeah, no, that's not that important. We can wait on Monday with this or we can wait until the next release. We don't have to open a ticket because we could provide these numbers. Also consider your reputation. If you're able to communicate with your end users, the state of your system, where your failures are, your reputation's much better often if users are identifying failures and then calling you to tell you that things are broken. Another point I'd like to make and kind of one of my slogans has been, it's never urgent until you need it. And this is production. You don't have the right tooling. You don't have the right observability in place. It's gonna be too late when you need it. So consider it from day one. Put the right observability in place in those critical places where you think you need it. And so some of the lessons that we've learned, as I mentioned, consider observability from day one. It's important and don't discredit it. Also test your observability at all altitudes. Clemens made a good point of testing your timeouts and we also looked through some of the valuable dashboards that we put together. But consider all the altitudes of your system from networking endpoints to holistic business value that you're delivering to your client all the way down to service level metrics and things that are important to those individual services. If you don't do this, you'll have impacts to cost, time, the progress of development because of the involvement of the development teams and trying to track down that failure, as well as impact to stakeholders and the cognitive load that it places on those clients. If those clients are having to understand what the failure was and they have to report to their superiors and state where that failure was, what was the business impact? There's significant load there. Whereas if you can communicate that from the get go, you're mitigating all of that for your clients. That's right. So we think it's really important to have observability as one of the non-functional requirements for every story, every epic that we work on. And to some extent, it is also on us, on developers, to point out the importance of that observability. It's really hard to get budget for operational what-ifs. Like Johnson said, you never need it until you do. It's never important until you need it. So it's important to point that I would very early on in the project, the earlier you built this in, the easier it is to have things such as correlation IDs and traceability working from the get go. So thanks for listening. We'll be over at the artwork booth. Please come and say hi. And we are a platinum sponsor of this event. Any questions? Absolutely. Thank you.