 Hi everybody, it's a minute early, but again, we're so excited. So we're going to get started. Thank you all for coming to our session today on debugging the routing tier. My name is Angela. I'm a software engineer at Pivotal working in the CF networking program. I'm Nithya. I'm also a software engineer at Pivotal in the networking program. So obviously we're talking about debugging the routing tier, but let's first start before we dive into the meat of the talk with a brief overview of what exactly we're planning on discussing for the next 30 minutes. We're going to start with a brief overview of what is the routing tier, what do we mean when we say the routing tier, what components make it up, before then looking and taking a deep dive into two case studies in which we as members of the networking program actually worked on in terms of debugging. So these are actual case studies from actual users where we really took a deep dive in to figure out what was going wrong when the problem manifests in the routing tier. And then we're going to finish it off with additional debugging tips. Our hope is through looking at these case studies and talking about debugging tips that everyone in this room leaves feeling empowered and able to, when encountering an issue that manifests in the routing tier, have the tools necessary to at least start the debugging process and sort of utilize it to hopefully figure out and fix the issue at hand. And with that, I'm going to hand it over to Cynthia to talk about what exactly is the routing tier. Cool. So what is the routing tier and how does it fit into the rest of your cloud foundry? We want to give you a brief overview of what it is so that we can look at the case studies and debugging tips. If we look at this diagram here, the gray box represents the system components of CF, like the API, Diego, UA and the routers. The components in that the gray box represents make it possible for you to CF push and run your applications. And then all of the colorful components are the components of the routing tier. In a typical CF deployment, your routing tier may consist of a load balancer, HA proxies, the routers and your applications. So if we look at just the left side of this diagram, we can see all the components that an HTTP request might hit before finally being routed to an HTTP application. The load balancer is your front facing part of your CF. It'll be the component that receives the request and it will forward it onto HA proxy. And the HA proxies would be configured to point to the go routers as its back ends. You tend to use HA proxy in a deployment if you need features offered by HA proxy that are not offered by your load balancer or your CF routers. For example, HA proxy allows you to define access control lists to block different kinds of requests for an extra layer of security. And then you have go router. Go router is a layer seven HTTP router that handles connections to the application back end. It has a series of handlers that will run and process your requests before it then makes a new request to the back end. It can also route you to other CF components like the API. So now that you have an overview of what are the components that make up the routing tier in Cloud Foundry, we'll talk about issues that people have encountered when running CF in production. We'll go through some case studies and show how to debug and also give some general debugging tips for common issues. So I'll give it back to Angela and we'll kick off the case studies. Great. Thanks, Nithya. So we're going to start with case study number one. Again, these are real problems that we had to slog through debugging. And so by going through them again, we really want to give sort of a framework and tools in order to hopefully help you debug in the future. So oftentimes, as members of the networking program, we get issues from users, from customers. Oftentimes, the initial statement isn't necessarily the most helpful. Being told something like the go router is broken or panicking right now. It's been prod. It's definitely scary, but it doesn't give us information on what exactly we should be debugging. So the first thing that you probably want to do when debugging the routing tier is to first state the problem in your own words. Maybe you're the one who encountered the issue, but formulating the problem in an actual problem statement makes it clear to you that what exactly you're trying to debug and if you're somebody who's getting an issue, like we were as members of the networking program, stating the problem makes sure that we have the same understanding of what's occurring as the user who's reporting the issue. So in this case, we formulated a problem statement. And the problem that was occurring was that the customer had two microservices deployed in their Cloud Foundry application. One microservice was externally facing and sending crowd requests to another microservice, which was acting as a back end. The microservice, that was the front end, would send a get request and then immediately follow up with a put request. The get request would always succeed, and the put would consistently error out. So this is a wall of text. And I know sometimes it's hard for me to actually internalize what it really means. So let's visualize what the statement is actually saying. So in this case, we're talking about the go router and two microservices, a front end back end. The get request comes in, hits the go router, it's forwarded to your front end application, and the front end wants to communicate to the back end. In this case, it was hairpinning back out and through the go router, and the get was being forwarded to the back end. The get would succeed, the information would be sent back, and the put request would be immediately following. And then what happened? We know that the put error now. Did the go router fail to respond? Was the error happening at that level? Or was the go router successfully connecting to the back end only for the back end to fail out? Who really knows, right? This is compounded by the fact that, as Nithi alluded to earlier, the go router front end and back end are only some of the components in your routing tier. When we zoom out from this, when we look at the entirety of your Cloud Foundry installation, you're probably going to also have an HA proxy and a load balancer. So instead of just having four connections to worry about, you now have a multitude more, a lot more places where the failure could actually be occurring. So the second thing that you wanna do after stating the problem in your own words and understanding what issue you're currently trying to debug is that you wanna level set on the architecture. How exactly have you set up the routing tier? In this case, the architecture was actually a pretty straightforward setup with a load balancer forwarding to an HA proxy to the go routers and then your applications. It's easy at this point to say, okay, I know what the topology looks like. I know what the problem is. Let's figure out exactly which connection is failing, but we're still in the information gathering phase. And so in addition to level setting on the architecture, I would strongly encourage at this point to also level set on the configuration, and that's what we did. When we say level set on the configuration, we were talking about gathering information about the load balancer, the HA proxy, the go router, things like the multitude of timeouts that are set on each of these levels, because oftentimes those in conjunction with one another could be what's actually causing the error. So after doing all of this, the last step is to gather additional information. You've know the architecture, you've been able to put all the configuration information next to each of those boxes, but there's probably more that you wanna gather. In this case, the initial problem statement we came up with was determined based on the application logs, where they saw in the app logs that there is an IO error on the put request for the front end application. So from there, it makes sense for us as people who are debugging this problem to then look at additional application logs. See if we can map this put request to any errors in the go router logs or any errors in the HA proxy logs. So you wanna gather logs on basically every single component you can, load balancer, HA proxy, go router and applications. Sometimes this will lead you to a root cause. In our case, there weren't any additional information to be found in the logs, unfortunately. And so this led us to the use of our first debugging tool, which is TCP dump, right? So what is TCP dump? TCP dump is a packet analyzer that you can use on the command line. There is no GUI, so it's sometimes hard to actually parse as a human, but it's really useful to run on remote servers to gather information, put it into a file, and then you're able to load that file into some other packet analyzer. That does have a GUI, it makes it easy and intuitive to read through packet traces. So we were able to run just a very straightforward TCP dump on all of the VMs that we had access to. So the Diego cells where your microservices were running, your HA proxy and your go router, and we just ran a TCP dump, writing it out to a pcap file and listening on the ETH0 interface for network traffic. Doing so, we then took all of those files and we piped them into our next tool that we use for debugging in this case, which is Wireshark. So Wireshark is an open source packet analyzer in this case with a GUI. You can use it to analyze local traffic in real time, but you can also load files into it and then debug in that GUI. And so that's what we did in this case. And it allows filtering and inspection of packets to make it easy to follow along and hopefully debug what's actually occurring with your network traffic. So what exactly does this look like? So here's an example of looking at a packet trace on Wireshark. If we're reading from left to right, we see the packet number, a timestamp, and then a source and destination address. We then see the type of traffic, the length of the packet, and then any additional information. This was a trace that we filtered of a connection between the Diego cell where the frontend application was running and the HA proxy. If we actually zoom in on this packet trace, we see that partway through this connection, there was a FINAC packet sent from the HA proxy to the Diego cell. Basically the HA proxy was saying, I wanna close this network connection to the Diego cell. And that's interesting because immediately after what we see is the Diego cell sending a myriad of data to the HA proxy. So it looks like the Diego cell isn't respecting the closed packet. What you would normally expect to happen is after a FINAC packet, after a closed packet being sent in one direction, the server on the other end should be sending a FINAC packet back to say, oh, I see you wanna close the connection. I'm going to close this connection too. But in this case, we didn't see that. Instead we see data, likely the put request being forwarded and then a bunch of red for the reset. So we see that the issue at hand is connection five between the frontend and HA proxy. This is where our problem is occurring. And so when you're debugging, you're usually not debugging in isolation. You're usually not the only one debugging as well. And so you're going to have information coming in from multiple sources. You're not living in a bubble. And so at the time that we had this realization, we got an interesting tidbit from the user who reported this issue. And they noted that disabling keepalize from the go router to the backend fix the problem. And we were sort of like, what does that even mean? So disabling keepalize from the go router to the backend was a change made to the red arrows here. And somehow this was fixing the problem, like what? How is something on connection four and seven, these keepalives fixing the error on connection five? So based on this, we decided to come up with a hypothesis and plan of attack. And our hypothesis was that some value was being propagated from the go router to the backends that would make the connection between the HA proxy and the frontend application fail. Now that is to say again that these red arrows, these red arrows here are somehow propagating information so that the arrow here, oh my God, I'm just dying, so that the connection would be failing there. And so disabling it made it better. But hypotheses don't just live alone. They also live with coming away to either prove or disprove them. And going back to all the information we had at hand, going back to this Wireshark output, we couldn't come up with a reason on why a value being propagated would either cause this error or fix this error. And so we decided instead to really try to view this as a red herring and again go back to the initial problem statement we had. Let's not look at what was a solution, but let's instead really focus on identifying the root cause. So doing so we decided to come up with a new hypothesis. We were repeating the step four and come up with one where hopefully we could valet or invalidate it. And so our hypothesis number two was that something about the HTTP client was causing the closed packet to be ignored. This really narrowed in on the connection that we were seeing fail and didn't really take into consideration any other components. So we really narrowed down in. And actually when we came up with a hypothesis we got another interesting tidbit of information that was sort of supporting our new hypothesis that if we waited five seconds between sending the get request and the put request, the put would always succeed. But when the put request was happening immediately after it would always fail. And so it seemed pretty close in line to our hypothesis that there was something about the HTTP client. So we actually looked at the application source code for the front end. It was a Spring Boot app it used HTTP get. The default HTTP client that you would assume does the right thing. But we took to Slack to ask our local Spring experts and we took to Stack Overflow. And what we found is that the default HTTP client actually doesn't respect closed packets from the server. It will just ignore it. And so in conjunction with that the default keep alive timeout is five seconds. And again, going back to the configuration information we gathered, we saw the HTTP keep alive timeout on the HA proxy was half a second. So the HA proxy after half a second would close the connection, but the default HTTP client wouldn't respect it. And for the next five seconds would try to reuse that connection which was causing the put to ARRL. Salt. So what lessons can we actually learn from this debugging exploration? One of the main things we took away was that clients can have configuration too. It's so easy to think about the routing tier as just the system components, but your application is an important part of that routing tier with configuration that is set. And so when you're talking about configuration together should always gather information about the HTTP client as well. Additionally, we also learned that we should probably be ignoring the red herrings and short term fixes. The HTTP keep alive from go routers to backends wasn't helping us with our root cause analysis. It sort of led us astray. And only after we figured out what the root cause actually was, did we then go back and try and figure out how setting this value could fix the problem in the short term, which was another interesting exploration in and of itself. But with that, I'm gonna hand it over to Nithya to talk about the second very interesting use case. Cool, cool. So let's read the problem statement for this one. The API becomes unresponsive when trying to CF login. Other requests succeed. And some of the go router instances have high memory usage. So first, let's visualize this. Even though we were told that the go router has high memory usage, let's remember that there are several components that the CF login request needs to travel through. In this case, here's a diagram of what the deployment architecture looks like. And requests travel into the load balancer through one of the HA proxy instances into one of the go routers before finally making it to the API. And then the response needs to go back through all of these components to get a response to the client. So now we can start to gather information. We'll try to start broad and gather clues and then zoom in as we eliminate possibilities. So here's a couple of places that you can start to look for information. Roughly ordered from broad to more focused. As you move from top to bottom, investigating various sources of information, you may find some indicators like specific error messages that allow you to skip relevant components. Right now, the only information we have relates to high memory usage and CF login requests hanging. So let's start broad and look at the VM vitals. So here's a snippet of some VM vitals. This is all stuff that you can get from Bosch that can tell you about the help of the VMs. And this snippet happens to be only from router instance zero. And this is one of the VMs that was experiencing the problem and the problem statement. So you can observe here that the memory usage percentage on router zero was pretty high, but the CPU usage is fairly low. Another interesting thing on this router is that the memory swap percentage is actually increasing over time. So based on what we've gathered so far, we see high memory on this router instance and that the memory swap percentage increases over time. So one hypothesis could be that go router isn't scaled enough and there aren't enough instances of go router to handle the load. But we also learned from poking around on these VMs that even though we see these symptoms occur on router instance zero, this wasn't the case for all of the router instances. It's just happening on a few of them. In fact, after restarting the routers with the high memory usage, the problem would return often on another router VM. And the routers that did experience these symptoms also still reported that they were healthy so there's no indication to the load balancer that it should redirect traffic elsewhere. So there's a couple of things that we've seen so far that tell us that horizontally scaling the go router isn't really gonna help us with this issue. One, the CPU usage is fairly low so the router isn't overwhelmed with large amounts of work. In fact, a majority of the CPU time was spent waiting. And the load balancer help check requests themselves are just fine and that indicates that only some of the requests are bottlenecking. So adding more go router instances won't help since traffic would still be going to the problematic VMs. So now we should try to zoom in further. Look at some logs, see if we can find anything useful. Looking at the go router error logs, one error sticks out. The go router is having some trouble making connections to other components because it's hitting the maximum number of open files. So now we have something to start digging into. What is causing us to hit this limit? Is the limit itself too low or which connections both incoming and outgoing are causing us to hit this limit? Let's look at the diagram again and try to help pinpoint where we think the problem is happening. So this is an architecture diagram of possible request paths in this deployment. Open files on the go router could be being caused by any one of these three red arrows, HAProxy to go router, go router to app backends or go router to other system components. So turns out there could actually be one more source of open files on the go router, which is if you have a route service and your request is going to a route service, the go router as part of handling that request will send a request to the route service and wait for that request to be processed before sending a response back to HAProxy. So HAProxy to go router, that arrow will stay open while processing is happening in a route service. So we wanna try to zoom into these red arrows where we can and gather some more information. The information that we were able to get about these connections is that the number of connections between HAProxy and the go router are growing, but not HAProxy, not go router to the backends. That's pretty helpful. It helps us narrow down to this highlighted arrow as a side that's growing. So now we can start to reason about hypotheses for why that highlighted arrow might grow. So we just talked about connections from the HAProxy to go router, stay open while go router is processing that request. And in the case of a route service, that connection will stay open while go router sends and has the route service processing the request. So a slow or misbehaving route service could cause connections from HAProxy to go router to stay open. And that could be a reason why this arrow is growing. So misbehaving route services are our next hypothesis. We ruled this out by killing route services that were slow to respond and seeing the problem of unclosed connections persist on the go router. So if you recall from before, it did seem like only some of the requests were bottlenecking and we've ruled out route services at this point. We need more information about what requests are actually going into the go router and the go routers access logs can give us that. So from the access logs, we actually saw many 404s but also that the 404 request time was much higher than that of other response codes. So our new hypothesis is that 404s are causing the connections to stay alive. And we have a lot of outside information at this point but all of our observations so far seem to be a symptom of like a deeper problem. So what we want to do is really dig into the 404s and see how go router itself is dealing with them. So to help us dig into what the go router is doing, we can profile the code. A good tool for doing that is peep rough. Peep rough allows you to collect and visualize profiles like CPU profiles, heap profiles, go routine profiles and traces. Go router has a peep rough endpoint for collecting profile and trace data. And flame graphs are a way that you can visualize this profile data that contains stack traces that's collected by tools like peep rough. So we'll look at a flame graph of the profile data for go router while the issue was occurring. So this is that. I know you can't read the text but we'll talk about it. And first let's talk about how to read a flame graph. So the text in all of the little boxes are function names. When using CPU profiling, the profiler interrupts the program execution at specified intervals and logs the state of the program's call stack. The length of the boxes tells you how many samples during the profile were spent in this function while this function is executing. So you can very, very roughly think of it as time spent in this function. And generally you want to look at the wide boxes and the last couple of calls in the stack in these wide boxes which are towards the bottom. So there's a lot of samples spent in this stack. An interesting part about this stack is that this is a stack that calls to the logger. So there's a bottleneck in logging. We know that when profiling this code there were a lot of 404s and that the 404s likely resulted in logging. And further down the stack we see that during the right for the log it's also calling to a locking function. We finally found our bottleneck which is lock contention when writing to the disk. So it turns out writing to the disk was a bottleneck and lock contention when writing to the disk could be caused by any combo of many writes to the disk or an actual slow disk. And the router attempts to log route unknown for all 404s which is happening very frequently during this case when the problem manifested. So this could be what's causing the connections to persist because they're stuck waiting to write. And this was verified by isolating the go router VM a go router VM to an empty hypervisor with no other VMs and seeing the problem go away because there was no more storage contention. Yay. So here's what we can take away from this case. There were a lot of VM metrics that we looked at at the beginning but none of them really had to do with disk. So maybe looking at IO stat or something on the problem VM earlier on could have tipped us off in the right direction earlier. But also it's useful to start broad and gather as much information as you can at each step because then when you look at more granular data from like tools like Pprof you're able to have some context when looking at that. We knew a lot about what was going on in the system by the time we looked at the flame graph so it was like easy to tell what was going on. Cool, so now I'll give it back to Angela for some more general debugging tips. Great, thanks Cynthia. That's so fascinating. So now we're just going to go into some brief general debugging tips. One of the first things we like to look at when we were debugging issues in the routing tier is timeout configurations. In general, you want to make sure that your timeouts are highest on your load balancer and then lowest as you go further down. So load balancer, HAProxy, GoVouter, application, high to low. And you want to do this to make sure that you don't have any component further down in your network stack, closing connections prematurely and causing problems to manifest. The next thing we want to talk about is logging. So while this may seem sort of silly, one thing we always like to highlight is your application logs have timestamps too. So if you see an error that manifests in your application log, you can take that timestamp and look at the corresponding logs in your GoVouter or HAProxy and see what was happening at around the same time. We also want to note that we recently added support that for endpoint failure requests that occur in the access log that you can actually map them to a log message in your GoVouter logs as well by using the VCAP requests header ID. So you can gather more information about what exactly is happening with the endpoint failure by mapping the two logs. And lastly, some additional other tips are that when in doubt, you can always pull your load balancer out of the equation. You can go on to your network and ping the GoRouters directly. This also helps you if you're worried about individual routers by allowing you to test routers individually. And last but not least, we also recommend checking your certificates on each level. Certificates can be pretty hard to manage and so one in doubt, you might want to make sure that everyone's expecting the right certificates. And with that, we have some references if you want to look further into any of the tools we talked about in both case studies. Our slides aren't up yet, but we will definitely be posting them and you can click on the links there. We definitely ran over time as we expected. So thank you so much. We'll be outside the room if you want to talk to us further, but otherwise have a great rest of the summit.