 also from Red Hat and working on Neutron. And we would like to thank you first for coming. And we are going to talk about how to debug failures and how to use this little tool called Osloq Merger to hopefully help you solve your problems, or some of them. We all know that the distributed systems have complex interactions. And in the case of OpenStack, this is certainly true. This is not pictured to scare people. This is an actual diagram of the architecture, the basic architecture of the different OpenStack components and how they talk to one to another, which is clearly complex. For example, let's see this complexity into something that we do every day, which is request an instance. So you could make a request in your dashboard or through the CLI. So it could connect to Keystone to get the authentication. Then it could make the request to the Nova API. Nova API could confirm with Keystone that the authentication is correct. Then it could go to the database to check that whatever you've requested is right. So it can proceed. It makes the request to the message broker, places in the queue the request. It goes to the scheduler. The scheduler does check again in the database and creates the entries it needs to and then makes another request through the queue to the compute node that also access the database through the conductor. And then we start interacting with other components, like LANs, Neutron, Cinder. And then in those interactions, we have some common goals, like I send you a message, I wait, or I expect you to call me back. And inside each of those components, we also have complex flows. So even something as common as getting an instance is quite complex. It's not a simple operation. And this is not even taken into account if you deploy things in a cluster where you could get your request to your HAProxy. It can go to any of your servers and then be delivered to one or more of your agents. OK, this is not that difficult to follow. OK, you can follow it. And then when we get the response, they can go to another server because it doesn't really matter. But it's getting harder to follow your request paths with this. And then you get a different request from the same flow. And it goes through another server because it's balanced. So if you're trying to actually find which problem do you have, this is getting a lot messier to follow. As we said, OpenStack is complex. It has a large number of components. Those components have complex interactions between one another, but also complex flows inside of them. And on top of that, we have a lot of configuration options that could go wrong. And we have a lot of deployment possibilities. So we could have different backends, different hypervisors, network agents. It's complicated. So when you're trying to determine the root cause analysis of any problem you know for a fact that most cases are non-trivial, you may be lucky and it's something easy to find. But it's usually not because otherwise you would have probably done it correctly from the start. And requires a lot of knowledge of the inside if you're trying to do log crawling through the different pieces you have, the different components, the different services. You actually have to know where they are going next to be able to see where you want to follow that request. You would usually have this centralized, all your logs. So usually it's easy to do. But in many, many cases, you don't have your logs centralized. You have them independently. And maybe you just have a proof of concept, you're testing, so they are not centralized. But you're testing it. You want to make sure everything is right. Something's wrong. You have to go and check. And just start opening files, log files, trying to follow where the calls are coming and going. And like I said, you would have to jump through a lot of different files. So first quick introduction. This is the OpenStack default format. It can be changed, but it's not very common to change it. The most important part right here is the date and time. Because what we are trying to do is go from separate files that you have your logs to something with an alias that will order them correctly and would also let you know from which file those logs came from. So you can specifically go there and check for anything if you need to. So this would make it easier to follow them and to follow all the requests. And that's where always log merger comes in. It was born out of frustration of trying to actually debug complex scenarios with HA, active active, and how everything interacted. And mostly it has been tested for production. But also cases where you only receive the log files, you receive a bunch of log files from a customer. And you're like, OK, there are a lot of nodes. There are a lot of logs. And you start trying to crawl through them, jump from one to another, trying to figure out where the problem is, and then follow the path. So this tool, it's a simple tool. It will not solve all your problems. But hopefully it will help you make it easier to follow the logs. The functionality currently has is it allows to merge OpenStack format log files, bar log messages, and temp delta logs, like Dmesh. It can get files locally from a URL. If you get it from a URL, it will cache the data. For example, if you're trying to debug something that has failed on the CI gate, you could get the files. The idea of the caching it is in case you first only retrieve a couple of logs, you think that the problem is only in Cinder. You only retrieve Cinder logs. And then you realize that you need more logs, because apparently the problem is with interaction with Nova. So you don't have to, and you add those to the call, to AlwaysLog Merger, and you don't need to retrieve all the files again. Alias definition, that is how we will be identifying each of the log entries, can be done manually. You give a specific alias. It can use the file name directly, or it can generate automatically the names as it sees fit, trying to make them consist small and easy to read. It supports multi-log entry, and it can run for speed or reduce memory footprint, which takes slightly longer to sort the files. So installation, you can install it from PyPy. You can install it directly from master, or if you're in a customer environment where you don't have virtual M, you don't have pip, you have basically almost nothing, you can directly download the file and just run it. Let's see a use case. You have this structure. Someone had sent you a bunch of tar file with this structure, and you just have node two, node three, and you just want to browse this in a sequential order. You just need to make the call to log merger here. OK, dash A3 means that it's going to do auto aliasing to the best to make it as short as possible. The ML option is to specify var log messages files. Then we specify delta time file, and finally the log files. As you can see, it's using Blob to include all the nodes and from all the nodes, the different components we're interested in, and this could create something like this, where the time stamp is preserved. We don't alter it, so it's easier to locate in the renal file if needed. So and then you get the alias. That has been reduced from the directories. In this case, it has only kept the number of the node because the node was common in all of them. The node part of the path customer has been removed as well. And we see the alias with Cinder API, Quantum Agent, Cinder Scheduler, Cinder Volume, et cetera. And they could all be in order, so it makes it easier to filter what's in the file to follow our request through all your logs and everything. Now we can see an easy example of merging files from the gate, where we are actually specifying the alias manually. We define a base for all the files, like all files are going to be here in this URL. That's the base. We define the extension, the postfix of those files. And then you define the path within that base and the colon and the alias. So this could get those files, merge them with those specific aliases. This is where the caching could be useful if you suddenly wanted to add a new file, because you see that you're missing some interactions. Now we are jumping to the fun part, which is actually using this for something. And we are going to see how you can do it. I'm jumping into a little live demo. My colleagues always tell me, please don't do live demos. But I like it. So I hope it's a reasonable size. This is an example of how to leverage these to go through that kind of log mess in the gate. If you are a developer and you submitted a patch and you are trying to debug a multi-node failure, you can use this. This is the review number. But you can get that from the review output. So it will run this. In this case, the HTTP requests are cached locally. So now it's processing all the logs to put them together. And you get everything in a single log. The compute node messages, the OpenVsweets messages. So you can go straight to the place where you are finding the issue or correlate to the specific log failure and maybe the request number. You could filter by the request number, for example. And see what happened across the services. I don't want to go in the details of that log because it was a mess. But before this, I remember a long sessions of having an editor with five open windows and jumping through logs. It was horrible. So OK, I'm going to explain how can we put this together with Ansible for travel suiting. So we can do this in an automated way if you have a deployment that is experiencing a failure and you know how to reproduce that. And you can try it on a staging environment or give it to our customers because you are providing support for that deployment. You can automate the whole thing and get the results, find what's happening. But also you can leverage that to make sure that if you are trying a backfix, it's really working. You can do it automatically. So basically, in this small methodology that I'm going to give here, we use Ansible to orchestrate the manipulation of the host, setting debug, restarting services to take the debug flag, introducing props in the system to augment the information that we have. For example, pinging from one side of the network to another or to a floating IP of an instance or getting the ARP tables in certain nodes or more complex stuff that I am going to show later. And finally, when we reproduce the issue, we grab the logs from the different hosts. We put them together in the way that we find more reasonable for our debugging purposes. And that's it. So I will go quickly through this. I don't want to go too much into the details. You will have the link up to the presentation later. But basically, this is very simple Ansible. You use it to configure the debug flag on the nodes, in this case for Neutron and Nova, because we were looking at a Neutron issue during live migration. So the next thing is doing a simple hack to rotate the logs. So we start fresh on the logs, and we restart the services to make sure that they pick up the debug flag and that they start with the new log file. And the same for the compute nodes. This clears the logs, restarts the specific services, the OpenVswitch agent, and then Nova compute. And then this was to reproduce the issue that we were trying to debug. It was a live migration from one node to another. And there was a hidden race condition between Nova and Neutron. At the time, Nova was starting the VM before the underlying network was configured. So the VM was starting and sending a network announcement of reverse ARP packets saying, I'm here. So the network topology will discover that new place for the instance and reconfigure everything. But since the network was not properly configured, that notification that packet got nowhere. So the final result was that the VM apparently was taking one minute or two minutes to be available again on the network, regardless of being active and apparently working. So once we finish reproducing the issue, we just fetch the logs locally to our host with the synchronized action. And the final step is getting them merged in a single one and looking for your problem. So one of the interesting things that you can do to get more meaningful information, especially in the network context, where things can be very complicated sometimes, is introducing profs. Like, as I said, fingers or something to dump the flows of OpenB switch. NetProv, which is a helper we have recently introduced, which it will explain. Or anything that you want as long as it uses the OpenStack log format, you can put them together. And I will show you a very simple three-lines example of how to do that with anything on your system. So we introduced this small tool. It's, again, it's a standalone one Python file. So it doesn't have any dependencies. And you can upload it to any host as long as you have Python. And in this case, tcpdump is going to work for you. It watches the network main spaces and the network interfaces in the host where you started. And when it finds non-patterns, like router interface from Neutron or an instance tap device or DHTP port, it's going to start logging the control traffic on that interface, get all the messages from all the tcpdumps that it's starting into a single log, and write it to a file. So in the end, it will write something like this. In this case, it found an interface in a DHTP main space. It's logging it. And then the packets will look like this. They didn't fit in the screen, so I cut them. But basically, any packet will show here in the tcpdump format. And you have the interface name here. And those are the default settings, like which kind of interfaces it's going to look for. But you can configure them in the command line. And the default tcpdump filter, like I'm logging the most controlled traffic for DHTP, ICMP or ICMP6, and ARP in writing everything to this place. But you have the command line, and you can modify those parameters here, here, here, so you can make it work for you. And this is a very simple example of how to grab something in your system that you want monitored together with all the other logs that you are getting. So it's just a matter of putting your command here or a function that loops doing whatever and then applying this small wrapper to prepend the OpenStack log format. So yeah, this is an example of how to upload this net probe to the system and start it. I make sure that I kill it. Please, my abilities are very low. Probably there's a much better way of doing that. But it works. It's just an example of killing any old probe that you have in the system copying the net probe script and starting it. So you will have the logs available in the end. You have links here to some examples. I'm going to show you how it looks to use an Ansible script on a deployment. It's a single node deployment. Because my laptop couldn't take much more. But let's go. And holding one and holding one. Yeah, in this one, first I clean up all the resources that I already had on the system before setting the debug flag and restarting services. It waits for the resources to be properly cleaned up. Now it's restarting the services with the debug flag to get them also writing in the new log file. And that copied and started this network probe that I was talking about. And now it's creating two networks, two routers, two VMs, and waiting for them to come up. I'm actually looking into the net probe log to find the DHTP request. So until I don't get the DHTP request, an ICMP workload that I set on the VMs. So until I don't see the ICMP, I don't keep waiting. So in the end, it finishes, it retrieves all the log files locally, I have them here. So I have here all the logs that this VM had. And finally, the logs after going through log merger, if I look for example for a bootp request, sorry, going to chop the lines. If I look for the bootp request, we can see that right after no compute spawns the instance and it starts resumed, resumes, it starts the instance. We can see the instance requesting DHTP. This is the instance tab. This is the router getting the DHTP request but he's not able to respond. This is the DHTP agent. Port getting the request and this is the DHTP agent responding to the request with the instance IP address, the point three in this case. This is the instance getting it. So after a while, we are going to see the instance looking for the router address like it's assuming where is, what's the MAC address of my router, the router response I'm here. And that happens because the instance is trying to get the, to do a metadata request. So we can see that after finding the router address is doing the metadata request. We see the Neutron metadata service and that goes to Neutron, that also goes to Nova at some point. And we can follow the whole thing that if I do, I filter it too much. But yes, that was the instance asking for all the metadata requests from through the whole chain. It has to do the router. It has to go to the Neutron metadata service to Nova and many, many things happen. So that was it. What's missing on this project? We want to move it to Stackfords because now we are on GitHub, maybe get other contributors. I started and Gorka joined me because he found it interesting. And we also want to add maybe some more filtering options and make it easier to use with the gate, mainly because we are developers and we find it useful. We also want to make like a small ansible generator that could be able to discover the hosts that you have and maybe by looking at the Neutron agents or maybe by looking at your undercloud if you are using triple or something like that. And also provide more profs. This is a network prof. Maybe some, we are open to ideas. And also maybe making a version for the gate because in the gate as developers, we have like a log viewer that works in JavaScript but it's only able to handle one dog and maybe it could be cool to have it handle several of them. And that's it. If you have any questions, feedback, ideas, so please go to the microphones and thank you for coming. Is that working? Yeah. I was just wondering about, you mentioned briefly about centralized logging. I'm curious how this would work with doing everything centralized. I've been trying to work on a project that's very complex with five controllers and a VIP and everything is HA, it's very big but it's a huge mass so it's very, very difficult to get merged logs that are useful without having to manually dig through every single thing. Well, the idea here, this is for when you don't have centralized log. If you have centralized via Elk or via Fluent D or any other mechanism, this is not really useful. Yeah, this is something also that we wanted to look at. Because, yeah, this is a first step for when you don't have any centralized log thing to work with. What we're looking at, things like Elk, Vlog or whatever. Yeah, something that we want to look at is being able to integrate with that kind of centralized log systems to do queries like maybe I want this node, this node, this service, this service and in this time frame and get something meaningful. Maybe having, I don't know if I separate tool based on this or maybe a command line for that. Yeah, but I think what you were asking is if we have a solution to the merging of your logs in your environment dynamically like live, right? Well, something like that would be more like Elk stack or something. But I was thinking just being able to dig through programmatically would be useful. I mean, once you have everything merged and you're using something like Elk stack, that's kind of out of the scope of this. Yeah, right. But something that would be able to probe through right when you have an error, right when you're having issues or something would be really useful too. To jump directly to that specific part, you mean? Will it be able to track down once you find a request ID for instance and be able to dig through the whole stack and find where this failed, that type of a scenario? Oh, right, yeah. But I think there are some efforts working on that on root cost analysis trying to determine from an error why that happened. And but I really don't know the specific or at what stage they are at, but I know they are working on that, things like it. If your database fails and everything starts failing to not report the small log errors, just report that the database is down. It's also, they are also trying to do root cost analysis detecting the error and determining where it's coming from. But I don't know the status at this point. What is useful is the mechanism to pass the request from one service to another that we have. So you can actually have a common request. I don't know if it's completely implemented, but there is a mechanism to, when Nova calls Cinder, for example, to pass the original request so you can in logs follow them easier. So you don't have to actually do it manually, like, okay, here Nova made the call. Let's see which call appears in Cinder at that time. So we, I know there are efforts. I don't know if they've completed the work because they were quite, yeah, there are places to go because it wasn't as easy as we all originally thought. Like, okay, you just need to add the original request. Yeah, but we have forks. We have, so when you actually start looking into it, it's a lot more complex. Yeah, it can be at three of calls. Yeah. We didn't work in too long. Yeah, yeah. But I think that therefore it would probably be easier, more helpful to actually locate because you locate the original error. You know the original request. You filter but die request on all your logs and if you use something like this, like, merge everything and filter by the original request because all the requests from that tree will have it. Then you have a sequential order of events that are easier to follow, I think. I hope we have been able. Excuse me, I have a question. Isn't there any splitting capability of splitting log, splitting output? Do you mean if we have a capability of splitting logs? For example, part request or part period. Yeah, not yet. That's one of the things that I also wanted to look at. Because we're filtering the final logs. Because we usually just filter it, for example, if you're with less, you just filter. You just filter in the viewer, like Miguel probably is going to. Generally, when I do this, I open it with less and if I want to see the Neutron services, I can do this and that's going to give me all the Neutron services or if I want the NOVA, for example, NOVA API and... Maybe this is not the file. But the reason why we didn't implement it at the beginning, the filtering, is because we usually do the filtering with less or whichever editor you're using. You just build a query of your filter there. For example, if you want a specific request, you say filter by request and if you also want to include a specific node, you use the pipe and you add more stuff that you want to filter by and then you only view the view is only that. But it is something we are considering, yes, adding filters to actually... Okay, thank you. Okay. My question is kind of just related to the previous gentleman and him is, we have a cloud which is really huge. We have distributed services and tools like this really help because you want to aggregate all the logs but the challenge that I face is even... So first thing is you have to figure out where the request is going. So if I'm making a volume create, right? Like it can fall on one of the sender controllers, basically sender VMs essentially, which we are running. The second thing is... And that was coming to the filtering, right? Simple filtering does not help, at least in my case, which I've seen is if I'm just doing a simple graph or trying to figure out, so let's say I create a volume and I find the volume pale. You try to look at the logs and try to search for the volume ID and the logs to filter. But the logs, when some error happens, usually you see, let's say a request is sent, right? So you would see a call request, but then it misses all the part because sometimes not the whole log has the UUID of the volume. So actually, and most of the tracebacks that show up, like they don't have any UUIDs, that's just a big traceback. So usually when I graph, like I have some simple filtering tools, like I actually miss out on those tracebacks. The only way I can do it is I just have to go into the logs and then find out the UUID and then traverse through the logs. Yeah, usually you locate the UUID and then the request that it is in that UUID and filtered by that and then you filter on those and then you mark the specific location where you say, oh, here there is a traceback and I miss it. So you mark it, you hit MA, for example, to mark it. You undo the filter and you look, yeah. But it'd be nice to have a tool which kind of does that for you because it takes a long time to figure that out. Yeah, the thing is we feel your pain because we've done it, so. Yeah, so I have a similar tool, like which is internally I kind of use and I would like to kind of collaborate with you guys if you have something, so I'll follow up with you afterwards. Sure, the thing about the more advanced filters is that it's something we want to do, but we know that as soon as we get in the code, the common, the parent request ID, everything will be so easy that it's like, if they are going to have it done in three months, do we want to spend more time on this to get the filters and do everything if we are going to get so much more with the common request ID? So, yeah. Yeah, but I suppose that maybe another tool on top of this, like a log navigator specifically built for OpenStack could be interesting, maybe help you filter the things in real time because you don't want to go back to the command line and change your filter, but you want to navigate, select stuff, move around. But, yeah, we also have other good, already good tools for that, but I don't know if good enough. Okay, well, so thank you very much.