 OK. Welcome, everybody, to the Cloud Hands-on Troubleshooting session. I'm Patrick from Linux Polska, where I work as a senior technology architect. I currently am focused on developing different solutions for OpenStack. And I hold a few certificates, like Splank certified architect and Red Hat certified architect. I'm also an instructor, and I teach Red Hat and Mirantis classes. So we're going to have a hands-on session. We will divide it into three parts. The first one is lab introduction. The second is actually split into two subparts. The first one is you setting up the environment on your own devices. And in the meanwhile, I give you the presentation about that troubleshooting. So after that, we're going to do the third part together, which is hands-on. OK, let's start with the lab. There are pen drives around there, a lot of them. So grab one from a person next to you and copy out the content into your device. And there is a virtual machine in there, virtual appliance, imported. And this is actually the lab number one. You set up in the environment. And in the meantime, as you will be set up in this thing, I will give an introduction into troubleshooting presentation. So we'll meet in this blue point, which is lab number two. Please don't proceed to lab number two until I complete my presentation. I'll show you how to import this machine. What I have here is actually the content of the USB device. So just click on the OVA file and then import it. I agree to the license and just wait a couple minutes. So are there any questions so far? There are four folks with me with the same t-shirts. If you need some assistance, just raise your hand and they will help you. They will reach you. Is everything clear after this point? Who has the pen drive now? Only two people. That's better. Let's look how my import looked like. So after the import, please start this VM. And this is actually lab number one completed. So please do it on your own. OK, so let's start with part number two, which is a presentation. And I will try to explain you what actually troubleshooting is. Yes, sir? Yeah, you can and ask it. But it will be probably swapping if you are under three. OK. So we all intuitively know what troubleshooting is. We all know that this is a nice thing to have, to maintain a complex system, such as an open stack. But what we usually forget about is that a troubleshooting is a must have. It is a prerequisite for developing and maintaining complex system. So it's not only needed just for maintaining, but it is a prerequisite for developing. And troubleshooting usually consists of four phases. First one is just an identification of what is not working, what's malfunctioning. The second phase is finding the possible causes of the problem of the issues. And what's important here is that we are not trying to eliminate the problem at this point. We do not try to fix the results of the problem, just the root cause. And the good practice is to make a list, because there might be many possible causes. And put away, just put a probability into this list. And so this list and start eliminating the problem with the most probable causes. That's the point number three, elimination of causes. And then we can proceed to the confirmation. So this is just a kind of test that it is working again. A good practice is to write some kind of guide, so-called cookbook. If someone else will spot onto this problem, then he can use a cookbook and have a ready to go solution. But a good troubleshooting platform usually is capable of optimizing all of these steps. So yeah, this is a nice to have feature. So let me explain what kind of problems, issues we are going to address. I divide the displays into four classes. The first one is actually the easiest problems. Something is not working at all. They are easy to solve, because after changing something, it's working again, usually. So it only requires an insight capability. The second class are the problems where something is actually working, but not as it is supposed to. So I don't know, Dinova is responding, but it's rejecting all of the requests. These problems are medium to solve, and they require some kind of analysis. The third class are actually an extended version of the second class, which are problems when something is not functioning correctly, but the problems are not seen in this malfunctioning component, but in the other components. You can think about different OpenStack components cooperating, and something is not working, like, I don't know, Keystone in the back. And then you will see problems in the other components. If your hardware switch is not working, then you will probably see problems in Nova, for instance. These are really hard to diagnose. They usually require a capability of correlation of different events coming from different components. So the last, but not least class, the problems which were there and magically disappeared. So we did nothing, and the problem is not longer there. So these kind of problems are impossible to find without proper solution, because you need capability of historical analysis. And if you have it, then these problems are medium to solve. So no, it doesn't matter what actually your problem is. What is important here is that it's the fact that we have to have the knowledge, be able to gather this knowledge, process it to find, to shoot the problems. So let me give you a small introduction into what actually knowledge is. It is literally any piece of information, any portion of data. It can be a line, a log line. It can be some kind of event, API response, and even a logical architecture of your system is a kind of knowledge. So in order to fix the problems with an ease, we need a platform capable for organizing this knowledge. And let me explain that the nature of the knowledge is quite nasty when it comes to complex systems like OpenStack, because the knowledge is present in many sources. It exists in many formats. It's being transported in many different ways. And sometimes it is only available on demand. So think about increasing the verbal city of the logs, for instance. And well, this knowledge usually requires a process of transformation to be available for digging for processing. So in the next couple of slides, I'm going to extend these points and explain what exactly I mean in these points. So let's start with sources and formats of the knowledge. Probably the most popular thing are the logs. They exist in text or binary formats. And the logs are formatted into various ways. I had a nice example on the slide. This is a listing from the NovaConf. I just take out three different format strings. There are more in this config file. So single components, such as Nova, are logging in very different ways in a single file. Here's an example of the compute log. I had some IDs highlighted and the corresponding ones in the resulting log are also highlighted with the same colors. So what we see here is really nasty, because here are four IDs for the request user tenant and instance. And as you can see in the log file, these IDs are written in three different formats, with rack in front, without dashes, and with dashes. So this requires some kind of normalization to be processed. So we definitely need some kind of tool capable of doing this. Another good source for the knowledge are script execution. Scripts usually write to standard output into the file. And well, in scripts, you can do literally everything. So the format of the outputs vary. It's usually up to you. And the scripts can be executed periodically in every five minutes, let's say, or only on demand if you need to get deeper for a while, like increasing verbosity. So we can use a script, for instance, to compare the knowledge of LibVirt against the knowledge of Nova if there is a difference in the list of running instances, then we had a problem. Another source of knowledge are the API responses. We should be capable of asking API for some kind of knowledge. And these responses are usually some kind of markups, like XML or JSON. And as an example, salometer is a good place to ask about, for instance, disk IOPS of virtual machine. Having an opportunity to speak different kind of protocols is also an advantage if we could use, for instance, an open flow to ask for dumping table flaws in open vSwitch. That's really powerful. That's becoming more and more powerful. And as you probably know, the databases are designed to store the knowledge. But we usually don't look that deep. Of course, you can, for instance, with a script. But most of the knowledge available in the database are accessible via APIs. So this is a better way to get this. And there are many, many more I will stop on this point because I just wanted to give you a brief introduction into where the knowledge is. And what is crucial is that the knowledge coming from different sources is not correlated. It means that these subsequent events are not linked anyhow. We need to be capable of linking them, organizing them. Think about a process of provisioning an instance. If you order an instance via the Nova API, it will dispatch this request into Neutron, Cinder, Glance, and so on. This can be distributed across different servers, different machines. And it all is correlated, actually. It all regards one request of provisioning an instance. So the correlation is another must-have for troubleshooting. Traveling to work requires choosing a transportation, like a bike, train, a car, and so on. It's the same with the transport of this knowledge. It can be transported in protocols as raw events, protocols like UDP or TCP. But if you're using some kind of solution like SysLock, it has its own format. So the events are transported in a SysLock packets. If you're using more sophisticated solutions like Lux, Taro Splank, they use their own forwarders and they use their own custom protocols. So we have to be able to speak all of this to gather this knowledge in one place. Another nice to have feature is capability of SNMP walks and SNMP gets. This is how we can ask for something in the IPMI. And as I already mentioned, the speaking rest protocol when speaking, when calling API is also what we need. So when we dig for the knowledge, it can be already there. I mean, lock forwarders or things like SysLock, remote SysLock are sending the locks every time. So we will receive this knowledge anyway. It's coming constantly. But a nice capability is to have the knowledge on demand. And I will give you an example in a minute. So I have a table in here. And this is when the first column described when we are digging for the knowledge. So from time to time, we dig a real time. What I mean is that we want to observe what is actually being run, what is being done right now in this moment. And this requires a capability of life analysis and is a prerequisite for triggering some alerts. And this is a proactive behavior of troubleshooter. And being proactive is always better than being reactive. The other example is an on-demand knowledge. This is what I mentioned before. On-demand is when you ask to go deeper just for a while. It is, for example, when you increase VerboCity in some components, we can't do this. We can't put all the components in the debug level because it's a big performance degradation. But from time to time, we need this. And another nice thing is that we can get the knowledge from the switch. We can tell the modern switches to set trap on some port and send us a callback if this port is suffering from micro-bearstink, for instance. That's a nice feature. And last but not least, we dig for knowledge post-factum. So we just want to know what happened, what had happened. So this is just a historical analysis. Another step in order to build a good troubleshooting platform is to transform the knowledge. And this process of transformation usually starts with normalizing or attaching timestamps. If there are no timestamp in the event, we just need to attach one. But if there is, we need a normalization. Different services, different operating systems, different devices are using different formats. So we want to use just one when we are looking for something. Another thing is extracting the fields for further processing. These are usually presented in a key-value purse. And I have an example here of using a regular expression, PCRE, to get this instance ID out of this log line and match against this regular expression and put the result into the field into the key called NOVA instance IP. And here is how it might look in a troubleshooting platform. Key and value, these fields are extractive. So extraction could be done according to markups. If you're using JSON and XML, actually, the key-value purse are already there. The only step we need is to just index them, extract them. OK. Another step of transformation is attaching the host name and source fields. We need to know from what point of our infrastructure events come. And about the source, the source can be script execution. And it could be some kind of file and so on. OK. And another step involves merging multiple line events. A good example are the stack traces being usually written into the open stack logs. Although the stack traces are written into different lines, these lines doesn't make sense if they are not together. So this transformation includes merging events into bigger pieces if they are related. OK. To accelerate the future processing, knowledge has to be indexed. And we talk about two kind of transformation modes. The first is the index time transformation. It is done when the pieces of information arrive at the time of approach. So this is when we put the timestamp or normalize it. This is when we could match against some reg exp and put source and host name fields into the event. But some of the things can be done later. And this is actually a good practice to do some things later. You can, for example, have a new reg exp to extract something new from inlux. And the best practice involves using both of these modes. Let's call it hybrid mode. OK. A good troubleshooting platform needs even more features, not only gathering and processing the knowledge as it was already stated, but also things like presenting the knowledge, graphs, charts, visualizations, and so on. Important, I would say, even a crucial thing is to be agnostic. It doesn't matter what kind of deployment tools you use for your open stack. It doesn't matter what operating systems you run or what hypervisor are you using. It all needs to be gathered and understand in one place because it is all related. And to make the knowledge available, we need to expose it via API. And I think I already mentioned about generating reports and trigger alerts. There are even more, like high availability, scalability, elasticity, and so on. So this is it for the presentation. Let's proceed to the part number three, which is troubleshooting. Let me prepare myself. OK, so in a minute, you're going to run a script which will broke your open stack installations. And you can use poor admin tools for troubleshooting. But the cloud bliss is installed into this virtual machine as well, so you can use this platform. It's capable of doing an insight, intelligence, analysis, and troubleshooting. So I will use this for the presentation for the troubleshooting purposes. Cloud bliss is a troubleshooting platform which can do everything which was mentioned during this presentation and even more. So we will probably make this lab environment available after in a few weeks. So if you are not here or if you don't manage to complete the labs right now, you will have an opportunity in the future. How many of you have this VM up and running? OK, quite good. Are there any questions on this point? Does anyone need some guidance or assistance? OK, so there are a few guys with me. They could help you just don't hesitate to raise your hand. So if everything's clear, let's start with lab number two, which is login. We use cloud bliss for the passwords everywhere. And run the script lab start, which will break things. So in a minute, your OpenStack will suffer from some problems and we will proceed to the further labs to fix them. OK, in the meantime, run your browser and visit local hosts at ports 8,000. This virtual appliance was configured to forward ports. So we are connecting to the local house and this connection is forwarded to the appliance. OK, and in the second tab, please open the OpenStack dashboard and log into the horizon with the demo user. Again, the password is cloud bliss. What we can see here is that we had a lot of instances in this project in various states. Some are active, some are rebooting now, and some of them are shut off. So let me give you a brief introduction into what we see here in a cloud bliss. Let's start with a summary dashboard. So actually what you see here is a demo of cloud bliss. It's limited to this environment. You have one node installation of OpenStack and this greatly simplifies things. So there's only one host. And here we can see, in the first panel, events generated by this host. As you can see, there are over 3,000 of them and this machine is running for 20 minutes right now. So that's a lot. Here is a table with services and information if they are configured during the boot time and if they are active or dead right now. This panel divides the messages into different severity categories, like all the debug error info and so on. There is a count and a trend. We just started this machine, so we won't see any trend. This is what I like the most. This is a table presenting different various OpenStack components and how many lags of different severity this component produce. So for instance, seeing their API extensions produce a lot of audit messages. And it's the same in here, but on a graph. So you could click on the graph and you will be moved to the raw events which produce these events. Another useful tab is an OpenStack Errors. Here's the list of the components and how many errors there is in this component. So as you can see, our OpenStack is probably broken. There's a lot of things which are not working. So actually, this thing is a limited version of the previous dashboard because it only shows you what is not working. And the same thing on a chart, on a bar chart. And here we can see the raw events which are being taken into account. OK, so enter this search tab and click a data summary. What I like here is that we can have some kind of global view in one point into our infrastructure. As you can see, we only have one host, but it's producing events with three different host names. So this is not really good. And the second tab represents the sources of the knowledge. Most of them on the list are the logs. So these are the log parts, like var logs with proxy log. And here are the source types of these different logs, like Cinder API log. So the source type is actually a clue how to transform these kind of events. OK, so this is it for lab number two. And now it's your turn. Feel free to ask. And I will give you a few minutes to do the same. And what I forget about, I had a printed version of the labs, but it is available in the PDF on your USB content. So just open this PDF and there is an introduction and goal of this lab and the solution described. OK, are there any questions? Did you manage to log in? We had a lot of t-shirts like this one, so please ask questions to get them. Yes? OK. You want a t-shirt? You have to ask a question. Yeah, why not? OK. So what's your question? Well, this question is quite tough, actually, because we currently are developing this solution for less than a half a year. And we just want to show you this and get a feedback. So most of the components are open source, but partially they are not. So if you want details, I'll give you after the presentation. Yes? Actually, you can execute these commands, although it's not what we are actually doing, because all of the knowledge which will be present by executing any Python client commands are available via API. So we prefer to use API. Yes, yes. So the output of the APIs are presented there, yes, as a path to the script executing the API calls. Yes? And we totally focused on the open stack. So this is our advantage. So we spend a lot of time into getting things work. And now we have a platform which is extendable. So in a few weeks, you will receive this thing with preloaded dashboards and partially, yes, partially, yes. OK, OK. Hello? Yes. I have a question before this. Where are you? Here I am. How does CloudBless read the logs? So does it require some kind of keys to go into the open stack machine, read the logs, come back to the interface? So we actually can get the logs and how via syslog and via forwarders. So it's up to you. So do we need to give CloudBase, first of all, do we need to install CloudBless on a separate machine, or it can be an open stack machine? Yeah, it has to be present into every machine which generates logs, but only the forwarder part. So the CloudBless consists of two main parts. This is the engine. What you see here is the engine and some kind of dashboard for the users. And each open stack machine has forwarder installed. So we ask forwarder to forward logs and to execute commands on demand. So do we need a minimum one machine installation of CloudBase, or I should have some? Yeah, one machine is enough. It depends on how many logs you have, how much of them you want to index and search. All right, thanks. Yes. So we actually are version agnostic. So we make a platform which is capable of organizing the knowledge. You can use it for not for open stack, for instance. Question? Question here? For example, you can, I don't know, use a VMware log format and get it there. Question here? Yes. Could you please describe the Splunk integration with this one and see how Splunk is integrated with this? Yes. Yes, Splunk is in there. So we use Splunk as a core engine for indexing the knowledge, for getting the knowledge and indexing it. OK, let me proceed to another lab. OK, so I answer more questions after the next lab. We have a tight schedule. So, time is it. So in this lab number three. What was the question? OK, so one more question. Thank you. Just wanted to understand what sort of information we can get from the neutron side. Say, for example, looking inside VXLAN or GRE tunnels, rather than looking the log, do you have a way of pulling data out of that to analyze? Yes, so actually, how deep you get is also up to you. You can forward all of the information and make it available in this platform. And this includes what neutron has in the database. It's available via API. But also, you can dump flows from open V-switch. And if you're using some kind of ML2 plugin, there are switches which are capable of sending the logs. So to clarify that, for example, with open V-switch flow dumping, can you do that from your interface? Is that possible? So in this demo, we are only a consumer. From the consumer part, we're not available to use any commands on demand. But yes, this feature is available. But it requires you a tiny trick. Thank you. OK. Yes? I have three questions, actually. Oh, OK. That's too much. No, maybe one isn't really a question. It's just, is there any way you could either hold the microphone closer or maybe they could turn it up because it's hard to hear you. But the main question was, how easy is it to extend this to analyze extra log files, like if I have a product with other log files in there that could be incorporated? Yeah, so it depends on what you want because gathering the knowledge is really simple. But if you want to process it, you need to do things like field extractions and so on. So yeah, so it's up to how do you want to use it? But it's possible. And it's just to have it? Yes, of course. OK. Well, so one of the nice features is that if you already are using some kind of troubleshooting platform or central log solution, you can feel cloud bliss with all these logs. You previously have and start with historical analysis. What we are actually working now is integration with ELK, so Elastic Search, Logstash, and Kibana. And I think that this is a good step into the OpenStack because there are rumors that it's being used. And my other question is about heuristic analysis or event correlation, this kind of thing. Is there something in there other than just looking at trends to see maybe add rules to say, if I see this kind of event in this log and this kind of event here, then it can kind of conclude something more interesting from the combination of those. We call it transition. Yes, you can do things like this. You can describe your own rules to combine logs, to make them one piece of knowledge and process them. It's not, I would say it's not really easy. It requires you to spend 15 minutes in learning how transition works, but it's available. OK. So as I understand now, cloud bliss is based on Splink. Where are you? Over here. OK. Yes, that's true. Is there a Splink and you're looking to integrate elastic search kebana and loctash in replacement or in addition to Splink? Well, that's a tough question. We love to hear your feedback. And well, it depends on your use case. So we just Splink up to this point because it's really easy to extend it. And speaking of elastic search, we would need to go really deep in the code and change these components because they are not so mature as Splink. Oh, thank you. OK, so let's proceed to lab number three, unless there are no more questions. OK, that's great. So the lab is quite simple. So one of the cloud customers called our support services and said that his application stopped responding. So he locked into the instance and saw the IO errors. Having nothing better to do, he rebooted the instance and this is not working anymore. So the instance is down right now. And this is all we knew. This is all we know. So actually, there are two problems. The one problem was already described. The other problem is that our support services didn't write the name of this guy and didn't write anything like an instance name and so on. So we need to find this information. And because we have nothing better to do, we can start with viewing the summary. And a good point is to look at the services which are enabled at the boot time, but they are not working right now. So there is one like this. It's TGT-Demon. It's enabled, but it's inactive right now. So let's TGT-DV our number one suspect. OK, now let's take a look into the OpenStack Errors tab and into the Recent Errors. As you can see, there is a quite nasty lock containing the stuck trace of a nova. And this lock ends with an information that LibVirt has an error. And the error was it cannot attach this volume. And here in the path of this volume, you can see that it's being transported via ISCSI protocol. So it makes sense. TGT-D is not running. And there's a problem with ISCSI. So there are more locks. This lengthy one comes from the all-slow messaging component. And what we can do right now is we can use our knowledge, not only the knowledge which is there. So we know that nova is responsible for running the instances. So I'm not interested really in what's all-slow messaging saying. I'm more interested in what nova knows right now. So there should be a nova here. Yes, there is. The nova compute manager produces one error. If you click in there, you will be moved to the search bar. And you will get an error like this. It says that this request, project and user, and this instance cannot boot, cannot boot this instance. If you click on this arrow, you'll see the fields extracted in the search time. And here is the list of the fields. So up to this point, we know what host is affected, what instance is affected, and what project is affected. That's a lot. And it doesn't matter if you have a thousand of instances or a thousand of compute nodes. It's usually as easy as it is seen here. So let me copy out this instance ID because we don't need this in the future. And the thing I really like is that if you put here just an instance ID, all of the events will be limited to this regarding this instance. So as you can see, it's something regarding this instance happened in two hosts and comes from seven different sources of knowledge from Neutral, Nova, Horizon, and some custom-made scripts of Cloud Bliss. OK. So we know that probably the root cause of the problem is TGTD. And it's responsible for the ICE CASI transportation. And it locks into the VARLOG messages. So why don't we lock into VARLOG messages? You could put source equals VARLOG messages in here. But I know that this kind of log is forwarded by this log. So I'm looking for this log event. And yeah, it's definitely not working. Something is failing. So I'm now pretty sure that TGTDmon is not running. And this is the root cause. And I know what instance is affected. The user said that he can't reboot an instance. And we saw this in the logs. OK, so let me lock into the virtual machine. SSH is forwarded on AT22 port. And of course, we use Cloud Bliss as a password. OK, so why don't we start TGTDmon? OK, up and running. So let's refresh what's in syslog. OK, just look in the last 15 minutes. And syslog is, as you can see, TGTDmon is, and iSCSI transportation is operational now for the first line of logs. So that's good. And let's move back to the console and put some open stack commands. First of all, source the admin credentials, which are available in Keystone RC admin file. Let's see what Nova knows about an instance with this ID. OK, so there's a long output. It's long because it contains a stack trace. So we now we can see that an instance is in an error state. So we solve the root cause. But we need to bring this instance alive again. And this is the stack trace. So let's try starting this instance. And well, we can't. Why we can't? Because the instance is in an error state. So the easy solution is to change the state manually of this instance. And there is a special Nova command for this called ResetState. So by default, it resets state into the error. So I put an active manually here. OK, so the state is recited. We can start the instance. And I will do this with virge command. This is how we tell libvirt to start an instance. And by the way, we don't have to use the instance name. We can use its uuid. So that's a nice feature of virge. Well, this is it for the lab number three. I've just shown you how to fix things. And now it's your turn. So good luck. And we'll start another lab in 10 minutes. Feel free to ask. Yes? Nobody said state active. What does it actually do? Does it update the database? Yes, yes. It actually only connects to the Nova database and changes state from the error into active. And that's all. But the instance is not running. You have to start it manually. Appears in the Horizon dashboard. As the status says. Really? Yes. And it throws the same error. So I wonder if the reset state can fix the same issue? Oh, maybe. Yes? OK. Maybe. Thanks for the comment. You can search through the dashboard logs and dashboard events as well. So maybe there's some valuable information in there. Why did you use a version, not Nova, to start? To simplify things for now. Because it's actually Nova. Things that this instance is running. So it doesn't allow us to start it again. We said Nova that this instance is active. So we can't use the Nova command to activate this one more time. Yeah, this is a dirty hack. But yeah, if you manage open stack, this is what you do every day. The question was what version of open stack we are using here. And this is a nice house. Yes? Which are Nova? OK. I'll take a look. A lab start command. OK, so yes, your open stack is working. You need to break it first. Well, for instance, there is no way to do that. OK, why don't you grab a USB and import this machine again and start it over? No, no. Open stack is really heavy. And cloudless also is quite heavy. Did you manage to run this in tablet? Disappliance. Good tablet. You need to pull all tenants, dash dash, all dash tenants. Yes, because you're listing what's in the admin project. Yeah, and what's broken, it's not in the admin project. I just want to add this. Yes? OK, let me type it in here. So there is something like all this, yes, yes, yes. So that's better. OK, we're going to do a special case. You hear your cloud which is running in the experimental unit with some experimental plug-in. And you use your cloud to do that particular way. Do you have a possibility to collect the transaction of the packets to that particular way? I mean, to check out all the interfaces which is in between you and that particular way. OK, so it is possible. But the question is, do you really need this? Because it will produce tons of logs, but you can't. And you can correlate every single step in the path. It would be a little bit difficult even more annoying to lose all the logs. Yeah, so I wouldn't do this in production. I'm quite sure that it's with the great performance to zero. No, it's more like the platform to organize the knowledge. You can put literally everything in there. So it's quite easy. Yeah, because you had the... Why well, scripts, distributors of all this? Yeah, so what you're asking is quite easy, really. Because if you attach the, I don't know, the check, you can save the checks on the packet in every single point and log it. So then you see the path in there. So it is possible. It's not really easy, but after an hour or two, you will see, do you have a T-shirt? I'll bring you one. First list will show this name as the stance, 19, 18. Yeah? You can use that UUID. So actually, it is being grabbed here with this ID. With the name of the place. But in the worst list, the name is stance. Okay, so this is why we really need a troubleshooting platform, because Lipfield has its own database, and Lenovo has its own. So the only thing which combines two databases are this UUID. Okay, so in the database of the... Got the name. You've already finished. That's great. Okay, now it's time to proceed to the lab number four. So the scenario is like this. We already fixed the problem, but there could be some quality of service or service level agreements in your company, and you need to write the report. You need to, for instance, determine the downtime. You need to put in here the details like what project, what user, and what instance were affected. So we still don't know who filled this issue. It was someone. So let me show you how can we find this information. So we still have this ID of an instance. I'm going to put it here and look for the events from the last, let's say four hours. There was a reboot word in this problematic line. So, yes, this is what I was looking for. If you click in here, then you'll see a field extracted from this request, and this contains the project ID and the user ID. So the only thing we need to do now is to translate somehow these IDs into the names. But we are pretty sure right now that this ID with this product was affected, instance with this ID was affected, and this user was the one who rebooted the instance and he filled an issue. So let me click in this and just look for all events containing this field, nova project ID. So in this project, there are probably many, many instances. And let me find this more fields, nova instance ID. Okay, so if you select this field, it will be ready up there for use. Okay, so there's only one instance in this project, which generates events during the past four hours. That's quite surprising, but it's possible. And if you want to get some additional information from the database, actually, we can use special kind of macros in this platform. And one of which is OS name project, for instance, project. OS name project. And as a result, we received an additional field in here with the name of the project. So the project affected is just a demo. You can map any IDs. We're grabbing live information from the database via API calls and correlating this with the event lines. Okay, so we could use another macros like map tenant, name tenant, but instance. Okay, so this additional field is not only present on this interesting fields list, but it's also attached to every single line of every event. So if you click in this arrow of an event, you will see additional fields called name. So the instance name was police. The affected instance was police, and we've already seen this in the logs. So the real question is, which user filled an issue? So let me look for his ID, this project, and reboot. Maybe we'll try to find all the name of this guy, OS name user, and we'll see additional field name here. So the user name who clicked reboot and produced this line, this line of log, is named demo. So up to this point, we know what project, what instance, and what user was affected. But the real question in the lab is, what other users were affected? They didn't notice, but they were affected with this problem. They had an access to this instance. So let me delete the reboot field, and that way we'll see everything regarding this project, every user in this project. So the field name should contain more names, like Derek, Patrick, and so on. So it was demo who actually filled an issue, but all of these users might have filled that issue. All of them were affected. So the good step from this point is to generate some kind of report, which is out of this lab, but it's possible and really easy. So this is it for my presentation on the lab number four. So your turn now. Do we have any questions? Do all of you have the t-shirts? Yes? I can't hear you. Yes, Splunk is in the back end, yes. Yeah, yeah. So we build our solution atop Splunk, actually. How does it work with licensing? So this is the guy responsible, so. OK. So speak at last. OK, so the question. Yeah, the question. Oh, I'm just curious if you guys are using Splunk on the back end side of it. How do you deal with their licensing models? OK, so the idea is what we build is actually an analytics on top of Splunk. So the reason why we went with Splunk for this exercise is actually time that took us to come up with a working prototype and being able to test different ideas. Since there are no tools on the market that we could mimic because the idea of troubleshooting itself is somewhat not explored, in our opinion. So for this exercise, we use the trial version, which eventually turns into a free version, because Splunk is available with such an option. If someone would like to apply those ideas, those extractions, the analytics that we come up with should actually acquire the license that Splunk requires them to have in order to analyze the amount of data that they have. But for us, it's more of an exercise, actually, in creating a tool than just applying the Splunk to the equation. So we're also looking actively into building a similar solution based on, for example, elastic search and log search. Those platforms are not similar in features, and the capabilities very differently. I mean, especially in the field of analysis and visualization, like you can use Kibana to visualize. But in Splunk, you can leverage actually some pretty cool tricks when it comes to building prototypes, like on the real environments. So what we'd like to exercise in the future is actually go with kind of a dual model, like build a elastic search and log search-based tool. It will obviously be open source and provide additional capabilities utilizing Splunk as an option for customers. So actually, the reason why we would like to show it to you or we're showing it to you is to actually get some feedback. I mean, is this or such a tool interesting to open stack deployers? Should we develop it further in which direction? What kind of features should it have? Should it more utilize Splunk or more utilize elastic search and so on and so on? No, so I think this tool is great. And I have my own reasons, because last fall, all in the Hong Kong, the eBay chief architect, CTO, gave a talk about how the architecture and all the problems they had to debug and everything. And I actually had to go down to Connolly-Pore. A few days after Flight 370 went missing to do a class down there on open stack. Well, my development, my lab went down right before that. And we did like Googling for two or three days before going down there, right? And none of the answers were correct. And we didn't get it up in time. And that's when I had that epiphany from Hong Kong last fall with eBay. But this tool probably would have helped a lot, right? Because we could have dove right down into it. So I think this is really good. I just need to... One question I have is maybe on the trunks are updated nightly, right? And there's actually thousands of lines of code that's updated weekly, right? How would you keep up with all the changes? I think that's probably my biggest question. Well, yeah, it's actually... It's something that you run into during development is that we had on several occasions, we had a need to actually patch something in OpenStack in order to have a consistent logging. So, well, we discovered some ways to actually parametrize OpenStack to the point that it actually started logging the way we could leverage the platform to the full extent. But the way we see it is the tool itself can analyze any text. Whether it's LogStash, for example, or Splung, and for this particular case, it utilizes regular expressions, so it's agnostic in this way. But on the other side, it requires maintenance because if someone changes the way, the platform logs, then you have to adjust and retest your regular expressions, for example. If it's more a markup base, then probably it won't be so much of an issue. But it will require maintenance. So, we expected... If it's supposed to be a kind of out-of-the-box tool, it should be tested against major kind of versions. So, and we don't see a way to escape it, actually, at least yet, so... What kind of hardware do you use for all your testing? I mean, is it just white boxes or what are you using for your testing? Let me interrupt you. There is lab number five still to be done, and it's quite easy and quick, so just do it on your own, and I will give a T-shirt to few first people completing this lab, so just start it. I wish luck. So, Jesse, my presentation contains my contact to me. It will be available. I want large, and he wants medium. It's small. Okay, I thought you already have one. Okay, so I'll bring it to you. Did you manage to do the lab? Yeah, it was very simple. So, what was the name of the process? What? What is the name of this process, the problematic process, system process? Oh, well, as far as lab two or three. Okay, okay. Thank you, sir. The presentation is available online, and my demonstration is simple. Yeah, so we will most probably... We had some small license issues right now, but in a couple of weeks, it should be available on our website. I can give you my card and make it available here if you want. Well, yeah. Yeah? The integration. That's actually... It looks like a good collection. Yeah. So, that's it for the film.