 All right, we're at the top of the hour. So good afternoon, everyone. Welcome. Hope you guys are enjoying. It's been a really awesome summit. My name is Diego Cassari. I'm a principal corporate systems engineer with Dell EMC. I'm joined today by David Sanchez. He's a senior software engineer. So to start our conversation, how many of you here know what's an RCA? Please just raise your hand a little bit. OK, so knowing is one thing. How many of you have actually done an RCA? OK. Is it a fun experience? Some of us think yes. OK, good. So to talk a little bit more about that and what is root cause analysis, let's make an analogy from the real world. In this picture here, we have a real problem. So the house is on fire. We can see the firefighters. We can see damage in the house. We can see the different crews working with the protection gear, with the proper equipment. Now, what we have here, if you think about it, translates a lot to what our practice in an everyday life. The different teams will work on site. As we see here, they need to be ready. They need to be fast. They need to have a way to deliver the same kind of skill sets across the board. And not only is the team on site doing this, but remember that we have an after-match as well. We have folks coming in, analyzing the scene. We have people inspecting the damage. And then at the end, really pinpointing the root cause analysis of the problem that we have. So really similar to what we have in our practice. There are three core challenges for that. So think about the firefighters again. If this is your house burning, you want them to be there fast, really fast. You want them to be, first of all, trained, well-trained, and have a procedure to do it, to fight the fire. So same for us in our practice. The business expect us to have the proper training, to have the proper methodology, but also to be there on time, to be there fast, to respond to the events. Those challenges, which applies to what we are doing, can be tackled by different ways, for sure. So today, we would like to walk you through four different areas that we believe would improve those three challenges that we had before. Number one, systems complexity and an understanding of how systems work. Number two, having the proper power tools. So drawing an analogy back to the firefighters. So let's suppose, oh, how many of us here are actual firefighters? OK, good. So that applies to us. None. So what if there was a fire and they gave us the fire truck with all the equipment? I mean, we can probably write code to distinguish the fire, but we don't really know how to distinguish the fire, even though we have the right tooling. So having the right tool sets plus some other things will help us. We have some use cases as well. We're going to go over that and show with the new tools what is exactly that we're gaining out of that. And then finally, projectives, maintenance. With that, we believe we have a solid foundation that will drive us and will help us to get a better RCA. So just to start, let's talk a little bit about systems complexity. When you first join OpenStack and you're delighted with all the things that we can do, all the greatness technology, that's usually the first thing we see. We see that in the admin guide. The three boxes, you have compute, networking, and storage. You have a UI. But in reality, the systems are more complex, and we need to look at our systems in the way they are. If you have a system that intercommunicates with each other that is dependent of each other in various ways and levels, a fire can break in into one place or another or yet another. And it's pretty difficult to tell you, well, it was a problem on that component, at least from the beginning. It is not an easy task. And it's also difficult to say, well, I have a problem in Nova or in a RabbitM key or in another messaging bus, and that is spilling over to another side of my cloud. So having a systemic understanding that now we are not dealing with the small individual boxes anymore, it's really key. We have to change the way that we see our practice. We're not dealing with individual boxes. And a few times, we're also not dealing with individual clouds. You might have interconnected clouds, too. Some on-premise, some off-premise. So think systemic. And to help with that, we have a new set of tools. As much as I love Grap, you can't just do Grap everywhere. You can, but then what happens if we scale from a single node to 1,000? You can certainly do some stitching. You can put some automation, like Ansible, and Grap everywhere. But then you have other problems, such as correlation of that data, maintaining that data in a way that other people can see it. For that, you have new tools. So for instance, the Elk Stack, which is going to be the focus of the presentation today. Couple with that, remember the Firetruck analogy. So we all have this great new tool, but we really don't know how to use it. Brandon Gregg wrote a book called Systems Performance. It's an excellent source. If you haven't read it, I highly recommend that you have a copy. And this is a way to make it standard in a group. So we're talking about the same thing. So Brandon illustrates the use methodology. Use stands for utilization, saturation, and error, which means that he would go to each component on the stack and he would look for, how is it working? Is it saturating somehow? Is it misbehaving somehow? So this looks at each component, but at the end, you can draw and see the big picture. Another methodology would be to divide and conquer between different teams, but collaborating. Remember, right now, we are not only the devs or the Unix's admins or the network operators or the storage admins. We're pretty much dealing with everything across the board, because a problem on one side can spill over on our side. So as a team, we need to make sure that we all understand what we are doing. Now, going back to this example here, you see we have the three fires, and essentially finding exactly what machine, what log to look for, and also what caused that. It's finding a needle in a haystack. You have a tremendous amount of information, but it's not so easy to drill down in time to get the information. If time wasn't a constraint, I'll tell you, sure, you can just spend a long time doing it, but it's very unrealistic to believe that we usually have a pressure from the business or from customers that we have to get to the problem as soon as we can. The final idea of the toolings that we're talking about today is what we have here in this picture. That really illustrates what it is. A single pane of glass that will allow you and your team to drill down on a time span on a specific problem, remove all of the noise. So if you, I went to telecommunications engineering and in telecom, you have a signal-to-noise ratio, and a lot of radio devices will filter out and remove the noise. Otherwise, you don't get the message. The same thing that when we're driving, sometimes we have to turn off the radio so we understand where we're going or the address. This is it. This is what we need. Also, from a single pane of glass like that, you can not only filter information, but you can make it so it's customizable for your other colleagues or crews. We'll see that in a second. With that, let's talk a little bit about power tools and what I mean about that. So there are various ways to chop a tree. You can go the old school way, just have a saw, and you can just try to chop it. This takes time, a lot of coordination likely. Depending on the tree, you might not be able to do it anyway. But with a power tool, you can be very precise, and it's easy to cut. Now, if you think about that, and you translate that idea to what we're doing here, you have some core components that you wanted when performance on RCA. Number one, you want to have a way to do an audit of the system. Generally, you want to see how the system is behaving and how that behavior deviates from a baseline. Number two, you want to be able to efficiently do a health check of critical components. You want to make sure that you are looking at the right place and you are not analyzing outliers. Number three, you want to work on the real cause of the problem. So of course, we will fight the fires. You need to put the fires away. But we can't spend all of our resources just fighting the fire. We need to go back and see where does this thing started. Because if you don't break that cycle, you go back to the cycle again. Some key takeaways from what we're trying to show here today is, number one, to make it a standard when it comes to troubleshooting steps. You want to make it a standard so when newcomers come to the team or when you have to interface with different parts of the organization, it's easy for them to follow. It's easy for them to comprehend. You don't have a really steep learning curve. You want it to be proactive as much as you can. Again, we will fight the fires. But we also don't want to hit our houses with gasoline. So it doesn't make sense. And then also make sure that we shorten the initial investigation. Many times, we know what to do, how to do it. But it takes a humongous amount of effort for our crews to get together and to start because of various complexities. Some of the power tools and the ones that we're mentioning today here are elastic search, log stash, and Kibana. The three of them together form the elk stack. To give you an idea of an elk stack in a nutshell, every component in open stack will generate logs. This is really a fact. So NOVA will generate its log and neutron and so forth and so on. The more logs you have, the more difficult it is for you to search through those logs. What elk provides here is a way for you to aggregate, to index, and to display that information to the end user. So really simply, you move that along this chain. And then you have something that is user-friendly and that can be customized. Drawing on that analogy that we had before of the functions, you have to move the logs with something such as log courier. You've got to ship those logs. Log stash aggregates that. Elastic search makes the indexing. And Kibana will provide you the display. Let's talk a little bit about the collection. So log courier is, in our case, the example of collecting any elk stack. So you have the logs and you just move them to the log stash server. In our implementation, we also do some normalization for that. And that's going to become more clear as we go. Next, you have an aggregation layer. What that does on log stash, you centralize all of the data coming from the collectors. You do another set of normalization basically from the various schemas and formats that you receive. That is also an extraction of log events. Again, this is key because you want to do the correlation. Because now you're looking at a nova log, but you're also looking at a cinder log. How do they relate to each other? And you pass this in real time so it doesn't really store anything. It digests, and it sends to the next level to elastic search. Elastic search will allow you to, based on some indexes, do real time analytics and real time search. All of the implementation that we currently have, all of the elastic search is scalable. It's put in a high available way. You don't want to also to move your data there and have a single point of failure and really lose all your visibility. So a key component of elastic search. And then lastly, you have Kibana providing a flexible way to visualize data, a flexible way to do analytics as well. In real time, this is important, too. The interface that Kibana provides is very user friendly. You can add and you can remove different views that will suit different groups. And then the final result is the information delivered to the user or set of users in a very user friendly fashion. For that, I would like to invite David for a quick tour of Kibana. Thank you, Dio. Can you hear me fine? OK, so first of all, the presentation. OK, I'm going to start with a quick tour of Kibana. OK, this picture shows you the default dashboard of Kibana. As you can see, there are three sections. On the top of them, there are timestamp with all the events that happened during the last week, in this case. In the middle of this picture, we can see different charts based on different components, severity, the platform nodes, or the Cloud Control node, and so on. And on the top of them, there are three charts where we can see the different components that are arising events on each platform node. Kibana also shows us all the events using color codes in order to see at a glance how our system is working. OK, if we click on Discover tab, we can apply filters in order to search on our logs. For instance, we can filter using severity error in order to narrow down our events, filtering only these events that contains the error string on our severity filter. We can modify also the time period in order to increase or decrease the frame of time that we are collecting this log. OK, Kibana also allows us to add or remove different views. For instance, if we go with our mouse over each field, we can add or remove different filters in order to increase or decrease the account of fields that you can see on your logs, on your view. OK, Kibana also allows us to customize our views. This one is really good because this tool can be used by different tenants, and then we can customize our views following our requirements. For instance, an operator prefers to see a frame period of time shorter than, for instance, a manager or something that wants to get a report of our events. Then we can create a dashboard, we can save them, and we can load all our preloaded, pre-saved dashboards. OK, now we are going to talk about a use case in order to show how we can work with Kibana in order to find what's happening or to find out what's wrong. We can imagine, for instance, that somebody phones you and say that he has a problem with a glance controller, and we don't have more information. With this one, we can apply filters in order to narrow down that. In this case, we have applied two filters. We are filtering by severity, filtering only the error events, and filtering also with a part type glance controller. Now we can only see these events that contain both filters. Now we can see that during the last week, we have more than 1 million of hits using it. If we zoom it, we can see that, for instance, this event started in a specific day and time. If we continue zooming in, we can see that our search is narrowing down, and we can see that the event started at a specific time. Now we can click on one of these events that you can see on the bottom of the picture, and we will see the detailed information for that specific event. And now we can see that the problem is that a user or a script or something is trying to get or to use an image that it doesn't exist. Now, with this information, we can apply more filters because of Kibana allow us to apply, for instance, operators like and or in order to modify our filters. And if we select that all these messages that contains the theorist image, the severity error, and the glass control, we can see that this error started in a specific date and it finished four days later. So we can see that that specific image was already uploaded or the script is not running now. OK, now you can continue. Thank you, David. So with Kibana, we were able to get from a generic conversation about, I have a problem, to what is your problem, where they happen, is it still happening? And you can see in the graph that David showed that. Now, let's move to the next level. So up until now, we've been showing historical analysis before and after, but it's still reactive. Now, doing a proactive maintenance is the next level. One of our products, which is the VXR neutrino, has that proactive component in it and bet it on it. And you can actually, if you haven't had experience, you can actually see it live here. We have a live rack in our booth. But the key here is even though so I have my open stack and it sits on a nice hardware, we still have to monitor and we still have to take prevention and take caution. In an overview, neutrino has a control plane with monitoring and reporting that will allow us to do the proactive maintenance. We have the elk stack with the Kibana that we've seen it before. And then on the side, we will have the hardware with open stack and some other services such as platform as a service. Now, our approach to that is based on alerting, is based on looking at all of the layers of the stack. So you go from the hardware. Do I have an alert on the operating system? Do I have an alert on the real hardware components all the way up to the stack, which includes all of the networking, the virtual, so neutral and friends as well as the physical, the storage as well. But we also look inside of the tenant and we see, OK, does a VM or a tenant has a problem. So we look all the way in the stack. With that set of tools, we can also do forecasting. So we're now changing the game. We're not just looking back in the past, but we're trying to make educated decisions for the future. And an extra component that when some of our customers allow that, we have enterprise support. What that means is, Nutrino is capable of dialing back to the LEMC saying there is a problem on a hard drive or there is a problem in another core component. It is also capable of opening a ticket and then we'll probably see it before customers. So that's the next level of becoming more proactive. Remember, we talked about this and we had this Kibana view. So for you to get here, you have to normalize what you're injecting in the system. That means basic things such as lowercase or uppercase, all of my hosts. What kind of IP addresses will I have? Do I have a specific field for a specific component? So we've built a dictionary of information with metadata that flows into the system. And that allows us to do precise filtering of the components inside of Nutrino, such as severe error here, but it can also drill down into more specifics. Let's take an example. We could, with what we have, go to a specific host or a specific tenant because we have the metadata coming into the system. Now, if you only have a bunch of logs, this is still a bit complicated. But as we normalize, you have it easier in that way. A few of the filtering search that we have available have a display in here such as a severity, the device. So we make sure that all of the device have a naming convention. That doesn't mean that your tenant cannot change their name of the VM at all. They're free to do it as they wish. But when I do a search, I know it's going to populate on a place called device. Same thing for the other thing, such as SparkType. So we now have a clear way to do one search with one field. And I want to see all of the Nova compute across the board. And lastly, it can also look inside of log contents. So the previous example of the zero's image, yes, you can filter per error. Yes, you can filter per part type. But what exactly happened? So you can actually search that in the message. Then you have a very clear understanding of how we got there. Now, when we go further and talk about maintenance, in neutrino with monitoring and reporting, what we do is this. We have a bird-eye view of the system. And you can tell what is happening at all times. So this here tells me that the system is fully operational and I don't have an issue. On the top part of the screen, you have other information such as cloud compute and other different subsystems that creates the cloud. And then you can also do forecasting. So by doing forecasting, I can understand that as I'm onboarding new customers, I'm going to run out of space or compute resources in six months or in three months and so forth. So again, being proactive, instead of let's onboard everybody in, and then you find out later that you can't really handle all of that. You don't have the capacity. So you can do that beforehand. And with that, you can also, for certain cases, reclaim space. So maybe a project that asks for an X amount of CPU resource and storage, they're not using it anymore. Can I reclaim that and reuse it? So you can engage different teams in that conversation as well. And lastly, we have alerting. Alerting is really important. And you do not want to be alerted when the business is down. You want it to be alerted right away on premise when something happened. So with the alerting capabilities, we are looking, again, all the way in the stack. So if Neutron stops working for some reason, you will get an alert. You can forward that to your email. And then if you, again, for customers that go to the next level with the enterprise support, depending on what you have, this will kick a ticket with Dell EMC. And it will be properly escalated. Again, at many times, we usually know if something happened before the customer. So going back to what we laid out today as a foundation, we talked about systems complexity. We talked about some of the power tools. We went through a use case. And we just talked about proactive maintenance. Now to put it all together, let's go back to the core ideas. Remember, we want to be able to do a general audit on what we have. And a lot of times, we're not able to do it. Time is usually the enemy when it comes to that. You want to be good and efficient in knowing this is behaving well, that's not behaving well. Get to the core of the problem instead of just fighting fires. Get a standard troubleshoot, such as the Elk stack. Be proactive as much as you can. And shift our teams. Nobody wants to be just fighting fires, just working over weekends, overnight on that. So be proactive as much as you can. And also, get ready to shorten the initial investigation. I'll actually open for some questions and answers. We have time. Do we have any questions? Oh, good question. So I'm going to repeat the question. So with the alerting, how do we set the thresholds? How do we know if it's working or not? So by default, we have a set of alerts and a set of thresholds for alert. But you can change that. That's something you can customize. Any other question? Yes. So the question is, what are we using for the alerting mechanism? The alerting mechanism is part of the monitoring reporting suite from EMC. So it's part of our in-house software. Yes, sir. I'm sorry. So the monitor and suite report, no. But the other components are all open source. OK, so thank you. Is they seated just for the reflow? Yes, I don't. Yeah, one more question, of course. So the question is, the Elk stack, as it is right now, can be used for third-party application. The way you have to see it, so the Elk stack, they're all open source. And you can definitely roll out your own. So if you have your own open stack, you could definitely do it. We didn't went through the details of how you get it to install and all of that. But you would have to install, configure, couple them together, and normalize the data. But then the answer is yes. The one we have is running inside of Neutrino. Yeah, it's customized in a sense that we normalize everything. Yes. Yes, sir. That's an excellent question. So the question is, when we're talking about forecasting, are we just based on alerts, or we have analytics? No, we do have analytics. And if anybody's interested, I'll be happy to do a demo on the rack, and then we'll show more. But there's a lot more than what we have here. And really, in a nutshell, we look at what is, let's take a tenant, for example. What is a tenant using over the past one month, three months, six months? And it sees a trend. And after that, it applies formulas to get that result and to make predictions. So we're really talking about analytics to go back to that raw data and do some prevention. It's not part of that specific graph that I showed. But an early warning, it would go up to the system. But it's not part of that chart that I showed. Absolutely. Thank you. So every log, so every open stack log will go up. But also, every log in the operating system will go up as well. So for your example, if you have an MCE log, so an MCE, it's going to be an exception for the machine, that's going to go up as well. So if you have another problem saying your bias needs to be up there or something, you will receive that information. Yes. OK.