 Good evening, everyone. Thanks for joining us today for this session. Today, we are going to talk about customer happiness as a metric, or what kind of insight do you need to judge the happiness of the customers. I am Ankita, and I'm a software engineer at Symantec. This is my colleague, Jasmeet. He's with our SDN team. We all know that monitoring is important for stable operations. As a cloud operator, you might focus on availability and uptime. Traditionally, we monitor processes, CPU, disk, network utilization, et cetera. These tools sure tell us what's broken and enable proactive repairs. But do they have any coverage for customer interaction? Let's see. In the previous slide, we saw that Xabix was green. But does this correlate with the customer experience? Here, a customer is trying to launch a VM, but this is what he gets. He's having issues with slow loading pages. Finally, when the request went through, he's having errors, no valid host bound. This is only one of the few errors which customers normally see. So is your customer happy? Maybe, maybe not. Why is there a gap between what we, the cloud operator, see, and what the customers see? This lack of insight into the end user experience leads to frustration and inefficiency. Traditionally, we monitor server health, service availability, cluster health, et cetera. But what about user experience? When a user requests something from NOVA and gets a response back, you, as a cloud operator, would want to know whether he got what he requested. What about async pipeline failures? When a user puts a VM, did the VM end up in an error state or active state? How much time did it take for the VM to get reachable? We need to analyze all these data points before we can safely come up with a number and say that our cloud is healthy. Therefore, we decided to implement additional methodologies to better understand end user experience. FCI is one of such metrics. FCI stands for failed customer interaction, which is number of failures per total number of requests. What does a failure mean to you? For us, it's the 500 errors. This is because this shows that the service was available, but still not reliable. How do we get this data? One of such ways is by audit middleware. Audit middleware is an optional whiskey middleware filter provided as a part of Keystone Middleware Library. It provides the ability to audit API requests for each component of OpenStack. It's based of PyCADF module, which utilizes the environment variables to build CADF event. CADF stands for Cloud Audit Data Federation. It is a standard way of defining audit events. This is what we get from a CADF event. It answers seven Ws. What, when, who, on what, where, from where, to where. This is an example of CADF event. The who here is Python Neutron Client. The what is an action which is read list. Outcome was success. In the request path, you see it shows floating IPs. So a user requested a read operation on floating IPs. This is how it works. When a NOVA API request comes through, it first passes through the auth filter, gets authenticated, then it goes to Keystone context. Here, security data gets added. After that, it goes through audit filter, where audit middleware utilizes the security data to build the CADF event. After this, it goes through the API router. This is how we enable audit middleware. In each of the service where you want to enable it, you need to declare the filter in api-pace.ini file. After that, you need to point it to the correct audit map file. After the declaration, you should call the filter after the Keystone context. This is because we need the security data. The map file can be found on the GitHub link mentioned on the slide. Now we have the CADF events in RabbitMQ. Since Selometer is primarily built to consume events from RabbitMQ, we decided to use this service. We built a new listener plugin, which listens to these events, filters and processes them, and then emits matrix and logs. This is how you get the CADF events via Selometer. You need to configure Selometer to listen to whatever queue you are sending the CADF events. This is an example of our endpoint. We are filtering audit requests and audit response. Then we are extracting publisher ID, event type, and HTTP code. This is because we want to differentiate between different kind of API requests and whether it was a request or a response. After that, we are generating the matrix by using all these three fields and then sending it to StatsDeclient. This plugin can be extended for capturing additional events. The GitHub link is mentioned on the slide for our customized Selometer. Now, Jasmeet will continue with how we can use application logs for getting the FCI matrix. So application logs are another way to capture this really useful information that we have. It's often ignored. It goes, logs are rotated. It's often just missed. And the amount of information that's in there that we need to use is astonishing. So when it comes to system behavior, solid logging combined with proactive monitoring are critically important. We may look at logs if something doesn't work right or something doesn't look right, but they're often ignored and rotated out. Many tools and services exist to make use of this data. Datadog is one of such services offered as a software as a service. But for this talk, I'll be focusing on LogStash, Statste, et cetera, tools available in the open source community to sort of parse these logs and get this useful information out. This is one example of similar log lines that can be found in any number of log files. Most of the RESTful services have similar structure. They're easily parsed. As I said, this is an example from Horizon. And you can clearly see it has from where the request was coming from, at what time, where the response, return, the bytes, the duration. All of this is very useful information that you really want to consume. And it can help you make this observation or it can help you figure out if the customers are actually really getting the service as it's advertised or are they simply anytime they want to use something they're running into issues, maybe getting 500s, as Zankatha mentioned, it really ups the frustration levels. LogStash is great for parsing logs. I don't really want to get in too much detail here, but many filters exist to parse existing log files. For example, Apache has a simple Apache common log format that's already predefined and you can parse. So this is an example of the GROC filter that's used to disassemble the log line into individual fields that can be used and emit it to statsD for aggregation and sort of collection of your stats. In this one, you can see the mouse cursor, but the response, code that's returned, the verb, the bytes returned, duration, et cetera, et cetera. And to put it together, the pipeline looks something like this. We have many sources, different sources, log files, the salameter endpoint that she mentioned that are emitting stats or they're at least writing log files that can be parsed with LogStash. LogStash takes those things using the GROC filters, breaks it apart into individual fields that are emitted to statsD for aggregation, and then you can ship it off to your DB for storing and querying for later purpose. We have Grafana for granular views of these stats and dashing to sort of bubble up the higher level, which I'll get into detail a little bit. This is one example of the stats assembled completely from log entries. If you're not parsing and using this information, it's gone away. There's a lot of useful business data that can be used. This is an example from one of our upgrades recently. I'll quickly go over what you're seeing on the screen. The top left is just the number of successes, somewhere hovering over 50k per hour that we're getting neutron requests. I believe this is neutron. The far right graph is the 500s. Just looking at that, you can see something happened at that point where the error started kicking up. The middle is the FCI that Ankita mentioned. This is the number of failures divided by your total requests to kind of figure out what percentage of your requests are failing. The scale here is a bit changed. I believe it's from 90% to 100% to highlight because we have such a high volume of requests coming in. A low number of failures would often go ignored. And this here is the response bytes, the response time, duration. And at the bottom, we have the HTTP verbs broken down. By just looking at this drilling down, you can see the gap here in the middle is basically the maintenance outage that we took to perform this upgrade. But something happened that kicked the time it takes to process these requests into tens of 20s of seconds. This is useful information that you can correlate. You can look at other events. Maybe you can go back to the logs and see what exactly happened at that time. In our case, it was a DB sort of getting into auto memory loop, crashing, causing Cassandra to do the repairs and come up. So let me refer back to this slide real quick. So the FCI we mentioned, I think it's a great metric to judge how your customer is doing, what their frustration level is, if you will. This one metric alone can give you insight into how your services are doing. We have what is this, closer 10, 12 graphs for one service. We have maybe 15, 20 different services. That's closer 200 different graphs. It's next to impossible for somebody to keep an eye on these things. So what do you do? You can bubble up that FCI per service. And we use dashing to sort of get a high level view of how things are going. This is the inverse of FCI's, because we want to see if everything's healthy. You don't really want to see a zero. That often indicates something else, right? So what is success? It's the inverse of the failures, all right? All of this request that didn't fail are successes. So we calculated perhaps per hour. This is an example of the past hour that the number of requests we had that were successes versus failures. This is great to kind of get a bird's eye view of how that one service is doing. These factors can be bubbled up again to capture perhaps one metric that gives you an idea if your cloud is healthy, functioning, working, et cetera. Now, this is all great and everything, but oftentimes you'll end up with this, designate not a number, right there. The percentage is not calculated to divide by zero. Why? Let's drill down, take a look at our awesome graphs that we just talked about. There's nothing there, right? Because designate is very lightly used service. We might not have had any requests in the hour. Now we have blank graphs and we don't really know if it's because the service is down or the users are having some other issues accessing it or even if the service is available for the user. That's where synthetic transactions come in. This is synthetic load generation. We're proactively exercising our APIs to make sure things are healthy. Problems are highlighted and fixed much faster. This, we can have integrations with chat ops, we can send out alerts if something doesn't look right, our guys can now react and proactively fix these things before the customer runs into an issue, has to file a ticket, wait for our support team to get back and acknowledge and say, yes, it's a problem, now we're gonna go fix it. So synthetic transactions coupled with metrics, stats, and the FCI, it gives you a really good idea of where your customers are in the sense of how you're doing with the service, if they're healthy. So moral of the story here is don't ignore your logs, turn them into metrics. There's a lot of useful information that's usually discarded. And that's, I think it's critical for a business to know how you're doing, how the service is being consumed by users, what their impression of the service is. Synthetic transactions play a large part in there as well, as we talked about. And then you really need to define what failure means. Depends on the service you have, right? How the logs are parsed, so you can have the filter set up properly to define these failures. Latency thresholds, perhaps for some service, anything that takes over one second is a failure, right? Depending on the service, it might just be 500 error codes returned or indicative of a failure. Sometimes you wanna look at 400s, 300s, depending on what the architecture there is. You need to calculate your FCI so you know what your customers are getting. And defining the failures is very important. The key notes highlighted the people aspect very nicely. I was very happy to see these things mentioned. They played a large part. Jonathan Bryce mentioned the three stages, right? How do you get your legacy apps into the cloud? First, you need to get them into the cloud, right? And then you need to optimize and finally become cloud-native. Well, it's the people that need to deal with your cloud. They're still legacy-oriented. Their thinking hasn't shifted. They're traditional Java programmers that aren't familiar or confident on how to use these things, right? So the first phase is you wanna make sure you don't upset your customers because they can simply turn around and go somewhere else. And Boris mentioned the same thing. Success with OpenStack is one-part technology and nine-parts people. Customers are people too, right? You can't be successful if nobody's using your service. And it's the culture that you wanna sort of enforce. So it's important to understand what your customers are going through when they're using your service. And I think all of that information is there. It's in logs. We need to use it. And where to next? Once you have the metrics, it makes taking action on these metrics a lot easier. You can develop automation. You can have chat ops integration. As I mentioned earlier, you can have L1, L2 metrics that can be bubbled up, which we define as sort of L2 might be service-oriented, FCI. L1 could perhaps be your entire region that could indicate how the region itself is doing. If people are getting a lot of failures, you don't, there's no way you can assume that things are healthy. This is an indicator that can tell you very early that things are looking good or looking bad. And finally, we definitely need to close the customer feedback loop, right? We can't just assume people are happy because they are, we need to correlate. We've seen the tickets, the flow of incoming tickets completely rises when you're getting a lot of FCI failures, when there's a lot of failures reported and highlighted in FCI's. So these things are, they're sort of validated, but we still need to close the customer feedback loop and determine, we're actually making the right decisions and deductions from the data that's available. And if you guys have any questions, we can take them now. We'll be at both D8 and I'd love to hear and talk to you guys in person. Yeah. Yeah, I was wondering if you could talk a little bit about the, how this kind of monitoring like might affect the performance of your cloud? I think people, I tossed around the idea of like that kind of like synthetic, like, yeah, synthetic use of the cloud to generate these metrics before. And a lot of people in my team are concerned that, adding any load at all to the system is just inherently bad thing. So can you say, have you tested like, how this affects the performance? So scalability is definitely something that needs to be addressed. It's almost you're in a loop where you're checking, are you okay? Are you okay? Are you okay? And how frequently do you do that? I think it's a valid question. I can speak from experience that we were doing that with one of our STN services. And it's almost like quantum physics, right? The more you observe something by repeatedly testing, you're gonna leave an effect. You're gonna change the system by just exercising it. So that's a balancing act, depending on what service we're talking about. For example, if it's VMs, you wanna go through the synthetic transaction of spinning of VMs. There's a lot of stages, right? You need to know what exactly failed where. Was it the first ping to the VM was slow? Were you able to SSH? Did the VM even come up? And if it did, did the post-boot sort of happen? So it's very elaborate. You need to actually look at each service and see how expensive it is to have these transactions. So there's not really one answer, but it does have an effect, but I think the benefit outweighs the cost of not having it. Thanks. All right, well, thank you.