 Hello, guys. Let's get started with presentation about Stackmon. I am Artem. I am project team lead of SDK and CLI. And that means that I am fighting for the users. Today, we are going to talk about observation of OpenStack-based clouds. Basically, we started the whole approach of monitoring OpenStack from the user perspective quite a few years ago, already five years ago. Actually, it was started even earlier. But let's have a look. Small agenda. Why and why not something else what exists already there? What are we doing? And then, how are we doing that? So basically, just three tiny points. Let's have a look. Why and what? Why not? So what we wanted initially is the monitor cloud from the end user perspective. Basically, the question here is not to know what is on the back end side, to know that the Nova is running and what is up and running, that there is disk space. But we really wanted to know what if end user start using our cloud and how it will feel for them. We wanted to actually really have different physics of the metrics. We want to know latencies. We want to know rates. We want to know occurrences, rates. I mean, failure rates, success rates whatsoever. We explicitly wanted to have events versus simple metrics because what if certain things start failing, we would like to most likely have logs for that to be able to see where exactly it failed when it started failing and stuff like that. We clearly wanted easy extensibility because there are certain frameworks and tools already existing out there for testing the clouds. But most of them are making it pretty much impossible for the regular ops, for the regular user to understand what is being done there, to understand what's being done and actually to try to reproduce that. We wanted to have a status page for the public cloud with SLA calculation. We are actually officially legally obliged to have something like that. And for that, we basically need to convert raw metrics that are coming out of the cloud into something like SLA, something feasible for SLA calculation. And we clearly wanted not to invent something new. And we wanted just to reuse what's already there, just really finding the right pieces together. So what we are doing now. In the regular use case, let's see basically the bottom part. The user is somewhere in the cloud of internet, let's say like that. And he is using OpenStack. Here, I just more or less drone that there are OpenStack APIs in front of the services, which is pretty much anyway the case. Maybe you would be having API gateway, stuff like that. And Stackmon is something that is pretty much behaving exactly like a user. It goes exactly the same way. It lands in the OpenStack. So it has no connection to the back end, means all the information that you have in the back end is currently not available in Stackmon. Can be fetched in the different ways, but at the moment this was not in scope and technically impossible for us really legally because connection to the back end is not trivial. So pretty much that's it. Now let's have a look how we are doing that. So the idea, really the very basic idea was to take an Ansible playbook as a testing scenario because Ansible is easy. Everybody can understand Ansible. You can read it, every single op would be able to understand it and what's more, what's more better even that everyone would be able to reproduce it on his own laptop wherever. API metrics are being emitted by OpenStack SDK under the hood. This is something that is not known by everybody, but Ansible is relying on OpenStack SDK and OpenStack SDK is capable of emitting metrics for every single API call that is being done. Additional metrics, well let's say we are monitoring APIs that means that user is coming, he is able to provide a new server, but what happens with the resources that are existing already there since a while? We call them at the moment the static resources. Let's say we have few servers behind the load balancer and we just want to ensure that those servers are permanently reached by the load balancer, not that load balancer itself dies and we don't even notice that. How we are doing that? Basically metrics are going out of OpenStack SDK, going through stats D and then landing in graphite depending on which area of the world you are coming in, you would be coming from, you would be naming it differently. Metrics processor, there is additional component that tries to translate different physics of the metrics into, we call them flex and semaphores, this is something, some abstraction which would help us to calculate the SLAs later. And we have basically a status dashboard that visualizes all the statuses. Last but not least, we have Grafana which visualizes more or less the raw metrics, trends, whatever, something that regular ops would be then able to consume. High level design, really very high level because once you start looking deeper, the diagram becomes unreadable anymore. So we have on the left side the testing framework. Bit later I would be describing more in details about individual elements there but let's say there is a testing framework which is having few ways how to trigger the tests. There is a backend which currently has alert, statsD, graphEats, metric processor, databases, logs, storage, whatever, whatever. And then on the representation layer we have Grafana dashboard, we have status dashboard. Testing options, now all those elements on the left box. We have API mon, this is more or less a historical name because once we started we were thinking about only testing the APIs. Then we started testing more and more things but that's more or less the name stick forever here. Then we have EP mon which basically stands for endpoint monitoring, just a pretty much, it doesn't make sense for us to try to create a server we would like to have easier tests like is the endpoint there, is it returning always the results of what we are expecting. So basically this gives us possibility to run easier tests, not the complex ones. And then we have more or less in addition to run even more complex tests like those static resources and so on. So for that basically just having a container out there which would be sending permanently requests and stuff like that. Let's have a look on the API mon side and see this is the Ansible playbook which is pretty much readable by anybody and what we are doing here, we are trying to test the image server as regular user would do this. So we are trying to list all the images without any filters, we would like to get information about single image. The third step downloading zero images basically not in the scope of OpenStack Cloud but it's just something that the user would do in order in the next step to be able to upload this into the cloud. And the final step is then deleting the image what also users would be theoretically sometimes doing. This is of course the most simplest case and we are having much more of those pretty much for every service trying to create networks, routers, provision servers, volumes, backups and stuff like that. This is just for really for simplicity. End point monitoring. As I said, possibility just to test the APIs that they are working, that they are returning positive response values. We don't care about what exactly is in the response body, we just want to know that the call worked, we received back something like 200 and those are only get calls. So here we have for the compute end point we are listing URLs under this end point we would be sending that request to those URLs and just gathering all the response codes that and latencies of course. Under the hood, OpenStack SDK. So as I said already, OpenStack SDK used by Ansible is emitting the metrics in studs D. It currently can also writes to influx DB. It can also work together with Prometheus but that is a little bit tricky case. Therefore not fitting for us really. And for the complex cases, for the complex metrics we have possibility to catch additional metrics more complex. Since some of the things in OpenStack are really not synchronous but asynchronous we need to be able to track really how much time does it take to provision a server until we are able to log into the server. Once we started creating a volume, how much time would it take until the volume becomes really available? So as you see at the very bottom tags, so this is more or less a hack to Ansible. We haven't found any better way but yeah, this is one possibility and you can actually this way also track to combine multiple Ansible invocations into single metric. So apart to the regular APIs, raw metrics there would be additional metrics just a combination of multiple Ansible steps. Ooh, time is running. Generic StackM1 plugin, this is just an example of load balancer exactly. We have three VMs located in different availability zones. We are having load balancer and our test load, this is exactly the container I was meaning, is permanently like every two seconds sending and request and we are trying to catch metrics from those VMs ensuring that all those three VMs are receiving requests from the load balancer and also trying this way to log also what are the latencies, what is the distribution of the request to guarantee that not that one is it is out. Data flow. We currently have a question what we try to monitor our cloud from multiple places in the world because it makes no sense to test from a single place. We need to test also connectivity to our data centers. That means because of that, we need to test things from different places in the world. Therefore you see on the top zone, Germany underneath zone, Netherlands and they would be running those test loads. They are having stats de-graphed that are trying to be combined together for high availability. And inside two boxes, those are clouds that we are testing. In our case, we are trying to test two regions of the cloud region, Germany, region US. They have the full complexity of the services that are out there. More or less the test load from zones are going to regions depending on the configuration what we want, where we describe which test is running against which region from which zone. Metrics are going to graphites. From graphites, they are landing in metric processor or also in the graphana is a raw matrix visualized and out of metric processor, the data is going to the status dashboard where it is already for the end user visualizing where the status of the service is, okay or not, okay. Now let's have a quick look on the metric processor. So the very basic question, when is the service degraded or when are we experiencing issues? How we define that? Is it when the latency of get request is about certain threshold? Is it when post request start to fail, delete request start to fail? Is it when API at all is not reachable? Or is it when provisioned resources cannot be reached anymore? Maybe error rate is too high. The next question would be, but what if things are working from one zone but not from another zone? Does it mean that something is wrong with the cloud or something is wrong? Just some tractor cut up the cable again to the data center. So therefore what we can do is try to get basically to the try to get the solution from the end to the start and let's start seeing semaphore. So outage is, for example, when condition X and Y or Z, major incident is when A and B, minor incident is when A or B, stuff like that. So pretty much simple. We can define using binary logic, the state of semaphore. And then the flags would be exactly defining those conditions really as booleans. Flag is raised or flag is not raised. Like average latency is more than one second then one flag is going up. Success rate in average is going under 50%. Another condition, another flag is being raised and stuff like that. Then in the, so metric processor is exactly doing all of those conversions, is evaluating raw metrics and generating semaphores out of that. And the condition of those semaphores is then landing in the status dashboard. This is more or less just a very brief implementation how we did that. Based on the bootstrap, just nothing complex but we see here, we have a box with incidents, we have regions, all the services are in the service categories. Upstream is great. We for our cloud are just customizing this and providing custom CSS's and trying to basically get the colors correctly. That's it basically. So links are here on the screen. We have this currently being placed on the GitHub in the organization called Stackmon. There are also some public docs but if you have questions just ask right now or catch us later. Any questions for the moment? Okay, then. Okay, can you please go to the mic? Too deep there. Can you go into that a little bit more? Like in the example where you had the tag, how was that generating a metric? There is a callback plugin for Ansible. Part of our runner for the tests, there is an additional plugin being catching all the statistics of Ansible and evaluating if the task was having a specific tags attached to it then it was combining and pretty much the callback plugin itself is emitting the metrics. So there are no more questions from the audience. I have a question to the audience. My question would be, how do you deal with those situation that Adam just described and deal with complex scenarios or are you satisfied with your existing solutions on say endpoint monitoring or something like that? Or are you doing observing or monitoring your clouds at all? So maybe that's the case. Okay guys, then thanks a lot.