 Welcome to our presentation yet another OpenStack Monitoring solution, why ApiMon is different. My name is Nierz Markens. I'm a cloud architect at the OpenTelecom cloud. And I'm also working as a community outreach manager. And with me is Atem. He's a colleague for me and is the principal designer and chief architect of the ApiMon solution here. And together we are going to give you a brief work in progress overview of this project that we are running. Well, if we are talking about monitoring, it makes sense to have an idea on what we are talking here, especially in the OpenTelecom cloud. So we have a rather extensive setup with several thousand compute servers, a lot of RAM and storage and networking and so on. And to still keep an overview of what is going on, we have to automate something. And we have to automate stuff and we have to monitor stuff to understand what is going on. Monitoring itself can be achieved by different perspectives. So for example, is the service just available? Are some specific SLAs as described in the service description met, or are we running into capacity issues in any time soon, for example? But another question, and this is something that I would like to highlight here in our presentation, is the compliance to some standards. So to be API compatible to some description like the OpenStack description. And it's very hard to answer all those questions with a single solution, and that's why we first started in assessing what kind of services are available already on the market or as an open source project for that. We checked classic stuff like Zabix, Isinger, or Nagios, as well as more modern approaches like for medias. But we lacked a special option to do actual API monitoring, especially for the OpenStack space. And when we started this project about three years ago, OpenStack Rally, which is more or less exactly focusing this topic, was still in its infancy and that's why at that time we did not consider it. Maybe this is a good idea to take a second look on that. So if anyone has some feedback on that or some experiences, please raise your hand after our presentation and we'd love to hear some feedback about that. Well, we have also several teams doing QA work and we also touched base with them to find out their experiences, but there was no single solution that met all our requirements. So we decided to come up with a whole new environment and Apimon was born. This is a work in progress report and if you'd like to find out a little bit more about the project or even contribute to it, this is our public data link for that. Well, about three years ago, as I said, we started with 2,600 lines of batch script that stored data in influx and visualized it by Grafana. But that was not very scalable, extendable, and it had a lot of scheduling issues, which is one of the major issues in the whole environment. So think about that instances or resources that you created, but for some reason did not get cleaned up after them, some dead logs, and so on. This is a major field of consideration. So our question was, since we had experienced those issues, should we extend or should we rewrite it? And we came to the quick conclusion that we need to rewrite it with all the and set up a proper project for that, assessing some requirements, doing architecture implementation, doing that in circles in a kind of agile way of development. So first, we started with some requirements engineering and we early discovered that we want to be able to describe use cases, use cases, which we call here scenarios. Those scenarios should be code which is independent and can be run on different instances of those monitoring system, like inside the cloud itself, which is probably not such a good idea, or outside the cloud as well, but then you're still able to compare results created in one or the other environment. But you could also, however, another requirement, an important requirement is the visualization for different stakeholders, like management architecture, like ourselves, service specific engineering teams, squads for compute storage network, and so on, and the colleagues from operations so that they can all access the same database of data. As I already mentioned, monitoring a cloud is problematic if this monitoring system sits inside the cloud that it monitors. So if the cloud is down, you will probably never know. So therefore, we needed the option of scaling it to different environments and have it run in different environments, which is what we implemented, not only for this reason, but also to address HA requirements, so that one part of the cloud or the monitoring system could be down and the remaining part is still running and working so that we are still getting insights on that. As a solution, we decided to build everything around Ansible Playbooks. Ansible Playbooks are coming very convenient because they run virtually everywhere. It's widely understood. We have lots of DevOps and developers available for that, or at least it's easier to get good Ansible developer than to get developers that are able to work on a completely new description language or something like that. And another advantage is that the scenarios that we describe in those Ansible Playbooks can be used as blueprints for solution patterns as well. That was one of the central ideas of our solution design. And another thing that we thought is, how do we connect those Ansible Playbooks with the actual cloud? And the OpenStack SDK came in handy, which has two advantages. First, it implements those connections already. And second, it comes with extra metrics, which are just collected for free and sent to us if we are the Ansible modules that call them, that facilitate those calls, so you just can fetch those metrics. Let's have a look at the Appianmon basic architecture, which applies to any version after this initial proof of concept. So there is an executer that schedules Ansible Playbooks, which in turn, internally invoke Ansible modules. And those modules are connected with the OpenStack SDK. So that is the first four steps. And obviously, the SDK calls the cloud services themselves, gets some replies. And then the magic starts, the collecting magic starts. Since we have been, the OpenStack SDK has been collecting those specific metrics, sending them to Telegraph, to Converting them, and which in turn pushes the data into a time series database. In the time series database it's stored. And if you request the data from a visualization tool like Rafaana, you see the classic graphs, displaying the metrics. Did I miss anything, Artem? No, it's pretty much everything clear for us to continue. And having really a deeper look on the first proof of concept or first version with this a bit naive concept of simply running Ansible Playbooks and relying on those metrics posed by Ansible or SDK. So what we did was really to get very fast on it, we have created a simple Docker compose file, which was having a simple bar script, which was in turn looping over all existing Playbooks inside and executing them in the loop. The data was then sent to Telegraph. Telegraph was also connected with Prometheus. Grafana was watching all those. So pretty much the pipeline like you described it. But of course, immediately we have found really the issues with this approach. So first of all, there are drops or there are scenarios that are running fast. There are scenarios that are running slow. So basically, if one scenario is taking 20 minutes and you would like on the other side that another scenario is being executed every second or every five seconds, you have no other choice rather than implementing a proper prioritization for that. On the other hand, we have faced a really pretty, I would say, critical issue is basically, are we having metrics or are we storing events? If you consider a simple Prometheus that's the approach which are more or less gathering metrics like what is your average temperature? Now or what is your average temperature was yesterday? This has nothing to do and cannot help you to identify really the issues like at that particular point in time, your scenario for doing this or that actually failed. Due to aggregation, you will simply lose this data. Therefore, basically the concept of metrics is simply not fitting here. We are talking really more about storing event data. Then the next, of course, Docker Compose was more or less not a job but just to get us quickly running. So high availability with simple Docker Compose is not really and something easy, easy option, easy possible. So it was also kind of like proof of concept for us to start very fast. And then the next one would be basically the SDK what it gives us are the metrics about each invoked API call, how long the API call took and what it returned to us. But would it help us to know that simply that this create server API call took 0.6 seconds? Actually, no. In case we are more interested in the time, it takes us starting from the moment we are sending the request to create server until the moment we get the server really running and are able to log in there. So we have refined our architecture a bit and kind of like here addressed high availability concern by really having two separate installations of our APmon. Environment one is here more or less something. We are having installation of APmon inside of our cloud. Another one, another environment is installation outside of cloud. They are executing the same tests and they are sending the metrics to their now influx DB and not primateos, since we are more or less really talking about storing event data. And Grafana, of course, installed in the high availability mode with cluster database and so on all those nitty-gritty high availability functions. So basically, we came to version 2, which was more or less focused on addressing the problems with our first POC and basically came as a first minimal viable product. So we have improved scheduler dramatically. We implemented it not as a bash process, but really with a Python. This process was watching to the Git project where all those scenarios are being defined so that you always have possibility in real time to update the scenario and immediately a new version of it would be executed just to remember that playbooks are running in the loop. Those Ansible playbooks are executed as multiple sub-processes so that we can achieve parallelization. The logs of the Ansible playbook execution is basically catched and pushed to Swift so that at every point in time, we are able to see OK. At that point in time, this individual run has failed. We are storing ID. We are storing its reference. And we are every time able really to nail down. We have added a callback plugin for Ansible for gathering additional metrics about the execution, which is exactly something which can help us to answer the question not only how much time did the API took, but really how much time can it take us starting from sending first request to provision a server and then doing a couple of steps in a couple of additional steps in the Ansible until we are really able to log in to the server. We have also reworked pretty much graphs in Graphana. We have deployed everything in the high available way, how we already described deployed load balancer, TLS. And we came to the next step. What's about alarming? We are now gathering the data. We are executing scenarios, but we are not really alerting ourselves if something is going wrong. We are also not watching really what is the status of our app in one itself. If the monitoring itself is down, how would you figure this out? So what we did, we have introduced alerta in place. Some of the components were equipped basically with hard bits. They were sending hard bits to alerta, and we are relying to the alerta really watching that hard bits from those components are really coming. Graphana is also doing now evaluation with the metrics, tries to figure out OK, there is something wrong with the data, so it creates alert and basically also pushes it to alerta. Alerta is then responsible for the deduplication of alerts. For example, we are executing the same scenario from one place from inside of the cloud and from outside of the cloud. But we have more or less technical problem inside that we don't get to alert just one of them. So that's why the deduplication is in place. And alerta is then basically alerting ourselves or operation people in our communication channel, in our case, Zoli. And to add, we are also aggregating some external data like the collected block files or stuff on our Swift buckets, for example, so that those people who actually watch the messengers can have the full data directly. Access directly to link. So we are really putting into the message really a link to the data, to the log file, the run. So the issues and observations with 2.x version. Well, since we are using containers, initially we said that we would be containerizing everything. But the next step we have said, OK, Docker Compose is actually not in place, not helping us really to address all our needs. We have actually started using Podman because of some of the features of Podman, like, for example, rootless containers. And immediately actually also started facing bugs with Podman. Crashings of the Podman, some internal problems with overlay fs. Sometimes ports are simply ports are not being bound. So you start your application, but it simply doesn't start to listen or the port is not exposed properly. And you need to fight with restarting the process over and over again or moving it and so on. So we came actually to the next version, extra challenges and solutions from that point. Pretty much inspired by the Zool design, we have again redesigned our scheduler and split it into the scheduler, separate scheduler, and multiple executors so that with growing amount of different playbooks that we are running, we are really able to scale further. And not a single host is responsible for running hundreds of playbooks in parallel, which it would be, of course, not even able to do. And the communication between scheduler and executor, which is itself already like a pretty familiar to everyone who is familiar with Zool, is also going through the gearman. So very much inspired by Zool design, this component was redesigned heavily. Well, we have faced another challenge. Management wanted graphs in integrated details, some aggregation here and there. And we have found that basically combination of Grafana and Influx doesn't give you really all the possibilities how to visualize really exactly those requirements with our chosen data design. So we have also done the restructuring how we basically store the data. We have complete also that all the components of the system are sending hard beats, some components were even exposing their own metrics in addition to that. Well, since we are more or less a downstream of OpenStack Clouds, we are providing additional services on top of Manila OpenStack. What does it mean? That means that OpenStack SDK and Ansible modules are not really capable in providing support or giving you support for those additional services. So at least for the moment until we address this issue, we have added a so-called endpoint monitoring is basically the services being exposed in the service catalog. And we are just sending pretty dummy get requests to this endpoint or even to some sub URLs of this endpoint, just trying to verify that it responds to us. We have started also doing additional type of the monitoring is long-term resource monitoring something it is not enough to simply create resource and delete resource. It is also required really to start watching if the load balancer provisioned one month ago, if it's still up and running, if it's capable in doing what it should do. So we have started doing also that. We have started also facing some troubles with influx in its open version or SS version, since it doesn't come with a high availability solution. And from time to time, we were producing up to 100 gigabytes of metrics. Synchronization was really a challenge. We have not solved it as of now, but that was the moment really where we have started seeing the problems. Ansible kind of forced us on one side, on the other side, it really helped us with the introduction of Ansible collections to make this step. And we have started really consuming OpenStack collection, which was not really ready at that time, just started building itself, let's say like that. And we have created our own open telecom cloud collection for really bringing those modules for the services that are not part of the vanilla OpenStack. We have also extended support for starting monitoring different environments, like not only production, but pre-production. And also we have a hybrid installations. So basically all those monitorings are get into a single place of the monitoring. We have started really even comparing metrics between themselves, so it's not simply enough to say what, how much time does it take you to provision a volume? We are starting to comparing what is the difference in provisioning volumes of the same size in different availability zones, or how much time does it take you to get to the server if you initiate a request from inside of the cloud and compare it with the same result outside of the cloud, which gives you really very interesting details on networking problems, which might be not even related to your cloud itself, but actually cloud internet connections. So those should be also monitored. We have also integrated internal issue tracker so that every change being done on the backend of the platform, like config change, which is having impact to the performance or whatsoever would be able that Ops people would be able to correlate it really on the graph easily. Version through the 3.5, the next version which is more pretty much our current state. We have added support for the project cleanup as we already mentioned from time to time, your scenario is trying to delete resource but due to the problems on the backend, the API call simply stacks or it returns you, yeah, everything okay, I have deleted, but in reality the resource stays. So from time to time, you would be over quota and we are trying now to enforce the project cleanup. So we have actually developed an open stack inside of OpenStack SDK, the project cleanup properly and here we are enforcing it really to be executed periodically. Debug logs, well, Ansible is cool. It gives you output, but in some cases it's really high, there is the proper response of the API. So well, we need to get the deeper logs. So far pretty hard to get, but we are working on that. We have added additionally some additional types of the projects to monitor, not only Ansible but everything additionally, Ansible is already capable and running everything, but we have added special support for that. API monitoring is currently in our case, really a main API monitoring platform in our clouds and therefore we are actually, we are even forced to deploy a staging environment of it just to be able to pre-check all of the changes that we are going to deploy. Deploy it also, EFK for the additional handling of logs. We are working on involving Zool more on doing continuous delivery for the API itself. Yeah, well, couple of screenshots just to show you how the dashboards are looking like. You know, all the graphs, they are pretty much looking every time the same funny numbers, funny graphs, funny blocks, lines and so on. Here we are showing a pretty simple scenario just to be able to really explain you how it works in reality. The scenario is doing simple thing. It lists all available images in the cloud. It tries to find a single individual image. It tries to basically create an image in the cloud. And of course, afterwards, it tries to delete this image that it created. So pretty much this is the structure of the scenario of how it could be. It can be as complex as you want. It can be as simple as you want. So not going... When we have several other scenarios in our GitHub repository, which is here on the screen as well. So if you're looking for something like that, feel free to check this out and download or clone the repository. Yeah, future plans. Not going really in details. Most likely we will move to graphite from the influx. We would definitely need to start measuring HTTP logs of the API calls. Most likely, but not still confident, we would be moving API monitoring into the open shift or at least give possibility for this type of installation. Well, we see definitely that scheduling becomes pretty much complex. So that's also the area what we are watching on and continue watching actually really what's happening in Zool. It has also pretty much same problem. Scheduler is simply not a scalable itself component. So what can we do there? And we have again, we are seeing issues and continue working on improving how to monitor the monitoring. So as you know, the issues are always there and you need to ensure that the system that the monitoring itself is running. Summary, what we are doing really very, very briefly, we are running test scenarios. Those can be Ansible playbooks. Those can be Python files, Python scripts. Those can be really simply literally anything what you can run on your Linux machine or even not Linux, we are running container, so whatever. We are collecting metrics from different monitoring source locations, inside cloud, outside cloud, production, pre-production, hybrids, whatever. We are visualizing results on the dashboards. We are capturing alerts from different system components of our own system, trying to deduplicate them and really notifying our operations department so that they really react on the issues, figure it out. And as a result, it helps us to quickly really detect all the issues that we have on the platform. Outages, networking problems, performance changes, whatsoever. And yeah, pretty interesting phrase that one of our API Mon project members said who preferred not to be named. Great overview comes with great complexity, but it is worth it. It was worth of invest. It was fun, but it was worth it. Yeah, thanks, thanks, Artem. And before we conclude this presentation themselves and start with the Q&A session, I'd like to point you out to our OpenStack scavenger hunt that we run on the occasion of 10 years of OpenStack at the moment. And this is still open all October 2020. And you have to answer 10 not so difficult questions. Find the right web pages for them. And then you can, if you want, win a photo drone, some cool nerd gadgets for that. And yeah, feel free to participate. What's left for us is to thank you for attending the session. And yeah, either right now available for questions or send us a message. See you soon. See you.