 So hello everyone. Thank you all for coming. My name is Martin. I'm a product manager at Fujitsu and with me today, there is Roland he's a architect at HP and the technical lead of the monaska project and Next to Roland there is Vitek He's a software engineer at Fujitsu and he will give us a demo later So I'll briefly introduce into the topic before going into the technical details So for those of you of you who are operating in OpenStack cloud and who are not yet concerned about logs I prepared this list that you can see on the left side. It shows the different components of OpenStack the services and their implementations and some supporting services like MongoDB and so forth And on the right side of the list you can see the number of logs or log files that each of these services have and write the well their application information to and This list is not complete, but It already counts up to 48 That you can have like on just a single note So if we go into like a production OpenStack cloud where we have like where we deal with like Multiple nodes up to several hundreds or even thousand nodes this like the total number of log files or lock Locations to which logs are written easily adds up to like a three or four digit number or even higher So I think it's a needless to mention that like with a standard Linux tools like grab or said You cannot cope with this and you need like a more sophisticated solution like a centralized log management solution Okay, so this is about logging in OpenStack in general that concerns the operator But I'm not only going to talk about this. I'll also talk about Logging as an OpenStack service. So what does that mean? So I think it's it's pretty obvious that logging is an issue in OpenStack and it's a it's a basic need and most of the vendors of OpenStack have reacted to that need by providing like Kind of vendor specific specific solutions when those specific solutions doesn't necessarily mean it's a proprietary Solution most of those solution are actually based on an elastic search Anyway, like every vendor has kind of its own flavor how to do it And so what went what we want to achieve with this project is to consolidate all these likes when the specific things into one standardized OpenStack project or an OpenStack service and If we look into the history like pretty recent risk history of OpenStack, we can see that this just happened with ironic so bare metal deployment, for example was kind of a basic need for everyone who like seriously operates an OpenStack cloud and every vendor like provided as kind of an add-on to the Like community OpenStack version like bare metal deployment technique Foreman is one of them for example But other vendors have other ones most of them are based on tools like puppet and chef and so forth and all these Like vendor specific things they have been now consolidated into one like official OpenStack project It's called ironic and this is exactly what we want to do with logging So logging is a basic need everyone needs it who seriously operates an OpenStack cloud So we want to have like an official project in OpenStack for that And it's not just about that. It's also about logging as a service So the idea here is to provide the same functionality that is like Therefore the operator to manage his logs to provide it to the OpenStack tenants because if you start to have like several VMs you have sys logs in these VMs and at some point you'll also need like a similar solution and the idea here is to provide the same functionality that the operator has to to the to the tenants of OpenStack So they can just consume it in like a software as a service cloud model Okay, next slide is a yeah now and now I'll talk a little bit about the technologies that we used here one is elk or elk stack it's called the other one is monaska and So on the left side I'll just briefly mentioned what the what the elk stack it's about So I suppose that most of the people who are here already heard about elk and So this is not going to be like yet another presentation on all the wonderful features of Elastic search and Kibana But just for those who are not so familiar with these technologies It's a combination of actually three base technologies. One of these is elastic search This is where all the the data is stored where all the log events are stored It's a search engine on top so you can do like pretty advanced search queries with this database The second technology in this deck is lock stash lock stash is responsible for collecting parsing and transformation of locks so it's kind of a small agent that sits on a machine and grabs all the locks that Yeah, that are produced on this machine and it also has capabilities to transform these locks Into something like kind of more normalized. So Typically different applications. They have different styles of writing locks Different flavors and this is lock stashes there to kind of like homogeneous nice this a little bit into something. Yeah something more common and Then the third one is Kibana. It's a dashboard a pretty powerful dashboard. So first of all it provides you like a Search query interface so you can type in Google like searches like okay, give me all lock entries that contain the keyword error and that appear on a specific coast or something like this and In addition to this you can do pretty advanced graphs so you can spot trends you can spot anomalies that appear in your In your logs and so forth. It's pretty powerful and So the combination of three these three technologies, it's called Elk stack and it's it's pretty common So if you start to look do like a Google search on okay, I need a centralized lock management solution and Okay, what is there? Elk is probably one of the first things you'll stumble upon and I mean it's so powerful that it's even That it can even compete with proprietary and expensive solutions such as blank. So on the right side, there is monaska It's the other technology. It's an open-stack project. It's not yet official, but it's in the process of incubation. So We all hopefully see it in the big tent of open-stack during the next month It's a monitoring as a service solution It's very performance scalable and fault-tolerant has basically everything that you need to run it as a cloud service and What is also being done? basically driven by HP is To integrate a complex event processing engine to do like correlation between different events and we also want to leverage this for our locks We'll explain a bit later. We'll explain a bit more about this later on And yeah, so the main players you can see them on the right side. So besides Fujitsu and HP There is also time-warner cable who uses monaska in their production open-stack system for monitoring Cisco or rex space and cray Okay, so we are integrating these two technologies. So why should we do that? Why don't we start like a separate project on logging? first of all This is like one of the not so tangible things. It's two related topics. So both metrics which monaska is doing at the moment and locks where that you okay is good at They kind of like give you a health status of your system, so it tells you okay is my machine still running It's my CPU load too high like in metrics or logs like if your application starts to write like funny things into the application log you should also know about it and Right, this is meant with the symbols that you can see here so basically you want to know is everything okay or not and if it's not okay I want to find the root cause as quick as possible so I can react and Yeah, avoid any damage to my service So apart from this we also want to have like a functional extension to you okay You okay is very powerful by itself But we want to extend it first of all with multi-tenancy. So you okay? Like native yield K cannot really do multi-tenancy. So we want to extend it We are programming an API for it We want we also want to achieve a higher performance and scalability because this is very important in open stack And so one of the future things that we're working on at the moment is the possibility to define Alarms on locks so you can say something like okay if there is a critical alarm That is written to any of my application locks sent me an email or let my smartphone ring Or you can also say if they're like the number of total warnings that appear per minute Ext to certain threshold Also sent me an alarm and also more more advanced things like the correlation between metrics and locks so typically if your application lock starts to write like errors into your lock and That has like as the root cause of this is like in the infrastructure like your CPU load is too high that causes your application to write things like Timeout errors as a timeout errors or something You can agree that you can for example aggregate these alarms so instead of multiple alarms you get just a single one Okay, so now I spoke a little bit about So what we want to do and why we want to do it now we'll go a bit into the technical details and Roland He will start with this and tell us a bit more about the monaska project and the metrics processing there in Testing 123 Okay How many people here know about the monaska project just show of hands? Okay, quite a few but I'll do a very very brief Overview of it. There's going to be another session Immediately after this that will go into more details on it if you would like to know about that less details on the logging side More details just about monaska in general Okay, so today monaska is very focused on monitoring as a service and primarily a focus on metrics So we have a first-class restful API for monitoring it supports authentication via keystone and Everything within monaska is scoped to a tenant so multi-tenancy. It's highly performant scalable and fault tolerant it's a Microservices message bus based architecture which provides flexibility extensibility load balancing Microservices are a fairly new concept. It's come out in the last couple of years So what do we mean by microservices? We have small Relatively autonomous components. They can all be deployed separately Communication occurs over a network Normally people think of that as HTTP in our case. It's usually over the Message bus which is based on Apache Kafka and these components can be deployed Independently in fact you can run monaska with or without all the components that are in the architecture diagram or add your own components Some of the technologies that we use within monaska. It's built on Apache Kafka is our message queuing technology within open stack Rabbit MQ is you know very popular, but we have not chosen rabbit MQ for our service We've using Kafka Kafka was developed by a company called LinkedIn And they use that within Their company for you know, they're messaging Another technology we use is a patchy storm which is used in our threshold engine patchy storm was developed by Twitter and patchy storm is a supports the computational engine you describe a Graph or topology and within that topology you have Elements in there called bolts that do calculations We also support a database called influx DB influx DB is a time series database And then we have some other technologies we're using so latest in real-time streaming and big data infrastructure The API is really today focused on metric storage retrieval thresholding or alarms and notifications and We have some things that are in progress obviously the logging that you're hearing about today But the real-time streaming complex event processing engine is in process It's been in progress for about a year now So hopefully we'll be getting to wrap that up in the next few months It's the overall architecture for monaska today There's a lot of similarity to when you look at the the way we're approaching events and logging but within this architecture See up in the upper right corner is an agent the agent is typically deployed on a system that you're monitoring But if you're doing what we would call active checks like monitoring systems or HTTP endpoints You would have the agent deployed on another system or multiple systems So the agent will monitor the system Typically, you know, we're collecting system metrics like CPU network memory disk space. It can monitor services like my sequel Apache or open stack services supports a stat CD and that's built into it a few other things are there So the agent sends into the API the diagram. There's you know, three boxes here This is all highly available and fault tolerant So that API be deployed across typically in our deployments. We think about doing things in terms of three nodes and The metrics come in there. You are published to the message queue, which is patchy Kafka and That gives us a bit ability to basically store those messages or those metrics in this case Patchy Kafka is fully durable message queue so that all the metrics are stored to disk immediately and It's also clustered. So basically a Kafka is running across three nodes and clustered now We have three components down there in this diagram. Sometimes I show more than three components But the persister consumes message off of the message queue It'll do batch writes into whatever database in the lower right Box there shown metrics and alarms database. That's typically in flex DB We support Vertica, which is proprietary database and we are working on supporting Cassandra the threshold engine then consumes from the message queue the metrics and evaluates where the metrics exceed certain thresholds or whatever alarms sort of calculations you want done If those thresholds are exceeded Then it'll publish another message back to Kafka, which will be an alarm state transition event And that'll be consumed by the notification engine It'll decide whether it should send an email a pager duly do your pager duty alert or a webhook in our case The alarm state transitions are also consumed by the persister and then stored in the Database as well. So we have a history of all of our events In the lower right my sequel, it's our configuration database That's where we store all of the alarms and notifications that we want the system to actually calculate We support we have a panel that's integrated in with horizon and we also support a Time series Visualization dashboard called Rafa Today we support Grafana 1.0 and as of yesterday Time Warner cable has Posted code to support Grafana 2.0. It's not quite done yet, but it's up there and ready to be used and tested Probably in a couple weeks. That'll be complete and then we also have a Python Client just like every open-stack project has a Python client So anyway, that's basically the architecture Okay, complex event processing am I going too long? All right. Okay, good. Sometimes I can talk forever There's a lot to go through here But okay, so our complex event processing. This is work in progress. We primarily working on this with rack space Some of you might have heard of the stack tack project So there's tack stack tack v3. That's sort of libraries that we've that we're using within our event processing engine We have an events API similar to the metrics pipeline There's a an events pipeline or API that we can store And query events from we can specify transforms on those events So you can take an event typically with the our use cases typically focused on open-stack notifications We would take that notification and transform in some way reduce it Notifications can be very large You want to normalize the data maybe convert timestamps things like that? And we want to take those events and then create streams out of them or filter those events group them by some criteria like a tenant ID or an instance ID and Then when certain conditions occur we want to do some calculations In our one of our canonical use cases is like when a VM is created that VM goes through a number of events There are about 24 events that occur over the life cycle of just creating a VM what we typically would do is want to calculate the elapsed time between the end event and When the VM is actually active and when it was actually started up and then from that we'll create a metric And then we publish that metric back into Kafka and then we can operate on that metric we can alarm on it We can visualize it. We can store it away and Ultimately create alarms for when for example your VMs exceed some Average time that you had you know if the VMs were taking two minutes and all of a sudden it was three minutes That's something you might want to know Okay, so this is all in progress This is the event processing architecture Reason why I'm showing this is just because it looks similar to the metrics architecture And that's why we think doing this also with logging is good. We have similar technologies We have similar similar architectural and design patterns The components operate a certain way and so that's in part the story of why we think this makes sense All right, I'm gonna turn it over back to Martin now. Yeah my turn again so Now I'll talk a bit about the technical technical details in on logging So this is just a brief like summary of brief wrap-up of what I What I talked about before so Importantly is it spilled on elastic search lock stash and Kibana also known as Elk stack It's basically you can say like the most popular Open source technology so we take that because basically because people like it and There is also a huge ecosystem around it We add some stuff to it. For example an API and A lock stash output plug-in that can communicate to the API to facilitate multi-tenancy and some some degree of security to run it as an as a service and and Yeah, we also like in by integrating this into monaska. We also leverage proven like the proven technologies that we have in monaska such as Kafka that Roland just mentioned and also some like common design patterns that we will see on an on an extra slide later and So the the added value that you have to like a pure Elk stack is in essentially logging as a service greater Scalability and performance and also alarms and locks So that was just a brief wrap-up So this is the architecture That we are targeting so this is where we are almost are so you can see on the On the top right corner you can see lock stash It's the lock collector the collector essentially. So this this is like the agent that sits on the machine and collects everything We wrote an output plug-in. So it has quite a pretty powerful pluggable Like plug-in architecture that posts locks here to the log API. It's a it's a REST API That goes on to Kafka then How can I do this like this that goes on to Kafka then? We also have a lock transformer that can normalize the locks and transform it into whatever format you like puts it back to Kafka and Then there is a persister that takes these locks from Kafka and stores them into elastic search And then on the top left we can see Kibana. It's the dashboard that I mentioned before that is there for like Analyzing accessing these locks and so forth So there is also an additional arrow that you can That you can see here This is where we not are today, but what we would like to do is to also provide The possibility to query elastic search directly from the outside search so from the outside world So maybe Kibana. It's good for like human interaction. It's a pretty powerful tool but for like if you want to run this as a cloud service and You want to do some like automated Analysis of your locks from from outside You'll also have to need the possibility to query these locks So it may be that in the future this bar the rest API Is drawn to the to the left here. So everything that Kibana queries from elastic search Goes over the API, but this is still in discussion So we don't really know if we're gonna go that way or another piece of possibility may also be that we Like design our own query language So locally, I don't know if you know locally. It's like a public cloud Logging service that is based on elastic search as well. So they design their own query language Which is probably which probably will be like kind of similar to elastic search But just provides a subset of what elastic search can actually do in order to yeah, basically for security reasons So elastic search has not really been designed for multi tendencies So if you do a query that looks like okay, give me like all the events every two seconds elastic search will basically freeze and Not do anything anymore because it's totally overloaded and we would like to avoid that. So we want to like Yeah, restrict a little bit the access that say the outside world has to Elastic search Okay This is a sequence diagram. So on the top you can see a sequence that It's basically what I just explained Relatively straightforward. So you have the log collector that is based on lock stash lock stash with the monaska output plugin Publishes these API these logs to the log API That goes to Kafka Then the log transformers Pulse these transforms into it into something normalized that goes to Kafka again And then the taking into the persistent from the persistent to elastic search so on in the red box on the On the bottom you can see where we want to go in order to provide alarms on locks so We basically leverage the existing thing the the existing stuff the existing technologies that we have here So there is not really an additional implementation necessary So we use That's our plan at least so we use the lock transformer. We use lock stash to make an event Out of these locks and log entries and we can filter specific log entries. So we can say okay Publish an event on Kafka If like the error locks content if the lock contains the keyword error for example and invent event in this in this context means an event that the Events engine that Roland the complex event engine That Roland mentioned before which is a stack tech can understand so this is the job of of the lock transformer and then the event engine Is there to transform? multiple events into a metric So it can count events so you can make a metric that looks like okay number of locks that contain the keyword error per minute or per hour and publishes it back to the Kafka queue and then there is the threshold engine which is part of the existing monaska That with which you can define an alarm on this so you can say okay if this number exceeds like five error locks per minute Send me an alarm or just one And then the threshold engine If this threshold is exceeded Creates another alarm that then goes via Kafka into the notification engine that you can see here on the right and the native in the notification engine you can well create notification that are that are like emails so or We also have integration with paid a page at you deep Or if you want to like trigger some automated process You also have Webhooks, yeah Okay, so This is the last slide about architecture is kind of a summary and it looks it shows how this like three say three big functionalities that we have in monaska are kind of combined and What is important is like the central role of the message queue so it's like a message Well, we call it like message bus driven architecture. So the message bus is that the Like the component that basically connects everything Which makes sense because as I just Explained in in the use case for alarms and locks we take like components from all from all of these like three big functionalities and we combine them in order to like create new use cases and to extend the entire functionality Yeah, and it should also show Like the design pattern like the architectural design pattern that we used like for all of these three components So you have basically the collection of data here on the top It goes into an API the API publishes everything on the bus and then here below you have several Consumers cause consumers of this queue that like process your data or analyze your data or do whatever you want With this data and it should also show the extensibility of this of this architecture so Pretty sure some of you may think of like use cases That we haven't even thought of So if you want to do some like more advanced stuff is pretty easy in this architecture to create like a new Consumer that sits below here. You just took it into the message queue We are creating a system that collects all the data That is relevant of your open-stack cloud and even more than this of your entire data center So the data collection here on the top is not is not limited to open-stack And you can create whatever consumer like to draw Whatever conclusion you're interested in from all this data right, so this was about architecture and This is Vitex part and she will now give us a live demo Thank you, Martin. Hello everyone. So yeah, I will show you the live demo of Our lock management solution current state Hopefully it will work. So let's Log in to horizon So here is my monaska overview panel What we see here, we are monitoring several services and Metrics from two different hosts everything is green. So everything is working fine What I would like to do now is simulate some hypothetical disaster scenario Yeah, so Can you see that? So I'll just stop the neutron server and Depend the line to the to the neutron server lock file. There we go where I would like to start with is our local vector so it's a lockstash I Will show the configuration Agent conf. Yeah, so as input we defined the the lock files locations and we oops We Append the the type of if each of the locks so we can easier find them again in in Kibana dashboard Then we have Filter section went where we defined the match patterns for multi-line entries So we collected like several types of match patterns for most common lock formats And here at last we have our contribution is the configuration of the lockstash output plug-in it authenticates with Keystone and sends lock entries to monasko lock API Additionally, we provide here metadata in form of dimensions here. I just put the host name and Geographic coordinates of Tokyo but You could really put here anything any information that the operator finds to be Interesting and important So let's come back to our Monitoring UI we can see we have here alarms in the networking Service, let's take a look Yes, we have a process check alarm Launched and HTTP status We can see the dimensions for these alarms which give us some additional information to that as Martin said before we want to We plan to integrate event processing for lock management so that we can also create alarms for for locks But it's future Yeah, and so this this is just a matrix But if we want to investigate the problem we want to find our locks to the problem in one place and to find them Quickly, so that's why we integrate the lock management dashboard in in the monitoring UI at This point we verify the keystone credentials the keystone role so that we have a fine control of which users Can view the the alarms which users can define the alarms and which users can view the locks So we come to the Kibana dashboard We will not go too much into detail about Kibana, but what would like to show is That you can easily find all the information which we supplied in the in the lock agent like host name Message itself application type Lock level which is extracted automatically from the from the message You can find it easily in Kibana. So Kibana is a powerful tool for for visualization and filtering of your elastic search data and You can create fancy dashboards It's highly configurable so every operator can Can create his own Dashboards here on the right we can see the the locations of My lock agents on the left in this donut chart. We See the distribution of lock entries between the the servers and between the application types Here is the histogram with lock level Information and at last is our the lock entries itself with metadata. So let's look for our At all Let me see which service was it neutron, right? It's not in my top five. So I will have to I typed it once already surprise Yeah, so now I can see only the locks for neutron And I can see there where some locks and some error at the end I can Filter that to this Time range. Yes, here is the error Here it is Thank you Okay Thank you all for your attention If you want to get in touch with us You can drop us an email or what is even better. We have weekly meetings every Wednesday If you want to get the code and some more information, you can visit our wiki page And as Roland said before there is a talk about monaska that covers more metrics and complex event processing on the room nearby and we also have a team meeting this afternoon and I think well one minute left. I don't want to I think we have time for one Just one question Yeah, so it was lucky sorry I've got two questions really and make them super quick One is how do you handle the chicken and egg problem of having your? Monitoring solution running on the platform that you want to monitor That would be the first question and the other one regarding the cp engine is are using an existing open-source technology such as Espo for example, or are you planning to? Take the mic Okay, so for everyone it's stack tech I'm not so sure about your chicken and egg question honestly speaking Monitoring my infrastructure On the same set of horse and services that you know may go down Then I may lose all the information about the monitoring infrastructure So you could set up a parallel monitoring infrastructure, but then that needs to be monitored as well. So how do you handle this? No, so so typically monaska would be deployed on on an Or an additional infrastructure that like runs next to to your open-stack system And it's also full terrain. So if one of your stuff goes wrong So like if you're one of your favorite of your service fails on that monaska runs like the application can basically handle this and And we also have like self-monitoring capabilities So monaska can also take care of itself if you want to be like really highly Like fault tolerant and highly available I think it's probably makes sense to have like two availability zones that where like every monitoring just money to us the other one Okay, so I'm sorry time is over. Thank you all for your attention