 So, hello everyone. Welcome Roland Hochmo, the project technical lead of Monaska and I'm Vitek. I'm working as a software engineer at Fujitsu and we will be talking today about lock management with Monaska. So, how many of you have been in the bootcamp today? Monaska bootcamp? Yeah, a couple and how many of you know Monaska already? Okay. So, I saw there are some people who don't know about Monaska. So, I will start with the introduction, our motivation and our goals. What is Monaska and logging with Monaska about? Then I will go to the architecture overview. And then give an update about the new features and new development in the project. And at the end I will present a short demo of the solution. So, introduction. When you look at the locks in OpenStack deployment, you have several components and each of them several log files. When you sum them up on a single note you get something like 50 or even more depending on the deployment. And when you take into account that in the productive environment you can have hundreds of servers, then it becomes clear that you would not want to look for every single log file somewhere in the system. You need a centralized lock management solution. And this need was recognized of course by many vendors and there are many solutions for that. But there is no standardized OpenStack solutions and that is our motivation to replace or at least offer an alternative to vendor specific solutions for centralized logging. Offer logging as a service, offer a single endpoint for operators and also for users to post the logs and to collect the logs through the single RESTful API. It brings many advantages, such solution. You get the isolation from the underlying technology and transport layers so you don't have to care if the technology changes with time. You have your constant API. You get authentication, multi-tenancy. You can authenticate with Keystone. You get role-based access control to the logs and you can also validate the input. So you could say I will not accept bigger logs or the logs have to come with specific metadata. So it helps to organize that. Which technologies we are using? We're not reinventing the wheel. We're using proven technologies. So when you search for a centralized logging solution you will probably the first catch will get ElkStack which stays for Elasticsearch, Logstash and Kibana. Elasticsearch is a search engine and analysis engine for real data. Logstash we use for collecting, for parsing and for transforming. The logs, it is a flexible tool through the fact that they have many plugins which come already built in with Logstash. You can also develop your own for your customized needs. And Kibana is the modern visualizing tool which we use as the graphical dashboard. The combination of these three is really a competitive solution which as I said is widely used as a state-of-the-art technology. And we combine it together with Monaska which is highly performant, scalable and tolerant monitoring as a service solution. It is developed on the microservices architecture schema and the central point of it is the messaging queue the Kafka. It is a modern high throughput distributed messaging system. Why are we bringing monitoring and logging together? Well, they are both related topics. They are both indicating about the health of the system. So when you collect the metrics you want to know the health and if you want to go deeper you want to look at the logs. If you want to analyze the root cause you can start with the metrics and you can go more deeply and search for the log entries. Combining monitoring with logging brings functional and non-functional extensions to the standard elk. You get multi-tenancy, you got the centralized endpoint for operators and users and it is the basis for a good performance and scalability. Through combining the metrics and logs you can create alarms on logs. You can create criteria and search for a specific behavior and then define alarms for this. Through the correlation of metrics and logs you can define scenarios where both these criteria are evaluated and then the operator is notified when the alarm is launched. Monaska has become the official Big 10 project in November last year and it won a lot of attention from the community. So apart from HP and Fujitsu there is Time Warner cables since a couple of years already. They are monitoring their private cloud which they are operating. There is Cisco who has been focused on the project C-Losca which is collecting the telemetry data from C-Lometer and putting it to Monaska. At this summit Fabio Gianetti will have a talk about integration with Congress which is a policy as a service project service. It will be on Thursday, right? Time Warner cable are presented also on Wednesday and there is Cray, Broadcom, NEC, EasyStacks, SAP and hopefully in Barcelona there will be more on the list. So let's come to the second part of architecture which we would like to take over. Apparently I don't know how to operate a computer. Okay, so I'll talk about the metrics really capabilities that are part of Monaska. So Monaska has been out there as a project. It's been really focused on monitoring as a service and obviously today we're talking about logging as a service. Monaska project is designed around a first class RESTful API for monitoring. By first class I mean that the primary way to interact with the system is via an HTTP endpoint. That's how all monitoring data is sent into the system. That's how all monitoring data is queried out of the system in the form of measurements or statistics. It's a way that the system is controlled. You can create alarms in the system, actually alarm definitions. You can look at your alarms and you can create notification methods that are associated with metrics and then send out email or create page of duty incidents or invoke web hooks. The service basically, it doesn't have too many dependencies in OpenStack but Keystone is used for authentication and all the data that we store in the system is scoped to a tenant and that's key for multi-tenancy. It's highly performant and scalable and full tolerant. It's based on a microservices message bust architecture which provides a lot of flexibility, sensibility, load balancing, et cetera. I have an architectural slide. We're going to go a little more detail about that. It's built on a number of big data type technologies. Monitoring has evolved a lot over the past 20 years but within the last few years specifically it's been, monitoring has become much more of almost a big data and analytics problem as people try to do more than just look at their data for operational purposes only. So the people doing analytics on this, using their monitoring and billing systems, they're going back and doing rope cause analysis and applying machine learning algorithms to it. So there's a lot of big data technologies used in Manosca which Apache Kafka is used for our message queue. If people aren't familiar with Kafka that is technology that came out of LinkedIn and within the past year to two years has been really adopted by the big data community as the message queuing technology. There are a few others but that one has really emerged as the dominant one I would say right now. And this week there's actually the first Kafka summit taking place in San Francisco I believe. Apache Storm is used in our threshold engine. The threshold engine is used to evaluate alarms and that's a technology that came out of Twitter. So Apache Storm is for basically creating a real-time streaming computational engine. You define a graph. Graph in their terminology consists of bolts and spouts. And basically in this topology you can do different computations and stream results to other nodes in the graph and aggregate them, et cetera. But we use it for our threshold engine today. We support several databases for storing metrics and our alarm state history. We support today ElinFlexDB and Vertica and we've been looking at some other databases. We'll look at Cassandra and Elasticsearch potentially. So that's really for storing the real-time streaming data. You can use MySQL or Postgres within the solution as well and that's therefore storing your config data. The system allows you to store metrics, retrieve them. You can set alarms and threshold on them and then send notifications. And basically that allows you to have a metrics processing system and at the same time do status and health alerting like you would traditionally do in a tool like Nagios. And then we have some other things that are in progress like real-time stream processing. Which one is it? That one? I'm not going to touch this. Okay. So this is really the architecture slide for the metrics components. Up in the upper right we have our Python monitoring agent. This is an optional component. The monitoring agent is deployed on the systems that you're monitoring most of the time. It also can do what we call active checks, borrowing the terminology from Nagios. It can do like HTTP endpoint checks or system up-down checks like host status checks. So that also can do, it sends lots of system metrics like CPU memory, networking metrics. It has a number of plugins built into it for getting metrics about services like MySQL or RabbitMQ or Kafka in our case. We're in the process of adding support from our OpenStack services as plugins. It has a built-in StatsD daemon. So if you want to instrument your application and do that StatsD style, you can do that in combination with the agent. You can do that. We have an extension to that that allows you to use the dimensions that are built into Monoska. Okay, so that's enough about the agent. Within the system, the agent publishes or posts metrics to the API. So we are a push model for what would be called a push model in a monitoring system. That data is ingested by the API. Not shown in the diagram is the authentication that we do against Keystone. We don't authenticate with Keystone every packet. We cache auth tokens for, that's configurable, five to ten minutes usually, because you don't want to overload Keystone with lots of auth requests. Then the data, assuming it passes, it's published to Kafka, that internal, that middle box there is the message queue of Kafka. It's published along with the tenant ID and a little bit of other meta information. Okay, so the layers there, the persister, the threshold engine, and the notification engine. Those are all independent microservices. They get deployed completely independently of one another. The persister consumes metrics from the message queue Kafka and stores them in our metrics and alarms database. And then the threshold engine consumes those same metrics and it has an internal state of all the alarms that have been defined in the system. And it'll do threshold calculations on them. If a metric has exceeded or exceeded as threshold, then that will trigger an alarm state transition event. So the threshold engine consumes metrics and then it publishes alarm state transition events back to the message queue. The notification engine then consumes those alarm state transition events. It determines whether a notification is associated with it, like an email or a page or duty or a webhook in our case. And then it'll send that off if something's going to be, if it needs to be set. It'll also publish those messages back to the retry topic in Kafka if it fails being sent. There's a config database, the lower left, that's my SQL or Postgres, that stores all the configuration information for the system, like what alarm definitions have been defined, what alarms have been created and what notification methods have been created. The reason for two databases is MySQL and Postgres, those are really good at transactional type updates. So you have maybe a certain number of alarms and notification methods that go through a traditional CRUD life cycle. They get created, they get read, they get updated and deleted. It's not a huge amount of data and it's updated often. Whereas on the right the reason for those other databases is to store lots and lots of streaming data that's being ingested into the system and being able to sort and group that data efficiently in the database so that when you query it later on, it is efficient. On the upper left, Horizon is sitting up there. So we have a monitoring panel that we've added to Horizon and there's also a Python client for using it on the command line or using it within Python programs as a library that can be imported. Oops, I failed. I did it. Thank you. Alright. So that was the metrics part and we are adding the logging part based on the same architecture patterns. So as I said, we are building on leverage as the database for real-time data LogStash and Kibana and leverage all the private technologies of Monaska. So what it brings it brings logging as a service it offers better scalability and performance comparing to the standard elk and by integrating with Monaska we can offer alums on logs. So at the current stage it's just alarming on errors or warnings in the log messages and the architecture looks like this as you can see it's from the design very much similar to the metrics part. So we have up here the agents we have two technologies we use LogStash or Beaver they are responsible for collecting the log messages from the files authenticating with Keystone and sending the data to the log API. The log API identifies the request again the project ID to the log message validates the input the metadata and then publishes the information to the Kafka message queue which is common with the metrics part. And then we have these components at the bottom who are all making their specific jobs so we have the log transformer which is responsible for normalizing and parsing the data from the log messages. So if for example parse the log level here and you can take care that your log messages will be treated in the same way for all agents. Then we have the new component log metrics which filters the entries and generates new metrics. I will tell more about it in a moment in a couple of seconds and then we have the log persister which is storing the log messages in Elasticsearch. We can query the logs with the UI with Kibana. We have here also the authentication plugin which controls which users, which roles can access the logs. As the plan development we also want to add this path here which means that the log API should be able also to query the logs directly from the database. And that's the complete combined architecture. Just as long as I don't have to touch the mouse again. Let's see. So the combined architecture there's a lot of similarity here which is in part why we're doing this. We've got similar technologies, we've got similar architectural patterns, for example similar uses of Kafka. So in the center again is the same pretty much the same diagram that I showed earlier with metrics. And then on the left is logging. And then on the right we've got events which is still in progress. If you've seen me talk in the past you might have seen me talk a little bit about the events solution that's still that'll be coming. But a lot of similarity between all this is one of the reasons why we're trying to we're combining these projects and then we're going to talk about a use case later where and we take you through the sequence of when a log message gets generated how we generate metrics from that. So the idea isn't to operate these two projects as sharing a few things and having them be mostly independent but you can't compose these things together so when you consume a log message and you do some parsing on it you can for example detect if there's an error in your log file and that's useful information but ultimately what you'd like to do is alarm on that so we can generate metrics from our logging system and then Alarm and Threshold on that later on and then some notifications so that's kind of the grand vision for this in the future that we're trying to get to I'm not going to touch the mouse. The next section what's new, the update of the project so let's start with Central Powered with the log API we have completely rewritten the API so we have the version 3 we added patch support for the logs so you can collect several logs and send them in one request we additionally put metadata to the logs in form of dimensions which is just key value pairs which give additional information about the logs and since we are now integrating the logs and the metrics we keep them the same as in the metric part so you can easily correlate the logs with the metrics from specific services we have dropped plain text support we support only JSON now in the version 3 and it's completely only Python implementation in the previous versions we had Java as well we have deprecated that that's the short specification of the API we have a single single endpoint v3.0 slash logs the required headers is the authentication token and content type and the standard response when everything goes okay is no content the payload can look like this we have basically always two parts dimensions and array of array of logs in that case we call these dimensions global dimensions because they are common for all the logs which are sent in this request but sometimes you would like to monitor more than just one log file and you would like to send the request for more than one message so you could also put the dimensions for every single log object like here this example of local dimensions and the third possible case are mixed dimensions where we have like a combination of both we take the common part of from all the messages to the global dimensions part and then we have local dimensions here and then it can come to the case when you have kind of conflict between the keys in the global and local dimensions and in that case the local dimensions are more specific and they update the resulting dictionary so we have the new api and we also need new agents as I said before we have lockstash we have written the output plugin for lockstash for communicating authentication with keystone communicating with the api and we have beaver the capabilities of the agents of course they support the new api they support configurable patching so you can set up how big the patch size of the of the single request should be what should be the maximum time for collecting the locks they authenticate with keystone or caching the tokens for performance reasons that's the example of the configuration of the lockstash agent so we can see here the sections with local dimensions they are specific for these lock files and here is the configuration of the output plugin of the agent what we can see here is the URL of the lock api the version of the api the keystone URL the part with the credentials and then the new part the global dimensions which are varied for for the for the whole agent for every single request from the agent and then the configuration of batching so we have the maximum number of locks the maximum elapsed time for correct collecting the messages and the maximum batch size in terms of kilobytes no Roland you have performed some some analysis on performance so performance is central to the solution we have done a lot of analysis on it there is a benchmark at that location in github the performance that we were able to attain running on a we measured this on a back book retina first generation with four workers running with municorn is 18,000 log messages per second that's assuming 100 log messages per HTTP request and each message was a thousand bytes and I said note this is what we measure within our you know helium distribution for example but if you're using log info level you would probably see around 500 log messages for 100 node compute node deploy so 500 is far less than 18,000 so we think that we're well within tolerances there for operational monitoring of course the goal in the future is to expand logging as a service out to make that available for tenants similar to something like loggly but our first step is to use this for operational type monitoring and just a note there on that configuration that we do need to have deployed in the keystone middle where to do auth token caching you can't do this sort of thing and have every single auth request go to keystone the roundtrip latency would be too high in that case so that's just an additional part of that performance analysis I can't touch it oh the the new oh are you hearing me yeah so that's the new component log metrics which is responsible as I said before for filtering the log messages and generating the metrics for the metrics part let's take a look once again at the workflow how it how it works so we collect the logs with the agent authenticate and to the api api checks the token validates the input appends the project id and pushes the log messages to kafka que then we go to the transformer to normalize the messages add additional fields common for all the logs from our complete system back to kafka and then comes to log metrics filters all the error and filter and warning messages and generates new metrics which are then sent to kafka and can be evaluated by the threshold engine checks the against the alarm definitions and checks if the threshold has been achieved and in that case it triggers an alarm which then comes to the notification engine and email or web hook or pager duty notification is sent so that's the feature which is combining these two words which are bringing the new value some other changes and updates in the project we have developed the DevStack plugin for the monoscologic api so we can easily install it now in the DevStack environment we have updated kibana to the version 4.4 implemented the plugin for kibana for authentication updated elastic search and logstash to the version and two years still working on the change in monoska threshold for support sporadic metrics the issue here is the metrics which are generated through the logs metrics they are not coming in regular periods they are just coming then when the warning or error is there so in that matter the metric is different than the common metric which is coming from monoska agent and monoska threshold cannot couple with that and that's the last update she didn't really make it to the release and I think it's the only reason why she is not called mitaka anyway congratulations Tomasz and best greetings demo how much time do I have not much really the last time I was doing that I just mistyped the password so now I saved it so let's go to the monoska dashboard to the monitoring panel overview oh come on yeah so we have some alarms here that's the overview of the system I can see that most of the services are running good apart from identity service I have a warning here let's take a short look at the dashboard first I have oh no no not that one here yeah so what I'm showing here are the new metrics for warnings we can see there are some warnings here for identity service system and networking I haven't produced any perhaps I will I still have some time for questions so I just want to show blocks too much yeah so this is the alarm which was launched for the metrics name lock warning we get the dimensions so we can see which lock file has launched the alarm which service we can also go to I wanted to show something more but we don't have time we can go to kibana so we have locks and we can show some dashboards here we have here a donut chart with several services and lock levels and the histogram of locks and the lock messages itself we can quickly search for the lock level warning yeah so I just filtered all the entries for the warnings we can see these are only warnings they are for identity service compute and block storage kibana is really powerful so we can search and do fancy things with that but I don't have time to show you what I wanted to show is that you can create these alarms here and see the metrics here and here if there are any errors okay thank you that's all I'm out of time any questions might have time for one can I ask a question? yes sure it's me so you were telling that monaska sports tenancy and I had an impression that open stack users can use it to store the locks, metrics in monaska and at the same time you show that you store open stack locks itself in monaska irregardless of tenancy itself is it kind of a feature that admin tenant has access to open stack locks or how this correlates to each other your statement that tenancy supported at the same time you store open stack locks or has access to locks? yeah so what we would typically do what we're doing right now for in our distribution is we'll deploy the logging service and then all of the call it the admin account the admin account has basically there's agents running on all of our physical infrastructure so in our case that would be beaver or the log stash to support it too so that agent would read the log files out of varlog whatever varlog nova varlog neutron etc to get all the log messages for all the services and then that agent would send those to the log api and then they would go in that system below so that's say an endpoint and that endpoint can be registered and can be both a private and a public endpoint so if it was public tenants could do the same thing they could send their log messages into the system currently we don't have a query api so that's not going to be very useful to them but that's something that we will add in the future so today we're moving in the direction and so all those log messages will be there for the operator and then they can use the combination of elastic search and cabana to actually visualize the logs and generate reports etc does that answer your question yes it seems like yes so it looks like you covered two cases you're like store all this technology itself and you provide dense means to store the logs as well exactly yeah that is the case and that's true for the monosca metric system as well only in the case of the monosca that part of the project is further along so that endpoint is available usually publicly and tenants can deploy the agent within their VM and send metrics which then they could use to query and get those metrics out and then if they want to visualize and they would have to also deploy something like Grafana locally do that visualization they would have to deploy they wouldn't have actually that is supported I'll take that back you could use that in horizon today and visualize the pertinent metrics so logging is going to be similar we'll have agents running the physical infrastructure will be monitored and then tenants will be able to also use that solution for themselves now the other part that we do with like metrics is the monosca agent will send metrics into the tenants account you might want to know your CPU utilization for your own VM that you've created so we do what a cross tenant post in that case so we have a if your client that's sending the data has the delegate role on it then it can publish metrics to another tenant and then that tenant can look at those metrics so it doesn't always have to be the case it has to publish the metrics into the system and that case might make sense in the logging case too but you know I don't think tenants really want to know about nova log files or neutron log files in that case so when we come up with examples or use cases that make sense for that then we will have that capability too thank you notification capabilities do you get on sort of log patterns of interest the same notifications which you have with standard monosca so it is email notification webhook and page of duty so what information would I get back in the email so if I pick up the log message that disk21 has failed in my disk array am I going to get that in my email notification um so yeah depends how you configure your system really so in the standard installation you will just get the dimensions so you can get the dimensions of the metrics which at the moment it is just path component service and local level yeah yeah so there will be and then as we develop the log stash components there is something in each metric the metric in monosca consists of a name a metric name these dimensions which are basically similar to what we are doing for logging it has a value and a time stamp and then there is also something called value meta and so the value meta it looks like a dictionary but it is not involved in the identity of the metric it is additional information about the metric and typically what you would store in there for metrics processing is like if you were monitoring HTTP server it would contain the status code and the text message like internal server error if it is a 500 status code so we can put additional information into the metric we have not done that yet so we can stuff that data so when we generate the metric we can stuff that data in the future that will contain additional information for logging purposes but yeah good question we have not gone down that path too far yet where exactly this notification logic takes place while you are storing the data into strong or once you connect the data so I will answer it so storm evaluates whether the alarm is triggered and then it publishes back to Kafka what we call an alarm state transition event and so the notification engine which is another component in another process consumes that and then determines whether that state transition is associated with any notification methods so based on the alarm definition ID that is a part of that message take a look in the MySQL or Postgres database and say oh should I send a notification and which one should I send the resource is exposed to heat and if so what is the viability of replacing smaller alarms with monastic alarms okay so this is also an area of sort of continual development but the heat templates are there to support this and we support webhooks in monaska so we can send a webhook to heat and then the necessary templates can be run to bring on more VMs or destroy VMs depending on what the case is the continual work that's gone going is today that notification is a one shot notification so we're adding support for what's called periodic notifications and in that case what we'll do is we'll if you configure the notification with a as a periodic notification then it'll resend the notification periodically like a minute and the reason why that's required is your CPU utilization goes above 80% and you spin up another VM and a minute or two later the VM is running but your CPU utilization is still above 80% heat doesn't have the capability to detect that itself and get into re-evaluate the state so it has to be told that the CPU and the alarm is still in the triggered state so we'll send that notification then heat will go ahead and apply whatever logic it wants to if it's been more than 5 minutes and it hasn't gone below 80% it can start up another VM so you can use menosca and heat in place of salameter and it's a big area of development for us right now any more questions? alright well thank you everyone for attending the session thank you very much