 Hi, my name is Emma Foley, and I'm a software engineer in Intel Shannon Today we're going to present snocke and collectee and Show how they can use fault for faster fault detection and maintenance Hi, everyone. I'm Julian work at that for a few years now, and I'm part of the telemetry team there Also in open stack. So I've been doing telemetry for five years now That's a pretty long time, and I'm going to talk to you today about gnocchi first and introduce you a bit About it We'll see then what collectee is and how you can use it Or you can plug this to software together to achieve amazing things and How we can improve all of this stack? So first I'd like to talk about a bit about gnocchi. What is it? And what is purpose? So It's an open stack project that has been created three years ago now. So I guess most of you have heard at least a bit about it It left open stack officially a couple of months ago, but that doesn't make it Use-based for that it's just that it's no independence and develop outside of open stack obviously of foolish It's the main back end for open stack cinometers are Cinometers used to store its data. It's in its own database and it no leverage gnocchi instead So gnocchi the transfer database as a service we use the term as a service here because it's one of this Rare database that offers an API that you can use to query it So it I mean it started in the open stack ecosystems. It makes sense to have an API and it should be rest API There's not a lot of time series these day exposing a nice clear HTTP API in the spirit of what opens type provided so far for computer or other services So I imagine most of you know what the term series is It's basically a set of points with a chance to open a value What gnocchi does is that it stores that So that's kind of data at a very large scale. So most of the database Open source database for time series that we try to use three years ago We're not really fitting in our different use cases for Cinometer and open stack. So we had to start a new project At first it wasn't like a real term series database. It was more a front end to over term serve database, but It turned out that most of them were not working in the Where we wanted them to work. So we ended up writing our own chance to read a base with the open stack trussets and the open stack services in mind. So for example, you can use over open stack services such as keystone for communication or Swift for storage, which is pretty on the one of the Things that makes gnocchi a bit different than the over Term serve database that it does pre-computed aggregation So when some metrics to gnocchi, it compute things like the average, minimum, the maximum, the number of points you send in advance per period and Store them for a defined amount of time. So it's a bit different that what you will see in most Term serve database that store every point with every degree of precision For example, here we tried to aggregate in advance. So when you request things like Give me the samples the metrics of this for the last hour with a five minutes interval It's already computed and it's pretty fast There's nothing to be computed on demand. Everything is done before Gnocchi has an architecture that is a bit different that most of the over open stack project you might encounter It has three different storage pieces The last one on the bottom in the indexer though the indexer Does not start any metrics, but it starts the list of resources in your open stack cloud or Any kind of over resources. It's not really tied to open stack. It could be over I don't know switches rotors networks Whatever you want. It's pretty agnostic in this regard and It's thought that into Database right now there's only two drivers which are my SQL and post-glacier But it's pretty easy to add any over database back in if you prefer any over technology The two other pieces of storage which are usually the same But that can be used in different ways are the measure storage and the metric storage So the measure storage at the incoming metric storage. So when you send new measures to Gnocchi It stores them into that storage. So it could be Swift Seth Redis Whatever you prefer or even a file a file in the file system and the this All of this new measure that you sound are going to be computed and aggregated into a time series and that's done by the metric de-walker that is in green there and This metric de-walker is going to compute all the average minimum the maximum of all the points that you send is going to Archive them into a metric storage. This is usually a largely scalable and there's a lot of space systems such as a self cluster or a Swift installation or a big Filer or whatever you have to use as that The way to interact with Gnocchi is through an API. Like I said the API is stateless since it's a REST API, so it's pretty easy to Scale the scale it out. You can have any number of walkers same goes for the metric de-walkers So the more metrics you send to Gnocchi the more measure you have the more workload you send to it it's going to need to scale and The metric de-walkers are also stateless So you can spawn as much as you need to have your own metrics computed in real time or with some delay if you have cheap hardware everything is Coordinated through a coordinator which is a Usually a ready server or in the DCD server or what you have to handle that there's a few drivers available So in this schema Where we're going to plug collectee is exactly at the same place on the over services which is through the API So collectee is going to talk to in a key for its API. The API can be used to read the data But also to send new metrics. This is exactly for example where a thermometer in a classic open stack deployment is going to put the data too The quick point about Gnocchi versus telemetry because I know there's a lot of people still confused about the two projects It's it's more about the history in the end Sinometer is older. Like I said five years old. Gnocchi is only three years old The story of question of Gnocchi comes from the fact that the thermometer used to leverage the database such as MongoDB by default Which were which was not used in a way that was very scalable like there was a lot of data being stored No aggregation done in advance in the case of Sinometer So as soon as you would retrieve data from Sinometer It would take ages to compute an average because you add millions of samples in a single collection of MongoDB Or in single table in SQL for example. So it was pretty pretty hard to keep a lot of data for a long time On large deployment based past a week or so you could have millions of point in Sinometer database And it would just explode and any query would take like 20 minutes to reply which is not very usable So this part of Sinometer which is the old API and storage database has been duplicated last last cycle And it should be away any time in the future Gnocchi is not a new API to use if you want to retrieve data from your open-stack cloud or feed more into it So I'll let Emma talk to you about collecties Yeah, I'm so I'm going to talk about what collecties how it's used in open-stack and how you can take advantage to this in your own deployments So first of all collecting is a system statistics collection tool It's quite a mature project. It's over 10 years old It is a modular plug-in-based architecture, which means that you can enable or disable plugins in a veg in independent of each other Which means you can monitor just what you want to monitor and it's designed to be performant But also to have a low footprint on your system So as I said, it's got a plug-in based Architecture and these plugins fall into two main categories read plugins and write plugins So the read plugins collect and system statistics the right plugins Send them off to whatever format you want to consume To a certain extent collectee also supports thresholding and notifications so you can get events as well as metrics and Collectee is widely available in most Linux distributions as well So what kind of information is available from collectee? Besides saying a lot of information. These are some that might be particularly interesting In total collectee has over 90 plugins. This is just a selection and each plug-in can provide multiple metrics and Provide these metrics for multiple resources as well So for example taking the CPU plug-in you get eight different metrics about your CPU utilization For each individual CPU and it's similar across Other plugins as well. So how do collectee and yaki go together? collectee generates a lot of metrics and Snacky accepts metrics and they're created by the collectee write plugins How is collectee used in open stack? I'm gonna go through a couple different projects to currently use collectee metrics and How they can how they use them to enhance their own use cases? So the first one up is opnfe doctor if you haven't heard about that's a false maintenance and Management project with an opnfe they detect faults in your system and failover for so that you have continuing service what they've used in from collectee you're actually stuff from the OBS and dpdk Plugins so they're monitoring the state the link state on their actual physical platform So that for example when the link goes down You can tell Nova to stop Stop scheduling to a host, but you can also do something like mark the control plane is down as well, which is one of their newer Newer features You can also then failover to whatever backup you were using so that your Customers or your service stays running So they actually do a really good demo quite often to showcase all the new features where you pull the plug on your On your board, but the video the streaming from it keeps going and usually you don't notice any changes Then there's the open sec watcher project, which is infrastructure optimization It analyzes workloads and determines the efficiency of the workloads and moves them into more efficient locations as well so they actually have a demo in the Intel booth in the marketplace If you want to go and see that they're using or DT Which is Intel resource director technology to detect when cash utilization is interfering with the Workload, so you have one application and another application starts using all the system resources they detect and failover in that case And there's the open sec features project which actually consumes events So they are also doing a noisy neighbor, but They Detect the fault and then they detect actually what's causing the fault. They do root cause analysis So instead of waiting for everything to fail the existence of one alarm can be deduced from the existing existence of a previous alarm and therefore you can actually Solve the problem before it actually becomes an issue So more generically in open sec you can use a ODH or a which is open-stack alarming So you can consume events from collectee and raise alarms in a which is similar to what v-triage does or what doctor has done as well So there are two main ways to create events in collectee. You can do them natively in plugins So instead of creating a read plug-in you can create a notification plug-in. So for example, this can be used to Sorry, this can be used to monitor the state of a system. So for example, there's a DPDK events plug-in available So it monitors the link status of your DPDK interfaces. So when they go down you get a notification directly out of collectee Which can be consumed through a Directly or you can use when in collectee you can use a threshold in plug-in which allows you to say threshold and raise alarms based on Arbitrary collectee metrics So the limitation behind this one is you can only do thresholding so if you want to utilize any other kind of alarms that are available in a you can just consume the nyaki metrics directly Okay So new features one thing I forgot to mention is actually the opnf e-barometer project which concerns itself with metrics and events surrounding capacity planning trending and the operational status of the NFVI So they've created a lot of plugins within collectee to monitor the actual physical infrastructure That you're running your workloads and that you're hosting your cloud on So they're responsible for a bunch of these plugins that they plan to package in the opnf e-release in a few months Those include the or dt plug-in that I mentioned improvements to the libvert plug-in which lets you monitor Sorry not monitor it. It lets you get statistics about your hypervisors through the libvert plug-in in collectee They're also developing an SNMP write plug-in so that you can export metrics from collectee to legacy fault management and monitoring systems and Also, they're planning to package a and yaki plugins for collectee as well so you forward metrics to yaki and raise alarms in a and one very interesting thing that they're doing as well is supporting dynamic reconfiguration of collectee So at the moment if you want to change your configuration in collectee that's enable or disable plugins or change the interval You have to restart the service, which means you're actually losing Losing data so the alternative here is you enable everything at the start you Guess what you may need to monitor in the future But that's also not very practical because you also have to store this data as well So if you're able to dynamically reconfigure collectee so Updates configuration and push it you don't have to collect too much data Or you don't have to collect more data than you need but you can also reconfigure it if the time comes and of course with yaki these there are upcoming stability and performance improvements because it's designed to be scalable and performant and Handle a lot of data as well so What can What can you do? So this is your call to action if you want to try this out? You can install collectee for many of the package managers You can download the collectee salamander plug-in which actually Gives you the option to build collectee from source as well and test it out see the latest and greatest plugins and of course Try out nyaki as well if you're not already doing it or if you're previously using something like salamander You can plug nyaki in there as well I just want to let you know that The examples here do not represent like a monitoring solution But I wanted to present you with the tools that you could use to build up your own system And it doesn't have to be for monitoring p billing rating prediction auto scaling anything and you can do metrics and Use these metrics that weren't previously available so There's a lot in there, so I'm just going to summarize nyaki can handle lots and lots of metrics Collectee generates lots of metrics There are already a bunch of projects in open stack making use of the metrics and collectee and Collectee nyaki and collectee a plug-ins that you get in the action and incorporate these additional platform level statistics into your own environments Does anybody have any questions? Hi first of all Collectee is awesome. I've been using it for years if I'm currently throwing my collectee statistics at a Graphite server. What are the principal advantages that I would get from switching from that to no key We'll assume with a Seth back end So it's basically the Availability the possibility to store a very large amount of data like usually if you use a graphite server, so you don't have any Load balancing or high availability available because graphite stores It stays high into files. So you have to handle files and to start running to something like I don't know big net app server Whatever you got which can be pretty expensive Whereas if you use no key you can just just like use a self cluster and spawn any number of API and metric the workers As you please and as you have capacity and that's it So it's it's designed in a way that you can't have any single point of failure in your metric system Follow one question if there isn't another one Could I make no key available to my tenants as well? Yes So it's designed this way. Like I said, there's the default is to use a Basic HTTP mechanism for its indication. So it's more or less single user, but there's a keystone of mode which filters the The metrics and the resources into no key by project user or whatever you want So you can use all slow policy to define RPC rules Like you can use collecting to put the metrics into the key and that's the right permissions to expose them to your tenants behind an open stack Thank you I never question So you can use them by please just Thanks Is it possible to run collect the inside the VM instead of the infrastructure to collect more data Related to the VM it is collectee has a network plugin which allows you to forward Metrics to different collectee servers so you can run it inside your VM and forward them to your host They can be aggregated there Is there any collaboration between the Nuki telemetry project and the monaska project We wish but we're known so far So it could be I'm not an expert on monaska, but it could be that they could use like Nuki to store our metrics, but I'm not aware of any efforts into that direction Whenever it reached to us for that would be a bit too to help or whatever, but there's none so far Thank you. No more question. Thank you