 I'm Emma Foley, this is Kristof Kepke and we're going to talk to you today about how to tell what's really going on in your NFE infrastructure, why you need to do it and what you can do about it if you can't see what's going on. So why do I need to know what's going on in my infrastructure? Well, I'm going to do an introduction and talk briefly about barometer and barometers. And Kristof is going to talk about collectee and I will talk about barometer again and how the two projects relate to each other. And then Kristof will talk about potential use cases and I will switch over to plans, upcoming features and open the floor to questions. Why do I need to know what's going on in my infrastructure? Well, as telcos and enterprises move towards a cloud-based IT infrastructure, they start moving their workloads from fixed-function network appliances to commodity hardware in order to reduce costs. But as they move to general-purpose hardware, they become more and more reliant on the data center and they become more vulnerable to the costs associated with data center downtime. Those costs are not just financial, although even a minute of data center downtime is very, very costly. The cost also comes in terms of additional complexity required and service availability. So as they move from fixed-function network appliances to the NFE environment, the tooling required to actually maintain, host and orchestrate this become more and more complex. At the same time, the requirements that customers have for maintaining service assurance, QoS, and the same levels of availability, they remain constant, they need to be meshed or exceeded. This requires more and more complex tooling and more metrics to be available in the environment. So that's additional complexity for deploying, additional hardship when actually maintaining the level of performance required, and then even more additional complexity in monitoring what you have going on now. Because it is vital to monitor the systems because there are many different things that can have an effect on performance and many different things as you move up in complexity that can actually cause downtime. And you move from not only having to monitor the platform itself, but also having to monitor the applications running on top of this because you don't want something like OVS or DP2K or OpenStack Kubernetes to go down, because that would be disastrous. This is where barometer comes in. And first off, a barometer, as in the scientific instrument, is a device for measuring atmospheric pressure. It is usually used for short-term weather forecasts. And another use that many people aren't aware of is that it can be used to measure altitude or height above sea level. Now, when scientists were designing barometer, they probably didn't expect this to actually be a use of it. And the same way when the barometer project was created, there were a lot of use cases that have since emerged that we did not foresee at the time. So barometer itself is part of OPNFE. And I'll explain briefly what that is because barometer's relationship to OPNFE dictates the activities that the project actually undertakes. So OPNFE is the open platform for network function virtualization. It's a Linux Foundation networking project, and it tries to ease the adoption of NFE. It does this by developing more NFE-friendly features in upstream projects and then providing tooling to deploy, test, and integrate these same features. So that is what barometer does. Barometer is concerned with collecting metrics that help you monitor the NFE infrastructure and exposing these metrics to higher-level fault management systems that can actually introspect and analyze and automate the management and fault detection in your data center. So like I said, barometer does testing, integration, deployment, and upstream development on a metrics collection. And that's what Kristoff is going to talk to you about, the upstream projects which barometer actually does contribute to. And I will try very briefly to explain what that project is. And Kristoff will actually give you some more useful information. Yes, Collegdi is a pretty major piece of software. It is kind of a veteran in the... Okay. Perfect. So yeah, Collegdi is a pretty major piece of software. It is kind of a veteran in the deployments across the industry, very well deployed. It is there for about 16 years. During those years, Collegdi was continuously evolving and adapting to industry needs. It is written in C, especially Core Demon. It doesn't have any dependencies. And it is built with small footprint in mind. It's open sourced, mostly MIT. Some older plugins are still GPL. But it doesn't have any dependencies. It is platform-independent on most of the available operating systems that are there. It is providing you the ability to collect multiple metrics and events, included in the Collegdi repository, there is over 140 of them of various types. Some of them are reading the telemetry from various pieces, like applications or from the platform, hardware, or many other places. It's also able to write this telemetry to multiple ways, to Norbont, either to the file of the CSV or to some time series databases like InflexDB or any other. And there are also binding plugins in case the plugins that are there are not enough for you. You can write some Python scripts or Java applications and feed them into the Collegdi or Demon to dispatch those applications for further integration with your analytics stack. There are also modules for logging, handling notifications, aggregation, thresholding, filtering metrics. Interesting plugin is the network because it is able to read and write the data over the network with the Collegdi-specific protocol. So Collegdi can be treated as a client that is producing the data, but also as a server that is receiving them, do something with them, for example, aggregating and forward them somewhere further. We know that Collegdi provides us the ability to collect the metrics, but which are kind of interested for you? Why would you like to choose the Collegdi actually? There are existing standards, organizations and bodies like Etsy or CNTT that are working on the documenting specifications that are listing out the set of metrics and capabilities that you are particularly interested in the NVV architecture. Today we are focused mostly on the NLVI, the platform telemetry and part of the traffic telemetry, but there are also possibilities to scrape the application telemetry directly from the VNFs with some of the plugins, for example, DPTK telemetry to push all this data to some telemetry databases for analytics engine and closing the loop with the providing feedback back to the manual systems to make decisions about corrective actions. So what's more available in the Collegdi to monitor the NLVI? There are plugins like MCE log, PCI errors or log parser that are able to provide you specific counters about, for example, memory errors that are happening on your teams through the Intolerant Sure technology or RAS features which are basically features built for reliability, availability and serviceability which are helping your platform to serve you longer even if there are any failures occurring. The Intel Research Director technology allows you to monitor per process ID or per core your cache utilization of the last level cache or memory bandwidth. VIRT plugin provides you the insights into the live VIRT domains so the compute, storage or networking insight, the VMs. There are integration for OVS and DPTK that allows you to see what is happening on your network with some packet processing counters including errors and drop rates that are occurring there. There are Python-based plugins that allows you to write this telemetry to the OpenStack for consumption. You can also push this data to the Kafka, to AMQP, to Prometheus or, for example, to the VNF EventStream which is a project in the ONAP. You can monitor the health of the storage or power consumption or something closer to the platform. In case you are, for example, selling the resources of the cloud to someone you may be interested in the Out-of-Bent telemetry which is provided via Redfish or APMI and there are also PMU counters that may be interested for you which are monitoring the low-level counters in the processor which may be useful in some cases like branch misses, mis-predictions or cache misses. So now let's get back for a moment to Barometer. This one's working again. Okay, you may have noticed that I like asking questions. So how does Barometer relate to Collectee? Well, Collectee helps us to collect metrics and that's the core of what we want to do because no matter what you're going to do with those metrics no matter how you want to manage your NFE environment you still need those metrics to be available and easy to access in whatever format you want in whatever higher-level management or automation that you use. So Collectee helps us collect the metrics and if this project didn't exist basically we'd have a lot more work to do in Barometer. So it's only fair that we try to give back to the Collectee community and we do this not only by upstreaming our own features but also by helping the Collectee community in general onboard new contributors, review pull requests and also Barometer itself provides a load of testing and deployment tooling which feedback into the upstream Collectee CI and provide validation information to developers on their pull requests and also assists release time to actually validate the Collectee releases and make sure everything is working. So if I want to play around with Barometer or Collectee and take advantage of all these new NFE features there what can we do? Barometer takes care of some deployment tooling as well that makes it easier to install and integrate Collectee into whatever system you have. You could also install Collectee from a package manager and configure it yourself but this gets a little bit tedious after one or two servers. So what we've done in Barometer is we've containerized Collectee and we've written a bunch of ansible playbooks to automatically configure all the plugins that we think are relevant that you can also put in your own. So this one-click installer will let you install Collectee as is or install it alongside InfluxCB and Grafana or alongside Prometheus. And this is a few examples of how the metrics could actually be consumed and I will talk about some of the pros and cons of these reference deployments. So first up is InfluxCB and Grafana. This is a very simple architecture. You dispatch the metrics from Collectee via its network plugin and these are sent to the time series database in InfluxCB. From here you can grab the metrics for whatever offline analysis you want to do hook it into any existing tooling you have that talks to Influx or create your own tooling around that and pull those metrics or very simply you can write out of the box and get some nice grass Grafana and if you're running Grafana 4.0 or above you get some basic alerting as well. So Prometheus is very popular especially when you're talking about Kubernetes and cloud native infrastructures but there is a slight problem when you try to deploy Prometheus with Collectee in that Collectee has a push model for metrics and Prometheus has a pull model. So Collectee as it doesn't have any in-built storage has to put those metrics somewhere until Prometheus pulls them. So there's two plugins that do this. There is the write Prometheus plugin and there is a Collectee exporter. Both of them work in the same way in that they create a small little web server which hosts the metrics until Prometheus comes along and scrapes that remote endpoint. For infrastructure scales this becomes a little bit problematic because Prometheus is scraping a whole bunch of remote endpoints and that takes a non-zero amount of time. Eventually what happens is the time it takes for Prometheus to scrape the metrics from all the hosts in that time Collectee will have created more metrics and those will overwrite the existing ones. So you end up with larger infrastructure missing data and another issue with this is that the timestamp recorded by Prometheus is the actual scrape time and this may not be the same as the collection time for the metric. So if it's a small deployment you can probably get over that because it's a small variation but as you scale up the differences become more and more profound and this actually limits the rate of which you can collect metrics so there's a lot of trade-offs that have to be made. I'm pretty sure it's something else to say about that but I'll figure that out. So the issues here would be the metric selection time is not being preserved and the latency involved means you have to trade off as you scale up. I remembered for a second what else was wrong. Normally when this happens you would just deploy more instances of your application but Prometheus explicitly operates in a single server mode so there's always only one Prometheus instance. If you want high availability you deploy two or more Prometheus instances but you can't share the data between them. Each Prometheus instance will be scraping all the endpoints so that doesn't really solve the problem of latency. So that's where the service telemetry framework comes in. This is pretty new in that up until Wednesday it was called SAF but we had to change the name. So in this case you still have collectee running. You still have the same plugins configured but instead of exposing a local scrape endpoint or a remote scrape endpoint for Prometheus all the metrics are dispatched over AMQ and then received on the other side in the STF application which is hosted at the moment on OpenShift. The metrics then are pulled off the AMQ bus by an application called Smart Gateway which exposes the metrics on a local scrape endpoint to Prometheus and Smart Gateway also takes care of the issue with the right time versus the scrape time. So then the metrics are available in Prometheus the same way as they were before and this also takes into account events and those are available through elastic search. This looks complicated. It actually ends up not being very complicated because all the orchestration for that is taken care of by the service assurance orchestrator which actually deploys all of these for you. So with that we have the metrics available and you can make them available to whatever other system you want with maybe a little bit of effort, maybe a lot of effort but there's a lot of reference limitations and a lot of choices. I like to use the STF acronym here so you can use these metrics to stop your system from being stressed to failure or you can use them to see the future or Christoph can tell you some ways that we're actually using them. Okay, so let's start with the first one that was actually being used pretty recently in November. So during the KubeCon in San Diego they have deployed a... We've been asked to start again. Is it not working? It should be working. Can you not hear? You can hear me. Okay, perfect. So yeah, they have deployed full open source 5G network. They have made a call from one city to another and as part of this virtual center office there for the monitoring the actual barometer have been used. So here we can see a Grafana dashboard that shows us some statistics from the system. So pretty cool big stuff. Now let's get to something simpler. This is actually just a proof of concept. Here on the one server we have two VB and J instances that are running in hot standby mode. So one is actively processing the traffic and the other one is waiting in one standby mode. They are deployed on two separate NUMA nodes. So they have different memory regions. And the resiliency part here is that if the one point on an active VB and J instance memory is getting corrupted during the time it's just getting degradated. The telemetry from there is scraped from the MCE log plug-in. It is dispatched to the Prometheus. And if the increase of the corrected memory errors is increasing too fast because usually memory is starting to generate some corrected memory errors, but small amount in the time. But as they are appearing more and more it's more probable that we will hit the uncorrected memory errors that could crush our platform. So before that actually happened we may find out the increase of the happening of the corrected memory errors and do something with it. So in this proof of concept we are just triggering the remediation action which moves the traffic from one VB and J instance to another just to simulate the high reliability. So that was one of the first proof of concept to show that it is possible based on monitoring the platform telemetry prevent the outage time or shorten as much as possible any service interruptions. But there are more than just memory being corrupted. You can also watch for example temperature headroom to prevent any CPU throttling or you can watch for the last level cache occupancy or memory bandwidth to prevent any noise neighbor impacting or affecting your workloads. You can combine all of those metrics into some similar indicators about half of your platform or a computer node which leads us to the second proof of concept that was doing actually that. So we have two compute nodes managed by the Kubernetes. We are scraping the RDT, PMU, IPMI and transfer technology metrics. We are pushing this to the Kafka stack for streaming analytics which we are calculating this host health indicator and it is providing this information to the Prometheus. Now let's take a look at this new component there that we are seeing the telemetry hour scheduler. This is extension to the default Kubernetes scheduler that is making it aware about the telemetry to help it with the scheduling decisions. So you can feed their policies which are monitoring the particular metrics and you can say if the platform for example is healthy you can deploy there something new. If there are some minor issues or the resources are getting really being saturated then you can just keep what's already there but do not schedule anything new or if there are any critical issues you can evacuate everything and reschedule on more healthy nodes. So by monitoring this metrics you can perform those actions and do some service healing and platform urgency. Then I can just quickly tell briefly about the parsing demo. On all of the slides at the bottom there is a link to where you can find more detailed information about those demos. So here we have a Kubernetes cluster that was running the VCNTS spots that were using pull mode drivers so they were eating 100 CPU all of the time. The platform telemetry has been pushed to the inflex DB and then it was being monitored by an analytics engine that was previously trained to find the correlation between platform telemetry, CPU core frequencies for performance and packet drop rates. Let's see the results. The red line here is the actual traffic pattern from one of the operators. From peak to peak it's 24-hour period time on the top and the blue line shows us the power consumption. On the top we are using the default Linux power governance performance settings so it is keeping all of the core solos on the turbo. In the middle we are seeing the on demand but due to the 100 CPU utilization it's also keeping very high power consumption and at the bottom we can see possible saving of the energy due to the lower core frequency being managed by this analytics engine. And as we don't have much time I will just skip that. In summary there are much positive changes that you can do by monitoring the platform telemetry going through the service healing, energy optimization, quality of your service. There is also possibility with the Intel Thread Detection to find the threats if someone is not trying to attack your platform which are based on the PMU metrics for example. But also there are OPNFV projects that are utilizing barometer and colegny. For example we have spare bottlenecks and yardstick that are using them in the testing phases. And as use cases are still growing the software still needs to adapt and evolve to metham which leads us to the next plans for example in the colegny and in barometer. So we don't have much time left based through this. Up next barometer in the next six months we hope to help contribute to the collective 511 release particularly for a new DP2K telemetry plugin which will use a new telemetry API in DP2K and to proceed the existing DP2K stats plugins that are available. We want to get the capabilities plugin merged and that provides some static system information. Redfish plugin and MD events plugins are also in flight as well as a bunch of bug fixes. We are hoping to do more work on our collect DCI to actually run more validation tests and help to verify collective patches and releases in an automated fashion. Always documentation updates and there are a bunch of metrics requests and collaboration requests from the Mano API working group and from the CNTT group which is the common NFV testing task force. So they want to provide a bunch of new reference limitations and unify efforts across a bunch of different projects in the Linux foundation and outside. If you want to get in touch we have a weekly Brommer meeting Tuesdays at 5pm UTC and bi-weekly collectee meetings Mondays at 3pm. Information about both of these is available in the mailing list archives and you can get in touch by contacting the relevant mailing lists for both projects. And if you want to try out some of what you've seen the best place to go is GitHub and Brommer and collectee source code as well as the service telemetry framework to get started with documentation on collectee their wiki is pretty comprehensive and lays out all the configuration instructions for each individual plugin. If you want to dive into plugin development in collectee we put together a plugin development guide in Brommer which is focused on getting your first plugin up and running and if you want to get involved by contributing features or share your own requests or requirements that information is on the opnfe wiki on Brommer's page all of these links will be up on the schedule later so you can find them there. If you want to contribute to collectee there is a bunch of different things you can do down to starting with symbol testing or you can contribute changes both features and bug fixes or you can provide code reviews more information on that link and if you just want more information you are welcome to comment on pull requests asking for clarification and if you want to get involved there is actually a collectee meeting happening later this month in Munich this is the second one and we are going to be discussing things like features testing strategies, upstream processes release processes and discussing architecture and requirements we are going to be discussing the next major release of it and we would represent a lot of efforts to make collectee more cloudy things like an API for submitting and querying metrics and for dynamic reconfiguration of collectee because at the moment it is pretty static in this configuration and also features like so that they will be a little bit closer in functionality to the two other collectors that are available information on that schedule on the etherpad and meet up information on the mailing list and before I finish I would like to note that it was not just us helping with this work usually when you get someone up presenting it is easy to forget that a lot of people contribute to projects as well so I would like to thank these people that helped with various demos, development and requirements and driving the projects and does anybody have any questions? I had calls to look at collectee recently and every time you add a new data source out of plug-in it seems to somewhat limit its scalability you are kind of bound to waiting for a new collectee release to add a new plug-in to get that new piece of functionality are there any plans to make that more dynamic? That's part of the discussion for 6.0 as it might require a major re-architecture of the quality internals so there are plans to make it more dynamically reloadable as part of this qualification Do I answer your question right now? Yeah, the question actually was that there is an issue with the collectee that if you want to change the configuration you have to restart it and are there any plans to change it? So yeah, the answer is yes Now you're presenting an approach which takes certain layers on top of collectee so I guess it's between the benchmark or how the scale or what's your opinion on how to size the system for operating this? So the question was that So the question was that collectee produced a lot of metrics a lot of disk space how does this actually scale and what we've presented here shows additional layers of complexity and do we have any benchmarks benchmarks for scaling or guides for scaling? We occasionally run benchmarks in terms of storage mostly the metrics are just batched to remote locations and things like Prometheus are designed for long-term storage of the metrics so typically they will be aggregated and archived to reduce the amount of metrics you have to actually store Fortunately the time's up If anyone has any more questions feel free to come up afterwards Thank you