 All right, good Cool. Can you guys hear me? Yeah, yeah, cool So thanks for coming out today. We're gonna talk about everyone's favorite project the salameter My name is Gordon Chung. I'm engineer at Red Hat and The new ptl of salameter And I'll let Pratt introduce Our topic for today Hello everyone. Thanks for coming Today, we're going to talk about Deploying salameter scaling out salameter. I mean how many in this room in the show of hands are operators who try to deploy salameter And awesome perfect. I think you guys will be like fit right in here So let's get through some of the obligatory slides here I'm going to read that mission statement. Meanwhile, you guys can look at that picture To reliably collect measurements of the utilization of physical and virtual resources comprising deployed clouds Persist this data for subsequent retrieval and analysis and trigger actions when defined criteria are met Well, that's a mouthful What does this really mean right so Just just a quick intro to what salameter is really, you know, what what we want to do So the the idea being you know, salameter is Basically, you used to collect the physical and you know, virtual resources running in your cloud Once you capture these resources, you know into whatever the data that you want If the data is exactly what you need to persist and use great if not, you know You can use some transformers to convert the data to something measurable Super next we want to publish this data to various targets. So the various targets here are Multiple publishers that salameter supports, right? Like HTTP publisher or I mean recently we added Kafka publisher, you know, it could be as simple as a file Or you know a notifier publisher So, you know things like that you have multiple options here So now that you publish the data for, you know, external consumption, whatever it's using it We also want to persist this data, right? We want to be able to access this data in future For querying or for building whatever so we persist this data to storage And then we provide a rest API for you to be able to access, you know for further analysis, you know building or You know coming up with pretty graphs, whatever you want to do So with that said, you know, let's look at historically how the architecture was So quickly I'm not going to go very deep into this architecture. This is ice house based And kind of that's where you know salameters You know experience came from or you know what people thought of it So if you look at this architecture, it's very simple. So we have open-stack services, you know publishing data to you know The notification bus We have polling agents polling off of, you know, the service API is getting the data that way We have notification agents, you know grabbing the data off of the notification bars You know your collector is getting the data and persisting it in the database We have an API through which we expose the data and then, you know You you basically query the data and do some cool stuff with alarms with the alarm evaluator and notifier Things like that. So this is kind of the basic workflow On how things work in you know ice house and pretty much even now for the most part So with that there are a lot of limitations, right? I mean with a given architecture So couple of things we highlight here. So if you want to, you know, how horizontally scale this whole application as is It's really not possible in ice house, right? And for example, you know the polling agents you can, you know Scale that horizontally in an active passive banner using pacemaker. Okay, great. Perfect Notification agents you could you could deploy that in H.A. You know provided you didn't have Any transformers associated with it, you know collectors They you know, you'd use the RPC publisher which comes with an overhead and you know, we fix that in the later versions which which is much more efficient and You know again the agents they are pulling off of the services awesome But you have one single agent hitting, you know, your API. So there's like a heavy API load and above all We had one single database with all the data dumped into it, right and you know We Mongo was the default of course we had SQL support But you know, it it was it was a real joke. I mean it didn't scale very well the queries took like forever So with this You know, we we asked the cloud admins, you know, we spoke to a bunch of people and hey, what do you guys think of? Solometer and Voila, this was this was the perception and this is really not the perception we want, right? I mean this says that cloud admins are really scared of salameter and The key here is the word perception because it's how it's been perceived You know, it might or might not be true in every case, but that's just how it was perceived With that so let's get into some of the complaints that cloud admins had right, I mean Yeah, of course, they were you know, it's obvious that they're scared to deploy this in their data centers But you know, everyone loves to complain and you know, that's how we get things fixed really So let's look at some of the complaints that people have Number one API response too slow. So when I run some queries, they just take forever to return What the hell is salameter doing there? When salameter dies glance dies I don't know if this is really salameter's fault I mean, this is probably a bottleneck that rabbit has or the messaging queue that you are using that's getting flubbered because you don't have You know a limit on the number of messages and that's taking down the services down So you can't really blame salameter for that Salameter has memory leaks. Okay. All right We can't argue that they might be some Salameter doesn't scale. Okay, sure. Oh perfect. I mean with the architecture that we just spoke about, you know, it's not surprising that it doesn't scale As a proxy is messing up Mongo replica sets again, you know, this is not necessarily salameter's fault It's just how you know, you as a cloud admin have architected your deployment You know, you should figure out whether you need to actually have Mongo with replica sets within that architecture Maybe you should just move it out And You know finally salameter is not production ready. I mean really we heard this over and over I mean, it's like you know people moving away from salameter, you know, either just by word of mouth or You know, they tried a few things and this just said production ready. It's our production ready And from here on I mean, there's nothing but you know going up, right? That's just how the perception is so Now I'll hand it over to Gordon to talk about how salameter has evolved from here on So that's kind of where we started or where we came from and I'm just gonna highlight some of the changes we made in the past few cycles So this is a similar architecture to what you saw originally, but there are some subtle changes to it So the one thing I want to point out is that salameter itself is composed of several discrete services there's a polling Functionals there's polling functions notification handling storage and alarming and it's possible to run all of them as a complete solution or If you should choose you can run each one of them individually if you want And also they're designed to scale horizontally So if you have a large node or a load you can just add more agents to the cope with that But specifically in Juneau what we did was we started to focus on some of the core issues around storage and Also the durability of our services One thing we did was we split off the alarms database into its own section We also updated a sequel database back in to kind of simplify the model originally the model had a lot of weird Relationships that we needed to handle both the v1 and v2 models of our API But in Juneau we deprecated the or we dropped the v1 API and we were able to simplify the model quite a bit and In a we ended we end up storing a lot less data and it improved performance dramatically Another thing we did in Juneau was we added Oslo messaging support So basically what it can do now is it publishes to a topic rather than using RPC which has a certain overhead to it And this also allows a lot more flexibility in that Because we're publishing to a topic you can have not right now the collector by default will consume Message off the queue, but you can there theoretically use any consumer to take that the data from the queue And just that yeah, just flexibility there and The last thing we do is we added coordinated HA for our polling agents and we used a Tool called twos and I'll go into that so Basically twos is how we handle our Coronation between agents It's a group membership tool that some of the guys that you know Vance created a while back It supports various back ends like Redis memcash zookeeper and the basic premise of it is that When you have an agent it knows about all the resources it it can pull and when it starts up They will register with twos and it'll find out all the other agents that are live and active and from that It uses a hash ring to kind of bucketize some of the resources so it knows What resources it needs to pull doesn't actually so it the one Benefit of that is that it's kind of ignorant to everything else It just knows what it's kind of lightweight and knows exactly what it needs to do and doesn't actually care about what other agents are doing So it's similar to something I think what ironic does in its conductor and So every time you start or you add or drop an agent You'll register with twos and it'll broadcast the new or the new members of a group and they will redistribute the resources accordingly and In kilo we kind of furthered went with the same idea of just kind of supporting discrete services In the pipeline we added support for Kafka and HTTP HTTP publishing so Again, if you don't want to use slumber for storage you can Push it to another queue or to another HTTP target and consume the data that way We also added the same coordination Technique to the notification agents so you can have multiple notification notification agents and If they are if there's transformers, it's fine We also added better event support so in slumber we captured both metrics Metering measurement data as well as events that happen within the open sacrum system So we originally had that in ice house, but it was kind of completed within in the kilo cycle So it functions similar to how we capture meters You can push them through pipelines and publish them to multiple topics or destinations and Also, we split off the events DB. So now there are three databases metering one for meters one for events and One for alarms, but they can also be just one database if you want Also for elastics for events we added support for a last search which is Really good for full text Quering I won't go too much into that we I have a session later on that talks about that if you were interested and Lastly we added added support for we added jitter support. So basically it's a random delay Across all the polling agents so that not all the polling agents are hitting the service apis at the same time All right, so so far we spoke about, you know, the problems that Solometer has The complaints that a lot of people have and you know how it has evolved You know regardless of how Solometer has evolved so far, you know, unless you know you as a cloud admin You know or an operator Follow certain best practices, you know and you know learn your software That you're deploying you're always going to have issues, you know regardless of you know what application you're deploying You know you can complain all you want, but if you're not following certain things It's you know, it's it's obviously going to have certain some issues. So we saw a bunch of complaints so far But you know as I was mentioning not everything with salameters fault really I mean, there are other bottlenecks that that was bringing down salameter or the performance was impacted by other applications That are running behind salameter So it's it's always better to understand, you know As a whole how you're deploying the application what it involves and then you know deploy it based on You know your needs So with that, let's let's go over some of the best practices, you know, this is just a subset. I mean You should probably read about You know other things and you know see what fits your needs So to begin with, you know, salameter gives you a world of data, you know cloud is a very very complex application And you get tons and tons of data that's been generated by so many applications that so many projects that salameter is running Having said that if you choose to you know collect everything, I mean if that's what you your Deployment needs sure, I mean, you know no arguments there But you know most likely 90% of the cases, you know, you you have certain metrics that Matter to you most I mean certain things that you really don't care So we have something called a pipeline dot yaml where you can go and you know prune, you know Which metrics that you want to collect which you want to? You know Keep away that way Number one, you know, your data is Pristine to what your needs are so you're quick your queries aren't going to take too long because you're not collecting like data that you don't really care about You know when you're querying your things are going to be much faster your database sizes will be, you know At least, you know, you don't have needless data that you need to worry about So, you know things like that. That's one of the best practices Again polling intervals, you know, there's a default polling But if you want to you know tweak it to your needs, you know, maybe you don't want to you know poll every You know 20 minutes or whatever Instead you want to pull like twice a day sure, you know go and tweak that that way you're not, you know putting load on other service APIs, you know unnecessarily, right And jittering to polling this is something that we added in kilo So imagine a scenario where you have I don't know thousand compute agents running right and And imagine a situation where all the thousand compute agents are trying to talk to the Nova Nova API at the exact same time That's a huge load on your Nova API Versus, you know, you can enable this jittering what what this jittering does is it adds a random delay in You know how each of these agents are talking to the API is that way you are kind of distributing the load And you know moving it around that way, you know, the Nova APIs are not, you know hit by The agents all at once and again, this is something that you can easily turn on in the config file Scaling out, you know add Agents as the load increases. So this is this is something that you as a cloud admin should know best, you know How your infrastructure is scaling how how much load is coming at what what what point of time? So, you know use your best judgment and you know add add more agents based on your needs, you know that way, you know you are you are You know addressing the problem ahead of time before you know your application goes down for heavy loads it while you have You know few agents that are not handling the load properly This is something I mentioned earlier So instead of using the RPC publisher use a notifier publisher. It's it's much more scalable. It uses Oslo messaging RPC comes with a Overhead so, you know, you can avoid that because now that we have that support in Juno All right, so some of the data store practices, so You know, we have we have heard a lot of complaints about, you know, Mongo doesn't scale or you know The back end is you know taking too much space So that there are there are certain things that you can do so for example the queries, right? I mean make sure your queries are, you know, full-formed don't don't do open-ended queries and You know, obviously, it's going to take a lot more time if you just want to you know query a tons and tons of data by just a single point Instead, you know have a full-formed queries Run-slameter behind mod-whiskey The advantage is that it gives you a lot more knobs to tweak So you for example in in the in your Apache configuration You can add the number of threads and processes that you know apache needs to spawn so that way you are kind of distributing the load And your API can scale much better Set the TTL so of course so this is again This is probably one of the best practices for you know other applications It's not necessarily salameter any application you're running So if you have a database, you know, and especially in salameter case we we capture so much data That you know, it always makes sense to figure out what data that you really need You know, how much do you want to keep? You know keep the data You know if you have years worth of data, do you really need it? You know or maybe do you only care about last week's or last month's data So figure out what's what's best for your infrastructure and your needs Similarly, Mongo, right? So don't run Mongo on the same node as the API I mean, that's just that's just basic common sense, right? Just put it on a separate node that way, you know number one it makes You know the load and as well as the space all those constraints on a separate node And you just don't have to worry about you know, oh my API is not scaling or you know Oh, I see a bunch of errors in the logs Oh, it looks like you just ran out of space and you were running the API on the exact same box as Mongo So just move it to another box and another node and just That way you you are isolating the problem in case you have issues in future At least you know that it is Mongo and not necessarily salameter scale Again when you're running Mongo enable sharding and replica sets, you know, just don't run Mongo You know as is, you know, you use your application you sharding You know that that helps scale, you know much better So the bottom line with all these is, you know you as an operator you as a cloud admin try to learn The application try to learn what what takes for me to deploy salameter and figure out what are the best practices for me to follow To deploy this instead of you know, obviously every application out there every project out there has some issue or another It's it's just that you know, how big a problem it is and you know, if you think Mongo is the problem You know, maybe you can investigate different back end for your needs But again, it's not necessarily salameter small. I mean, we're not saying salameter is perfect I mean salameter has issues absolutely, but you know, so so does other projects, but you know So you need to understand what your needs are and follow some of the best practices that way, you know You can you can have some better experience in future with that Cool so I mentioned so as I mentioned before there salameter is composed of multiple Discrete services so you can kind of deploy it in many different ways to suit your needs So I'll run through a few deployments that I've seen or are possible with salameter So this is the lambda design the basic premise of the lamb design is there's like a fast path and a slow path So in this case we support our pipeline supports publishing to multiple targets and This narrow you would take a single data point and publish it to do two different places You'll have a short term database which would have maybe a shorter time to live setting so you would expire data quicker and You'd have another archive database that would kind of keep more data the short-term path obviously you Would you'd send it'd be more for kind of tense time sensitive queries, so if you for alarms such for example if you use alarms you probably set that against the short-term database, so You can get your queries back quicker and for the archive one if That's more for queries that if you you're okay waiting a few minutes or hours Then that's also that path for that one The next one is data segregation. This one is similar to the lambda design for this one the use cases if you have two data two different data points one of them being I don't know CPU you utilization you would send that to the public data store and if you had more More sensitive data like audit data you would send that to the audit database and you'd have two different api's trolling controlling access to these databases to Meet the required access control So Salama itself also allows you to not just store and write into databases, but also write JSON files So there are some big data tools out there that can consume JSON files like a patch of spark So you directly could write to a JSON file and have that load into a patch of spark and do your data processing that way Fraud detection. Um, so this is what I did with the previous company What we did was we had a proprietary learning system. So we didn't actually use Slammers storage or alarms. We just sent all the data via HTTP to our own system and we would build our rules and alarms that way custom consumers We also in kilo supported Kafka And one of the consumers of Kafka is a patch of storm which is similar to patchy spark So, yeah, this is also an option if you want to go that won't go down that route if you're familiar with both these tools I should add disclaimer that none of these tools are actually part of Slameter. You can use them like as Extension, but Slammer itself doesn't include these tools Debugging um so in in addition to metrics. We also collect event data, which is kind of Notification each so every service emits notifications on kind of the state of the resources that the state of a resource it has and Slammer consumes that and builds metrics of it, but also builds events and Using that you can send that data into your last search and use cabana or Slammer's API to query that data And kind of do deep diving into it and kind of see what data is a is available And the last scenario is noisy services. So there are certain services out there that will emit significantly more Notifications than other services So one of the things that Slammer can do is it listens to multiple notification buses so you can actually have It can consume from multiple places. So if you have one that's particularly noisy You could could have that on its own dedicated service so you don't overflow the bus and have some more quieter ones on its own bus So where do we go from here? Now I'm just gonna highlight some of the goals we have for Liberty So you might have heard of this term called no key that's been flowing around It's a pasta. It's also It's also Storage mechanism we're doing on or we've been building some of the Slammer developers have been working on that for The last few cycles and we'll talk more about that next But it's something we're looking to make as a viable back-end in the livery cycle we also in Kilo we we've been working on events and We hope to kind of extend that functionality in the upcoming cycle I'll just do alarming on events and maybe build add some the ability to Transform or build alarms build samples from the events, sorry and Lastly we have Design session around that declarative data collection. So right now to collect data or metrics. We actually write code to build that Which is very restrictive if you add something Mid-cycle or beyond It's kind of hard to like bring in master constantly. So what we want to do is a Load in a file like declare our metrics in a file. So you can actually Load that in and not have to write code and that should improve the flexibility of our data collection And lastly, I think for me when I ran for ptl The basic the main goal was to minimize the bloat like we we will add new features slumber, but We want to make sure whatever we do that slumber remains as lightweight as possible No key. I'm even hungry yet So what is no key so as a garden mentioned, this is a project that Julianne started You know kind of to address some of the performance problems that salameter has So to to kind of give you a very high-level overview I'm just going to trigger your taste buds here You're going to read about it. There are other sessions that you know Owen and Julia and presented in the Paris summit which goes in very detailed on what knocky is So what what's knocking knocky is resource metering as a service? So in salameter, you know as as we discussed all the data is kind of treated as You know a single data point, right? I mean we don't really differentiate between, you know What is a meter? What's the metadata associated with it? It's all one giant blob that goes into the database What knocky does is it kind of you know differentiates the you know the data into kind of two categories So one is metrics. These these are more like, you know a single point in time These are like time series data that are very lightweight, which we just care about like the value of it and then We look at another data point, which is the resource that that kind of maps to this measurement So think of resource as you know an instance or a volume, right or an image, right? And a metric a measurement here is You know the the CPU utilization of the instance or you know the size of the image or you know the number of disks in a volume So it's it's a measurement. So what knocky does is it kind of separates these two into like It realizes that these are two separate problems to deal with and uses the right tools to handle them appropriately So it uses this light lightweight time series, you know data So we have this concept called a storage and an indexer So the storage is kind of responsible for this time series data Which you know by default we have something called a carbon error Again, yummy. So it is like Swift plus You know pandas based, you know a canonical implementation And you can also there are other patches to support inflex DB and you know open TSDB and you know a similar Databases to support the light light kind of data and then the indexer It it stores all the resource information and it maps the resource to the metric and the measurement is in in your storage And you basically add the measurements to the metric So this way you are handling the right amount of the right type of data with the appropriate tools Without, you know handling everything with like one giant tool, which could be Mongo or which could be my sequel or whatever in salamander And again, it comes with you know a few other, you know nice little features So we do like pre aggregation eager pre aggregation So in in salamander when you make a query all the aggregation is done on demand, right? I mean you make it very and everything is going on behind the scenes right there versus in knocky We actually do eager pre aggregation And similarly we have support for cross metric entity This is this is mostly for like, you know auto scaling with heat so if you have multiple instances and you know, you want to take the measurements off of you know, multiple resources And again, so we have retention policies so you can define a retention policy on a poor metric basis So again, knocky realizes that you know, you you don't want a single retention policy for your entire data, right? I want to retain In a CPU utilization metrics one way and you know, this metrics another way So you kind of you know can define on a per metric basis How long you want to keep one versus the other so these are some of like the advantages of you know Knocky and and it's it's addressing the problem with the right tools the right way. So and you know, it looks pretty promising So, you know, this is a very good, you know Example of you know knocky versus salamander data point So if you see the amount of data that salamander stores on a per data point basis versus in knocky It's just the timestamp and the value. So it has much much smaller footprint It's going to be much more performant and you know, it's it's pretty obvious from this a little picture You know why so we don't need to go much deep into that So we have some metrics that may be, you know collected. This is this is basically using 10 You know clients Parallely posting, you know doing post requests hundred times per hundred measurements So this is these are the post measurements and these are the get measurements. So you kind of see the performance This is for Swift based and he also has a blog post where he he he focuses on the SAF implementation as well So you kind of get the idea of you know, if you want to use canonical Swift versus you want to use SAF as the you know back end and the performances Pretty pretty on par for both if you go and look at it So with that I'll pass it on to Garden to you know, talk about where exactly no key fits into our future architecture Cool. Um, so this is how to keep it in the basic idea is that your alarm evaluator will now query the no key API which should significantly speed up your Sets of the your statistic queries because it's already pre-aggregated and This would be how you deploy it so Tomorrow we actually have a operators discussion or Obsession where You guys can I think a lot of slumber developers will be there and we can chat about what needs to be fixing We also have a few design tracks on Thursday or Wednesday and Thursday The main topics we have our event alarms and slumber Componentization, but there's also a few other ones that we Will be working on I don't know if you're Free you're welcome to join And I'm gonna do a self-plug here on Thursday, I also have a talk where We show how we use slumber events to debug an environment Yeah, and also, I guess as always there's IRC if you end up ping us with questions and the mailing list Just take it with slumber and someone will answer This is just to highlight some of the resources that or some of the stuff. We've been talking about I you can kind of jump into these links if you want and kind of learn more about what we've been talking about and That concludes our talk there's beer I think