 Good afternoon everyone good morning some of you me it's like afternoon, but Today we're going to talk a little bit about how we're going to enforce application service level agreements with Congress and Minasca Ken Owens. I'm the CTO at Cisco's cloud services team. I have Fabio with me. He's my chief architect So I'm gonna talk a little bit about the vision and then turn it over to Fabio to go a little bit deeper into sort of the How we implement this or how we plan to implement this and talk about some of the current state of what we have in the next steps And so from a vision standpoint, it's a pretty simple vision Developers don't want to think about infrastructure They just want infrastructure to work as you guys all know and so there's sort of two pieces to that the first piece is How do you make it easy for a developer to tell you what their intention is like what they what they would like to see the performance or what what kind of constraints That the application developer knows he has to live within for his application to be successful in the business and in the Deployments he wants to deploy or that the business may want to deploy him into and then you have the enforcement of that intent and Between those two you have the third area Which is how do you sort of interpret the intent and make some sort of a specification out of that that can be then adhered to When you deploy that application And so the the vision that we've been looking at so far has been kind of looking at you know a very simple kind of spider diagram where you have like maximums and Minimums and you can kind of allow the developer to sort of work within those constraints and in general the ideas of an administrator or the business owner Either based on he's at the bottom of the picture. There's like this You know an estimated price per month and so the the idea is that the business could set up the Minimums and maximums so a developer can't make it less secure than the business has said it can be less secure than right and developer can't make it more Expensive than the business is willing to pay for the application without getting additional approvals for instance And so if you can't think about those different dots you have like a high Medium and a low and so it's trying to keep the intent like as simple as possible We're not asking a developer to tell us exactly how much CPU it needs because as you guys know It only depends on how much uses the application is getting in production It might need a lot of CPU because there's a lot of people hitting it It might not need any CPU because it's just being hit every once in a while and so instead of asking for a specific value We just say is your application going to be CPU intensive if yes then say hi And we'll make sure that we give it the most capabilities around that intensity. It's possible if you need more elasticity or less elasticity All right, you can kind of move that you know the high in the low around within the constraints that the business has defined So this is sort of an example of what it would look like If you kind of gone in and said here's in general how I would like to have my CPU at this level my RAM my network bandwidth You could come in sort of play around with that my throughput my elasticity and Then what's needed is how do we monitor that right and so it's one thing to basically say on a nice picture I want this sort of you know set of like high and low parameters But the big question is how do we then look at that and monitor that holistically and not just you know In general from the application standpoint, but going all the way into the infrastructure as well Because all of these applications are running on physical infrastructure and as containers and other things come into play You now sort of have to look holistically across that entire stack from the physical layer through the you know Whatever sort of cloud software you may have a virtualization layer. You may have Whatever sort of containers the orchestration of those containers the You know availability of the services that those containers are using all of these different pieces That are pretty complex have to kind of get monitored based on what the developers as they care about And so you know the last slide, you know before we get into some of the implementation details It's sort of looking at you know kind of thinking about this in terms of you know alarms so if you kind of have a You know a threshold at the low end that you would say at this point trigger an alarm and Then the action from that alarm would then be taken once you trigger an action So you can kind of give somebody and hint up front. Hey, you've just hit your you know Network latency is already is close to the boundary of what you said you could handle and then once you cross that boundary Then we would move those containers or move those servers to another location where the latency wouldn't be as high and try to We balance that for you, but you kind of give them an idea of that's coming So with that I'll turn it over to Fabio So let's look at the different Capabilities that we can implement using a combination of Monasca and Congress. So the first aspect is ops knock SLA So I have this example. I took from the documentation of Congress. So this is an example where a Policy, which is fairly complicated to read but in reality what is doing is looking at Utilization of a server or a VM and saying if that utilization for the last X in this case was two months It's less than a certain amount then it means you are not really using this stuff And so I'm gonna send you an email because you're gonna be in the wall of shame Right, you requested all these resources and you haven't really used that right and you see that in this case They are leveraging if cilometer statistics They could leverage monasca statistics, too But that's not the point the point is that we want to change the model in which this data is moved around and So the way it works today You basically have in the system a list of VMs those would be the VMs that you have and What is gonna happen is that Congress through the cilometer data source connector? We'll go and talk to cilometer API at a certain given interval So we'll pull it and it will say hey, can you tell me the CPU utilization statistics for these VMs as a result? Cilometer will return the value and then this will populate any memory Table in In Congress and now you see that there is one the instance number three that actually does have that low value and at that point what Congress does is going and talking to Nova and cast and Keystone to retrieve the rest of information Reality is in this type of case this polling happening parallel So it will keep asking to Nova and Keystone for this data because when you evaluate the policy the data source will evaluate it continuously And then give an interval and as you can see there is a lot of data that needs to be Taken right if you think that there are like 10,000 VMs or something like that the volume of data that you move around is Significant and then what happened is that you know you found that my VM Ken VMs are actually fairly full, but mine is not and and so it's gonna find me and it's gonna say hey, you know you haven't used that VM a lot and The policy engine will use that and it will say oh you are now you know infringing the policy So you're out of compliance So what we want to really do is we want to move from this model Which works fine, but we think has some challenges in scalability in large environments Especially when you want to also leverage is for containers because usually containers are a factor of 10 or 100 compared to the VMs that you run We want to move from this Approach using alarms. So you see that I have the exact same policy than before the only thing I changed is that now I have a new type of data Source, which is called monaska alarms and that at the source Wants to do the same thing wants to find the statistics of my CPU usage But that can be transformed into a monaska alarm So that monaska alarm as an expression which goes and look for the average CPU usage For a given VM and then if that's less than five is gonna go and fire an alarm So how this works First we need to define a Integration point a notification integration point with monaska monaska ready supports Nagios and email and web books So we probably need to go and create a dedicated web books for Congress in that way We can pass the information or the majority of information about the alarm back to Congress because But in mind that alarms can be a portion of the policy evaluation So Congress may have to do all sort of other stuff check all sort of other things in other systems they are not monaska nova Neutron you name it and so this alarm could be just a signal there that it could be a potential infringement then Congress will do the analysis of all the other components and then determine if it really is a situation to be worried about or not so then the monaska data source can create a Alarm notification and what monaska will do will store it into a setting database that usually prevented by a my sequel and In parallel to that The monaska agents that you have deployed will collect the the real data the streaming of monitoring data and the beauty of this is that One a monitoring agent eats the monaska api Basically what monaska api does stores into the Kafka bus and at that point you have several components that act in parallel And that's a very unique feature that monaska has because he basically evaluate Alarms in stream rather than as a post Fact right. It doesn't pull it from the database. So you see that when the Data arrives into the cluster there are two components with the threshold engines and and the persister for the persister does takes the data and Batches it up and store into the database for other usage where the threshold engine evaluate those alarms that been posted and see if any of the Matrix there are coming are actually changing the state in those alarms If they happen basically the threshold agent fire the alarm but firing the alarm is again simply Storing a notification of the alarm or an alarm indication in another queue another topic in the same queue And so there is another component which is the notification agent which it can be configured as I said before with webbook which will take that Message that error alarm and it can use the webbook interface to basically push it to Congress Congress already supports a methodology of having a RPC and we are looking at they are adding a new method which is called push notifications Which is way more aligned. So when that will be available, we probably switch to that. But even at the current state with the With the RPC, I can call that monascal arm data source and the monascal arm data source will basically When you eat the congress API will do an RPC call to my Monaskal data source which will populate the table with the value now You see that you will only populate tables with the values that are relevant It doesn't know it will not even create a table We will not even add a row to that table if any alarm will be fired so the other interesting aspect at this stage is that if We can somehow make alarms conditional to the evaluation of the entire policy Let's say my alarm is the entry point So I will wait for the alarm to happen if the alarm does happen Then I can start to think and look of all the other stuff I need to gather to understand if the policy is good or bad and So at that point this could be the entry point to say oh, I got the alarm. So let me check the novel API You know would this VM Belongs to and so then I can find the owner and then through Keystone I can also find who is what is the email and then send a mail So potentially and I will need to work with the Congress guys in this space to Create a conditional aspect where the other parts of the policy are evaluated only when you receive an alarm That will be significantly dramatically decrease the amount of data that you will have to gather at having given time And then from there is the same thing right I got the table that says the policy is infringed Now as Ken mentioned before where we're really looking at so with this was an example for operators Right, but we also look at how we can empower developers to simply define an SLA That says, you know, I have some needs and I want the infrastructure to Respect my needs without having to go into the guts of the entire infrastructure So let's do something simple like this you know, my application is a very important application maybe and Or is a you know production application where my business rely on so what I really want to do is I want to do something simple like Hey, alarm me if the host has some issues I don't have any go to tell you what the issues are, you know You ops define those or could this could be defined or expanded later on And those could be simple thing like, you know, it's an elfin mean I cannot ping it I cannot ssh into it anymore into my host I start to lose packets the connection is shaky So, you know, probably it's not a very good situation or the disk space is getting small Potentially the the disk will be host and then my application will suffer out of it So as a result of this will probably want to do from an application standpoint I want to migrate live migrate these VMs elsewhere similar with containers If this was a container environment what you will do you will probably go and reschedule the container elsewhere So you will take them out from that VM or Austin and move them elsewhere So the very interesting thing here is that now we have a bunch of metrics in monaska that we can leverage to implement this and All of this that you see here are already available and so my fairly simple concept of host issues can be translated in a set of alarms like The status alarm and so in that case if you ping or ssh a metric with one will be generate if that fail And then the other ones are fairly obvious right percentage of user or disk used and there are many all packet drops per second So I can create For alarms and therefore are for independent alarms that can be independent if I want to or those could be All together sub alarms of a bigger alarm. And so once one of these Fire I will receive the alarm So just to just for for the sake of it what I did I mapped those so you see that even the alarms are not That complicated to to be written so the idea is that in the future we want to have a Concept in Congress that will allow us to take Policies gives in of the metrics or of the list of the metrics and the values that want to be used and then Congress can simply go and Instruct Congress to so excuse me Monasca to generate those alarm on Congress behalf Same thing with a network one network related right the packet drops and all of that So what we are with with our work right now. So this is the overall architecture that I described before So Monasca is available. He supports The alarms we have identified a fairly simple way to integrate between the two through using the web box and RPC and So really we are going to develop a Monasca alarm data store Which will be specialized in receiving alarms and handling alarms for policies And then the policy engine will work as is today because basically what it will do It will wait for tables to be filled with values and it will be the responsibility of the Monasca alarm data source to fill those Tables with the values when the alarm has been received So what the current status as I said We've done we developed a simple data source to validate that we can indeed talk with Monasca and that's Has been done But this is this is the old camp of polling we use statistics, which is you know gives a certain performance boost And then you know that was an exercise for us to really find out what the integration points How the data source is written and how it works and all of that was a more of a learning for experience And so we found a solution that we believe works And so what we need to do is We need to create that Monasca alarm data source, which instead of polling is gonna wait and process alarms and then We probably need to a stand a notification in Monasca because the current notification are fairly skinny. They don't send too much data back To the requester. I think we need more data. So then Really Congress has enough information to make an intelligent decision about okay receive this alarm What is the impact what does the other components that are important or make sense and then Lastly, I think this is more of a research more of an interesting Development is how do we Enounce the policy using the policy language. I think the policy language is very expressive So how do we make that policy language behave with two aspects one is the ability of giving hints of what the matrix and Freshholds should be and on the other end. I think is the interesting part is can we make this conditional to alarm? So parts of the policy will be only evaluated Whenever there is an alarm as an entry point so that we reduce the amount of data that Congress will have to store at any given time And that's it. So I think we have plenty of time for questions actually Hi The host is not being used for some If the host is not being used for some amount of time then we send out a notification How are you going to get alarms? Like let's say the your policy is your host cannot go below 50% CPU usage for six months How are you going to handle that with a monoscope alarm? right, so Basically would be Congress to store that Long time period right because one ask a will alarm me every time is under that right and it will not going back to the So the way the alarm works is that when there is the you go below the Level then you will alarm and until you go up. He won't alarm again that the alarm is off Right, so it's the Congress's responsibility to keep a timestamp of when the last alarm happened to say Yes, it's past more than two months and I didn't receive another alarm that says okay now you're using more than 50% Okay, so what data store are you going to use to keep track of all those alarms for that long a period? What kind of database are you going to use? So is any is a database that? Congress has and Currently is in memory which could be a challenge because if you lose the instance right if you lose the service For the two months you're back to square one, but I guess it will be possible to store that That's not a very big amount of data right? It's just an entry point with a timestamp So after a certain amount of time periodically you go and fetch that data and you check if that policy is still valid or no Yeah, okay, so are you going to rely on periodic alarms coming in like every day you get an alarm says below 50% Right, so we talked about In the session that one ask is going to add some sort of periodic or repeat the alarm and when that sees enabled Actually will makes this kind of example way easier because at that point I don't even even to remember right if I have the In the periodic alarm I have the first alarm and then I have the timestamp of the period I can just go and say when I receive an alarm After two months, I can imagine immediately identify. Oh, well, this is periodically alarming from more than two months So that I'm in policy violation, right? Thank you. You're welcome two questions on the Long-term storage for some of this is any idea on how you like in a purging the data So understand them you could in which side sorry in the monaska or in the congress in Congress Yeah, so currently I don't think congress has a A way of purging the data. I mean is it in memory database and is the refreshed any given time What I do is that they do a delta though So if you are in a traditional polling mechanism, right and I pull let's say a poll keystone for the list of users what the Data source is gonna do it builds a list at the first the time and then when he polls again. He only add The difference from the last time, but he will need to keep if I need to have the list of all the users that are Available to email them. There is nothing today. I can do to avoid that Unless team disagree on that Congress question So let's say a violation of policy occurs and there's a migration that needs to happen But that happening within Congress or is that there's there's something polling Congress So so there are two there are two ways actually you could have an external agent the polls Congress and he says I'm interested in this particular policy because maybe you are the policy owner Or you have the grants to see that particular policy and then when the policies in alarm Then you can do your own action. So let's say you have a Mistral and you are polling because you create that kind of policy and then you can say if the policies infringed I have a workflow that I want to run which will do the migration or whatever the mistral workflow wants to do Alternatively though in the policy itself. There is an execution model where you can say Run something and the execution command in the policy itself does allow you to Run anything that can be run through a CLI for any of the services So potentially in your policy itself, you could have an execution Nova live migrate whatever and and and have Congress to actually call Nova and Make the execution on behalf of the tenant Is there any concept in the future of adding some sort of so this to me kind of sounds like incident management Is there any concept of doing something would give you eventually problem notification or a way of escalating this up to a problem management aspect? What do you mean for problem management? Well, come I can and I thought pattern seems like this is going to be instance Or incident notification that there's an event But if it's on ongoing problem, let's say we have Big your example, but oh Yeah, your problem management team can look at that data periodically and say okay We've seen this happen for the last six months every week at on Wednesday at 2 p.m. What's going on Wednesday at 2 p.m. Right? Yeah, well, so they will be as part of the external consumers of that policy right because you can monitor the policy And then if the policies this pattern then the knock and say, okay, I'm gonna sec or the office people why I always get the situation, right? This interesting is and kind of like that point you made that You you may not want an automatic action to happen, right? You may want to say here's a recommended recommendation for an action send this off to Someone who can understand if that should actually happen or not and then they hit okay, then that fires off something inside, you know Either you know workflow or some other execution command that someone else can say yeah, we bless this go So I think both use cases are important to keep in mind All right, some somewhat related to that last question. I think so we have inside Ecom we you've probably seen that white paper from AT&T right about this thing that we're gonna be going into open source with One of the first things we're gonna go with is is our what we call in our VNF event stream So you have basically you know, you have I think you have nodgeos integration today, right? In in Alaska. Is that true or or is it just within open stack? I think it's just for that opens just no, okay, so I think it's vitrage. Maybe that has the idea of a nodgeos plug it so so When ask a support spager duty, I don't think nodgeos Okay So if if you have this this notion of exhaust coming out of the VM which includes sys logs Which includes all the stuff you typically get from SNMP things like that things that are application specific, right? And you and you stream that up to a collector which throws it all into a data lake and does the sort of things You're talking about intelligence, you know Oh, you know time trending and whatever prediction and all those sorts of things one of the functions that that thing can do Which is one of the pieces of our data collection analytics engine component of e-comp is actually integrate with Monasca For example to to to issue alarms. So that's the first question if if I have that and more more workload internal or host internal rather than Open-stack infrastructure internal right focused set of information, how would I integrate that with Monasca? to invoke these same sort of There are two ways of doing it There is a lightweight and a heavy weight the heavy weight is that you take the matrix because Monasca is metric Agnostic right really there is a metric name There is a value and timestamp and a bunch of dimensions so that they Monasca doesn't have to have Metrics that only open stock you can consume metrics is a general purpose metrics engine So you can put any matters you want so one way is that you put the matrix They're relevant for the type of alarms you can of and you forwarded back to Monasca. So similar to that Monasca If I go back So that Monasca agent you're there you can be a new agent that will push new type of metrics into it And then there is a alarm that is based on that the second way of doing which is I think is more lightweight It's better because you don't duplicate the data is that you do you get your own metrics You do your own analytics and as a result you generate a new metric Which is a composite metric or is Distilled metric out of your analytics right and that is still metric will be pushed into Monasca And then you have an alarm the only act on the distilled metrics that you have right and if it goes above a certain value Or if it does exist like the one that was like I cannot I cannot I cannot SSH 0 1 right So that become more of a flag if that metric exists with value 1 then fire the alarm And that's the the the creation of that is something that's documented in Monasca's Yeah, Monasca as the way where you what Monasca is an API and as an agent so you can just eat the API we've With your metric and it will be automatically stored and because it's treated as anything else if you have designed an alarm that uses that metric It will fire Okay Yes But but but but to answer to this gentleman here is there The aggregator will use the metrics that you have pushed as part of the external system So if you don't want to replicate or store those metric twice in the external system You do the analysis of the aggregation you fed into the other on only one metric, which will be then allowed So it depends how much data replication you want Yeah, especially if the data the way I would do it if the data is then Relevant to be seen within the open-stock environment to correlate the metrics in views and then I will push it into Monasca because then you augment what Monasca can see from the open stock to the underlying for structure or the Networking for structure around it or whatever AT&T is capturing around the open-stock environment, right? But if it's completely relevant then I would just send the the metric that is relevant to the it is relevant For example scaling right auto scaling is so yeah, if it's relevant I would just you know push the to the Monasca API batch the data though The other very quickly the other approach that we've been considering is of course you we can create a data source driver for Congress And we can create tables right a policy tables that that you know abstract the data that we're seeing in Inside our kind of over-the-top sort of analytics and yeah, but if you're in that camp you are in the polling camp, right? Well, I mean we're but again if there's a push driver we can push data into that table So so that that you know that push driver was one of the things we proposed initiated through opnfV right to be implemented in Congress Right the very last thing really sorry very briefly you mentioned a a policy expression for intent policy expression to Congress data log parser translator or whatever is that did I get that right that you that you want to to Allow developers to express these policies without having to know necessarily the data log language Yes, that's that's that's the vision to have a very simple UI where you know because I Yeah, and so it would be very important I think to make sure that that policy expression is aligned with what's going to come down in a like a VNF package as part of a toss to toss the blueprint right so that we we make sure we do it one time Yep, right once we minimize the number of parsers et cetera. We have we can reuse et cetera I can't hear what the very well what data is lacking. What are we lacking? Why do we need a web? For the alarm, I think I think if I'm correct monaska just send the ID and the state transition change and maybe the value Probably what we want to do is push some of the dimensions that are in the alarm, right? They are in the metric to generate the alarm because for instance if I already get host ID or if I get a tenant ID or all this kind of stuff Will be very useful that data to be then used fed up to those to send up to put to the policy engine to the say Oh that this is the tenant ID So I don't need to go to talk to anybody else to find a tenant ID right or stuff like that So that's changed recently there. I think most of that is there. Yeah But what I'm wondering is if we've talked about maybe adding a custom field to that At this stage, I don't think we will need that it could be if we need to somehow Change the type of course of action that we will do based on if we need to call a different Perform a different action based on on values of the alarm so that could be so if you look at The vision aspect that is there there, you know soft limit and hard limit right so I could have some meta data They tell me what to do if I was the soft limit that was hit or the hard limit that was hit All right, what different things I need to kick or not But I this first implementation or integration stage I think we may we may do without but could be an interesting thing and I could be definitely push through so if the Notificate if the alarm setting as a concept of Metadata that I can push then you just give me the metadata back. Yeah, that would be fairly easy First of all great presentation a couple of things I was thinking how you're planning to Give out the integration integrations that you have done to the community and also like what are the plans for the proposals that you have Like it is going to be done as part of open-source project. Yes. Yes. So the idea is to do those changes in so probably We may end up having a new notification mechanism, which is a congress specific one and we I need to see if it's needed We do it if it's not needed. It's just a matter of creating the right web book You and I then then would be even easier and then I think in in in Congress where the majority of the works will go because we need to implement that Particular data source that deals with the monaska alarm and then I think I'm gonna work with the community to try to see if we can express the policy in such a way that Alarms will be the entry point as Validation of the policy so there are all other aspects I need to validate But my alarm will be the entry point and then I can then start to talk to the other services to gather their information Instead of doing this in parallel. I think to me that's the other interesting Part that I want to work with the community and the third is this ability of executing things Which is already there. So how do we tie it all together so I can really demonstrate things like I Simulate I disconnect. I reduce the I make a host connection Difficult or something like that and then the alarm will allow me to go and talk to Nova and Live migrate the VM right so if I can get with a scenario like that that will work with a combination of monaska in Congress I think that would be a very interesting solution, I think I think our biggest goal here is a sort of It seems it's a sort of complex We're trying to do something pretty simple and get you know something we can you know contribute back to the community To kind of start the conversation I do think that as you know, if I were kind of alluded to constraints, you know, non coexistence type of Policies that get much more complicated. We definitely want to get to that point and and work with the community to kind of figure out the right way to To provide a much more details around policy and how you would want to see a policy executed But I think we have that we were trying to kind of find this a good starting point That's sort of where we're at today is let's just start with this see where this goes get the community involvement gets to get feedback and Then as we want to add better and more Explicit policies if you will then we can kind of work on that together and say what's the next step after this Great question. Any other questions? Thank you all very much for your time today. It was great