 Okay We'll cover today the high-level in high-level the changes and the workflow that is introduced into in order to use quality of service the API level changes and And a bit of the a bit of the detailed implementation We're going to cover a short a real-life a customer use case and Future walk and hopefully we'll get to show you some demo as well So very ambitious, and I would really like to get started if we had a presentation Okay. Yeah, so how do you go next? This doesn't work. Yes, it's working. Yeah Okay, so to kickstart the session they wanted to I always looking for a definition and the definition I found is actually a good point it would get me into Emphasizing the first point I wanted to start with so let's review the definition. What is network quality of service? It's the ability to guarantee a certain network requirements and in order to satisfy SLA between The application provider and the end users It's an awesome definition very high-level definition and For those of you who've been around for the couple for the last couple of years in open-stack Neutron, you probably know that the original blueprint that was submitted a for quality of service was introduced in Havana Havana anyone remembers it's over two years ago and and this Definition is actually the one of the reasons why it took so long. There's no industry standard de facto. It's it's too high-level Definition everyone there are multiple aspects for quality of service Obviously everyone can take it to its own direction. We have multiple vendors in the community We would like to accommodate over everything. We want to do everything. We want to do it very good Well two years. We didn't do nothing actually so and what we decided to do is and Understand acknowledge the fact that there's a variety of ways to define quality of service even if we're looking only on the Linux a domain which is kind of a narrow domain you can look at the traffic control and you have a Rate and see rate and burst and see burst and there are queues and there are classes like Tons of stuff and you go to the obvious labor layer a bit of abstraction that you have there Which is good. You have minimax and burst much better But still like there's no consensus. There's no standard That was one of the problems. We had to deal it with so we took a step back And we actually looked into what problem do we want to solve? Which was also an interesting question to ask but We come up with we came up with three things Basically that we would like to solve the first thing is we would like to give the cloud administrator the ability to control the physical That the physical devices in the data center the physical resources. We want him to control the bandwidth The way he shares the bandwidth bandwidth between different tenants We want him to be able to provide Different SLAs to different types of networks, which is something we all want to do We maybe want to enable him to actually give different quality of service to different tenants and be able to Charge accordingly. I don't know. So those are the things we identify as something we want to do Which is still really high-level stuff Okay, so we took another step and we said, okay, no more general definitions. Let's go into a use case Let's get let's solve a use case in liberty. Let's deliver something that our community our users can actually use So we come up with the noisy neighbor problem also known as the chatty neighbor problem For those of you who are not familiar with it The problem is basically a shared resource It all starts with a shared resource in our case that network bandwidth And we want to share it with multiple consumers in our case the VMs that are sharing the single hypervisor They all want the same resource like these two cute answers. They want the same biscuit But if one of them would take a bigger bite or more of a fair share of his Biscuit, I'm sure the other answer won't be so pleased And it's the same without actually our virtual machines in one if one of the virtual machines is becoming very chatty The other VMs are suffering and we would like to avoid that problem And this is the problem we actually focused on in liberty and this is what we were trying to solve One more thing that we kept in mind is that there's a wide Range of use cases. We have to implement something in a generic way. We have to be able to extend it We want to make the extension easy a for other vendors for other types of rules And that was something we kept in mind throughout the implementation, and I'm sure you're going to see that throughout the presentation here And one more thing we actually wanted to deliver something in liberty So that was our biggest challenge We wanted to get it to start and end in one cycle to scope the work really good So focusing on that use case actually enabled us because it's eliminate three really big challenges So let's go and see what we did not handle in liberty But we hopefully will be able to handle at future cycles So the first one we didn't handle any of the physical air So we do not configure any quality of service on the switches on the router on the physical switches physical routers We do not handle any of that in in liberty We also only handle the traffic that is leaving the virtual machine the VM egress traffic Why? Two reasons one we'll be able to solve our use case without doing the incoming traffic and Second solving the incoming traffic rate limiting on the incoming traffic into the virtual machine is a more complicated problem to solve Not super complicated because we have plans in Mitaka, but still it was something out of the scope Okay, the last and then my favorite thing is that we did not handle any integration with our projects specifically Not with the Nova scheduler. It's really challenging We really wanted to get something done and that was the reason we postponed it For example to implement minimum bandwidth guarantee We would have to implement integration with a Nova scheduler because we want to make sure that when you schedule a VM on a hypervisor You'll be able to give it the minimum bandwidth you want and not over commit the benefit So this is again something that is planned for Mitaka not handle for this use case So what we did we met in the red a Tel Aviv office And we had a coding sprint for three days We had five companies on site participating in this effort. It was red out mid Okura Melanox VMware and Huawei we had very and we have multiple participants Remote ease and the core team members in neutral were really helpful. I especially like to thank Miguel and Ihar We're leading this effort actually areas here in the room So thank you guys you did an awesome work It's one of the key features that we were able to accomplish in in the Liberty cycle So we were coding and this is what we came up with We introduced the new entity that we introduced in order to or maybe one sentence before that So we were focusing in the OBS Default implementation for this so everything I'm going to mention at this point is for the OBS later We also mentioned a sri of V which is another mechanism driver which also implements the API and And we were focusing on rate limiting remember the use case. It's going to follow us throughout the presentation So the first entity is a policy a policy is basically a container of rules Okay, it has the trivial Parameters like name ID description a shared and tenant ID you can go ahead and create a policy neutral quality of service a quality of service policy create and put the name in description and there you have it Then you can go and associate a policy either with Neutral port or with a neutron network if you associate a policy with a neutron network it all the ports in that network inherits The policy formed all the ports inherit the policy from the network You can always override the policy on the port level Okay Next API model that we added is the Rule so we added an abstract entity quality of service rule To to be able to extend it with multiple types in the future So the first type that we implemented in the Liberty cycle is the quality of service bandwidth limit rules It has two parameters, which is the max kilobit per second and max burst in kilobits and You'll be able to configure these two parameters and go ahead and start your a policy So one of the things that we added That is we're going to add and it's a blueprint that was approved for me to cut is the is the DSP marking rule We already have a blueprint for it It's a new type that is going to be implemented in this cycle and it will it's going to be very similar to this rule I only have one parameter, which is the DSP mark with an integer and There you go. It's as simple as that Okay, so we have a quality of service policy It can be associated with multiple networks. It can be applied on multiple ports for now It can only have one a quality of service bandwidth limit rule Going forward one of the future works. I'm going to present later is the network classifier And and once you will be able to classify the traffic You'll be able to associate different types of rules with different types of traffic and then you'll be able to have more than one rule In a policy one more than one bandwidth limit rule in the policy Okay, so the workflow you create a policy you create you create rules in that policy You associate the policy with a board By default the permission model enables only the cloud administrator to create the policy and The cloud administrator can go ahead and associate the policy with a network The tenants themselves cannot disassociate the policy from the network Which means that's a way for the cloud administrator to force some kind of quality of service on tenants Another option is for the cloud administrator to mark the policy as shared if a policy is shared The tenants can go ahead and disassociate the policy a form the network The third option is that the cloud administrator fully trust the tenants You go ahead change the policy JSON file and enable the tenants to create their own policy associated Associated with their networks etc Two more things I'd like to mention So the first is that the quality of service policy and it changed to the policy immediately propagates to the ports So if we have with no downtime, so if you have a running VM Associated with a policy that for example limits the bandwidth for five K KB per second And you go ahead and you increase that to seven because we are very generous today Then the VM would immediately get more bandwidth No downtime That's cool One other thing I want to mention before we deep dive into the implementation is that by default the policy or the quality of service In general the feature is not Enabled it's implemented as an advanced service extension So if you want to use it if you want to you have to go ahead to the configuration file and enable it It's all specified in the user manual. We we have very good documentation for this feature. I hope I read it I think it's a good and so go ahead and try it. So it's very useful. Now. Let's start to read the deep dive So just before I enter into the functionality with support I would like to remind you some a terminology which is actually always confusing a bit So when we're talking about the quality of service quality of service can be applied on either Ingress or egress traffic or maybe both So when we talk about the VM egress traffic traffic that enters the virtual machine It is actually the bridge egress traffic and then and we talk about the VM egress traffic traffic that sent out Of the VM. This is actually the bridge ingress traffic. So in case of a Open v-switch each virtual machine device is actually represented by the tap a device in the physical server and It's possible to apply a policy that we need for VM on the interface of the bridge Currently open v-switch Supports two different options to play quality of service for the traffic that ingress of the bridge It's possible to apply policy Which is quite simple quality of service mechanism that will drop packets if they exceed the configured rate Another option which is currently not supported by the work we did is Applying shaping on the egress traffic which is the more sophisticated mechanism and it actually queues the exceeded packets So for the quality of service rate limit implementation that we did for liberty We have not mentioned before what we want to do we want to a limit the traffic that egressed the VM so we do it by applying the Ingress policy on the tap interface of the obvious bridge once user defines the bandwidth limit rule and Associated with the port that realized by the open v-switch connectivity What happens on the driver layer is just the configuration of your open v-switch control commands and Deploying the Attributes we have on the tap interface. We support currently two attributes one for setting the ingress policy in grade Which is an integer a number measured in the kilo bits per second and the ingress policy burst Which is measured measured in kilo bits and it is advised to set the birth size to be at least as large as the interface empty you and To allow the Algorithm to be more forgiving to give it at least 10% of the policy in grade itself another a mechanism driver that that was Supported this currently supporting quality of service is s rio v So let me just explain a bit. What is s rio v they're all about The s rio technology allows the PCI device to appear as multiple separate PCI devices in the host So we have one single physical port Which is actually can be shared among a multiple virtual machines and it can Provide a near a native performance to the to the guests When we Have some a bandwidth for the physical port it is actually shared across all virtual functions of this of the physical port a Open stock supports s rio v a virtual function pass through as a network interface to the guests since a general ease And in order to apply rate limit on the s rio v a Port a we use IP link commands and we need to know the exact physical device The device of the physical function that hosts this virtual function and the virtual function index So then we apply the requested rate on this virtual function a The IP link rule the IP link utility actually has number of limitations first of all the measurement is Done in the megabits per seconds and not kilobits per second So if a the user specifies in the current API we have the rate in the Kilobits per second it will have to be rounded to the nearest megabits per second and it also doesn't support any Floating numbers only integer. So Will always be the nearest Number and it also doesn't support the burst. So it will be just ignored by the implementation the this moment So now we'll deep dive into the implementation details the quality of service was implemented as a service plugin that implements quality of service API The plugin handles API request to To handle quality of service policy and rule related operation It is responsible to persist the quality of service models in the database Which holds for the policy for the bandwidth limit rule and also for the binding between the core Resources such as network and port with the quality of service policy They have the pluggable Notification drivers that are used in order to propagate quality of service objects modifications to the back ends Currently for the reference implementation. We have message queue based Notification driver that is responsible to send Notifications to the L2 agents as part of this Implementation a some infrastructure mechanisms were added in order to a support a Plugable way to add Additional rules and the support in the future rolling operates So we have the RPC Colbex mechanism that both sides agent and server Registered the server the server site registers the provider of the information and L2 agent can subscribe to the changes And also this is a this is a middleware that we use to propagate the changes from the server to the agent The data is it's past between the service plug-in and the agent is actually using the Oslo versioned object library and Harkier here did the tremendous job to introduce this mechanism. So it's very easy and looking forward model to propagate changing between between a the server and The the agent on the agent side. We also had some infrastructure changes We introduced the agent extensions layer which allows to load a different L2 agent Extensions and currently we have the quality of service agent extension So this is generic layer that can be reused between different agents And in case of the quality of service agent extension it uses the quality of service agent driver Which currently we have for SR IOV and for open of V-switch to do the specific configurations on the underlying technology while the quality of service agent extension is a Handling the generic part that can be reused between different drivers in in order to integrate this Functionality with extending the core resources. We didn't want to follow the Mostly common but Very abused pattern of the mixings. So what we introduced we introduced the core resource extension Which defines the interface? that the L2 plug-in should implement in order to delegate the extension association with the L2 core resource To the handling service plug-in. So in our case there is a quality of service core resource extension that will propagate the policy idea association with port or network to the QoS service plug-in for user handling and also to store it in the database So with the reference implementation in ML2 a plug-in we Have ML2 quality of service extension which simply invokes the quality of service core resource extension and Uses a service plug-in to provide all the required Work for the quality of service models Now I would like to talk about few use cases for the extensibility That can be done with quality of service. So we envisions the extensibility from the early beginning and it was Actually one of the major requirements for our implementation. So the most common case Is a adding new rule type to the model? And as I mentioned we already have this happening for the Mitaka with the DCP marking rule. So it's actually quite easy To add a new rule it should be initially defined in the QoS API as a new API resource And the proper handling method should be added in the quality of service plug-in Obviously the new data model should be defined for this rule type since as I mentioned before we use versioned object that are passed between server and agent to propagate the The changes of the quality of service policy and rules So we'll have to introduce the version objects of this rule and to bump the quality of service version to reflect that there is new rule that Extended the current policy model in order to manage this new resource Obviously, we also need to update the neutron CLI with the the handling of Of this a new resource Both neutron server and neutron client were implemented in such a ways that it is very easy to add new rule type without maybe with a little modification of the General code another use case that we envision that the quality of service is going to be implemented by more a Mechanist drivers or L2 plugins. So first of all a Is it mechanism driver or plug-in? It should declare what quality of service rules is going to support. So this will just require to populate the The attribute of the class or the mechanism driver with the list of supported rule types in case of ML to plug-in a The supported rule types is the common subset of all the declared Rule types by the active mechanism drivers in the current deployment on the service plug-in side The one who adds who wants to add support with the back end will have to add the notification driver That will propagate the model changes to the to the beggin to the back end for fuser implementation The last but not least The L2 agent extensibility. So let's say for Linux bridge or first maybe some way Vendor maintained L2 agents. They want to support quality of service on the agent side So it was to reuse the infrastructure We already added in open v-switch and sri of v L2 agents to use the quality of service agent extension managed by the agent extension management and so in such a way that It only requires to implement the quality of service agent API driver in order to configure the Underline networking technology that is used by this L2 agent Now we would like to explain the workflow we have for the most common user operations So first of all what happens when the quality of service policy is attached to the port So initially user just created the policy populated It is this required rule and and then port is created or updated with this policy when this Then ML2 plug-in will send the notification to the L2 agents about this port changes and the L2 agent will queue these changes to be handled during the a general execution a Demon loop that it has so during this loop execution a Agent will Query the server to get all the devices details Actually the neutron port details for the device that it needs to update When they get device details comes to the ML2 plug-in After it populates all the common parts such as the MAC address and P, etc It will invoke the registered ML2 extensions one of them with the quality of service Quality of service will get the details of the policy associated with this port from the quality of service plug-in Using the same core extension that I explained before So once this data is back to the agent agent first of all does its regular L2 connectivity work and then a Agent extension manager will invoke the registered Agent extensions when it comes to the quality of service agent extension What it does it will pull the details of the policy from the server using the RPC callback infrastructure And get in return the quality of service rules Once it gets the rules it will invoke the registered quality of service driver. So currently based in the OBS or SRIOV and will request this Configuration actions and first of all it will remove the previous policy and rules and then configure the new ones the general layer of the quality of service Agent extension will store the mapping between the policy and the port So once policy is updated and the notification will arrive it will do the Reconfiguration of the onboarding ports In order to support this policy update the the agent site registers for RPC notifications upon policy changes on the server side quality of service plug-in Register itself as a producer of the information for the policy changes. So as I mentioned before we introduced some general infrastructure for RPC callbacks, which is actually hides the usual RPC topics work that we usually do to to propagate RPC messages between agent and server and Once some policy is changed by the user The RPC callback is notified by the quality service plug-in that the policy is changed and then the Policy changes are pushed to all register L2 agents When it comes to the agent that has the mapping I mentioned before between port and policy Quality of service agent extension will go over the whole all ports Related to this policy remove the old rules and configure new rules Hi I would like to talk about Customer use case that started at the beginning of this year around the February Customer contacted us and requested several issues To solve and to help him to build a cloud open-stack cloud for his needs And The requirement list wasn't so clear at the beginning So we met several times and discussed about the requirement after a few meetings. We had a clear list of the requirements the requirements splits into two areas the first one is the Network area and the second one is the virtualization and the cloud area, okay for the Virtualization and the cloud area the customer asked us to provide Virtualization with the support of multi tenancy for each tenant the customer asked us to provide policy with specific policy settings for each tenant Each tenant should run several applications On top of his virtualization for this case we provided also Containers list of containers on top of the virtualization on the VM on the virtual servers Also, we use the standard the SIOV The standard SIOV in the community in order to provide virtual function for each VM For each virtual function. We also support cost policy with rate limit at the beginning next stage For the next year is also to support bandwidth guarantee and for the network side the customer asked us also to provide auto provisioning for the policy in order to Enforced the policy with the rate limit also for the network side not just on the VM side And also to support the H a For the network side As you can see in the in this slide. We also Provided something that we called Vaf lag virtual function lag that is this is a transparent transparent for the VM for the tenant the user doesn't see the lag the lag is Supported by melanox in the nick it transparent for the user the policy also with the hash function implemented in our Nick side and Also, we provided ML to SDN make a plugin that supports also and the policy the ML to SDN plugin actually publish outside To any SDN controller The policy properties in melanox we have a Product that called that is called melanox neo melanox neo gets those notification and Configure the entire network with the policy and also provide auto provisioning of segmentation and like villains or in Case of infinite bend the solution in melanox we provide PQA isolation Okay, this is the Vaf lag that I mentioned Actually in the Vaf driver as you can see we split it to To virtual function each virtual function is associated with a specific Network port it's transparent for the user the user doesn't see to the two virtual function only one and for this virtual function we Define the policy Also, please know that for a lag We can support also active active active passive and LACP for this case We took only the active passive option because the user just asks us to implement the core support Thank you Okay, so we've seen what was implemented in the Liberty cycle We've seen that we only covers basic use case. There's a lot more More work that needs to get done some of the things are only Start we already started to work on some of the things some of the things we would really love any a More working coding hands Actually, so let's see what's what's currently baking. So we have the marking We have DSP marking that was mentioned a couple of times during the session the spec was submitted and approved for the I really hope we'll get the implementation fully done within the cycle So we'll have a DSP marking rule We also have a patch already available to implement the cost API changes on top of linux bridge agent If you guys are interested so it's up for review and you can look take a look at that One of the interesting more interesting work that is going to be done Hopefully in the Miteca cycle is traffic classifier It don't it not only applies to quality of service I've been to a session about a network function chaining that they mentioned They already have some kind of implementation to turn network classifiers Also, I talked to one of the guys about be it's being useful in security group So it's a key feature probably needed by many features Hopefully we'll get started with it in the Miteca cycle will get the network classifier in and then we can apply Quality of service rule on specific types of traffic. So it's really cool stuff You don't see it in the slide But there's a walk to integrate a our back mechanism that was introduced in liberty our back is a role-based access control a finer grained Access of permission model. Let's call it like that if you want to create a policy and only share it with me Specific tenants and not with all tenants. You'll be able to do that Obviously, we also have to handle the upgrade mechanism and few other a marking So a lot of work to be done. We are starting to take a look at some of the of the things that needs to get done Hopefully they will be completed in the Miteca cycle. So thank you. I think now it's time for questions. If you guys have any Yeah, thanks Nadir from zero stack. So I wanted to clarify You said in the beginning of your work you focus on the hypervisor only But more she showed an example where it's supported on top of rack switch as well That's a melanoc specific implementation that involves the physical layer But if you go to the reference the API is generic, right? It's in neutron everybody every plug-in can implement it and extend it The way she fits for the specific reference implementation We only handle the the virtual switches in the hypervisor Thanks more questions Yes, thank you They did a great work Yeah It's a good question So the question was what about policy and and bandwidth a minimum bandwidth guarantee is basically So this is something we are definitely going to look into Miteca. There are two aspects One is the integration with the novice scheduler to place the VM in such a way that the hypervisor is not over Committed versus the mean bandwidth guarantee That's one aspect the other is to do the policy versus the reference implementation with quality of service Both things are hopefully will be handled in in the Miteca cycle At least there's one design session to integrate it with Nova scheduler in this summit So let's hope it's going to go. Okay We'll see how it goes More questions. Yes, I just have a quick question. Sorry. I work in relate So for the poor change the policy update Example you gave do you expect any change on the API side? In other words in the M release Can we make API call and enter into policy on the report basis? Yes, if I understand the question correctly, you're asking if you can go and add the Quality of service to existing ports. That's the question. That is correct. Yes So to my understanding, yes You have to install and in a upgrade your code, of course You have to enable quality of service and then you need to create a policy and associated with the existing port Is that something really available in it in the L release or yes? Okay. Yes. This is labor tea. You can use it and open bags Okay, so sorry guys, we don't have time but we do have a demo the demo is available both in Miguel's blog Miguel is the core team member who did lead some of this war and there's also a demo in the melanox boot So we won't have time to present it here But go ahead and check it out. Also the slides would be available with link to Miguel's blog So you can check it from there Do we have time for more questions? I guess not. So thank you guys