 Okay Hello, everyone I'm gals agi from while we were paying research center With me presenting today a run gamble also from while we were paying research center and Mali from a w cloud We are going to present the project dragon flow Anyone heard about the project before here? Okay We won't be able to cover all the technicalities in 40 minutes But our goal in this presentation is to show you that dragon flow is definitely a project That is interesting. It's something that you should check out And you can come talk with us dragon flow is part of open stack big tent So we have our own design summit sessions and you can come talk with us and get a little deeper Into the project dragon flow was created when we received feedback from Running open stack public cloud at scale at large scale and when I mean scale I mean many hypervisors It's moving and Dragon flow mission is to Overcome some of the challenges that we found in the reference implementation today in terms of Performance of the data plane, but more importantly in the scale of the control plane and how many hypervisors We can support One major objective of dragon flow is to keep everything simple and lightweight in terms of the code base We are doing everything as part of the open stack project and as I mentioned before we are a big tent project You will see throughout this presentation that in dragon flow we focus on distributed networking services We are reusing a lot of other open source framework and components that are production grade That are production grade ready Instead of trying to reinvent many of the things and we'll see this going in this presentation So our dragon flow environment looks like Looks like this we have a local dragon flow controller sitting at each of the compute nodes in our setup And all of these controllers are synchronized with the logically centralized distributed database and This one second can you do? And this database Holds a policy level Information so two points two important points about about this Overview first is that the database and I'll touch this going forward is a pluggable database Okay, so we can use any key value database framework out there To with dragon flow and I'll mention In few seconds why this is important the other important point is that this database holds policy level abstractions and information And not anything else which makes first it makes the data that is being distributed to all the compute nodes relatively small and second it lets us Compile or have smart logic in the compute nodes that knows to take this policy and translate it According to the hardware that is running or the solution that is running on the edge on the compute nodes And this makes it very easy for us to do smart integration with smart nicks and all kinds of how the offloads capabilities So going a little deeper inside How dragon flow looks like we have the pluggable database and we already have support for various different database like rum cloud riffing db ready is zookeeper And the process of adding a database to work with dragon flow is relatively very easy We found that doing this layer pluggable is not something that is too complicated And on the other side we have the dragon flow applications Which are very flexible models that take this policy and translate it into an open-flow Pipeline that is installed in the local OVS or any other switch that my beater Another small thing to notice about the diagram is that in addition to the database part we also doing a Pub sub the abstraction a pluggable pub sub We recognize as we moved along in the project that Some database don't have an efficient public subscribe Some database don't have it at all and we wanted to optimize these two Separately so it's very flexible to the user if your database support Something that notifies changes you can use it if it doesn't you can use something else and that helps to Achieve the scale that we are looking for Some of the features that we have done for mitaka, so we have layer 2 Implemented with all the common tunnel all the common in sorry All the common tunnel in protocols We have distributed layer 3 done only on OVS flow, so no agents no namespaces We have a distributed DHCP application, which I'll touch in a second the pluggable Databases and the pluggable pub sub mechanism We already have a nice integration with OVS connection tracking support for security groups So we removed the need to use the linux bridge and we have some nice Designer that actually reduced the number of flows needed to implement security groups Distributed net is if you're familiar with it from the reference implementation if your compute node has Nick to the external public network You can use your Dina traffic Your Dina traffic doesn't need to traverse the network node So the pluggable database framework This is actually a very critical point in dragonflow and this is a critical point in our view in how you scale an environment, so The first thing we did is we said, okay Let's try to implement something of our own something that we can optimize for our use cases And we actually consulted the many people that work with us that are very familiar with DB We'd even call them DB experts and they said this is very bad. This is All of your requirements the high availability the redundancy the supporting the scale that you need and Having this consistent enough will make will be very long to implement It will bear very long to implement But even longer to productize and this is something that we couldn't wait for and Then we said, okay, we can pick one solution. There are some many open source alternatives out there But we didn't want to look ourselves To one solution over the other each solution has its own characteristic each solution can fit to different environments And we said, let's try to take the pluggable The pluggable path and we found that it's rather simple So instead of locking us down maybe in few months, there will be this greatest database solution so why why lock ourselves to one solution and we went and implement this in a pluggable way and In this in the first implementation what we did is we distribute all the policy all the data to all the compute nodes This is not the current implementation dragon flow. This was just the initial version What the current implementation in this release in dragon flow is something that we call selective proactive What we recognize is that with large environments when you have tenant isolation You don't really need to You don't really need to send all the information to all the compute nodes So we only send the relevant information to each compute node and just to Have a quick example of why this works. We can see in this example that we have two Open-stack networks and these networks are totally isolated right VMs from one network can't reach VMs From another and in this setup we have two compute nodes. Each has only VMs from one network So it's obvious from this example that compute node one only needs to get a topology information of network one It doesn't need to get everything else and this reduce the load on how we publish subscribe Changes to the environment Another nice thing is the pluggable pub sub if you look this is a common Open-stack cloud environment. You have many AP and neutron and API servers usually in active active mode And they are all talking with all the local controllers each one can receive an API change and Needs to transfer this change to the to the controller Molly will touch in a second. This is very important to keep all of these consistently and Send these information in a reliable way and we did it the important thing to note is that we abstract this from the database because the database Require some characteristic, but sometimes public subscribe require others So we didn't want to we wanted to be able to optimize them separately I'm going to touch with a little example of a networking service that we implemented in dragon flow distribute the DHCP So if you're familiar with the reference implementation The reference implementation adds a DHCP namespace at the network node for each network Which means if you have 10 tenants each with 100 networks You have 1,000 of these namespace in the network node and all the DHCP traffic is traversing to these network nodes And that's without talking about high availability namespaces and redundancy for this In dragon flow we we decide to to take a different approach We have all of these DHCP information in the database We can add this to the policy and essentially send this to all the local compute nodes And have a DHCP applications that answer DHCP offers an axe And it's it's say if you look at the implementation, it's relatively very easy And it's very easy to implement networking services like that But the point that I wanted to make in this slide and this is just an example We are building this very nice Infrastructure in dragon flow that you have the database you have this Mechanism that knows to distribute relevant information to all the compute nodes We get the high availability of running another controller and we have this Infrastructure and when you're writing your networking services, it makes life very easy You don't need to worry about all the other all the high availability things all the how Information is dispatched to all the compute nodes How can they sync between each other you have this very nice infrastructure, and we are stabilizing it to make it to Speed up the development of distributed networking services I'll hand over now to Mali that is going to show user Point of view on dragon flow Hello So hello, I work on Lima and I'm working for a double cloud So a double cloud is pure open-stack player in China And we help all enterprise customers to design build and operate open-stack clouds from 2012 And so currently we we've already built several large clouds in China scaling from 500 to more than 2000 physical nodes. So So here is Example a typical large-scale deployment for us. It is a public cloud for local enterprise in China And we cooperated with Gaussian Ewing and Dell and Intel in for this deployment and it the data center are located in Guizhou province in China, which is hot of the big data industry in China and We have two more than two thousand and the five hundred physical servers deployed in the data center and so far we have 500 physical servers have been virtualized by our open-stack distribution. So it is a large-scale cloud We deployed So here's several pictures which are taken in and around the data centers So for us we have several large-scale deployments and In order to run these open-stack These large open-stack cloud successfully we have our requirements that each component in open-stack installations are Scalable and reliable. So it is very important for us. So especially for those networking part So currently we use Neutron OVS plug-in and But as workloads increase we discovered some stable limitations in our deployments The first is messaging Okay, the first is messaging and I have already shared something in Vancouver Summit and I When I was in Vancouver and I presented a distributed the messaging system for open-stack So if any of you are interested in distributing the messaging and you can find the video in YouTube. So another stuff is the persistent high available database So it is very critical For almost all the open-stack components is a database layer. So If that your database layer cannot scale out as your cloud grows You may run into major issues with the rest of your open-stack deployments. So the database layer is very critical so And for me, I think that If any component of your open-stack installation cannot scale out so the whole systems cannot scale out so So for the persistent storage in open-stack. So currently open-stack use use Rational database measurement system heavily. So but The rational but the rational semantics is too strong for cloud and scout applications We discovered critical performance loss due to the rational Semantics. So according to our experience, we believe that centralized database clustering cannot practically scale out in data center cells So we need a distributed data storage system for open-stack, which is optimized for read and Which can be reached a consistent state for the whole system and which is always high availability High available and which is also able to work properly on the data work partition So these are our requirements for the data storage for open-stack So There are two kind of data storage systems Basically the asset system which has strong semantics for example the rational database and the base system Which is almost like the nautical servers, which is Key-value stores So according to our experience with refer based systems for data backhands The base is basically means basically available which has a soft state and which is even to your consistent so of course there are many options for this kind of data store systems So as described before in Dragonflow we implemented a pluggable key-value interface layer which can plug the Which can plug some the Almost all the key-value data stores for example currently we supported the ETCD RAM cloud zookeeper radius So fast the Dragonflow has a scalable persistent storage which doesn't rely on the rational database So is it enough? So the persistent storage is scalable and reliable because we take advantage of these Nautical solutions rather than using the rational database So the answer is no so in our production system We also discovered that there is a common problem Almost for all the SDN solutions all the third-party SDN solutions that if you need to Integrate with OpenStack Neutron, which is a database consistency problem so in OpenStack there the Neutron server uses the Rational database to store all the networked policies and For all the other SDN solutions They all have their own data They own data stores for example the open control the open daylight The middle net and also the Dragonflow we use these SDN solutions are all use key-value stores So here's the problem. We have two kind of distinct type of data stores the rational database for Neutron server and the The key value stores for SDN controller so How to make sure the two Kind of data stores are always consistent So this is a huge challenge so the the Neutron database is the rational database which has a strong semantics and Which stars the whole virtualized network topology for OpenStack and on the other hand the Dragonflow database is a key value Store it is a distributed database and it stars part of the topology that used in Dragonflow So here are some problems that we discovered in the production system For for example when Neutron database transaction is committed, but the related operations on Distributed database have failed so It is clear that the two database base are you are not consistent And the problem too so when we concurrently run multiple Neutron API on Give a Neutron object the Neutron database can deal with it a very very well because due to its Rational nature, but how about to the key value stores and the problem three is that The Neutron as you all know the Neutron use Nested transactions is heavily. So how about Dragonflow database how to translate these Rational nested transactions into the key store semantics key value semantics So there are many many problems because of the two database distinct So here are some options that which can Help to enhance the consistency of the two database The first one is okay. We just use the one database and then we remove the Neutron database So actually it is a very complicated solution Especially we're involving ML to so and it cannot be done in a short period of time And the next one is we introduce the key value store into Neutron But there are also some problems because Neutron have coded in a Rational manner so how to work with it is correct me It still needs much time on evaluation and deep discussion in the community So is there any other simple and straightforward solutions that which can make the two database always consistent So in Dragonflow we introduced a distributed log for calling for coordinating the two two two type of transactions in the in the database and It guarantees the authenticity of a given API and it implemented in the Neutron core plugin later So it is a project-based lock. So which allows API concurrency at a certain degree so actually when a user when a user runs an when you run the Neutron API and then we just look at the session API session and then we do the Neutron transactions and then we do the Key value key right key value operations and then we will release the lock So actually this this is a simple distributed lock and which can help us to coordinate the two type of Transactions and the implementation the detail of implementation is just like the two-phase commit so and in the next stage and then we will Introduce an object synchronization machining them which can In in this machining them all the objects stored in both databases are versioned so currently the Neutron doesn't have the version objects and Boss the dragon floor. So we introduce a version objects for both databases And then we also take advantage of the compare and swap operations of the key value stars, which Which helps update the version And finally when something unexpected happens for example the Neutron database The object in Neutron database is inconsistent with the dragon floor database and we synchronize the object according to its version so when we read object or write object to Neutron database and we read the Neutron object with its version and Then we do compare and swap according to the version and we write to the SDN database and finally we notify the dragon floor local controller to Flash the floors according to the give-up date So with this kind with this kind of machining them and we think that we can We can fix the income we We can guarantee that the two kind of distinct databases are always consistent everyone we want to discuss the roadmap that we have for the project and Leave some time for questions So before I go into the roadmap, I want to to discuss the challenges that led us to develop to establish this project and What we are facing now in the next releases. So one is the scalability currently with reference implementation according to our a Production and from the production of a w cloud We can reach around 500 compute node But this is with pushing the limits and a lot of tweaks to get to that limit So if you have a massive deployment above that it won't work Mainly it's the message queue, but there's an other limitation second one is performance the Performance of the data path is relatively low because we have a lot of extra Software stack for a virtual router and namespace Going to the dcp Etc and the the last one is operability We took care of some part of the operability Meaning that we don't need to manage a lot of namespace and dcp but a for this release we Did the distributed D not so we took out that part of the reliability managing the namespace, but we still use Centralized s not so one of the feature we have for the next road roadmap is a To take the namespace for the virtual router all together and to implement the distributed s not as well so For the scalability what we are planning and what our road map So currently with the reference implementation, we are around 500 We are currently in the mitaka release testing it we going into bigger and bigger environment and Simulating it we already reached 2000, but we open the next month or two To release the numbers so we are almost there for the 2000 number We are doing it the testing that we are doing currently we are focused on redis as the database and Redis as the pops up mechanism For the next release Our target of the scalability is to reach 4000 compute node in one pod meaning in one region one open stack region We think it will be with one of the database We are not sure which it could be redis with zero MQ zero MQ for the pops up showing a a lot of advantages in manner of scale We introduced the selective Proactive distribution that girl introduced us allows us to scale even further Currently we doing it by the tenant so we distribute if a compute node as VMs only on one tenant It will get the object and the topology of only of that tenant. Of course, we have to exclude the public Public object that like public network between tenants that you need to distribute to everyone But we are open next releases to go into lower level like a girl showed that you have even for one tenant you have two isolated network, so the one thing between the compute node and For n plus two we call it we open not the Newton release the release after that To be able to reach around even 10,000. We think that one major Thing that will allow us to do it is move to a Lazy mode meaning a reactive model So we won't distribute all the data to all the node But in lazy mode pull the data whenever we need it Of course will pay some latency to do that, but we think so that's our Roadmap of the project is very ambitious. We know but I Can for sure say about what where we stand in the 2000 we feel very good currently Their road map other road map for the project so for the mitaka We have additional db driver zoo keeper reddies the selective data distribution that the gal covered And I talked it about it a little bit the pluggable pops of mechanism Plugable pops of mechanism allow you for database for instance rum cloud doesn't have a Notify option so for zoo keeper. It doesn't have a notify option so you can use the pops up to notify and As gal said if the database support it you can use it But what we found out that the database are not optimized to the exact use case we have So if we have a pop sub applicative bap sub that sends the event it works much better With lower latency and much more scale Distributed d-not is no namespace. It's currently only flows Implemented and security group And security group we introduce it and we introduce a nice mechanism Gal is a blog about it that Reduced the number of flows so we don't have Flow exactly as the number of connected port We have less than that For next release what our plan so our article poor binding I will touch it in a minute We got a lot of requests for that mainly for Tunneling off-loading to the top of the rec and I will show it in a minute Containers and we'll touch it in the next slide as well They we will support the service chaining the SFC We are planning to support SFC as the base But we want to add another type of we call it topology service injection So and I will touch that a little bit further the Intel inter cloud connectivity L2 gateway is a Like a software emulation using some of the code there is in the obvious but allow us to to implement the the software emulation for a L2 gateway switch and optimize scale and performance so First of all with the ML to so what what we we currently are a core plug-in in neutron We are switching to ML to mainly to support hierarchical port binding the main use case is supporting villain up to the top of the rec and off-loading the VXLan tunneling to the the tour For this we would have to support another feature the villain segmentation that we don't have so villain will be supported as well for For the new turn release this will allow another controller ODL or any other SDN control This is just an example of the L to control the Hardware overlay the underlay we want in the dragonfall to focus on the virtual overlay and only on the virtual overlay and network service distribution Another feature that we get a lot of requests and we will support for a Neutron release is a container support will support nested Nestor container inside the VM we are we don't have the spec yet. It will be With a obvious inside the VM and we have an option without it. We want to use the IP villain Maybe we have a lot of option or with a nested the OBS If someone is interested you can come to the To the IRC meeting and we are now designing it. So it's in design the Container and we will support of course the career integration and Support the career driver in in dragon flow so a networking service chain this is we from the Work session that we did Today we understand this is a very very important Feature so for sure will support SFC as it is We understand that it's something that can drive more adoption to to dragon flow and it's a needed functionality But we other than supporting SFC We want to introduce a new a new type of service and we call it topology based service injection and allowing you to bring an SDN application that can be centralized SDN application and can take Control of part of the user topology and get of course in a secure way We will take care that it's Doesn't override the other tenants and it doesn't interact with other application and we would do the abstraction that it will this SDN application cannot touch and Interfere with other things, but we wanted to be something like Surface injection look you say I want my SDN application To be in the router. So it's something similar like IP table as a the post route prayer route Interaction so you can do it over the user topology and because we have all as Mention earlier we have all the user topology Inside each compute node we think it's doable and we didn't define yet the API how the user will configure it and How it will do it, but we have a lot of use case that are not A feasible with the regular SFC and will be feasible with the this service injection looks Other other application that we want to support and this is Showing again, how you can push the smartness to the edge one of them is IGMP application in I I don't know all the Implementation of the network in the cloud But some of them what they do when they want to give a VM multicast and if you want to want to send the multicast It's translated on the overlay to broadcast and it being broadcast to all the compute nodes or all the hypervisor The IGMP application is already in review. We have the spec. We have the code. We didn't merge it yet so what it basically will do they the Dragonflow controller will answer to IGMP join and IGMP live right to the distributed database which address is listening and then when someone wants to send on a Multicast address, he will send it only to the compute node that SVM that listen on this address. This is for IPTV or other use cases mostly telco We want to support distributed load balance there, but not north south east-west Brute force prevention is already in review First of all, we have it for DCP. So the VM cannot compromise the local DCP agent So we have like great limiting on the amount of DCP request that the VM can do in a minute or in a second DNS service it's something that we didn't started yet, but we have to start Currently the local DCP just provide the DNS address But we think that DNS service and the distributed load balancer the east-west go really together But we didn't finalize it yet Another interesting application that we are currently developing is the distributed metadata proxy I don't know if you are all familiar but currently in the network node You have a distributed the metadata proxy that all it does is Adding an HTTP adder and then go into the Nova and metadata server. So We trying to eliminate this service altogether and to have it as another distributed service inside the the dragon flow and fault port fault detection we have some idea how to implement the via flow port fault detection and and To have an application for that and so the documentation of dragon flow is available in our wiki we have a Lunchpad with bugs and blueprint We have a weekly IRC meeting and what we got until now from the work session that we need a another time that is good for the North America time zone so we probably do Changing so one time in a time that is good for China and another time that is good for for the US So we will update it soon But we are always available in OpenStack dragon flow the IRC channel We have a Vagran deployment tool that deploys a multiple compute node and one controller So it's fairly easy to test the deployment and start playing with it another thing that we try to do and I think in this point it's still like that and we are Making a lot of effort to live it like this is to keep it simple So if you go into the dragon flow code, you will see it's very simple it's not a lot of code and It's all written in Python. So there's the limitation of performance, but it's very easy to To get involved and to start working in dragon flow. It's it's very well abstracted and The layers so if you just want to develop an application even a new application You can use all the infrastructure and don't worry about the DB consistency high availability Anything this has been taken care by the infrastructure and We have the work session. So we had two of them already we have a tomorrow To working session and on Friday we have the regular Working session of the development We see more and more people in the work session and of course everyone is invited in the in the Tomorrow session we're going to talk about public subscribe next phase Because currently we support only public subscribe the publisher are only the neutron server next step will be that the we need the Publisher from the compute node as well. So it's multiple publisher multiple subscriber from the compute node themselves and we're going to talk about SFC and Survey chaining on the next one and now if someone asks question, we will be happy to answer So I have a couple questions, but one is about the NAT stuff Have you guys thought about implementing some kind of a scalable? Stateless NAT at the edge layer So stateless NAT so you could basically have multiple NAT nodes and just load balance between them using layer 3 ECMP So so you mean for the S NAT for the distributed shirt Well mostly for for floating IPs for for D NAT, but so for that to I suppose for D NAT We implemented already. So if the compute node it you can have What are you saying that not all the compute node would have they should have a layer at the edge Separate from the compute node. So you do not at the edge of your cloud It's actually possible. It's something that we thought about but currently you you can choose if the compute node it if it has an interface on the public network he You can offload the floating IP directly but we it's actually not so far I Think that this the approach that you are saying is something that we will do it There's not case So definitely have something like this. Yeah, that's not what we see it won't make sense In the S NAT. It's much more complicated because it's state fall not and We don't see us distributing it to all the nodes but it's an interesting use case and if you want to We should think about it for D NAT as well currently D NAT is Basically only dealing with that the local compute node offloading so for the distributed D NAT Don't you have to have some kind of a shared? VLAN across all racks then a shared layer to basically across all racks to bring that down to the compute nodes Right that external network. You need public network connectivity. Yes, but it has to be a shared layer too, right? Why should you do you? Floating IP is you attach it to one instance, right? One VM right by the layer to broadcast. Yes, we need yes Yes, so that's kind of Do you want to have like a pure? Yeah, it makes sense what you suggested. It could be our next step. We need to think about it currently We develop it only for the local compute node But it makes sense to have like some edge devices that serve multiple compute node maybe in the rack or okay and last question is a Production ready as it stands today It's very close to it. What we are doing now is test and we didn't release them but we're doing data path test that showing really good result and control plane test and It's in the process to be pushed to a testing pod in our company public cloud so It's in the A double cloud is looking for pushing it so it's on the worst of production ready and It's currently being tested for production Hi, it appears to me that you have two layers of Plugable database one is at the controller the other at the compute nodes. Is that correct? No, what we have is what basically there is the the database What do you have is the database solution then you have clients right you have clients on all your compute nodes for the controllers Reading this and you have the same client on the neutron server writing Okay, so so now on the control side I understand the rationale of using a no sequel or key value store Do you customize the the hashing mechanism to? Somehow more intelligently map, you know the tenant network state versus which shard this state goes to Or you just blindly you know do a consistent hashing. This is something we discussed about doing using I guess the shredding I mean We are playing to do this based on the topology. So the selective proactive is going to be something we are going to Basically shred it according to tenants or according to into the topology, but it's not there yet. No the topology is But are you asking about consistency no, I'm asking about You know eventually when you scale out, right? You will determine which states goes to which shot or which note Yes, so you can do it blindly which may have implications, right or you can do it in a more topology aware way Which can give you more scale out capabilities. Yeah, so currently we do it by the project the tenant ID But we hope to drill down more Thanks Are there any other questions? Thanks for asking that question had the same one about been one of your PBDs You had the database and the compute nodes. So that's I had the same question. Thanks for asking So there's only one database on the controller side, right? So the other question I had was what is the cost of your distributed lock? Isn't it limiting your scalability? Okay, I let Mali implemented that so it will be best so but we just it's just the first Yeah, yeah, the distributed lock is just the first phase and in the next release And then we plan to use the version object to get rid of the lock problem So currently we use the distributed lock because it is easy to implement and it guarantees It helps guarantees the API atomicity to make to make sure the two database has the same data Committed to the to the two to the two database, but it is just the first phase And also I think in terms of scale the the main issue is how you distribute this information to all the local controllers It's less about The neutron API to the to the database. This is sently mostly The most of the loads from open stack to the database is about reads usually from horizon or from Nova So we haven't yet. I think so any bottlenecks on the API And we use the database itself as the mechanism to do the distributed lock SQN database, okay Thank you. Can I just ask another question? So You so I assume that in the the controller database, you're only holding the abstract topology information, right? You're not holding like in the OVN. You're holding pretty much a mapping of the Open-flow Configurations right now. We are holding the topology as you say but again in next phases We see a lot of applications that could use this to distribute other information So my question is how big a data we're talking about so this gentleman talking about two thousand, you know 2000 nodes, how how large does it translate into because it's not so large. It's not so large, right? So now what's the need of scale out at that database later? Okay, that'd be fine if I just do a mysql with multiple replicas because the number of clients so it's even even some of the No sequel database that we tested ever problem scaling out to hundreds and thousand nodes So only some of them know to handle this really good And when you want to have a number of clients then you need something that is No, no not centralized And it's like no talking but mostly the read Transactions, yes, if it's read Let's say you can scale out the read of my sequel right? By adding From from thousand of nodes. It's almost impossible No, I'm asking so so I remember at some point in the presentation and it's not only read It will be right because we would have to add the port status and Okay, currently we don't put port status and multicast information and other information currently from the compute node We just write the chassis ID once the the hypervisor come up. But later we would have to add Do a lot of right from the compute node themselves, but but even for read we found out it Very hard. We don't some kind of cache layer that this no sequel or vibe. Thanks Okay. Thank you very much