 Okay. Hi. Good afternoon. My name is Travis Broughton. I'm with Intel in the open source technology center and Today, I'm here with Jamal and Melvin to talk about some work that Plumgrid did in the OpenStack Innovation Center So how many of you are familiar with the OpenStack Innovation Center? Quick show of hands. Anyone? Okay, so for the the rest of you OSIC is a collaboration between Intel and Rackspace under Intel's cloud for all initiative and really our goal is to improve the scalability and ease of use, ease of operations and deployment at scale for enterprise class enterprise usages within within the OpenStack community and we we really kind of have a path that we follow in in that pursuit So you know first we want to grow the number of OpenStack community of open stack contributors So you know more people contributing to OpenStack Will hopefully increase the number of contributions upstream and then finally, you know We want to do that to enable OpenStack innovation at a larger and increasing scale So in in the first area the growing the number of OpenStack contributors We between Intel and Rackspace we've had over 200 individuals come through and get trained And and that adds up to a lot of time spent in in growing the community and enabling people To be onboarded to become OpenStack upstream contributors Additionally the direct contribution so those people are committing code They're contributing reviews. They're contributing blueprints. And so we focus on six main areas again geared towards enterprise usages and scalability so manageability reliability and resilience scalability high availability security and compliance and simplicity and we've contributed to over 25 OpenStack projects. We've done a hundred and fifteen blueprints and and Over 40,000 code reviews and almost 30,000 patch sets and that's in the past year that the OpenStack Innovation Center has been in existence. We announced this in Tokyo We gave an update in Austin and this continues the momentum that we've made So the last piece is you know, even though we've brought on board a number of contributors and we have people Working for both of our companies who are contributing upstream We know that there are a lot of folks out there who may have who have great ideas who are great developers But maybe have a harder time scrounging up testing capacity compared to a large-scale hosting provider or somebody who manufactures CPUs so for that we've launched the OpenStack cloud. So we've got a thousand nodes In Dallas that are managed by a rack space another thousand nodes coming online in California that Are available for the community for a number of different usages and over the past year of the the cluster's existence We've had over 60 community projects come through so people get a reservation for three weeks And they're able to use between 50 and 200 nodes Carved up they can as a bare metal reservation. They can stand up their own stack and or we have a Cluster that is running an OpenStack cloud that's available for all sorts of users Users to access you saw this morning in the keynote the OpenStack hackathon in Guadalajara The the hackathon users use the OpenStack cluster for their for the hackathon You also saw the infrastructure community the infrastructure team is doing a number of integration tests running in the osik cluster and Today, I'm happy to have one of these 63 cluster projects one of these companies that has done upstream use of the cluster With plum grid and one of the requirements in using the cluster is that you share your results with the community And what better way to share the results than at the OpenStack Summit. All right. Thanks Travis So today I'm going to share What was the tests that plum grid did on the osik cluster first? I'm going to talk about the infrastructure that we had What different versions of OpenStack did we install and then talk about some of the results that we got In the end, I'm going to share some of the challenges that we had during during the overall Installation, what did we learn from from the from the tests? So let's start with the first thing that I'm going to start with is the Overall infrastructure that we had so we were provided with the 131 physical node infrastructure at osik's one of the one of their data centers Total time that we had the cluster with us was close to three weeks And the installation that we did so we started with so he was back in April that we got the first First cluster with us the first access and the installation that we did was that we started with the OpenStack and civil installation the Liberty version Along with the Liberty version we installed a plum grid open networking suit, which is an SDN platform for OpenStack clouds And I'm going to talk briefly on on what it is because I'm going to refer to it later in the talk as well And also plum grid cloud apex, which is our cloud visualization and monitoring tools Some of our high-level results like so so this was the first time that we got the cluster back in April So more the overall results or overall testing that we performed from from a perspective was that How the different components within OpenStack like not just related to SDN But generally what are the different components? How did they interact at such at like at this scale? So what were the overall results that we get from that perspective? I'll get I'll also share some results at the end. What were the different tests that we did and what were the results but from initially what our main goal was to figure out and test out that at this number at this Scale number what are the different components? How are the different components interacting with each other? and then We again got the same 131 node physiocluster in like October last three weeks in fact just immediate last three weeks before the summit And in this we had a different goal. So we started with the on the installation side We did Mitaka based installation. So we use the latest OpenStack in Varian and again from the plum gates perspective use the plum gate open networking suit Varian 6 which is our latest release along with the cloud epics latest release as well from from the results side since the last time we were Keeping the things in within certain limits and then just testing out having a broad spectrum of tests In the latest testing that we did for last three weeks We just tried to max out on the on on the like number of virtual machines And then test the system test the system on certain number and I'll share the test as well at the end So moving forward just a quick Overview of plum gate on s So plum gate actually provides an SDN based networking solution for your OpenStack and container clouds our plum gate open networking suit is So it is Integrated with OpenStack. So we have a neutron plugin We what we do is that we replace the OVS plug-in and then install the plum gate buggy plug-in With it. So you use the same neutron API layer, but the back end implementation is done by from gate on s along with plum gate on s is the cloud apex, which is our monitoring and operational tool Which actually integrates with plum grid on s and gets all the information by on s One of the key differentiators is that we don't use OVS as our data plane component We replace the OVS by the eye wiser component eye wiser is an open source Linux based community project and What it does is that I guess from a very high-level perspective quickly It's it's an extendable data plane. It's a programmable data plane and it allows us to install or like install Dynamic network functions in a running system in dynamic manner. So we can pretty much by the use of eye wiser we can Provide like functionalities around security segmentation policies all in a very distributed manner So it's a complete onus is a complete software only solution It gets installed the eye was a component gets installed in your compute hypervisors We use the VxLan based overlay mechanism and using the VxLan based overlay mechanism We expose the concept of virtual domains. I Even in the previous slide you saw some comments around Tenants and then I written virtual domains besides that as well So in a nutshell what a virtual domain is it's from an open stack perspective. What do we do is that we Like a separate tenant in open stack is a separate virtual domain within Plumgrid Basically it this by this what we do is that the virtual domain becomes a virtual data center for each tenant So within that particular virtual domain you can have completely heterogeneous topologies Create different kind of network functions either those which are natively available within the platform or any third-party functionalities Irrespective of what other what other tenant is doing in its own virtual domain So that isolation and segmentation becomes very key from from that perspective and again, it's fully distributed. So Those scale and flexibility flexibility and all those things coming with it So very quickly again Since we're going to talk about the infrastructure how Plumgrid ONS gets installed within the open stack infrastructure So we have from an infrastructure from a like overall overview point of view. We have three basic components We have the control pain architecture, which is known as the Plumgrid directors So Plumgrid directors is the control pain architecture for us, which pretty much just manages the resources underneath Then the data plane component like I talked before Iowizer is our data plane component So the Iowizer edge component gets installed inside the kernel of your compute hyper nodes So there is no user space agent anything like that It's just a kernel module which actually gets installed inside the kernel of your compute nodes And then provides all the networking and policies and all those kind of functionalities right there inside the kernel And then since we learn in the overlay model We have a gateway with which is basically the same Iowizer kernel module just gets installed in any bare metal x86 servers, it's a it's a transition from a VXLan to a VLAN Model so since we run in overlay we have a gateway with just encapsulated encapsulate the traffic for us And the fourth one is the cloud apex. So the cloud apex is a monitoring tool What we have is that since we are anyways are running inside the kernel for my Iowizer We have all that information readily available to us So based on that information what we do that we built up this cloud information the cloud visualization tool This was the it was a it was a product that came out earlier this year now It's going through a second phase of release But we have seen a lot of attraction around with our current customer base and our prospects since this is all this is solving a major Area for us and even in this cluster you'll see a couple of pictures that I have come off images that I've Here that actually show you how easy it becomes from an operational perspective to actually look around and get a feel of it So moving on from our testbed setup we like I talked before we had 131 physical nodes and We we distributed in it in a in a highly available architecture So we had three open stack controller nodes That were also hosting plum grade control plane LXC's LXC's in those and then we had In there was an info host which was basically just hosting a repository or package repository for us and Since we are running installing the OSAD Ansible There is a deploy node that you need to set it up just to do the deployment So that was basically our info node and from a compute nodes. We had 121 compute nodes In the compute nodes, it's a new compute nodes along with Plumgate Iowizer edge component that gets installed inside the kernel of those compute nodes and you for externally a gateway Traffic like from a traffic from within the cloud going outside in the legacy world on the towards the internet You have the gateway servers the four Gateway servers again. It's just bare metal x86 servers which are Having an OS and getting installed with the plumgate Iowizer gateway component inside the kernel Basically just a transition from an overlay VX LAN to a VLAN model Okay, so moving forward from there from an installation perspective, I've just kept some images the overall just just a point over here the overall OSIC report like All the testing that we did including the installation What did you learn? Everything is included in one of the reports that we have published One report is already published from the last testing that we did and the second one that we just completed last few weeks ago It's going to be published very soon as well So all these installation things and everything is going to be part of that But I just kept like a couple of images from from that report on the left-hand side is just Overview of the dashboard of insect dashboard on the right is the plumgate console Plumgate console is just a config manager for us from an SDN perspective from a networking perspective And right now what you see in the image. It's It's it's an external network topology What what we have is that within plumgate SDN and the provider network is Is a separate virtual domain for us which we call a service virtual domain, which actually then can be shared across Multiple tenants it becomes a way where We do it in a way where you can have an external network We just shade across multiple tenants and then you can apply some common policies across your tenants as well still in a distributed manner There is no specific like there's no hairpin of traffic. It's still distributed inside the canal But from a security perspective from a policies perspective you have a global presence where you can perform those policies so this just depicts a Bunch of different networks you see like different networks connecting to the outside world So this is the image from cloud apex Cloud apex is again our visualization tool. So right now what you see in the picture in the picture is a complete overview of your infrastructure I know I'm not sure I'm sure you cannot see or you cannot probably read at the top So on the left side on the like on the left-hand side the top colors that you see the big block colors. These are the These are the tenants or virtual domain So just like tenants from open stack perspective and on the top you can see that the total amount of tenants are 144 That's the tested. That's the number that they tested with and then the total number of Workloads that you have in your cloud So the bigger boxes on the top show your tenants and then the smaller box smaller square boxes are your Workloads those workloads can be your VMs can be your containers, whichever you are deploying in your cloud So right now it's showing a lick 4,000 Virtual machines and the containers are zero since we didn't this was just a virtual machine testing on the bottom of this On the left-hand side on the bottom is the physical Setup so you have all the physical servers set up in racks This is again. We run LDP. So we know the exact rack combination as well. And on the right-hand side is the details panel These fancy colors that you're seeing this is just depicting so during our testing our Engineers did some metric testing as well like what was the current bandwidth or what was the different traffic analysis of each VM? That is in different tenants. So these red and yellow colors just show that there are some VMs which are showing higher traffic There are some which are less traffic From an operational point of view this this helps a lot from an operational point of view since from one interactive dashboard You can get to know all the details of your cluster It's a couple of more views of cloud apex on the left top left is the security view again since you have Different tenants from an admin prospect perspective. It gets really difficult to figure out if Even some if even there are some basic issues during during your testing It gets difficult to figure out what's what can be the issue what different layer you have like multiple different aides within your cloud so if it's a physical problem if it's a virtual problem if it's a Security group problem if it's some logical constructs problem. So this actually helps us to get into the details This view is our dynamic security view So each tenant gets its own dynamic security view and on the top left you see a lot of red lines That is just showing That each that there are certain security groups within that tenant which have rejected flows and the white and gray lines Show that they are accepted flows again from an operational point of view This was tested in a sense that you can quickly figure out if there are some rejected flows and you need to Change them or and vice versa So moving on on the testing part We what we did was that we have Different we ran different test suit during the time that was allotted to us One of the first test that we ran it was just to after the installation just to do some functional API validation testing So we used a tempest suit for running the for API validation tests and we did a bunch of tests based on overall different components of within OpenStack Again, I've kept an image over here of all the tests that we did But it's it's just an image from a report that we did so all the details around exact testing What was the result everything is completely documented inside the Inside the test report as well inside the report that we has already published From from a summary perspective We didn't find any Deviation from the usual behavior that OpenStack should do so From if a main solution perspective, there was a good thing that every component within installation was working fine So for scale purposes, we used heat templates We created a template For each tenant like and then the external network as well So the template was having a bunch of different and like network function constructs for each particular tenant We created a typical Excel network topology that can consists of Couple of layer 2 constructs that we call the bridge then the layer 3 construct the virtual router the dynamic router and Then the NAT as well all of these functionalities are natively available part of the open networking suit and are again completely Distributed are being done by the IYZ inside the kernel. So there is no Network no no centralized node or even the control plane We actually test some this test the same in the HA part of the next few tests as well but from From from that perspective, it's completely all these network functionalities are completely distributed so we created 140 tenants each sending having a particular topology external network topology and then 4,000 VMs as part of the overall testing One of the questions again is an important for us that we go through even in our internal testing as well That there is one thing that you reach certain number Whichever is possible in the environment that is given to you But at that number are is is it still highly available? Is it still an HA architecture? So whatever tests that we perform even internally within our own Testing with the Plumgate on s and cloud epics. We did some similar tests in the in at OSIC cluster as well so the there are a little bit more detail in the OSIC report, but I just kept a high-level detail that For the maximum number of tenants and VMs that we had in this in our data center at that time We did the single director of failure test What basically it is that we talked about that we have three control plane You can say components which are acting in active active active So they are highly available and each of these control plane components has some processes and they are all distributed among the three so if we kill one of the Director planes it shouldn't affect the topology since it should be evenly distributed among the other two So that's what we tested in the as a as one of the tests where we just killed one of the director Alexi's and all the processes which were running on that particular Alexi get caught started in the other two But from the data plane perspective since we never touch the traffic never goes to the control plane It should never have any Problem on the data plane so the data plane Was completely verified for all the VMs and there was no issue from that side Similar tests were again done from a full cluster failure like for some reason if all your Cluster that the other you have three Control plane clusters if all your cluster fails what happens in that case? So we did a similar test where we killed all the three director Alexi's Since again The eye-wise there is our forwarding plane. It's intelligent enough all the data is already there from The topology perspective that that is currently deployed so that there shouldn't be any down time even if your Director cluster is down which we call a headless mode so we tested the same over here we killed all the three directors and Successfully then verified connectivity across all the VMs that we had And you'll get a bit more details around again from from the report that is published But these are the results that we have from a high level So the last bit of testing that we did was the rally testing So the rally testing basically tests different gets benchmarking numbers at different scale For different operations that you can do from an SDN perspective Some of the tests are written on the right-hand side like creating networks creating routers and listing them Creating security groups and listing them subnets and then creating VMs and listing them So we did like the rally testing at Three different times one was that without any load of the system So there is no VMs nothing and you just create these Do rally testing and then the next was at a certain Number of tenants and 160 VMs and then the third was at 80 tenants and 6,800 VMs again all These so we we have a bunch of graphs in there Where you see a lot of details around What are the numbers that we got and all these graphs are in detail explain inside the OSIC report as well From a high level. I'm just showing the graphs. I think you would you'd see a lot of different a Lot more details in the OSIC report, but from a high level what we saw was that What was expected as well that as as the scale grew from no load to a certain number The overall API time was increasing But there wasn't any alarm alarming increase or any major red flags It was a usual increase that we expected as well All right, so Let's talk about some of the learnings that we had during the course of these three weeks testing We so from an installation perspective, we didn't face any major issues It took us like two days to actually do the complete 130 node installation Just a major just a minor tweak was that at certain times we figured that they're During the OSIC installation what happens is that the deploy node actually tries to do SSH into each of the Slave nodes you can say all the compute nodes and then downloads bunch of stuff and things So there were a couple of like SSH timeout errors So the idea was to actually increase those timeout tries and the timeout numbers within the playbooks and then that that just that just Does it for you so other than that they weren't any major issues from an installation perspective that we faced from from the testing phase so There are some so from the testing phase one of the major things that we faced in the start since we were using Community playbooks for the OSAD installation that at a certain number of scale We started seeing the API calls or the horizon dashboard being extremely slow It would take time to actually just Like scroll in between the pages of horizon or even a basic API call was taking some time So we and we actually touched back to the community We got a lot of help from a rack space team and then we tweaked some of those Things based on their recommendation. So one was this thing the usual suspect which in this case is a rabbit MQ So around rabbit MQ you we tweaked some one of the things was that we applied some optimization patches Which you need to apply at a certain scale when you're running at certain scale and another was that at production systems you need to There are some tuning that you need to do within the within the system limits and some kernel parameters So we applied those tunings Along with some thread pool sizes for Nova and neutron services There are some since we were running the OSAD playbooks the community playbooks So there were some changes which you need to apply when you were actually looking at that particular scale And that those were with the recommendation of rack space team and that that was done after that and When we made these changes The API there was a mark difference from the API call response time and on the horizon dashboard as well So definitely when we started The testing in October phase we applied them right from the start and didn't see them again at least When we did the testing again this October Another common issue that you face from horizon is that at certain scale the horizon gets slow as well again It's something that we have seen in the community as well What we did at the end was that since we had limited time We just changed the landing page from the initial page because in the initial page It has to do a lot of different queries for a lot of data that had it needs to show So the easier way it is to just when you log in you just change the login page to Projects or instance is something like that. So it doesn't actually take a lot of time when you're actually logging in Yes, so though from from the challenges. These are some of the names. These are some of the Points major points again, you'll see much more detailed ideas in our report as well And yep, so that's about it. That's and I think I'll hand over to Melvin Thanks so Travis talked about essentially what OSIC is its mission and tenants And how we're carrying out that mission and tenants right at a high level And that's in you know, the contributors you can see the numbers from the amount of Contributors which translate to contributions which translates to of course things getting fixed in OpenStack essentially or potentially being adopted at a larger clip Jamal, you know presented plum grids successful use of OSIC, right? 131 nodes OpenStack Ansible Liberty On SV for which is I don't know if you guys paid attention to that But they deployed on SV for it and cloud apex v1 in April and they took those lessons that were learned right and in October they bumped up to On SV 6 and then cloud apex v2 So that speaks to basically the use the usefulness of the cluster in terms of learning some things in a Staging environment to say that you can push into production So operations essentially as he was saying is was charged with ensuring that the cluster is available stays available So one might argue right the logical step next or the question comes to mind is how do I how do I get access to? OSIC so don't really Gloss over from all of this but and I'm not going to try to explain all of it, but essentially you go to osic.org and You submit a request there and there's this great process that happens and You get your request approved you get your nose basically what we do in operations amongst other things is we make sure that those nodes Work and make sure there's no issues with the hard drives make sure the nick cars are working perfectly fine So that when you get them, they're ready to go out the box per se And then again after your so once you're done using them essentially we have a Cleanup process we have a provisioning and a cleanup process that we try to reduce the amount of time from All this stuff that happens here To you're getting the nodes and actually using them So requesting resources you can go to osic.org slash clusters You can learn more about the cluster itself at that bit.ly link Or bit.ly. However, you want to say it? And essentially that link will tell you what's available. It will tell you more details about those That flow chart that I provided It will tell you Specifications for the nodes as well right and one thing that we're looking at doing is moving from potentially Having to go to the website to something that because right now you have to like do emailing and so forth and so on with fill Out a form so we came up with possibly using github still in the works But it's actually going to be pretty good for us because and for you as well because you can see what folks have requested You'll get a chance to see how that process happens so you can skip some of the delay Right in terms of you requesting product requesting resources and getting those resources And that's pretty much it I'll pass it back over to Jamal for questions Yeah, so if you have any questions, I'll be happy. Yep. Okay, so one question Did you check a data path on all those the deployment that you did? A data path meaning any performance or yeah your VMs you've you've instantiated about 4,000 VMs what for I Mean you didn't really need to Create VMs unless they were doing something. It was just sports would be enough. We were actually so By increasing the load on the like so when you create create the VMs like 4,000 VMs you are really creating the load on your and Components within the open stack rights and we did see we did learn from a fact that even after certain number of Like scale around the VMs if you're hitting 2,500 3,000 What are the different things that you learn from if you're just doing even novelist? How much time it's take just just to do the novelist part as well So there were like other up goals from that perspective Specifically on the performance number. We didn't do any performance testing between Like between the data path between data path for that you don't need even 2,000 4,000 VMs You can do it as less, but we didn't do any performance tests for this now. Sorry. So maybe I misunderstood So I thought that your product is in networking. Yeah, but you still have something like you check the So it's yeah, it's it's more on the overall goal was to actually check from a scale perspective that what are the different components and how they are work at this Certain particular number. We didn't do specific Tests on scale, but what we actually did were different like around creating multiple different Network components at this scale. What are the different things that you see when you were actually creating? Like when you are actually creating different components ring rally tests at this scale The performance test and all the other base testing that we do during Internally when we are testing our releases out we do some performance test during that those I would be happy to share those numbers with you on offline, but Right now in this cluster. We actually didn't touch any performance numbers Okay, so first question is actually I think to you If someone wants to start testing on OSIC then what's the pipeline like there? You mean in terms of cycle time from request to yes actually getting capacity It varies that's why I said like you you fill out all this stuff, but you may miss something so The hope is we're moving it to get up we get out of the delay sometimes that emails calls Yeah, so like you make a request and the governance board does a board that decides who gets what right? The government's come back and says hey, we need this information. Well, it's not maybe lost in a bunch of emails It's on github and comments and an issue that you submit and then we're waiting on you to reply back So we use tags to show Where those the status of those requests Yeah, I think from approval to starting testing. I think from the approval gate We're running about six to eight weeks out there. There's a bit of a queue and everyone gets a three week So all of the clusters are basically always Consumed and so we're we're generally accepting requests for the next next right and that but we do have the the open-stack project where we have a developer cloud and The cycle time to get into that is much slower So if you're looking for VMs, okay, if you're looking for bare metal, you have probably six years Much larger scale than this Like a thousand notes not a thousand notes now we've got it carved up I think 200 is the largest pool single pool that we can allocate right now. It's 240 to 242 242. Yeah Hi in my case It's I didn't know about this I advisor found it very interesting, right? I wonder if you could explain a little bit more about For example Is it easier to troubleshoot than? obvious for example, you just just regular tcp down and The data in the disaster recovery scenario that you proposed The that you tested sorry. I wonder if once the control plane is sold down I reckon that because the control plane is down It cannot react to topology changes for example, right? That is another assumption I have And Yes, I wonder if you could comment a little bit more about this this this layer to layer 3 component so One of the question is that what is I always are so I always it is is an open source linear spaced community project It's basically Where we it's based on the ebbf technology it runs on the ebbf technology and it's from a high-level perspective It's a programmable. It's an ebbf technology Yeah, so it based on the ebbf technology and it's basically an extendable data plane for you. We actually Don't use OBS since we use IO iser So we actually replace the OBS plug-in within the open stack installations and use the IO iser data plane component for us and the second question you asked was around and And I was it is not out of much broader concept so we can always touch offline as well I think it's already time as well, but The second was around the ha part so What happens around the ha is that once you? Bring out the control plane down since the control man is down You're not able to change anything and that there is that is how it works, but So the physical if the physical link goes down the all the VMs on that particular Host would obviously go down as well But from a control pin architecture since the control pin is down you cannot make changes in the data plane and Once the control is down, but whatever is there right now like whatever all the communication that is happening that is Continue to happen because the all the data all the forwarding tables and all the information is already there in the data plane Yeah, so dbdk is basically it bypasses the kernel so for performance We don't use dbdk. We use the xdb for IO iser and we can touch offline It's a it's a quite clever wide topic for that. Yeah I think that's it. Yeah, alright. Thanks guys for showing up and I think that's about it