 Good afternoon everyone hope everyone can hear us if you can't just wave your hands crazily, and we'll do our best to speak up Okay, or just wave him anyway My name is Jonathan LeCours. I'm the vice president of cloud at dream host my background is Development engineering, and I spent the better part of the last two and a half years learning the plight of an operator And operations that brings me to Jeremy. So I've been with dream host 14 years Start out strictly system in been doing more and more development as kind of have to do once you get into this open stack thing We're gonna spend some time talking today about exactly what's been going on there and the pitfalls and the Bonuses that we've been getting out of everything. Yep All right, so first I'll give you a quick introduction to dream host itself if you've not heard of us We've been around for a long time in internet years. We're kind of like Grandfathers, you know so 1617 years old we have over 400,000 customers at this point hosting 1.2 plus million domains On our hosting platform. That's kind of our bread and butter and something we've been doing for a while And we're really proud of it and now we're doing cloud services as well We have a big open-source focus All of our work over the years and our hosting platform has been as a result of the fantastic work That communities have done on things like Linux and Apache PHP WordPress We have over 750,000 WordPress sites hosted at dream host. That's a lot of WordPress That's a lot of PHP. We like to say we kill trees with PHP because of all the CPU that it uses up So we're also the creators of seph our wonderful friends at ink tank Just now over at Red Hat stage while the creator of seph was one of our co-founders So we are very familiar with seph and we're users of seph Big contributors to salamander neutron Oslo all sorts of other things even myself the VP I created a framework called the con which is used by a lot of open-stack projects now For the air API service So we're being an open-source and we're an open source or an open-stack foundation member so we're very committed to open-stack and I'm gonna gonna give you some caveats here because I'm about to be very candid in this presentation and so is Jeremy and and we're gonna say a lot of things that Talk about some of the troubles we've had with open-stack But I want to be very clear that we're very committed to open-stack and we're excited about the community and all the things We're talking about are from our experiences growing and learning and we are contributing a lot of fixes back upstream And we want to share as much of this knowledge with others as possible I think public cloud is a use case for open-stack that isn't talked about nearly as much as private cloud and it has different kind of requirements and So we're gonna try to bring those to light here and help other people help us make open-stack even better for public cloud So in spite of any constructive criticism, we're as committed as ever So first I want to talk about the fantasy right two years ago when we embarked upon this great journey of Getting into open-stack. What what do we have in mind? What were our assumptions? What were our goals? What was our vision for this product? So the goal was to create dream compute a public cloud built on open-stack We felt that the world could really use a public cloud from a smaller player like ourselves We have like I said hundred four hundred thousand customers We felt that we could offer them a cloud service that they could really use and enjoy And we had this great sef thing that we had been working on that We really wanted to get out there and use the block storage That is provided in sef for our cloud So the ultimate vision Was to be a truly open cloud We are very open about everything that runs dream compute including what the switching hardware is What operating system we run on the hosts what hypervisors are? Every single thing about it all of the code that we use and generate in the cookbooks in the monitoring We try to open source we try to give everything back and go upstream So we really wanted to really create a really open cloud We wanted to use sef for block storage We are the first public cloud that I know of that is using sef for block storage and that provides some really interesting Features that we'll talk about later. We also wanted to get ahead of the game and Jump on board virtual networking. This is important for a number of reasons for us Primarily the security that comes with layer 2 isolation for tenants every tenant and dream compute is going to get their own Layer 2 domain completely virtualized and isolated from every other tenant and open stack We wanted people to be able to create these networks whenever they wanted to Be able to completely program the network and we also wanted to do that with IPv6 support from the ground up This is something that was very important to us as a small provider We don't have millions and millions and millions of IP addresses anymore And getting them now is actually very difficult. You have to go to auction and buy them for a lot of money And IPv6 is here and it's not enough here, but we want to help push that forward So that was another critical part of our vision, right? And we wanted to have integrated authentication We wanted that to be a part of the dream host kind of family of products We wanted people to be able to use some of the credentials that they already created use the same sort of management ability To be able to Connect to their open stack to dream dream compute. So this was the vision This is what we had in mind and now Jeremy is going to walk you through the operation side and the kind of physical side of that vision So you can see it's nothing super crazy We actually use VMware NSX for the network virtualization and you probably can't see this diagram But don't worry. We're going to talk you through it and publish the slides as well So that would be the top row of the main compute pods you see Everything is controlled by a separate hypervisor or control pod Basically three racks in our environment It's all running strangely enough on a VMware cluster All of our endpoints for the API stuff for the open stack various service nodes that sort of thing we didn't want to start with open stack on open stack Yeah, we wanted to start with you know something else and then we would eventually the goal is to transition off of that Exactly when we're when starting your first open stack cluster having open stack via dependencies not exactly what you're hoping for So we've decided to go with the tried true easy, you know, just buy it get it up and get it working solution So that runs, you know staff monitors. We've got chef nodes all of our central services the API endpoints for Cinder and for Glantz and Nova and Neutron That's all running in that control pod Back over to our compute pods It's basically a giant stack of storage boxes on the bottom Del R5 15s if you really care 12 disks each three terabytes And then a bunch of super beefy AMD hypervisor nodes Everything's got 64 cores 192 gigs of RAM It lets us over subscribe only two to one on the CPUs, which is pretty great Better than most the other public clouds that we're aware of That's over 3,000 cores for those of you doing the math It's a lot of course Also AMD And then all of the networking gear we decided to Try something new the network Layout is terminating layer three at the top of every rack. So each layer two domain is very very small You don't have to worry about any layers who we done see protocols or any of that crap We just do OSPF between all of our switches. They're all ODM switches just white box stuff running chemilis linux 48 10 gig ports and for 40 gig ports for interconnects between them in a partial mesh It's pretty awesome. You end up spending 60 bucks a cable instead of 600 from one of the other guys, which is Majority of your savings in the long run you get fewer stickers though. You do get fewer stickers big thing fewer stickers Sometimes you might get the wrong sticker. Yeah, which is also true And yeah, that has actually been one of the most rock-solid bits of our Architecture so far. It's it's almost a surprise really how well it's been working for us And and all of the architecture decisions we made were really driven by the fact that we're a small team very small team We have Two operations engineers today working on very tired operations engineers working on dream compute And that is an incredible feat and a testament to them and some of the decisions we made but they all worked Necessarily the easiest decisions or the right decisions. So we're going to go through some of the nightmares now So clearly, you know, it's open-sac. It's a giant complicated problem Project or both So clearly not everything went the way that we wanted it to from the beginning the compute side in particular is What I like to consider an onion not only are there layers upon layers, but you end up crying a lot You've got Nova compute sitting on top of Libvert sitting on top of QEMU 99% of the time they talk together great and then that 1% of the time you're wondering Why is there a QEMU process that nobody thinks should exist and it's Causing, you know some set volume to be unable to be deleted. It's causing ports to stick around in neutron The most bizarre things that you never think should happen all tend to happen all at once To combat that we've kind of written our own suite of software utilities Nova auditors neutron auditors Cinder auditors Auditors for the auditors. Yeah, that's very important as it turns out. In fact, maybe we should just name the project IRS Yeah, it's a good idea and bundle it all together So that's all up on GitHub. I think I think we posted the link later on. Yeah So continuing on Nova Network is no more the future is neutron the future is always great and shiny and just a brand new projects always work It is always true So we were pretty excited about neutron particularly because we're doing a lot of layer 2 virtualization and all this white box stuff and it's all bleeding edge And so, you know, you got three parts bleeding edge. You might as well out of fourth Because why not? That's always smart and sexy and exactly what you want to do Neutron While it was designed for this sort of thing from the ground up it kind of wasn't at the same time They suffer in the same way that a lot of other open stack projects suffer where they are developed by developers who've never run anything and There's a lot of outside influence coming in from various vendors. So you've got and I'm a developer and I've done that and I love you guys I'm an ops guy. I do that when I'm developing something right we can say the same thing about you crazy people So you end up with situations where Feature parity between drivers is not it's almost there and it's there enough that you always expect it to work And then it doesn't This is a hard hard problem. So I want to be very clear. We're huge. We're deploying neutron now We run it and we've got it to work for us But it is not an easy problem and when we first adopted neutron it was pretty early So a lot of progress has been made. Yeah, and we've been pushing a lot of code there So and in no way is it anything that's you know neutron specific This is this is something that happens in any complex distributed system. So it's to be expected You just need to know what to look out for So so software to find networking was so great. We were so excited and there was all these great vendors I mean, there was one at the time or two or three maybe Most of them in stealth mode. We're talking about three years here, right three years ago So we talked to a bunch of different people and we ended up selecting NYSERA who is now VMware Which is gonna be it's I'm still not used to that but it's pretty cool great for them We're very excited for them about that and they provide just rock solid L2 network virtualization So we knew that was the case They were the they were the people we selected because they provided the best of the basic, right? And it worked really well But there was this whole need for you know L3, right? What if you can't actually route anything? That's kind of a problem. So we talked to them about that and at the time They didn't really have anything since then they have developed some things that that provide L3 and that's great and fantastic But we didn't have time for that. We needed something today. So we Went out and we talked to a bunch of vendors who had Software routers or some sort of L3 kind of capabilities for us And it turns out at the time none of them really got it Especially cloud you go to them and say yeah, I need this for my cloud and they say well, how many do you want I? Don't know Depends on the instant, you know on how many instances get fired up and you know crushed every day You know, it's it's a cloud, you know, it's gonna change all the time Can you bill us by the instance hour by the what? Yeah, they didn't get it and they didn't get open source either So this was a real problem for us So, you know, we did what any Crazy person would do and a lot of developers and operators have done and said Because do that ourselves. Yeah And deal. Yeah, that's that's always the right move So what we decided to do was start up this project called a conda It is a software router runs as a service VM So we get on mag distribution. Thanks to Nova and all this other stuff It's actually just running open bsd and pf. I think Open bsd 5 4 we're testing 5 5 so we're we're able to keep up pretty up to date with all the latest releases Routers run one pretend it you can It's not great. I'll get to this later. You can add multiple networks on The process is not super great. Thanks to open bsd They also run a restful API service so you can push firewall rules You can push DHCP your metadata all of that it gets configured on the fly as a push service In order to Maintain all of this. We have what's called the rug which I think was Mark's yeah name for it Yeah, one of our former engineers Mark McClain named it and it's because it ties the whole room together So if you've seen the big with Big Lebowski, that's why it's called the rug It it is the service that essentially Glues everything together. It spins up the routers the orchestration of all of our layer three services basically It watches on various message message cues for events like I've created a new VM or I've spun up a router I've added network and then it does the appropriate things the API running on the the economy image Now There are some problems open bsd Unfortunately, we found out just very recently does not actually do network hot plugging It does PCI hot plugging and then the PCI devices sit there and don't get configured Which means that we can't simply add new networks to them without a reboot Which is something we're trying to figure out how to overcome right? It's rough, but right now most most use cases you don't actually use more than one network So again, we're in the nightmare section right now. We're gonna get to the dreams in the It's not so depressing. Yeah, the whole talk is not gonna be a downer open in the kimono here And all the all the bad stuff so some of the interest interesting issues We've had had with it neutron is not super happy about handling really rapid API requests Particularly when you're running NSX on the back end There's there's a lot of layers there the last synchronization has to happen and it gets a little bit bogged down Race conditions it turns out are a huge deal and synchronizing state between all these virtual devices in the central process It's it's a hard problem to solve we recently I think probably ironed out the last of the major issues, but it's been years Literally years. It's not an easy problem The service VM updating is hard that's actually one of the more expensive processes because you basically have to initialize all the state So if we're doing You know, we've got a hundred routers running and we have to reboot them all to do an upgrade to the image Which you know, isn't just no s update. It's also upgrading the appliance Application that's running there or anything else It basically means initializing state for all hundred routers as rapidly as we can Which ends up being quite taxing on pretty much everything, you know, it's no great way to not run right over just yeah Kick it over And DHCP management is one of the things that we actually thought would be super easy and it turns out it's not DHCP and metadata even in bare neutron suffers from some pretty heavy delays what 30 to 60 seconds, I think is about what we were seeing and It turns out as we'll get to later that our own home belt system is actually surprisingly good at that compared to What everybody else is using and all these delays that Jeremy is talking about this impacts our vision Our vision one thing I didn't mention was to have 30 second boot times 30 to 40 second boot times So that's a big ask. That's a very fast And that's including the OS everything yeah, not just until the the image starts to boot but up and running and you can Ssh in 30 seconds is the goal now Seth Because we built it ourselves Partially to be able to run stuff like this and partially you know for the hosting side which is pretty obvious We we try to use it as extensively as we could it's backing all of our block storage and dream compute It is behind our dream objects Seth s3 and Swift product all of our images stored in as well inside glance Which is great because means you know, you've got one cluster storing all of your data It's a copy-on-write operation in order to boot from an image into a volume Remember the fast super duper quick. Yeah sub-second volume creation from any image in our cluster now We did have one Incredibly amazing issue with this if you remember from the slide before we've got about 1,100 discs in our cluster At some point I think because we started with a cluster at such an early version has been through a lot of major upgrades Our crush map, which is what distributes all of the data or maps out how the data should be distributed in your cluster Got into a state where our I want to say 20 terabytes of data at the time ended up on 10 hard disks Not exactly a massively distributed data store We had like all of these empty discs and like 20 just like every yeah So the funny thing is we didn't notice because it actually performed pretty well still But the problem was when you're distributing that much data off of only 10 hard disks It doesn't happen quickly. No, it was angry three weeks of data redistribution because it was all coming off of 10 spindles We have a solution to this as well. Yep, so you will not have this happen to you learn from us Okay, so cinder so, you know, noble volume is no more. Let's let's cinder is gonna be great. It's the future It's another new project. It can't be bad, right? We've heard this before so cinder is actually great. I I don't want to Again tell you that it's a bad thing, but there are some interesting things about it and it was fork for Mova so it's a great idea you can Get a lot done with it, especially if you're doing some very specific things like using local storage or LVM whatever If you're using something else say, I don't know sef Things get a little bit different because you can't assume that every operation is going to be hyper fast, right? Deletes and sef in particular are a little slow Creators super fast, which is kind of what you want problem is cinder volume is currently designed to run a single process And so if you get a whole bunch of operations queued up that are deletes in front of your club creates You know, we actually graph and chart every single we spent up about 6,000 vms a day and destroy them And we graph and chart all of the boot times because we want to get it down down down We'd see things spike from the 42nd point up to 120 hundred and 40 seconds because cinder was just sitting there going delete Delete great great great great great Delete so this is a problem So we have a sef and cinder war stories panel tomorrow that I'm on that we'll talk and Jeremy as well We'll talk a little bit more about this and I'm going to get to in the next section about how we fix this problem So sick scaling cinder big it beyond kind of private cloud scale. It's not easy And it's putting out that it's also it makes sense Given where cinder came from because absolutely it was designed with the idea that once you create a volume That volume belongs to the node that it was created on because you're dealing with ice fuzzy things There's one-to-one mapping between the cinder volume host and the storage mechanism with sef any cinder volume process can handle any Volume in the cluster because they all talk to the same thing and that's a use case that really wasn't considered when cinder was created right and Make sense because that didn't really exist when it was created and nothing competitive was really out there either So let's talk about keystone So like I said, we wanted to do integrate off and this seemed like a great idea to us at the time So we were gonna use the keystone plug-in infrastructure, which is there for just this purpose, right? Well turns out no It's not exactly there for this purpose and no it doesn't actually work particularly well for our use case Also, it happens to change from release to release in breaking ways So this is kind of on us. We chose to do something here that was kind of out of the norm in a Not in the suggested way We did get it working But keystone as you know is very central to all of OpenStack if you make keystone slow What you've just done is made all of your OpenStack installation slow your entire deployment slows down if keystone is slow So yeah, we had a solution to this one as well and it's actually kind of a funny one. We'll get there So now the realities it's not all bad news Yeah, we we've been working on this for a very long time And we feel like we've finally gotten over the hump and things are starting to work the way that we're hoping they would Image building It's actually a surprisingly interesting problem we've Come up with a methodology where I literally just throw reboot into RC dot local and let them cycle It's kind of amazing the sorts of bugs that you uncover in the boot process by letting you do that In particular a bunch of precise we found there's all deadlock in Mount Hall About I think five percent of our boots were just deadlocking Saying it couldn't mount some random thing like temp or something else and it's a bug that is Well, there are two bugs one of them was flagged for backport and forgotten and the other one was apparently just ignored By the way, we are having conversations with the appropriate people about this to get it fixed Yeah, we actually had a thing that you turn up when you do this kind of testing So we had a pretty good conversation with them today And I think we finally talked to the right people and things are gonna great. That's awesome get all ironed out So It's not out is yeah at the GitHub This is where we've been storing some of our tools the Nova and Cinder auditors and I think I've got a keystone thing up there Basically every time we encounter a problem. We write a script either to fix it or at least to report on it And toss them up on GitHub now they'll layer 3 and network virtualization We've discussed this a bit in the pains, but it actually turns out that once you deal with it for a while It's starts to work We We hit a bit of a milestone about what two months ago when we decided We weren't really sure if we were gonna get the rug working as well as we needed it to So what we decided to do was talk to you VMware Get the cluster set up just a small one a test scale on one of our hypervisor boxes and just Put like a deathmatch together between later layer three services with all the bare bones neutron stuff Against our layer three services with a condo in the rug And we ran for two days three days something like that like over a weekend And just compared times the same boot time tests that we do for everything else where we find out How long does it take before cloud and it launches until cloud and it finishes and then you can SSH in and It turned out that we were actually like two to three times as fast We've talked to the neutron guys apparently they are quite aware of some of the timing issues Particularly with a metadata service and neutron and I think Aaron said Juno was gonna see a lot of fixes there and looking promising another point on this front This is not a statement that the L3 services from VMware are good They're actually great, but there were a couple things that were specific to us They didn't make sense one no IPv6 support, which is something we said was one of our tenants that we really wanted to have We were open two months ago to just saying well, we gave it the good old college try and IPv6. Bye But you know, there's also some other things. There's DHCP the way that you have to do DHCP and the all three services for for NSX are very different than the way that we're doing them and we actually end up Preferring and being able to scale ours better. So yeah, we ended up we were surprised here actually we were expecting to be defeated But we didn't yeah, we also One interesting thing is as a bit of a workaround for those times when DHCP does lag a little bit We configured all of our images to try using config drive first Which it turns out Nova just kind of does for you even if you don't tell it to mm-hmm because there's an option to always use config drive and Basically as far as I can tell what that does is disables DHCP But config drive is actually always running even if you don't tell it to so using cloud in it It will just look try to mount something with the right label config dash to remember correctly, which is kind of odd. Yes, and then it works and It actually allows your image to boot up a little bit faster might take a little bit well a while for the networking Just to actually get packets in But it means that you don't have that two minute timeout or whatever it is for DHCP, right? We've ended up being just a few seconds slow So as Jonathan said earlier Our testing consists of basically spinning up about ten six thousand VMs a day Tearing them down instantly and just graphing at every point where we can determine something has completed An important stage in the boot process We graph it as a data point and then if anything fails if at the end We can't SSH in or we see in the log the cloud in it wasn't able to do its thing We graph that as another data point so we have failure counts success counts and timing info for all of it And this keeps us up at night We we lead off every stand-up meeting every morning looking at what our current success rate is how fast are our boot times How many fails you do get why? So this has been all of our conversation lately, which is good Yeah, our QA guy actually wrote a little little script that uses phantom.js to Download all the graphs and just email them every morning. It's pretty great. It's a nice Great when you get back to the ninety nine point nine nine eight percent. Yeah success. It's not so great when you get the eighty five percent Now Seth I don't know if the CERN guys are in the room, but apparently I just yes Apparently we're tied for the biggest running operational stuff clusters in the world at three petabytes And I'm sure as they'd be happy to tell you it's not always great Any cluster that size no matter what workload it's running you're always going to run into some problems Just like with open stack after we had that data distribution problem I basically spent two days just writing Nagios plugins to check every single last thing that I could figure out We could have caught if we had known to look for it Those are also up on github under my own repo right there because to be fair Most of these problems are caused by human error that you you should really be monitoring for I mean If you're gonna monitor one thing that's the most unreliable in your cloud. It's the people who run it. Yeah And honestly, we don't know how long it had been in this state We don't know if it was somebody who fat fingered something or if it was an upgrade gone bad And that's that's that's on us and now that's what all this stuff is supposed to catch and Even if you're a small guy and you can't you know afford ink tanks professional services. They're on IRC They are very mailing lists You'll be hard-pressed to find any time day or night where you won't get a response within a couple of minutes on something That's gone horribly wrong or horribly great if you feel like just or you can get Sage's cell phone number Which I figured out and that's a really great way to get things going I'm not sure you'll be able to do that, but So let's talk about scaling Cinder mo workers less problems, right? That's Anything you're gonna be operating at scale You're gonna be able need to be able to divvy up work amongst a bunch of different workers You can't just have a single threaded operation sitting there doing things and pulling things off a queue. It's senseless So, you know, we tried a bunch of different things, you know we tried some some of the stuff that we heard the CERN guys were doing and What we ended up doing and the thing that we're working on lately is we actually implemented an RPC driver That will intelligently send certain operations to different topics And then we made a small patch to Cinder volume such that it could bind to only specific topics in this way We can create a pool of different, you know Different pools of Cinder workers a bunch of Cinder volume workers that can process only specific things So I can have a pool of Cinder volume nodes that are designed purely to make fast operations happen as quickly as possible And a separate pool to relegate the slow operations to so that they don't Slow down the fast ones and this helps even out the lumpiness So what we were seeing in our boot times is like this spikes and it's now Starting to go like this and I'm hoping that with this RPC change, which we were talking to the Cinder devs about Well, if I remember correctly that one patch one line patch to Cinder was actually a bug fix Yeah, it was it was a feature that they were supposed to send or yes work right right note to Everyone if you configure if you configure Cinder to lift Cinder volume to listen on different topic it just won't But now it will yeah, we're gonna upstream that so All right Keystone so We don't always have the answer to these problems where we say here's the smart trick We did to fix this problem in this particular case We met with some of the keystone devs and talked to them a lot about our use case and Talked to some of people on our security team and did a lot of thinking and decided to give up And we went back to vanilla keystone and you know what life is much better And at the end of the day we actually prefer it this way now that we're there because We actually have the ability to create a bunch of different users in a very you know nice way We can integrate with our control panel This is where failures are actually kind of good sometimes because it makes you rethink why you're you know using some feature or developing something Because after you've been developing it for a year, you know circumstances change. It's good to go back De scoping is okay We did that in a couple other places that I haven't mentioned Especially on the networking side and if you have questions about that we're gonna have a couple minutes left We'll talk about those as well All right, so on to the dreams What did we get right in the end? Fewer slides in this part guys. Yep So packaging and testing We like I'm sure Probably half the room here do we run our roller-roam packages We use Jenkins Jenkins job builder FPM package everything inside of a virtual and in that packages because everything's running on a bunch of precise and talks for dependency management basically, right? Thanks Jenkins job builder. We were able to pretty easily target any particular patch set or Tree branch in in a git repo so we can pull in new stuff. We can run our own patches on top of something if we decide to fork And it ends up being really flexible and we're liking it a lot. It works really well Any packages that get built through this which for most cases happens on a nightly basis or as it detects changes going into upstream github They automatically get rolled out to our staging cluster Burn in for a little while before a manual upgrade into our promotion into our production cluster whiteboxing whitebox switches as I mentioned earlier with cumulus and who are great partner They've been public for how long now six months. Yeah a while. Yeah They're pretty awesome. Basically you end up with a Linux box running essentially Debian in our particular case they show up with just 48 ethernet interfaces that show up as 10 gig and then 440 gig that we use for interconnects You can I think on the gear we have if you want do QSFP on those and get four 10 gig ports out of every 40 if you really need some more density The newer Trident 2 stuff. That's just starting to trickle out now. You're looking at I think 3640 gig ports or four times that number of 10 gig ports, which is just insane The other nice thing about this is when you're buying whitebox switches Wow, do you save some cash? The the list price on one of these switches is about six thousand dollars if you're buying one If you're building a cloud you are not buying one You are buying a lot of them and it drives that cost down a lot and cumulus Linux is actually really inexpensive as well Their pricing is also published no games really really great Yeah, it's oh and they're doing demos with us at our booth Throughout the week so a couple times a day Through Thursday, I think well actually ask station to one of our switches and you can yeah see us poke around on it Actually the live cluster yeah One of the crazier things which I actually didn't realize until I was putting together a talk about six months ago As cheap as the switches are we really save is in the interconnects we run Just copper for the 10 gig because we're doing pretty short runs But even with copper Other vendors were going to be charged like six to eight hundred dollars a cable Which is absurd and we're at least a year ago last time we had to buy some we're paying like 60 bucks a pop The at the end of the day, I think our costs were about 10% of what we were using for our commodity or our Hosting infrastructure using you know the normal players in the game All right, we're running close to our end, so we're gonna speed it up a little bit Oh good. This is the last slide before the end. So yeah On this the Seth stuff It's it's probably one of the better storage platforms that I've used and I'll say that just because we developed it You know being in the hosting business for 16 years. We've seen a lot of crap come through our door with storage devices Seth really does solve solve all the problems that we've had with everything else and Introduces not too many new issues. Yeah comparatively Lightning fast boots the open-stack use case is pretty incredible for it Because you literally upload an image in a glance preferably raw image But with cloud in it, you know, we can build these images that are like three gigabytes. So it's not super huge It gets Copied into a volume the moment you start up an instance it's three-tenths of a second to do that and then Cloud it's yeah, it starts booting cloud in it will actually just automatically resize that volume to whatever size you want to be Which is almost just as instantaneous and you can run an 80 gig instance off of a 3 gig image and You know, you're ready to run a website in like 30 seconds my favorite thing about Seth as the person who runs the business unit is that I can operate it with a very small number of People because it's self managing self healing that doesn't mean that these guys aren't you know Sometimes pulling their hair out and working really hard on it But you know, it is enabled us to be doing something that would normally take a team ten times that large minimum To do with a couple people. Yeah, unless we're doing an upgrade because you know, it's a cluster of roughly a hundred machines And it has to be fairly well orchestrated because it's particularly with major version upgrades You have to have a very set regimen of the order in which you upgrade everything because disk formats change and all this other stuff But outside of those instances, yeah, we sleep well. Yes, nothing really goes wrong thankfully As long as we everyone just everybody If it does it's usually our fault. Yep So what does this all mean? Well, good news the the beta is expanding on May 27th We're looking forward to adding in more people. A lot of you have already signed up We're really excited to scale it up now and and we really think that we've got this thing working So we're looking forward to your feedback, especially from you guys in the open stack community Because if you do see things that crop up, you'll help us find out what the problem is and get it fixed So the future is bright. We're out of the weeds Why aren't we going straight to GA? Well, we want it to be perfect No, no, we don't want to be perfect. We want it to be really good, right? And we feel like it's good now, but it's not quite where we want it to be So come help us out make that happen come by our booth and see a demo of cumulus linux running on our white box switches and in our actual cloud and Come see us tomorrow in the seph and cinder war stories. Okay, 1150. I think so. Yeah, thanks Questions, I think we have maybe time for a couple if you could go to the mic that would be great What what hardware vendor we're using with the for the ODM switches? I think they've changed names now You have now they're Agama. I think yes Agama Agama. And what was the previous name? Delta networks Yeah another question you mentioned you guys are doing a Vcpu-to-cpu ratio of two to one is that two vcpu to one hyper thread or actual core actual core Yes, so you mentioned about performance. So do you use something like rally to test performance and You also mentioned a second question for that is like 6,000 VMs booting at an instance. So it's like Currently keystone runs in an event-led based model. So do you use something like Apache or something like We use Apache for keystone and As for using rally we do not We have our own test suites that we use okay, but we are planning on you know We've done some tempest stuff as well Now with ref stack coming, you know, there's some other things we're planning on doing there So testing is important to us all the things bring them on. So right now. You're not using rally not yet Okay, but you package say Apache and all those in a virtual and kind of things not Apache. No, it's basically the open stack code And a few other extra things, but all of the Apache and other system services. We're just using a bunch of packages Thanks You mentioned that you created a not just plugin for our safe. How do you deal with monitoring of other parts of Open stack? Well, we're not doing a whole lot of monitoring of the rest of open stack just yet Mostly it's of our performance tests If any of the endpoints go down they're gonna catch it we will eventually Explicitly monitor every endpoint individually. We're just not there yet Yeah, we are using log stash Grafana Graphite, you know, we've got a lot of things in place and I guess we just haven't written every single monitor. Yeah Yeah, basically everything's being looked at it's just it's not alerting anybody So you mentioned that you guys gave up on some changes during making the keystones Maybe I missed it, but what was not vanilla about your initial keystone? We were we had implemented a custom plug-in to keystone that would middleware middleware or plug-in I'm not exactly Rosario. You can probably help me answer this question It was at a plug-in. Is that what it was? So it was a plug-in and it was Hooking into our off to make it possible to use those credentials And because of both the way that the keystone plug-in mechanism was implemented And the fact that it was unstable meaning it was changing in a breaking way from version to version We couldn't get it to go fast and we felt it was going to be an a burden to maintain So we just said forget it And so just one more question on the stuff you guys have a shared storage across all the nodes They see the same storage. Mm-hmm. Yeah, that's it for time. Yeah, I think we're maybe out of time come and talk to us later Yeah, well, we'll be both at the booth. Yeah quite a bit in the next couple days. Thank you You