 Now then, the question is, does this work? Well, so much for the quicker working. All righty, apologies for that. Minor technical difficulties needed to turn it off and on again. It's funny how often that happens in this industry. So we're giving the talk before lunch. If anyone wants to leave now, because their belly is telling them they're hungry, then I won't hold it against you. So I'm from Living Social. I should probably skip to the next slide. And I joined them about three months ago to push their cloud efforts. I moved over from HP Cloud Services. And we're a reasonably large Ruby on Rails shop. And we do daily deals. It would be remiss of me to not thank the other person that helped me prepare this talk, Paul, down in the first row. He's done a whole bunch of the technical stuff behind this, all the nitty-gritty of actually asking OpenStack to do some work. The slides are going to go up to speaker deck. They may well put you to sleep. I'm not going to be held liable for that. Excellent bedtime reading. So what's the problem? Performance in OpenStack has been improving. But it isn't where we need it to be today. Performance isn't really rocket science. You, at some point, hit a point in time where you're not getting any benefit back for the investment that you're putting in to making things faster. But we're not really at that point today. And some of the people who are running massively scaled-out clouds have had to do a whole bunch of tuning, tweaking, enhancement, code changes in order to be able to operate at the scale that they're operating at. Now, what we're all really interested in is making the experience for the user better. The user is our customer. Sometimes it's an internal user. But the user is also the guy that usually phones you up to complain. How many of you have ever had a user that phoned up and said, that was too fast? You've really got to slow this down. It just doesn't happen. I wish I had users that were that easy to please. But they're not. Users almost always want things to go faster. They usually phone up and say, hey, things are too slow. How can you make them quicker for me? And that's a multifaceted problem. Now, I'm not the guy that's just going to turn up and say that things have never improved. We've been doing this for a little while now from Cactus, Diablo, Essex, Folsom. Every single release brought enhancements. Way back when we weren't setting the table engine for MySQL. And we were ending up in a lot of installations on MyISAM. We had no indexes. People that were spinning up and tearing down a lot of instances were pretty quickly running into full table scans. And yeah, things sucked. All I'm saying is that things can improve faster if we as a community focus a little bit more on this. So I'd characterize things today as being like a car. I like cars. So it's a really good thing to put on a slide. That's a pretty reasonable car. It's a bit old. It's a bit reliable. But it can be a bit slow. What we'd really like is for OpenStack to be faster. We'd like it to be more scalable. We'd like it to deal with the small shops, the large shops. We'd like the experience of operating OpenStack to be really neat. But yeah, am I just standing up here and basically crying and saying, hey, there's this big deal. I care about performance. Why should you care about performance? The answer is that we're a lot like you. We develop changes to OpenStack, not big ones. We operate OpenStack. We're engineers. You can see OpenStack. And we're also users. We see potential in the cloud. We see opportunity. We want to make things in our business better. We want to use hardware more efficiently. We want to gain operational efficiencies. But the answer is to do that, for a lot of our use cases, we actually need things to be a little bit quicker. So every time I run into OpenStack being a bit slow, I go back to my childhood and think, hey, I was watching Star Trek when that dude said, Scotty, we need more power. And the answer is that today the warp drives are maxed out. We need a new warp drive. So why do we care so much about performance? What are we doing that actually benefits from OpenStack being quicker than it is today? The answer is, largely, we're building a platform as a service that we call AirSpace. AirSpace is very similar to other platform as a services that you folks in the audience will be aware of. You could compare it to Cloud Foundry. It's a very living social centric in terms of, right now, it only does Ruby and a couple other things. With a platform as a service, when you want to roll out a new version of the app, normally you're introducing a whole new set of instances running that new bundle of the application. And you're removing the instances that we're hosting the old version of that application. This means that we have pretty high instance churn. So over the course of a day, a week, or a month, we deal with hundreds, thousands, or tens of thousands of instances being created and destroyed. The speed at which we can create and destroy them is important to us. We don't really want to be waiting. You want to deploy a new application. You want it to get out there safely, but you also want it to get out there quickly. When you look at seasonal demand and auto-scaling in platform as a service, you have to keep enough slack capacity within an interval to deal with any normal capacity increases that you would need within that interval. If it takes you five minutes in order to get new instances online, you have to say, OK, what if my site gets posted to reddit.com and I get hammered for five minutes? It's going to take me five minutes to react. So it's your reaction time. If we can get instances spun up and online more quickly, that makes developing a platform as a service significantly easier because our reaction time is a lot lower. Now, there'll be a bunch of you in the audience who've used public cloud. There'll be a bunch who've used private cloud. There'll be a bunch who've used a mixture of both. The question that you have to ask yourself is, when you guys are spinning off instances, how long does it take you to get them online to an SSH prompt with working networking? Now, I've used public cloud as well. It varies by cloud provider, but the answer is usually high numbers of seconds, low numbers of minutes. And that is kind of crappy. It's not their fault that things are taking this long spin up. A lot of it is down to software. And that should become apparent when we start talking about our hardware stack. Our goal is that we need to be able to get, in the 95th percentile case, instances online in 10 seconds or less. This is actually completely achievable because popular images, such as our precise image at work, those are going to be cached on most of the Nova Compute nodes within a fairly short amount of time anyway. So you're not transferring the image from glance across the network. You're usually just pulling it from the local cache on the Nova Compute host. So yeah, in summary, performance does matter. Other than our airspace, what can you do with a cloud that's quicker? Well, everyone's talking about testing these days. There's a lot of sessions here which cover how do we improve open stack through more gates to trunk. I'm sure that all of you are developing code that has tests, whether that's simple unit tests, whether that's complex integration tests. And a lot of you will be spinning up virtual machines using Vagrant, J Clouds, you might have Fog, or many other things tied into your pipeline. The problem is that spinning up those instances is right now quite slow. So your test time, even if you only spend five seconds actually testing the code once the instance is online and has your code on it, you could spend 50 seconds waiting for that instance. The seasonal spikes thing I already talked about. Happier customers spend more money. This is a proven fact. If the website takes 10 seconds to respond, then they think screw this. I'm going to go to someone who can deliver me a website in less than 10 seconds. And engineers are effectively famous for grumbling. If you can't get a server quickly, it takes a week, two weeks to get physical hardware. That's the big boon for cloud. You can get servers on demand within certain capacity limitations of your overall cloud quickly. So what do we do? Right now I've talked pretty much solely about problems and it needs to be a glass half full equation, not a glass half empty equation. Solutions are undoubtedly better than problems. However, this is a really big space. We have Nova, Swift, Keystone, Glance, and all these other things that are coming in. We send the metering thing, which for the moment has escaped me the name of. There we go. And how do you tackle that across the board? It's a pretty daunting thing. I firmly believe that when you're looking at performance, dividing and conquering and breaking down the problem space is absolutely the first step. For us, that actually just means taking a two pronged approach and taking a look at our hardware and our software. Now we're currently reasonably traditional. We have lots of applications running on one piece of physical line. We've got very little virtualization going on in our infrastructure. So the cloud that we're building, private cloud, is to enable us to move to a model where we have an app in an instance and it's the only app in that instance. So we have much tighter control over system and application dependencies, less conflicts, and generally things work better. However, this comes with a bit of a warning. What we pick in terms of hardware especially is not necessarily gonna work for you. Hardware is one of those things that you need to figure out how much capacity you need, whether you wanna pay the premium to go for solid state, whether you wanna go for 10 gig, whether you wanna go for 40 gig. What works for us is not necessarily guaranteed to rock your world. However, if you're doing a relatively small cloud and what you want it, is it for it to go really freaking fast? Then, yeah, hardware. Where would be without it? We wouldn't have a cloud, that's for sure. Hardware consists primarily of servers and networking. We've been working, we worked for quite a long time to figure out what the appropriate server for us is. And one of the applications that we really wanna run on the cloud is databases. Databases are a real pain in the ass because they use a lot of random IO. Traditional clouds, many, many public clouds, Amazon's recently changed this by introducing an SSD instance. The ephemeral IO performance is terrible. And the persistent IO performance through something like EBS is, can be pretty hit and miss. I've heard a lot of stories about people, requesting an EBS instance, benchmarking it, throwing it away, doing it again and again and again until they get one that has acceptable performance. You'll see talks from major Amazon users saying that. We actually don't use persistent storage at all. We're ephemeral everywhere. And for cases where we care about the data and we want data persistency, then we have MySQL master slave replication and things like that going on. The stack that we've got is entirely solid state. No hard drives in our private cloud at all. That may sound a little bit extreme. And when I saw the number at the bottom of the quote for buying it, I thought that as well. But it delivers us significant benefits. The RAID card, we actually had to have certified by the vendor specifically. It's unusually performant with SSD workloads. The sequential throughput that you get from a RAID TAN array of SSDs normally gets pretty close to maxing out your PCI Express bus, which is entertaining when you watch it. You know, you run DD and IO benchmarks and think, wow, this is great. But databases, they do a bunch of sequential workload when they're doing backups and things like that. But the day-to-day workload is usually pretty random from select, insert, update, delete. What has this got us? Well, we measure things in cloud units and we allocate pretty much identical instant sizes to platform as a service workloads. That isn't exactly right sizing. Some things are more CPU intensive, some things are more memory intensive, some things are more discontensive. But it is easy. The nodes that we've got today give us the correct balance of CPU and memory. We can get a really good amount of capacity loaded onto one server. We get exceptional ephemeral IO performance. We've benchmarked each of the boxes around about 200,000 IOPS through the RAID card. However, we're not using Enterprise MLC and I made a specific note of that. The right durability of a consumer MLC is significantly lower. It's very important to understand what your right workload is on a day-by-day basis because eventually you are gonna wear out the nand and the drive is just gonna die. SLC is there for extreme right durability. It also comes with extreme cost. And as an example of cost, consumer MLC, you're looking at maybe $200, $300, $400 a drive depending on capacity. Enterprise MLC, you're looking at $1,000 plus per drive for essentially the same capacity. We've actually taken the decision with building this hardware platform that we don't really care about the fact that the drives will die in 18 months time. The operational cost of replacing them is by far and away justified by the IO performance that we get as a result of running SSD versus hard drive or as running EMLC. Now, the key thing, servers shouldn't be a bottleneck. We really wanna get to the place where our software is the bottleneck, where OpenStack is the bottleneck. Servers not being the bottleneck is this wonderful land. But servers not being the bottleneck doesn't really work if you've also got network bottlenecks. And clouds, tax networks, more than many traditional enterprise workloads. So in addition to the multi-tenancy that you have on an individual compute host, you've got multiple network streams coming out in and out whereas previously you might just have one database stream coming in and out. So the network is kind of important. We kind of fell in love with our vendor on this one. I'm not ashamed to admit it. I really like Arista. We've been doing some very, very neat things. We're a two times 10 gig down into the server. It's non-blocking on the switch and we're 40 gig up to the spine at the availability zone spine. We are only two to one choked at any point, whereas we have non-blocking connectivity in the rack. The really, really neat thing about what we're doing here is that the switch runs Linux. You can log in, do show run, you can pipe it to graph, you can pipe it to more, you can drop to a bash shell. You can access all of the switch's internals through the Python SysDB. We are not doing as much automation on the switch today as we would like. Automation on switching is one of these new developing fields and I can see Mara looking at me from the audience just down there. We're getting pretty close to the point where you can run Chef on a switch. You can run Puppet on a switch. You can export Facts or OHI stuff from the switch, send it back to the Chef server. So we can now discover from both ends of the network the application and the server and connect the two together, configure VLAN ports, configure VLANs on supports automatically, that kind of thing. The network shouldn't be a bottleneck either. If you're building a cloud and the network's a bottleneck with the kind of hardware that we've got, you've not really done it right. You've got an imbalance. So what do we run for software? Well, I shouldn't really be talking here if we weren't running OpenStack. We are in production running Cloud Scaling OCS and this is an Ubuntu 1204 operating system. It's running a pretty straight KVM hypervisor. It is OpenStack Essex. There's no two ways about it. There's people doing interesting things with OpenStack from a packaging it up and a vendor perspective but the Cloud Scaling OCS offering is fundamentally OpenStack Essex today. And they've been talking this week about OCS 2.0 which is their move to Fulsom and we should be adopting that in the near future. That means that from a, what do I do with this cloud perspective? I understand it. I can log into those boxes. They run the same services as all of your clouds. What it has bought me is support and time to market. However, most of the stuff that's being covered in this talk we've actually been doing in our dev environment which we'll be very familiar to all of you. DevStack's being used for quite a while as a gate to get changes into OpenStack. If you break DevStack, hey, your change doesn't get merged. So where should you do performance analysis of DevStack? Well, you should probably try and get close to the hardware that you're actually running in production because otherwise it's not really an Apple Swapples comparison. There are certainly operational tweaks that you can do and OCS does a whole bunch of these to make your clouds faster. But when you get right down to it, whether you've got lots of API nodes behind a load balancer, whether you've got one API node behind a load balancer, the lowest common denominator here is the single API node. The performance of that, if you can improve that then as you export more of them then you benefit from the performance even more. It's a multiplicative deal. As I said, we grabbed most of our data from DevStack today. We actually have patches that we're gonna be distributing for the application performance management component that we're gonna talk about. And we're hoping to get that folded into our OCS and store pretty soon. So yeah, what now? So far I've just talked to you for apparently 35 slides and haven't really told you anything about solutions other than, hey, we picked some awesome hardware. Well, when you've got hardware and software, how do you know that things are broken? Support calls are really, really imprecise. You've all been there. You get this guy on the underline and he says, my email didn't turn up in three seconds fast. And you say, but email's not an instant messaging kind of protocol. There's a whole load of handoff and queuing and stuff that happens, but he doesn't care. He thinks he's in the right and it's not fast enough and it damn well better get quicker. And the same applies to the cloud. People get in touch and they say, my instance took five minutes to provision. What are you doing? And yeah, that sucks, but we need better data and monitoring is how we provide that. The old school thought, and this is still, this still has a place in all of your infrastructures, is the service listening on the right port? Can I do a simple request to go get some valid data? This is the guy that pages you at three o'clock in the morning. It's the really annoying guy that pages you at three o'clock in the morning saying my service is down and you need to get up out of bed and go fix it or someone needs to get out of bed and go fix it. This is both system and service. At an individual system layer, you've probably got HA at the service. If a component crashes on a system, so Nova Compute dies on a compute hose. Yeah, life continues, life carries on. You probably want to go fix that at some point in time. Is it worth waking you up at three o'clock in the morning? Yeah, maybe not. If the entire Nova service stops responding to API calls and you can no longer get new instances or stop instances, you probably want to get out of bed and fix that one. But this is what gets you out of bed. This is a determination that needs to be made on a case-by-case, business-by-business basis. Now, in the web application world and we do a whole bunch of web development with Ruby on Rails, there is this new thinking. And the new thinking is that we care about the end user and we care about their experience. So you hear a lot about people doing ABE testing for site changes. Do we expose this new feature to all of our users over to 1% and see if it increases the sales volume that we get from that sample group? This typically will answer things for you like, how long did the site take to show on the end user's PC? And that'll include rendering time, time spent in the actual web server, time spent in the application. The nifty thing about this is that you can actually break down to an individual component and action level where your application is spending all of its time. So for every single function that you've got in Rails, for every single controller, you can say, what's going on? The difficulty with something like OpenStack is that we have a user interface. It's called Horizon, as Django, and it's a pretty easy to strap, New Relic or some other APM system on to. But it doesn't tell you much about all of your Nova APIs, Swift APIs, Glance APIs. And most users are consuming the cloud through an API, not through Horizon. Those of you who are operating internal clouds, you may have developed your own web interface, which is more intuitive to users of your business. It may have charge back capability. If you're operating a public cloud, you may have your own control panel. So the lowest common denominator, again, is the actual API. It's the individual API node. It's the request, that one request, which says, create me a new instance or give me the details of an instance. And the difference here is that instead of me being alerted on a one or a zero basis, binary, is the service up? Is the service down? What I'm interested in is the average, it's not quite average, but the average user experience. So if a decent percentage of user's experience is slow, say 10%, I probably need to go fix that because customer satisfaction is gonna draw. However, if it's just an outlier, and we get a whole load of those in OpenStack, then I probably don't care. The difficulty is that our applications are quite complex. A create operation against an over API is naturally more expensive than a show me something that's already running. So you have to be able to break it down on a controller and an application level. Oh, sorry, a controller and an action level. Anyway, I don't know. Maybe some of you learned that something from those slides so far, most of you didn't. So we should move to pretty pictures because then I can talk about what we're actually doing. I go to a lot of operations conferences and things falling over, things going badly. It's always referred to as disaster porn. There are certain things in OpenStack today over time when you actually run a cloud over say a month or two months or six months. And it doesn't really deal with this ever growing number of instances in the database. So you get pretty close to disaster porn. You get this nasty spike or increase in query latency. The first thing that we do is really simple. We monitor this at OpenStack and also our individual application lab. We use a tool called boundary. There's no reason that you can't do this inside your own data center. This happens to be a software as a service solution. They have some pretty high profile customers from the web world. I think they have GitHub signed. They certainly have LinkingSocial signed. And what it does is you tell it which servers are running, which applications, which ports are involved. It looks at all of the cross chatter. Every single flow over your network gets reported back through IPvx. So it's an open protocol. There's no magic going on. The, what they do is take all of that IPvx data and present it to you in a useful way. I don't know how many of you are network engineers and have worked with like Sflow, NetFlow collectors, IPvx collectors. In general, they all suck until you spend a hell of a lot of money. And this is pretty much democratizing the concept of IPvx for your server. Now, the really, really neat thing, and I only discovered this when I started at LivingSocial is we send a lot of email. I don't know how many of you get our emails, but we send a lot of email to a lot of people that subscribe. And I was logged into Boundary in my first week at LivingSocial and looking at the overall TCP network flow of our data center. And all of a sudden, we spiked up from doing very little traffic to like an additional, you know, number of gigabits of network traffic. And I was like, what's going on, you know? So I drilled down into this traffic and I discovered that was just like an email burst going out and that these happen, you know, three, four, five times a day. So I saw this real time spike and then I went back and noticed it's actually entirely normal. We just send a lot of email. But yeah, so I talked about things getting worse over time and this is an example of things getting worse over time. Latency trends tell you a lot more than your average monitoring system. Even if you're applying some fairly smart trending, you don't notice the spike if it's shallow enough. You know, your software is gonna tell you, hey, you're just more popular, you're just doing more business. I would hope that monitoring would notice a spike like that, but who knows? And this particular spike is, believe it or not, that's probably over about a two-hour period. Now, we were hammering OpenStack to pieces when we were running this particular test, but that's a spike in a glance's response time over a two-hour period. Well, it's an over, but most of the time is spending glance. And it's because when you're running a, go get me a list of my currently running instances, it's going to glance every single time to fetch information about the image and everything else that's going on. And it turns out like it's really slow when you've got a lot of history in your database tables and we probably need to think about pruning. So before I get on to the rest of the demo stuff, we've actually done some patching. We're using TraceLedex, it's another one of the software as a service things, it's not free. If someone in the audience likes the idea of some of the stuff that I'm showing, wants to go and do this in open source, more power to you, I will happily use it. But it's not that expensive and it has given us really valuable insight into what our cloud is doing. It patches in relatively easily, it's about a five-line patch to glance, five-line patch to know that that particular GitHub repository has instructions for you to apply it. It's basically a three-step process, patch and bounce the open stack process, preferably after installing the library. If you don't install the Oboe middleware, it will know up, it has a try catch, so it won't hurt you. And you also need to install a little agent on the box and it takes all the traces and it sends them off to the cloud, they are cloud users. And then you get a pretty whole load of displays. So what it tells you is stuff like this. So you feed a whole bunch of requests into individual services. You can tell it about individual applications, you can tell it about multiple servers, you can tell it about many other things. And it tells you how slow things are. And this, if you start seeing this increasing, this is a kind of nasty indicator that things are getting bad. The same can be applied to Keystone, the same can be applied to Nova. Now, that's interesting, but that's an aggregated graph. So that's all of the controllers and all of the actions. So if this spike right in the middle around 40 milliseconds or 38 milliseconds, if that was the 90th percentile out at like 140, 160, then hey, we've got problems. 40 milliseconds isn't too bad. When you wanna see how much time it's spending elsewhere, this is calls to Nova. And Nova obviously is making a bunch of calls to Keystone, Glance and other services. But the really neat thing is that we can see how much time is spent in HTTP load making calls. It actually pulls together the traces. So it's not as simple as just taking traces and doing timing. It actually has a relationship between the traces. You can see that HTTP load is initiating this connection to Glance. It traces that Glance interaction as well, and it then bolts it all together into this overall graph. This is a whole bunch of requests over not that long. If you're spending up to a second, and this graph doesn't really catch the outliers very well, if you're spending up to a second, just waiting for Nova to respond, and you want your instance to be online in like 10 seconds or less, where you've just burned 10% of your time waiting for Nova to get out of bed. That's not particularly useful. This screen is much, much, much more useful because every single one of those dots is a traced interaction between a user and your service. Again, this is Nova. I'm gonna pick on Nova here because it's the more complex of all of the pieces. Like I said, when I was talking about boundary, all of this is feasible to do. Tracing, timing, I think the Swift guys already have some stuff built into their middleware for doing some basic timing analysis. The problem is that you gather all of this data and there's relationships between all of the various calls that services are making to each other, and you don't really know what to do with it. It's not presented in a consumable manner. You give a developer like a five-meg tarball of, you know, here is 100 traces, 200 traces, and you go, well, okay, great, but this isn't helping me hone in on my problem any. With this particular interface, you can go click on like an individual area, and it'll drill down and just show you the individual five, 10, 20, however many traces for the area. And the bottom line is that at the bottom, all of those are really nice, fast operations at the top. This is only going up to 1.6 seconds. There are operations in Nova today which are significantly slower than that. There are less, but those are the dots that we need to be bringing down over time in order to improve overall service performance. The ones that go over 1.6 seconds, and I haven't included here because they are outliers as far as the overall picture goes, but they're outliers that we understand. Those are usually create operations, and believe it or not, those usually take about five or six seconds. Now, someone is at some point going to ask me, well, okay, you've got lots of pretty dots on that scatter plot. How the hell do I tell where time is being spent in OpenStack so I can go and make it better and I can fix it? Well, that's a perfectly valid question. And the answer is that you go and look at an individual trace. You can get this in like a graphical, big page that you can scroll down, but I actually, and indeed this has like six pages of information beneath it for every single one of those entry and exits. It times down to the millisecond. Any SQL queries that are being made, it gives you the full query that it's made so you know exactly how much time you're spending in the database. Now, I don't know how visible that is from the back. I apologize. One of the interesting things that you can see is that we make a hell of a lot of calls to Keystone. So over the last 24 hours of hammering Nova, we've noticed that you make like 1,000 or 1,500 calls to Nova and that results in about 18,000 calls to Keystone. Now, whenever we call to Keystone takes around about 80 milliseconds to respond, even at the best of times, this can add up. The saving grace is that as you can see just over here, quite a lot of the Keystone calls are actually made in parallel. So you're only eating 80, 100 milliseconds to make a whole bunch of calls to Keystone. That's the saving grace. The downside is that all of our SQL Alchemy stuff is being unsurrealized. So even if your database query is relatively quick, you're still entering an exiting SQL Alchemy a hell of a lot. That's a really difficult thing to solve. I'm reliably informed down to how event is being used. But that is basically adding a number of seconds to some of the more expensive queries and that's a problem. So what do we do? Well, we can go and look at an individual trace, we can try and make it faster. But me doing that on an individual basis, I can try and identify the things that I care about most that deliver the most bang for buck and I can contribute those changes back. And that's great. But I'm only an individual and the team that I'm on is quite small. We're sure that we could get some help from cloud scaling on these efforts, but we fundamentally need to treat this as a community problem. If we take some of the larger users and I can see HP people in the room, I can see Rackspace people in the room, you're operating clouds at a larger scale than most people. If you start actually looking at all of this and I'm sure that you are, the changes that you make bubble downstream to everybody who is using the service. So it doesn't really matter who makes the change, we all benefit. So what I'd like to talk about and we can do this through the Q&A is how do we make things better at an accelerated pace? I already said that cactus to Diablo was good, Diablo to Essex was good, Essex to Folsom was good and the number of contributors in the community is increasing. So Folsom to Grizzly, it's looking good already. But how do we make performance a priority? We have a security working group. I can see one of the guys involved in that at the back of the room. Performance is this whole stack thing. There's a lot of interdependencies and complexity between Nova, Keystone, Glance. This is the nature of the beast. It does mean that you could, as a Glance team or a Nova team, you can certainly go and fix things in Glance or Nova, but we do need to step back and take the big picture view of the whole software stack at some point. I would like to work out how, in a similar way to Vish is talking about introducing grenade as a Nova gate so that we can test rolling forwards from one version to another. I would like to see a better understanding of the performance implications of code that people are merging to master. And I think this is probably feasible and probably not that complicated. We already have the DevStack stuff. If you break it, your code doesn't merge. We need to hook more capability in there to understand exactly what's going on. Now, there is another talk coming up later today talking about the millions of options in Nova's configuration and how you can make Nova go faster. That's awesome, but we also need documentation covering it on OpenStack.org. And we do need to start asking ourselves, how do we fix the bigger architectural issues? So the serialization of SQL Alchemy is, I'm sure everyone will tell me, absolutely necessary. But it also sucks from a performance perspective because you're serializing a whole bunch of things that each take a certain amount of time. So you're committing yourself to saying, this is gonna take 700 milliseconds to service minimum. Whereas if you could in some way parallelize safely some of those queries, then you could make big savings pretty easily. Well, big savings is difficult. I'm gonna quickly run through some credits before I go to Q and A because someone paid me to come here, and I should say. I work for these guys. We use these guys' software, it's awesome. These guys do awesome networking stuff. If you like Linux on your switches, you should check it out. And TraceLogic spent a whole bunch of time with me actually getting their stuff working. We actually got a phone call from them yesterday saying, we've been looking at the traces that you're doing. We think our tracing software is broken. We've never seen anything that does this many interactions in this kind of way. We think that our tracing software might be the problem. And the answer was actually no, it's just how open-sack works today. Which is, yeah. When the performance company phones you just say their tracing software is broken. So, thank you very much for listening. I apologize for being late again. I have consumed about 30 odd minutes of your time. I am interested in any questions. The question was, does adding all of this tracing actually introduce a performance impact itself? The answer is yes it does. The way that the tracing works is it's a sampled basis. So you're not sampling every single hit that comes into an API. The sampling level is configurable. The right way to refer to that is you're looking for a sampling level that has statistical significance. For a lot of people, that's around 10 to 20 to 30%. It does depend on how many hits per second and hits per minute you're doing. The tracing overhead with the tracelizer in Python appears to add about 1.5% overhead. If we were looking at 15% overhead and a 30% sample rate, then okay, wow we just introduced a pretty big amount of performance penalty just by tracing. One and a half is something that we usually write off. Ruby, the language specific implementations are slightly different because of the way they're architected, but Ruby and Python are actually pretty good with this software. Sure, great question. Open sourcing our platform as a service, yes is the very sure answer. We are still frantically developing. It turns out that when we thought, hey, let's go and write our own paths, we thought it'd be easier than it is. Turns out it's actually quite hard work. There's a lot of moving components. The architecture is actually really neat. I will certainly be encouraging some of the guys on that team to be turning up to conferences in the near future and presenting that. As to when it will be released, I'm not sure. I believe it's slated for sort of start of next year, but I could be wrong on that one. It is designed such that adding more languages to it is pretty simple. We're obviously pretty focused on Ruby on Rails, but I'm sure that we'll probably throw it out there with three or four languages supported and ask for contributions. Can you put the URL back up? For my slide deck, yeah, I can slip to the next slide because it has a helpful reminder. We are basically hiring as fast as we can find people. I'm pretty sure that everyone has one of these slides in their deck. It seems to be an aberration when someone doesn't ask at the end of their talk. We're hiring. If anyone's interested, we have interesting ops problems and interesting engineering problems and we'd love to talk to you. Any more questions? Nope, we're good.