 All right, cool. First of all, thank you guys all for coming here Today I'm going to talk to you about open stack at two sigma a quantitative investment management company in New York City My name is Xu Chen also go by Simon I currently manage a very talented group of engineers to leverage open stack as the building block to Build a large scale high performance and highly available private cloud to support a diverse set of workloads here at two sigma So in case some of you don't know about two sigma It's a company that was founded in 2001 by a statistician and a computer scientist on John Overdeck and David Siegel with the goal of applying cutting-edge technology To the data rich world of finance with which provides a fertile ground for exploration We have vast amount of data fast feedback and unbounded opportunity We currently have about 750 employees and over 500 of them actually software engineers and together we manage about 25 billion assets globally So as a technology company to sigma has a fairly large compute infrastructure That that are deployed across multiple data centers in the United States So before we started the open stack project We had a already we already had a fairly sophisticated build system that can consistently and automatically Deliver physical machines of from variety vendors to application owners I mean but the drawback of this particular system is very obvious I mean like very long turnaround time and then the lack of sharing across different applications So we decided to I mean we really wanted to leverage the latest and greatest cloud technology to revitalize our infrastructure and address those problems and at the same time Want to push our application owners to rebuild and or build more? Applications are distributed scale out and that's suitable for the cloud and cut ties to the physical infrastructure entirely So when we decided to build a private cloud obviously there were a number of options available open stack being one cloud stack eucalyptus I mean back in 2001 2013 that was Open stack was already a pretty clear leader And I can tell you honestly after 18 months of working on this project We had honestly no regret going with open stack at all Well today it's more about today. I'm gonna talk up to you about the journey that we have been through I mean I kind of just mentioned a little bit how we get started I'm gonna talk to you a little bit more about where we are now and hopefully in the end a little bit about where we hope to Where we hope to get next Okay All right, so before I kind of go to more of the technical details about our cloud deployments I do want to mention a few things about enterprise cloud integration So what does it really mean to build a private enterprise cloud versus building a public cloud? So the the biggest difference between public cloud and like private cloud is probably policy control right? If you think about the public cloud is like the wild wild west everyone can do whatever they want Everyone has root and then if you want to roll your images short for free to do that Private cloud is more about policy controls You want to make sure that people can do what they want but also under certain rules So I mean I'm not gonna talk too much about all these things But it's essentially in our environment. No one has root and you cannot roll your own images you cannot install arbitrary packages and What I do want to mention is that even with the existing open stack framework It's actually very easy to kind of enforce all those Enterprise policy controls simply by customizing for example policy.json for each individual open stack components The second difference is probably the level of Expectation for your cloud users on the public cloud essentially you get a set of API documents You are given a forum to bitch about things and then when actually VMs are really slow I mean well tough luck But for the private cloud it's more there is more hand-holding involved in the sense that you do want your application owners to Succeed because in return that actually that's a success story for your private cloud as well So I mean just quoting Jonathan like from the keynote We're trying to maximize our developer like productivity by actually doing things better on the cloud So I mean so in case there are problems with performance problems or any reason that application owners Can't get their application to work better on the cloud that you do want to help them I mean because in the end in return it's going to drive cloud adoption and The final the key difference is the level of integration I mean if you go to any kind of decent-sized enterprise usually there are already Very strict ways in terms of user authentication There is probably an existing network infrastructure with existing IP address spacing space design a routing policy to define And at to signal we already have a pretty decent software ecosystem where loud which allows Developers to easily and quickly build test deploy software I mean you don't want to really build your private cloud in a complete silo so that they kind of Smells and feels like of the public cloud because you want to make your private cloud really integral part of your entire Ecosystem and therefore it makes it I mean if you spin up a VM It's nothing different from existing physical machines except it's faster to spin up And you can get more of them. So I mean that kind of really helps and you kind of drive your cloud deployment so at a Very high level maybe 30,000 feet We deploy our the cloud deployment is probably similar to most others we have multiple data centers in the United States and we deploy cloud into many of them and Which we deploy multiple availability zones in each of the data centers and we ensure that for every availability zone There's no single point failure Across within that zone. So one thing I do want to mention is that we chose to make an Availability zone an entire Entirely separate open stack deployment. So nothing is shared across different availability zones I mean there are probably a gazillion ways to deploy opens that right now But we made this decision for a number of reasons first of all I mean we what we found is Actually pretty surprising not quite surprising is that software are actually much It's much kind of less reliable than hardware So when you have multiple availability zones actually depend on the same software components Let's say the rabbi MQ or something like that. It doesn't really make much sense You will have on foreseeing failures from time to time Secondly, we try to I mean as we're trying to leverage like cutting edge technology We try to do something slightly different for every new zone we built and as a result I mean if you actually kind of completely isolate different availability zones This allows us to try something new because I mean essentially it's a new environment Fewer users and then essentially it's much larger scale deployment than a regular lab lab environment and Finally if we can ensure that each availability zone are completely isolated from each other We can do like plan maintenance much easier right because we can tell user with a strict face that we're gonna maintain this particular zone And no one nothing else is gonna be impacted so you can steer your workload away from that particular zone So I mean this I mean this design makes some sense and makes sense in many ways But it also brings a list of challenges one challenge of course is on the infrastructure side because now as a cloud engineering team We have to build and manage multiple zones at the same time We solve this problem obviously through automation, so we use Ansible Assuming all the hardware is in place we can essentially bring up an entire zone maybe within less than a day And we also I mean although even though we tried to keep all of our production zones consistent So even though we do something new in a new zone then we use around the same Ansible play to upgrade existing zones So that all the zones are the same Essentially, but there's also this kind of setup also brings some cost in terms of user experience Because right now user had to remember multiple API endpoints multiple horizon dashboards to go with multiple zones So we built many software artifacts to help users to navigate through multiple zones, which I'm going to talk about in a little bit All right, so how do you user get access to open stack or Private cloud so a to Sigma we have a fairly sophisticated Configuration management database which literally calls cmdb So what happens is that when a user when let's say a new employee Alice comes on board? a new user entity and a corporal's principle are created and then as Alice joins certain projects Let's say big data. There's going to be some association saying this user is part of that user group. What we do is to run a A cron job that synchronizes information from cmdb periodically into open stack in particular keystone instances of every availability zone So what's going to happen is that each user group in cmdb is going to map to a tenant or project in keystone And then the user's Kerberos credential is going to become a user in an open stack And then obviously the membership can be established pretty easily So remember, I mean this is something that kind of happens automatically in the background So no kind of manual intervention is really needed So as the user comes on board is she's going to launch let's say a VM called Hadoop one on top of open stack What we do is to have another process called Tweety bird which kind of listens on this Notification queue or rabbit MQ such that it will synchronously Register the fully qualified domain name into our DNS system and then push kind of let's say now You have new VM into cmdb so that now you have a centralized place to search all your VMs I mean the reason we kind of do this is that we want to make sure I mean the reason we want to kind of synchronize all information into keystone instead of letting keystone to query the outside system Is because I mean what if cmdb goes down? And what if your other app server goes down? We don't want to have that kind of dependency in this way So, I mean all right so, I mean how to in many kind of enterprises and Particularly in the finance world Kerberos is kind of the de facto authentic user authentication story So the idea is that what every time when the user logs on to a workstation And you are given your kind of step there a Kerberos ticket is stashed onto the local file system And then essentially the user use that same Kerberos ticket to request access to like all the existing other services So we want to kind of extend the same kind of password list authentication Experience for users. So what we do is to kind of customize keystone We wrote our own Kerberos authentication plug-in and make it part of the paste pipeline So what happens is that we modify all the open-stack clients to prefer Kerberos authentication first to go to a keystone and get a Token back and then you can use the same token to talk to other vanilla open-stack services Horizon is a little bit more tricky because essentially let's say when you open up a browser You want to kind of fire up you want to go to horizon what's going to happen Is that when since the user is not authenticated yet horizon is going to kick back a redirect So that the user would then go to keystone To do Kerberos authentication and then after that keystone will kick the user back to horizon Except that the URL now has a token embedded such that it can proceed on just recognize the user So just remember that all of these does not require any user interaction I mean it just happens automatically in the back like when like with a number of requests in browser So from a user perspective if you fire up horizon essentially you already recognize as who you are and then you have the list of Projects to that you can have access to so it's pretty kind of seamless integration Um, I mean What I just talked about kind of address the single sign on our password list authentication problem But again, it's people still have to remember the keystone URL or maybe the horizon dashboard URL We want to kind of even freedom from that so what we have here is essentially we build custom build a in-house Kind of a simple web service called cloud API So it's essentially just like web service running on multiple VMs across different open stack availability zones behind this like load balancer Just like anyone should build a web application So what you can do here is to use like for example negotiate as you use Kerberos authentication You can just talk to the single API and say I want to launch Let's say three VMs in this big data project in this particular zone And then cloud API would actually take this request and talk to the appropriate zone in the back end for the particular user So it's kind of just make it a single entry point for user to interact with our cloud And obviously everyone like dashboard more than API's We're in the process of building our own dashboard So as you can see here it makes things even simpler So you can actually say I want to launch one VM in one zone another VM in a different zone And then you click of a button then essentially kind of go this Dashboard would talk to cloud API in the background to talk to multiple zones So you can kind of launch multiple VMs across multiple zones in one shot There is actually a simple five view We just allow the user say I want five VMs and we can schedule wherever we want for the user to make it just pretty much One click for launching of VMs All right, so I'm besides API's and GUIs We also have done quite a few customization to allow people to navigate across multiple availability zones For example, one thing that we do found very useful is a maintenance API For some of our workload is actually pretty computational Intensive like batch processing job which may take tens of hours to finish I mean It's not a good idea to actually schedule that kind of job into availability zone Which you know is going to have some maintenance scheduled maintenance events So we have is kind of a simple API that can let's the batch scheduling the job batch Sorry the job The batch job scheduling scheduler to query and then to decide okay If this zone is going to enter maintenance, I'm going to sort of schedule my batch job somewhere else As you as more people as most people know the security group in term security group Is a good way to enforce going to cross VM communication patterns But the problem with security group is that it only kind of is limited to one particular zone, right? So if you want to have an application a cluster spending multiple zones Then you kind of it's pretty tricky and cumbersome to set up security group correctly so we do is to kind of allow people to enter a security security groups like Rules inside of cmdb and then we kind of in the background synchronize those rules back into each individual zones to make it much easier for people to play with and We also for low balance of service I mean open stack has one, but it's also limited to a particular Availability zone and a particular tenant So we do is build our own custom layer so that we can actually load balance into any zones any any machines You want and even including the physical machines So what I mean with this comes I mean the benefit of course is that now people can gradually kind of shift cloud Resource into their existing applications and gradually phase out physical machines to make it much easier to shift to the cloud Oops All right, so enough about user access I'm going to talk a little bit about our highly available deployment strategies. So I mean Probably many people are already doing this but our philosophy is to deploy three instances of every piece of software and make sure that they are They fail they can fail somewhat independent of each other. So I mean some we use let's say Revit MQ in cluster mode rate with h.a. Qs and database. We use Murray DB with Galera So I mean and also for every piece of open stack services We make sure that we run three instances and behind a load balancer in that way So in this and and in reality what happens is that we in every availability zone We have three controller kind of physical machines Which we run multiple VMs on top of and then we guarantee that let's say nova API is actually scheduled on two different VMs across different physical machines So even if we lose an entire rack or lose that control VM not really nothing matters All right network, this is actually going to be a fun piece to talk about We started using Open vSwitch plug-in from with neutron from the get-go. I mean although some many people were actually holding on to Nova Network we use We use open vSwitch plus the excellent tunneling for example and then sort of later on we switched to ML 2 plug-in But still with open vSwitch open vSwitch driver. So we actually Heavily customized this solution. We solve the performance problem by And doing for enabling multi-Q and lipvert and actually we do some quite a few Open vSwitch customization and then using special arts network hardware to accelerate The performance so we can actually essentially get like basically using the regular 1500 MTU can get 15 Gigabit per second between VMs across different hypervisors So I mean with the performance problem out the H8 sort of neutron does not provide a very good H8 story. So Essentially we had to kind of design our own H8 solution Before I talk about that actually I don't I do want to mention that I mean because of the enterprise security restriction We cannot reuse life like overlapping IP addresses So every VM actually get a real to sigma IP addresses that is globally reachable from our internal network And there's no really floating IP addresses. It's some in some sense. They make our design somewhat easier So here is just one example of a what a tenant network look like So what happens is that whenever someone creates a network? We're gonna create two logical routers Which is going to take dot two and dot three within the subnet and then we inject keep alive the instancies Into those namespaces so that they can negotiate who's going to be the dot one gateway to take care of the outbound traffic for the VMs I mean there is probably there actually some supporting keep alive Like of keep alive the right now in neutron But I mean remember what did this like 18 months ago and we actually gonna keep using this because of some flexibility It gives us about rolling upgrades which I'm gonna talk about in a bit So here's the more complete picture right so once you have a subnet then you have you create two logical routers What we do is to plug those two logical routers into two separate external networks And then we guarantee that the two different external network are associated to different top-of-rack switches So in this case What can happen is that the logical router is going to be scheduled to the layer three agent associated to? That particular rack and then we ensure that the two kind of logical routers are scheduled on to different racks for highly afford for high Availability so in this case even if we lose an entire rack or lose that logical lose that That particular layer three agent doesn't matter the other one just takes over So we do additionally is actually to establish use quagga to establish a BGP connection Between the layer three agent to the top-of-rack switches so that we can inject all this tenant subnet Into the network so that they can actually the network infrastructure can handle the inbound can actually forward Correctly the inbound traffic towards the right layer three agent for the subnet Yeah, I mean worked out pretty well so far and I mean we don't actually require We actually provide a fully HA solution without requiring Let's say a big ass layer to like spending tree across the entire data center So all the racks are actually fully routed In terms of like I there is something that we also do for example, I mean Make like kind of maintaining or upgrading neutron is actually pretty painful I would say that I mean pretty much all the open-stack components are pretty easy to upgrade except for neutron because it actually kind of Hits your production traffic So what we do here is that I mean for example if you want to upgrade this particular physical machine of this layer We're in this layer three agent what we do is kind of just kill all the keep alive the instance See so that I mean the dot one gateway is going to shift to somewhere else And then we kill this BGP connection so that obviously the inbound traffic will flow somewhere else So essentially completely dry out the traffic going through this particular layer three gauging And then therefore we can do like kernel upgrades open v-switch upgrade whatever you want And then after you're done you can kind of re-enable all those artifacts and therefore the traffic will shift back So it's actually enables us to do a truly kind of hit list Rolling upgrade for neutron components All right storage. This is another fun topic on we actually decided to use to Let VMs to launch VM to launch Only with kind of cinder volumes Which sits on top of Seth with three replicas. I mean for a number of reasons I mean stuff is pretty awesome. I mean with this setup it allows us to do live migration For the VMs we actually do this pretty regularly as we kind of do rolling upgrades for our compute certain hypervisors We do like kernel upgrades and all kinds of things This actually on Seth also allow us to completely rebuild the cluster for example We mistakenly started using like be cash and butter FS to build our self cluster and then we kind of swap it out so essentially we just like completely wipe out SF node and then Like install new kernel like just basically redo the file system and then just put it back on to let it refill And it just works. We've done this probably this rebuilding the entire cluster probably two three times so far I Mean stuff is pretty nice and a lot of people are using it But I mean my advice is to people who are thinking of actually running this in production is that you really need to Know everything we had I mean I would say that probably 90% of our cloud downtime. It's actually related to stuff surprisingly You really need to understand every piece and to run it better So I mean what we had to do is that we we had to customize our hardware set up Like you have to really understand your rate card from where and in configuration You have to customize you have to use newer kernels their system control parameters. You need to use Enable jumbo frame helps like set configuration and sender glands configuration everything. So this is probably a whole nother talk, but it's It's pretty fun so besides set base storage which obviously provides the HA story by itself because now it is through your application therefore it can Tolerate up to two physical machine failures We recently started offering people with the ability to launch VMs with locally attached storage Which is maybe like reverse to what people do So here and with with locally attached storage would make it clear to the application owners that okay now You lose the ability to live my we lose the ability to like migrate your VM And therefore if there's anything goes wrong with your physical machine I mean your your VM is down and for an extended period of time and most people are okay with it because now they can build HA in their application and gets better performance with local storage All right, so enough about the HA story I'm gonna talk a little bit about how we actually deploy open stack into production So I mean you may ask the question. Okay, what kind of release you're on kilo Juno, whatever Well, the short answer is we're not on any particular release. We deploy trunk into production I mean too many people this might sound scary, but it's actually okay because if you think about it The core functionality of open stack is actually pretty stable I mean instead in terms of launching VMs and then doing like volume attached I mean as long as you kind of stay away from those fancy new features, you're generally, okay And in particular, this is kind of a given it is a cloud as a private cloud environment We can just tell users say hey You're not supposed to use those features because just not it's not supported So it's actually allows us to actually use leverage trunk pretty successful in our in our environment So but if you kind of really think about a sort of sit back and think about what we really want I mean trunk is just sort of one way, right? But what we really want is to kind of is quickly apply our local patches and roll it out into our production environments And at the same time has to has stability or the option to absorb upstream advances to take advantage of new features or bug fixes, etc So here's a quick picture of kind of our workflow. What's going to happen is that we always going to let's say This is Nova upstream what we do is that we fork from Nova From upstream and then to our local branch. Obviously, let's say provide a local apply a local patch I mean it could be some modifications that we do obviously we do a lot of them Or it could be a patch that is being reviewed upstream not accepted yet But we really want to use it So what's going to happen is that essentially we're going to pull in the patch and then form a Particular point of our local branch then there's a Jenkins job Which will actually just take that particular commit tag and rerun all the unit test to make sure this component actually builds and runs Okay, and obviously sometimes this fails because we messed up and then you can just apply additional patches What's going to happen is that then a certain point Jenkins job is going to succeed which actually give you a successful build at that point we can we have a Sort of a in-house software we call packet arm which makes sort of somewhat similar to pack and deploy So what happens is that packet arm essentially can take a particular commit tag from a local brand a local repo And then pack and create a Python virtual environment to package all the dependencies as well as the code base in question in sort of in separate Virtual environment and then what you can do is essentially you kind of tarball it and drop it to an object store And on any machine which you want to deploy that particular version you run like packet arm deploy Which essentially pulls that object from the object store and then just put it into local file system and the benefit here of kind of having this virtual environment based Deploy is that is you can let's say have Nova and Keystone Let's say deploy in the same machine, but having their own dependencies You can also deploy multiple versions of Nova and on the same machine And then just easily switch between them just by changing some pointers to the virtual one to the virtual end and Then if we go back to the top of the picture It's essentially periodically we do merge from upstream because if there's something major we want to absorb in so it's so mean the merge Process, I mean sometimes it works sometimes it fails because obviously if we touch something there could have conflict with upstream So I mean there could be some sort of manual merge Process going on but when the merge thing is done Essentially you can apply the same workflow of Jenkins and packet arm to to deploy the new version All right, so as I mentioned we use Ansible to deploy to handle the entire deployment flow So on the left hand side you can actually see this sort of each packet arm build has a unique you ID associate to it So for a particular deployment essentially you just list for all the say Nova Keystone Whatever with you ID in terms of packet arm build you want to use So if you want to kind of upgrade a particular component to a next version you just change the you ID and rerun Ansible This is kind of a role-based tasks a list of things to do to could deploy let's say Keystone on the particular machine So what this means is that it's actually kind of in short it drop some configuration file based on templates It actually drop a particular packet arm build to make sure the code base is there And then defines how to run this particular service in this case just run Keystone dash all program and This is kind of like our entire playbook for deploying Ansible So to deploy over Keystone with Ansible So essentially it's gonna run the list of role-based tasks first So drop code base configuration and then they stop the current services and then perform DB migration Like restart the services and ensure all the service credentials are our provision So essentially it's the same playbook which we use to bootstrap a new environment as well as upgrading so All right, so next topic monitoring In an enterprise kind of private cloud environment monitoring is absolutely critical Because you want to understand how things like how things are going what a performance is good And sometimes you want to detect failures when when people even notice them You just certainly don't want to wait for people to say hey your cloud is broken and you can then you started looking So I mean there are sort of list of mundane things that we do such as like service aliveness checks We have we use Nagios for checking some sort of regular services like Rabbi MQ and my sequel And then we kind of wrote our custom checks such as for example You want to use a Cobra's credential to talk to Keystone get a token back and then go against like a Nova Let's say so I mean those are the things that we kind of customized to make sure everything is working correctly We do periodic like we have a crown job to periodically deploy VMs into every availability zone to make sure that everything works I mean if the new VM deployment fails for whatever reason we get an email notification and then we kind of see what's going on We collect obviously collect all the logs We turn on debug logging by default because storage is cheap and the debug log actually gives you more information than you then you would imagine And then we kind of put all the data into elastic search and kind of visualize it over time So here's actually a pretty fun example that we use since we built our custom like and load balance earlier What we do is that we actually beam the API response time to elastic search and track it over time So here's an example is actually happened last year that after the particular keystone upgrade The API response time for keystone just keeps climbing So within a course of a week it kind of climbed from a hundred millisecond, which was already pretty high to 250 millisecond What we do is that we look at the code. I mean there is actually some caching Issues I mean for from the upstream code. I mean we kind of pat we patched it you do another packet and build and run Ansible and essentially like pretty quickly the we're able to solve this and the API response time dropped to 25 millisecond right away So and because we run a private cloud environment We actually control all the operating system and what does what is called kind of software We run inside of VMs therefore it gives us a lot of flexibility and visibility into what people are doing inside of their VMs What we do is actually we simply call install collect the instance in all the VMs and then dump data into a Kafka queue And then we run this camel's job kind of periodically dump data from from Kafka into HDFS for long-term storage and batch job Processing for example, we can do some weekly processing jobs to understand who has been using the cloud with resources Etc. And we're also kind of experimenting with using a spark streaming to analyze data from Kafka in real time The good thing is that everything like spark runs on Hadoop and it's Hadoop runs an open stack So essentially we're using open stack resource to monitor our open stack deployment So this is actually a pretty interesting example, right? We recently started this quarry because we since we use Ceph base storage, I mean this is all shared everyone uses the same kind of pool of resource So what's so we're trying to understand? Okay, which tenant who's like which user are actually using doing the most disk rights, right? So essentially this is a pretty simple query that trying to understand, okay among the cost of a day What is the aggregated IO writes in terms of bytes across different tenants and obviously this TS cloud data is our data Ingestion pipeline. So I mean of course it's gonna consume a pretty good chunk of disk rights And then I mean the light blue one is actually an expected heavy user because it's actually a lock collection application that in just Like all the log messages for the entire two Sigma company, of course, it's gonna do a lot of writes What is unexpected is there's I mean so basically this lock collection To give you some idea the lock collection application has actually have hundreds of VMs running in this particular deployment what is unexpected is that there's another tenant with only two VMs running and then they can actually They actually consume more disk rights than hundreds of VMs combined So what we found is that that's actually a pretty interesting application. They use a very specialized database which Do some fancy things actually do a ton of disk rights But luckily that application already has like building ha so that they can live with local locally attached disks So we work it with the application owner to migrate an application from set base disk to local disks Which actually give them better performance and actually free up like resource very valuable resource in terms of Seth for the rest of the cloud users All right, so what's next? Probably rushed a little bit So the big thing next is obviously about performance. So we care really care about performance there are we want to kind of bring the Performance in terms of compute storage and network to the next level in terms of compute I mean there are a number of things from upstream that we just that we kind of plan to leverage such as the new malware placement huge pages which will certainly help with with very kind of Compute intensive jobs in terms of storage. We're actually looking at for example pure SSD based Solutions to see how we can actually accelerate our applications for network We're already in a pretty good shape And but we also want to push like our boundaries to let's say get full line rate I mean we're kind of evaluating between let's say using dbdk base Acceleration are using kind of more full kind of awful hardware offloaded solution So we kind of hinted a little bit because I mean now we have the ability to track the actual usage of different Different tenants and individual VMs We want to start it working on some kind of usage based scheduling So the idea is that you want to put VMs You don't want to put too many hot VMs in the same physical machine And then you want to kind of mix and match machines with kind of come like Complementary workload on the same in the same machine for example There might be a kind of intro a day like trading analysis Serve a VM and you can probably put it on the same VM as some sort of post-trade analysis that only runs after trading hours And finally, I mean this is a big question obviously is container integration, right? I mean there is I think a private cloud infrastructure is very nice in a sense They give a kind of a complete operating system and kind of help with a very smooth transition from a physical infrastructure to a Virtualized infrastructure infrastructure, but container obviously has its own advantage in terms of like lower Overhead and then kind of you can manage a ton it can manage sort of I would say probably less orchestration overhead so But I mean obviously there require it requires some kind of modification for application to live on a container-based environment So I mean how do we kind of combine container versus with open stack environment is our kind of next big question to solve All right, so I think I'm all done with five minutes left. I'm open to questions For OBS oh sure The question is that whether we release any customization with it for open v-switch No, because open v-switch is essentially we maintain the same Abstraction so essentially we want the vanilla open v-switch plug-in or ml2 driver to interact with open v-switch Everything is actually down kind of at the open v-switch layer So it doesn't really kind of go into the open stack proper We wanted to Yes, hi My questions a bit around just the general How did people react management when you just kind of came to them and said we're gonna do open stack and replace what we had How did you deal with that kind of situation? Was it easy smooth or did you have a lot of a? I would say and they're definitely a lot of resistance because people are kind of so Hold on holding on to their ideas saying okay, and then I'll I'm giving away my physical machines, right? I now have to use this crappy VM somewhere in the cloud which I have no access to right But I mean again, I think it's more about it's a process right? I mean like I said there's user engagement helps a lot I mean not all the application works well on the cloud But every time we do work with individual clients saying okay if they have problems We'll work with them trying to understand whether it's a resource constraint on the cloud whether it's their application design problem, so actually we do kind of help them to move to the cloud and that actually kind of Overall kind of you actually build make friends that way and then you a lot of kind of naysayers becomes like cloud cheerleaders that way Thanks. Thank you. Yes So the question is that before open stack whether we use any virtualization technology the answer is yes, so the build system that I mentioned actually Can build not only physical machines and also virtual machines, so we actually started using KVM a long time ago I mean the problem with that build system is obviously is treating that VM with kind of like a physical machine So essentially launch a VM Have attached it to a local disk put it into a VLAN and then sort of pixie boot and just like a physical machine It takes two takes a few hours to build a VM. So obviously right now we cut it down to a few minutes So that's certainly an improvement So you describe sorry We describe SAP is a tricky implementation for you. I'm sorry. Yes. Can you describe a little bit more? What is behind it and what's the expected response time you're getting from SAP the performance? Yeah so right now we have so for every SAP cluster we deploy about 700 terabyte usable storage and Across 30 something machines each one with six OSDs and right now we can sustain about 22,000 IOPS per second This is just sustained and then we can actually peak to 75,000 IOPS without too much trouble So it's we actually have done fair a bit of tuning to make it to that level average it depends I mean there is the There are two types of latencies apply latency and commit latency right apply latency It's essentially how much how long it takes for stuff to commit the make the commits into the in-memory Caches, I mean think the commit latency is just constantly below one millisecond The commit latency is actually Slightly higher, but usually is under like single digit millisecond. I would say yes How many people are on your team that supports this and sort of how do their skill sets break out? So we have a very large team for people So But I think my I really like my team we have pretty diverse skill set I mean someone are responsible for kind of actually Integrating the physical machines into this environment and then some people are looking at more open-stack oriented and then I mean I recently kind of diverse a little bit to work on different things. So yeah