 All right. So let's get started. Good afternoon everyone. And if you are still in jet lag, good night or maybe good morning. So hopefully everyone is awake. And let's get started. Sure. So by the way, I am Anand, work for PayPal Cloud Engineering. And I have a first pulse group of smart engineers and characters, whatever you want to call. I work for these guys. And I have Shinmai with me. And before even we go and it started actually, I wanted to give you a little bit background about how we started within PayPal and how it translated to us to build one of the bigger cloud in OpenStrike. And we are running serious business on it today. So a year back around same, you know, October last year, you know, me and Shinmai and we were just looking at, okay, what to do something different. I used to go for OpenStrike conferences like you guys. Some of you are very new to OpenStrike conference. And I went to San Francisco first one and all the other conferences. And I used to play with DevStrike and stuff like that and looked at, okay, how can we take it to reality, right? Then I was in eBay and then moved to PayPal and good. I need some systems to go and build a small lab to get started with the OpenStrike Cloud. Of course, I don't want to spend too much of money. I got a bunch of servers, decount from our data centers just sitting there in the floor. I got all of them and, you know, we racked everything ourselves and we didn't take much help from anyone. And everyone was busy during end of the year, you know, capacity add and whatnot. We don't want to disturb them too much other operational teams. Then we built it and we went to an executive team and then showed a demo, okay, this is how, you know, we are building the cloud and, you know, we wanted to, you know, take it to the next level. Oh, well, okay, it looks good. Okay, what do you guys need? We said, yeah, so maybe we will start with a couple of, you know, small applications to start with and we put it into the production by end of last year. We are running a couple of applications and, you know, we spent around 200 K to buy some, you know, servers. We started with that and everything looked good and executive team said, okay, here you go. Next year started, okay, what do you guys need? Tell me. And it was very easy. There was a question in one of the panel yesterday, okay, how do you guys take this thinking, okay, go to executive team, you know, convince that this, right? It was very easy for us to convince the executive management but we got challenges bringing rest of the organization into this because, you know, we, you know, space is not a small company or maybe, you know, it's not a startup company to change the things overnight, right? We got to bring the entire organization into this. That's where the challenge has started. Okay, I bought in executive team said, okay, we bought in. Now you make sure that you bring in rest of the organization into this. Then we did talk to, you know, multiple teams and, you know, they were not buying into this because there were a lot of initiatives like this earlier and that they all failed. And even I was personally asked, you know, a lot of questions in one of the cloud expo and San Francisco maybe last year, I think November timeframe when cloud expo happened. Okay, do you really guys want to, you know, go and build the cloud yourself or, you know, use OpenStack yourself? Is it really your business? Right? I said, yeah. So if you want to innovate, innovate in every, every area, not only in payments or maybe in our applications, and it is going to indirectly help all our applications team and, you know, our products team to go and, you know, find out whatever they need in the infrastructure level itself to make sure they can write any application, the infrastructure is just there for them. They should not worry about, okay, I need this type of special hardware to run my East West traffic or North South traffic. Okay, we'll find out all of that. But we just took OpenStack not to just save money on the licenses, but we used OpenStack to bridge, you know, some of the gaps that we have in our data center. So that's why we approached OpenStack instead of just go and, you know, just save money on licenses. Of course, we're still, you know, going to be buying the hardware from, you know, multiple vendors and, you know, software from, if needed from, you know, multiple vendors. You know, OpenStack has shown up the medium for us. Okay, this is what we use. Okay, all of you guys, you know, vendors, make sure that it is working with OpenStack and we are ready to go and pilot in our lab. And if it, you know, once it works for our, you know, data plane, whatever we are looking for different workloads, it performs well, then we just go and, you know, use your product. And OpenStack is one of the medium for us to connect with all these vendors. When you say, okay, there is no vendor lock-in, okay, we are going to be using the vendors. It's not that we are not going to be manufacturing hardware and software everything ourselves, but we wanted to use OpenStack as the medium to communicate with our vendors. Right? So that's how we started. And of course, you know, we took, you know, OpenStack code, we faced, you know, a lot of challenges in running in logical production and we fixed, you know, some of the bugs and also, you know, some of the blueprints, not even, we are even still talking about, you know, some of the blueprints. Even though actually we didn't wait for the community, we, you know, went ahead and implemented them as, you know, plugins and drivers ourselves and we made it happen. Right? And Chinmai is going to be talking about that, but I just want to, you know, small introduction about PayPal. Of course, everyone know about PayPal. There's no surprise. I don't want to, you know, go too much into that. And okay, so how we are going to be, you know, structuring the presentations. So basically, we are going to be mostly focusing on what are the challenges that we faced instead of going into what is OpenStack. And of course, you guys all got filled up with a lot of information about OpenStack already and you keep, you know, I don't want to bore with it again and again. So the challenges specifically that we ran into to build PayPal Cloud on OpenStack for serious business. Right? And why, how we, you know, chose to go with OpenStack when there were, you know, a lot of other open sources and getting OpenStack for production prime time and some of the success stories that we, you know, already realized by using OpenStack. Right? And I'll just, you know, start with, where's the next one? Okay. First of all, you know, what we are really trying to solve. So before even, you know, OpenStack, it's not that actually we didn't, we don't have any automation because it is very obvious for our operations team to have, you know, a lot of automation built around, you know, provisioning or, you know, creating ever, you know, load balancer automation and what not. So we all had scripts. Cloud is maybe the new name for all this automation. And, you know, we had all of that. But only one thing, what we were missing was a common set of APIs that everyone could understand easily. Right? And instead of having a small, you know, operations team within the organization, they create bunch of scripts and, you know, they make sure that, okay, if you want to do the stuff, if you are getting the sticker, go and run this, you will be fine with that. Right? It's also all working. So it's not that it's not working. But if I want to go and talk to my, you know, product development team, hey, guys, you have the, you know, your infrastructure to go and you know, deploy your application. Okay, where is that? Okay, you say, actually, you have it. Okay, send me the IP address and all those things. Right? So we don't want to deal with the tickets. That's where, you know, the automation breaks. It's not that actually, you know, automation we didn't have. It was broken when we wanted to integrate with, you know, different systems like pass or whatever. Right? So that's where we really, really wanted to have a common set of APIs or the standard that everyone could easily understand and they could go ahead and integrate with the, you know, the infrastructure usually. So that's where actually we chose to go with, you know, some open source and it is easier for our, again, for our vendors also, they just go ahead and say, okay, it is working with OpenStack. Now I could go and talk to people or maybe I increase the company and you just want to, you know, go and pilot with Univerlap. So that's where actually we wanted to make sure that it is covering both our internal needs as well as our external pricing. So it really, really helped us in that ankle. And also, you know, we wanted to have earlier, before even, you know, we had infrastructure as a service, right? We used to create tickets, PD team or, you know, different other teams. You know, they will take maybe a week to figure out, you know, different components of the infrastructure, what you need, DNS or maybe you need compute itself or the VMs, you need, you know, couple of VMs here, couple of VMs there in different data centers. You know, it used to take time to figure out what every application needs. So that's when actually it took so much of time for application teams to go and deploy their code. Instead, actually, we wanted to have, okay, these are all the different APIs that you need to connect. And after that, you figure out actually how, there are a bunch of engineers, the smart engineer, they could figure out if you give the APIs because rest APIs are not rocket science for them, right? So that's how they started with that. And they filled, initially, you know, that's looking cool because if you are able to create a VM in two minutes and they could do self-service, they're cool. Yeah, now I'm on my own. I don't want to talk to, you know, anyone to get my infrastructure, right? So that's where actually they are all excited about it. And yeah, so it is a self-service tool for them. And it really, really enabled them to go and figure out they want to have a small automation around. I want 10 VMs, but I don't want to, you know, click the button all the time. They just created, you know, themselves a small script, and it will integrate with, you know, different APIs or maybe the Nova CLI itself. And I used to help them earlier, okay, this is how you call Nova API and all those things. But later on, I just used to refer only the KitCup. Okay, code is the documentation for them. I don't need to teach anything because they're all, you know, smart engineers and they don't, I don't need to go and say, oh, this is how you need to create, this is what the API that you need to call. All of them are part of, you know, KitCup itself. Instead of, you know, we use the code, the same code that we shared across the organization. And they're all, you know, very happy. So one of the use case that, you know, our past team had as they wanted to integrate with OpenStack. Yeah, of course, you know, they got a tenant in, you know, every data center we used to create, you know, tenant in each data center, they created the VMs, they provided them the credentials. They all, you know, they just went and created the VMs. And after that, they don't want to click the buttons, right? So they want to have a workflow, they just call ISAPIs and create, you know, VMs and they call, okay, how do you use, you know, your OpenStack APIs. I said, okay, just Google it or maybe go to KitCup, this is the code. Take the Nova CLI code and instead of using Nova CLI, take the same code and put it into your code. That as simple as that. They were all really excited about that, actually how they can easily integrate with, you know, open source code instead of, you know, getting a vendor product and then you're going to, you know, go for the special training, understand actually specifics of Asian individual vendors product, right? That's how it was really, really easier for us to bring in the entire company into this whole mix with the last one right. And we, you know, there were multiple options earlier, open stack, cloud stack and Eclipse and we even have, you know, our own IS actually within the company and it was a hard decision for us to go and use, you know, a new technology bring into the, you know, the company and it was very hard for us to make the choice when there were a lot of other choices. Only one thing that really, really helps us as this open stack because we had, you know, very good, you know, community support and also the backup from the foundation itself just to keep open stack as clean as possible without having, you know, a lot of pollution within the community itself, right? So, and also we met with, you know, board also and they said we are going to be making sure that open stack is going to be clean for all the cloud users. It's not going to be biased on a vendor E or vendor B. Okay, if you guys see something and we all as a community will address that. So, that gave us really, really good confidence just go and use open stack when other, you know, open sources, it was missing the strong foundation behind. So, that was one of the good decision making points for us. And when I say specific vendor logins, of course, you know, we are not going to be using anybody's distribution. That's one of the, you know, decision that we made internally. We are going to be, you know, packaging everything ourselves using our CI and CD. And we don't want to go to a vendor if I want to, you know, use a stable sender or stable neutron. I don't want to go to a vendor to get a stable code. Instead, actually, we wanted to take everything from the community and we wanted to use it only from the community, including the plug-ins and drivers, right? So, that one of the, you know, that's why we are going to be, you know, having no vendor logins. It's not that we are not going to be using the vendors. Of course, we are going to be using the load balancers or maybe, you know, the virtual networking, everything, you know, we are not going to be developing everything ourselves. But instead, actually, we wanted to use, you know, everything from the community instead of, you know, a distribution from a particular vendor, right? And fast grouping, you know, the developer community is going very fast and then code is getting cleaned up very often. And, you know, the summit every six months, those are all the, you know, really, really good decision-making point for us to, you know, go and convince ourselves and there were, you know, a lot of other opportunities or other tools to use within the open-source world itself. And, of course, you know, we don't want to innovate everything ourselves and we wanted to go and leverage industry-based practices. If there are multiple architects and multiple, you know, smart engineer outside of our company, we want to leverage that as well. So, we don't want to reinvent the same thing over and over and somebody is already there solving it, right? We just wanted to go and leverage that. And as we said, actually, we decided to go with OpenStack and we looked at and there were a lot of challenges circuit and we couldn't just take, you know, DevStack as it is and put it into production. And we had to, you know, tweak some stuff in OpenStack itself to meet some of our network topology or maybe some of the specific use cases within our, you know, data center itself. So, I will hand over to Chinmai to go over, actually, what are all the specific things that we have changed within OpenStack itself to meet our production workload. Yeah. Hello. Hello. Yeah. Good afternoon, guys. My name is Chinmai and I am one of the lead engineers in the cloud engineering team. As Anand just gave you a brief introduction of how OpenStack came into picture with PayPal, let me just walk you through. I mean, I'll go through the changes, some of the changes we have done in Nova Keystone, our DNSL service, load balance as a service and hopefully you guys can get some lessons if you guys tomorrow want to go and go to the production level that we are at. So, this is a picture of our stack. It's a fairly simple stack that at the infrastructure level we use x86 compute, we have storage network load balances software infrastructure, we run on rel, not on Ubuntu yet. Some of the functional services that we, the core services that we use is we have Nova, Sender, Swift, Keystone, Neutron, Horizon, we do use to have, after that we went on and created our own portal which was based out of Netflix Asgard and we have heat for our orchestration as an orchestration engine, load balance as DNS. Let's go on to some of the changes. So, let's start with Nova and tuning it for high availability. What we did here is so like the main concept about productionizing this OpenStack is we look at racks, right? Racks of servers and inside PayPal production we have a rack which is split into two which is a fall zone. I mean in the diagram you could see there are two fall zones right there per rack and those are basically our availability zones. So, what we want to do is our main aim of getting scheduling done for production is when a tenant spins up VM which the tenant would be our internal customer itself. So, when he wants to spin up a VM, we want to make sure that his VMs get landed on these different availability zones. So, we need our scheduler to take that into consideration. So, when it was fall some we had build our own custom scheduler which was a compute zone filter and what this used to do is compute zone is basically a combination of fall zones. So, basically of our ability zones. So, what that means is you could have like FZ1, FZ2 and FZ5 selected one compute zone and you can have this compute zone attached to a tenant. So, what that means is whenever a tenant is going to be spinning up VMs his VMs are going to be landing on one of these racks. So, that's one of the changes, the custom changes that we did. We come grizzly we made use of host aggregates. What host aggregates helped us is so we use host aggregates in two ways. The first way we use host aggregate is to define the availability zone of a compute node itself and so that's the key value pair you get and the second way is to actually use the host aggregate for our web tier and our material tenants. You could spin up host aggregates special host aggregates based on your computer resources and so basically the RAM and how much disk and all they have by giving the key value attributes. The special thing about this thing is that we added an extra able inside NOVA which does a tenant to host aggregate mapping. What that means is as soon as I mean coming back to the same point as soon as the tenant launches VMs, we make sure that they get exactly the host aggregate, the host in the host aggregate which are attached to that tenant. So, another thing that is important here to note is a 25% distribution among fall zones. What this means is the example that I gave you, say for example a tenant is saying that I want to be part of fall zone 1, fall zone 2 and fall zone 5. When he spins up say 3 VMs, we want to make sure that there is an equal distribution. He doesn't land on one rack itself so that you actually get high availability. I mean if a rack goes down or if the top of the rack switch were to go down and there was some networking problem he would still have his services up on the other racks. These are some of the changes and the 25% distribution, the way we did it is there was a special paper that we built and what that does is basically gets per host aggregate calculation of how many VMs are there. Say I have a tenant, it would do a calculation of how many VMs are there in the host aggregate that I have across the availability zones. So it will take a look how full the availability current availability zone is and then it will go on distributing based on to make it equal, make it equal distribution. So these are some of the changes that you might need in a production environment. Some of the other no more changes is say instance host naming. This is a classic production use case wherein currently in DevStack when you get in DevStack if you spawn four VMs their names are same and in production you wouldn't want to do that because you need unique names and you need basically these are going to be put in, used in DNS these are going to be used everywhere else in the application. So we created a plugin API which makes sure that in the entire cloud deployment you have unique host names. So basically we give a, we generate an ID at the end and we give a, it's a format and we can even change it based on per tenant. So what we have done is we'll talk it in the Keystone changes. So we have made use of Keystone metadata wherein per tenant we can specify the host name format so every tenant can get a different host name according to his application requirements. Auto assigning floating IPs. So this was a case wherein we, so this is for specific clouds, this is again configurable and the thing is we needed external connectivity for VMs. So what we said is instead of a developer having to create a VM and then manually go and add a floating IP to it, what we did is we we inserted into Nova again another plugin which talks directly to Newton APIs and gets the floating IPs to these VMs as soon as they come up. So this has been inserted right where the instance spawn code is and then so that is one of the use cases wherein we needed external connectivity for all VMs. Again it's configurable, it's a plugin and rack-aware networking. So rack-aware networking is, so the previous image that we saw wherein we had distributed where I showed you racks and we had distributed fall zones we have different, the way we have configured is we have different subnets for rack and we want to make sure that if the VM were to land on a rack it gets the correct IP address from that subnet. So in Grizzly what we did is we made sure that there is a map there is a mapping between every host name it we figured out what availability zone it's a part of and what subnet it's attached to that availability zone and this basically helps us in in overlay and bridged mode of networking. So my friend when I was sitting he'll be talking more about the network setup that we have his talk is on Friday about bridged and overlays but basically that helps us so if you were to talk about overlays wherein you have subnet running throughout say three or throughout all racks then your normal logic should be intelligent to detect that the VM has landed on the compute which is part of a rack and these are all the network IDs associated with it and these are the network IDs you send in the requested network data structure to Neutron so that it gives you the IP from the correct networks. So that is one thing we added. Leveraging config drives this is some of the self specific information that we pass on to the VM itself helps us in cloud in it stuff the post boot install steps. No more conductor services this is one quick point I want to make because Grizzly brought in no more conductors and it's I mean the whole reason why no conductors came in was security they said that you do not want direct connectivity to the database from the compute nodes because your guests are living there but what this internally does and this is specifically a scale issue this is this comes in to play when you have like thousands of hyper voices running and what they're doing is every compute before talking to the MySQL actually a very periodic task because every compute node has to do there are a lot of periodic tasks that go on and everything it has to check with the MySQL it has to put a RPC message. It is a good thing in a small environment but I don't think it's so it's a trade-off the thing is it puts a lot of load on rabbit and rabbit if it has not been scaled correctly it can become problem because you start dropping a lot of you start dropping a lot of errors on the rabbit side and it could become a problem so you might want to turn on or turn off no more conductors services and see how it works for you but we currently have turned this off so let's go to some of the keystone changes keystone changes are a little bit standard so these are some of the things like integrating with LDAP so open LDAP and AD so what happens is we wanted to make our users not have any special kind of authentication so they can use their own PayPal credentials and just log into the log into the cloud another feature related to this is the auto tenancy feature so what this feature does is that if I were a developer if I were to log in I wouldn't want to go and talk to a cloud operator and say okay go and create a tenant for me I want this I want that what this does this is again specific to some of the clouds like developer clouds wherein a developer just wants to come in he logs in uses using his LDAP credentials what we do is we create a tenant we create a tenant which has a name exactly same as his user name so what that does is it creates an automatic tenant for him and it adds his user as a member role it's configurable or you can add him as an admin to his own tenant and he can just start using cloud so basically he can spin up VMs with his own inside his own tenant and he's admin to his tenant so that's one feature that we found useful in special clouds like not in all like the production clouds can be restricted you want to go through the normal process of cloud admins sitting and knowing what tenant name you want what are the setup but this helps in the regular developer and QA clouds tenant based host names and DNS zones so this is the point that I just told you before the Keystone metadata we have made use of this to have tenant specific stuff so it's like host names can be different for tenant DNS zones can be different for tenant because the FQDN that we create in our that we use in our DNS service he can have his own dot something dot PayPal dot com like X dot PayPal dot com or Y dot PayPal dot com UI has been changed for this I mean we have done a lot of UI changes for the users to select the DNS zones that were attached to that tenant so it's pretty cool because it's it helps a lot in production environments client side token caching so a quick point about Keystone performance the thing is Nova when it integrates with Neutron what happens is it creates a lot of tokens a quantum creates a lot of tokens and what happens that Neutron what happens there is Keystone performance takes a hit because every time you do every time you do on authentication authorization call it goes to an entire list of tokens and you would want to reduce a number of tokens so we have done on the Neutron side in the Nova IPAs we have done client token caching so you guys might want to explore doing some clients I mean I know Keystone is planning tomorrow to go to the certificate based infrastructure but I do not know when it will be coming or till then till the time we are using tokens it's an important thing to keep in mind that you need some caching on the client side. Team admin feature so this is one feature that we had this comes from Falsum what happened then is that in SXN and Falsum releases we had this OS admin user which was a general cloud admin user and we did not want to give this role to all the users who want to be admins because you could literally manipulate all the tenants any anyone's tenants so this is a specific feature which allows you to become an admin only for your team and so Grizzly I think handles this with some level but we still needed this and we went ahead and developed this anyway so we have it DNS as a service integration let me talk quick so again another production classic production case wherein you have a VM and in a production VM you want to access it using host names I mean in a classic AWS world you wouldn't have host names you would just get IPs but hey at the end of the world it's our internal production and you just cannot say we are no longer going to be using host names so you need DNS entries and this is automatic automatic meaning this is again being plugged into NOAA it's a plugin and what that does is whenever an instance is spawned or an instance is destroyed we take care that DNS we give it's API driven we have built on top of it runs on bind it uses bind but it has REST APIs that we have created and we can call REST APIs to give those bindings and the IP so basically it's right there again when the instance spawned success happens basically at that time point of time you know what is the IP what is the host name of my guy and then you would generate the zone name which is the next point actually so the project based zones which we had from Keystone so you make a Keystone metadata call and you actually get the zone for that tenant and you append it to his base host name you give this mapping to DNS and that is taken care of so this is automatic and handled gracefully while creation and deletion floating IPs also we handle DNS so again the automatic floating IPs that we are creating DNS is handled for the floating IPs as well so basically you would need DNS as a service for a classic production use case load balancers so this is again so what we wanted to do is when the VMs come up you are going to use it to deploy applications the whole thing about it is you would want to at the end of the you would want to sit them behind a pool inside a pool behind a load balancer so that it starts serving traffic whatever you application running so we have integrated load balancer as a service with our OpenStack cloud and what it does so basically we have around hundreds of load balancers that we have so we do auto discovery registration we have a load balancer service that it's written in Java and it's rest API driven again and what that does is you can register each and every physical hardware load balancer using an IP and register it inside the service it has rich tenants and operating APIs range from everything so you could create pools you could create web sports like i7 rules SSL search there are some operating facing APIs like you could manage sync, config restores, config backups and basically this is all sent basically from Nova you could just call these or any orchestration engine that is handling you could just call after the VM creation is done you could just call these rest APIs and so propagating changes to multiple LBs so the thing is to integrate load balancers service with OpenStack what we want is the tenant specific concept so what that means is you should be given a tenant you should be able to identify this tenant is part of how many load balancers is part of what pools is part of what ports so the rest APIs that we have given they are similar to the OpenStack APIs which identify everything based on a tenant ID so when a change is basically done to for a particular tenant we need to make sure that it is propagated to all the load balancers this tenant is a part of so we keep that in mind and have that propagation logic to all the load balancers it manages even the secondary and the primary load balancing so the change once it's done on secondary then we go and sync it back to the primary and then that's how it's done and change management integration so all the load balancer changes are critical what you want to do is you would want some accounting some way of accounting all the changes that you did so what this does this is a message based change management integration that we have done what it does is whenever a change is fired on a load balancer for example you are adding a service in a pool at a particular port there is a ticket that gets filed that gets fired this is PayPal specific basically ticket gets fired which has details of who did the change where was it done what all changes were done and this is basically for accounting purposes again a production use case that you would want to use so yeah so these are the changes that we had done in terms of the coding I mean taking open stack and actually doing the changes to it to run for our production clouds I want to call Anand again and briefly go through some of the success stories like some of the things that we have done which might be useful thank you yeah so I already shared couple of important success that we already got one of them is you know easy integration with pass and you know we didn't babysit anyone to go back and you know we discovered you integrate with open stack because everything is open source and it's a github and they figured out themselves to integrate with IS that is one of the big win for us and another big win is actually bringing the rest of the organization into the whole mix you know how we can all of us you know work together to make you know build the cloud faster for the enterprise so that two things definitely I want to mention that's a big winner for us and another thing is actually we build cloud okay these are all rest A PAs and we have you know horizon and what happened was okay we wanted to go back and tell our users that how you can access you know multiple parts or multiple you know different cloud with a single user name and password and we wanted to you know physically separate for you know compliance reasons and stuff like that even though actually we cannot logically you know group everything together like you know a production and non-production into the same hypervisor we are not yet there and I don't know when you know our compliance will allow that put into you know both work load into the same thing so instead what we want to do is actually we wanted to have a single point where you just log in with your user name and password then they will be able to manage you know multiple different regions at the same time multiple different you know projects within the same single user name and password so we did that actually we built UI on top of you know all the A PAs directly instead of using the CLI so the reason for that actually you know we cannot upgrade all the environment with the same version within a day for example if we are running on grisly and if you wanted to go back and then upgrade with Havana we cannot do it overnight in all the environments right and at the same time the users will should not see any impact to the A PAs or maybe the UI itself the usage point of you actually they are not getting affected so what we did was actually you know the simple configuration configured you know JSON file so you put in multiple regions and you just specify okay what are all the different keys to one configuration that's it and after that you know it's a very simple jar file it's written in Java and start managing all the different clouds across the regions with multiple different versions of OpenStack itself so that really really you know helped us to you know mix and match the you know OpenStack deployment within you know production production so instead of posting you know go and upgrade everything overnight and get into outage or whatever you know that really really helped us a lot and some of the screens are so I could show you actually you know how what was built okay and let me get into you know some of the pain points we really you know got into and then how we haven't you know some of them not solved and what we solved that we will get into that and finally we keep some time for any question and answers and DevStack yeah perfect so you get your DevStack installed on your laptop and go and spin up couple of VMs that's all good but if you want to go and run it in the production that's where actually the challenge starts right basically you know the rabbit MQ issues and you know there are a lot of you know compute nodes you know the message is getting lost and then how we are going to be handling all of that right so that's where the you know pain points started for us and we learned over the period of time actually in stop just you know okay it works and then put it into the production and let's see later so it's not going to work for us because the it is going to affect really you know the end users the developers every ever is important for us we don't want to lose their productivity and we wanted to make sure before even putting into the production we wanted to run through our CACD make sure everything is working and that is one of the reason actually we just do not want to go back and take something from in a public it and directly you know put the production and look at things later right so we we are not comfortable there and we wanted to have our own CACD and it's not yet perfect and we have 80% of the confident we take the chance and we go back and then say 20% we take the chance but we cannot take 100% chance and keeping up with Frank okay this is one of the issue that we currently have we are not on track right that is really really biting us for example you know we wanted to we ran into some performance issues and we wanted to go back and look at okay how we could fix it when we will talk to community in the IRC channels and yeah it's being addressed in Havana better you go and run Havana right and we cannot do that because we wanted to test it and before even go and try some of the new features we don't know the impact right but we wanted to be closer to the trunk as much as possible so that actually fix it and then run it and then contribute back so that's one of the challenge for us and also you know when we talked to our vendors also for the plugins and drivers yeah actually yeah we wanted to we cannot contribute back to the stable version of previous release because you know community is not approving you know all the changes that we wanted to push upstream right so they won't say okay if you go and running on Havana we are fine because you know the fixes are in Havana and we are not going to be back voting too grisly so that's one of the challenge that we have currently and we are actively working on our CI CD to make sure that we stand the trunk right so and single keystone service of course you know we have multiple regions and multiple data centers and multiple cells and every cell has its own keystone today but if you have your past layer that is integrating with RIAs you don't want to have this case go and you know manage multiple keystone endpoints to go and talk to your infrastructure that is really really getting complexity and if they have to authenticate against multiple keystones and they need to you know get hold to the token if you are not you know syncing the token between the multiple regions and multiple keystone at the same time so that is really really a challenge for us and we wanted to really really get into a region based one single keystone and it is propagating you know it's managing all the other cell underneath so basically you know we wanted to get that one authentication that region instead of having a multiple keystone that cell or a data center right and that is one of the deployment pain points that we have and I think you know there are some you know enhancements made in keystone and have on other we are going to be exploring that is going to address some of the issues that we have today and performance scalability again so if the cluster size increases we have figured out actually we are not going to be adding all hypervisors into the same you know cell so we you know made some decision saying that okay but the number of hypervisors not going to be you know more than you know 500 or 600 then we are not going to be putting into the same you know cell or same keystone because of you know the traffic between your controllers and computer specifically you know the rabbit MQ right we paste those kind of issues and we wanted to make sure that okay we are limiting for now until we go and fix all of these things into open-stack itself right so that and also we were exploring some of the options with zero MQ we have not yet done and there were some issues we recently faced where you know the VMs got deleted in the Nova database but still sitting there in the hypervisor right and if you go and look at the code everything is directed but you cannot launch your VM because you know the hypervisor still holding those instances then we have to go and create some scripts you know run periodically to go and clean up all of that so this kind of things actually we have to build within Nova itself so that it consolidates between you know the actual state what is there in the Nova database and what is in the reality within the compute nodes itself so things like that actually we need to really really you know make sure that it is working for the real time and we don't want to end up you know losing the resources within your cluster so that you think that actually you have capacity but really you know you are running out of the capacity these things are still lying there in your infrastructure right so that is really a reliability issue that we had and it is going to get out actually very soon and error handling so there is another problem that all our you know all our team actually really pressing on that so today okay we have different functions within the organization where you know we do some engineering and we have support organization and we have L1 L2 for you know different you know customer issues so they are all getting L1 L2 calls and after that actually if something happened in your infrastructure can everyone debug what's happening can we bring you know open stack issues okay this is what happened in the error code or whatever take that and if it happens this you go and do that can we educate someone apart from your engineering team right I don't think we have a good solution within open stack itself today to go and kind over things just you know you go and have other guys you know to take care of your infrastructure so basically we need to have you know better error handling and also the lock files are lying everywhere and you know and it is randomly selected hypervisors and you know cannot go back and then look at each individual lock file to figure out okay how to what happened to the VM that you created why it failed right so basically we need to have you know of course you have the log aggregation we you know put in elastic search and lock stack and keep on up top of that everything everything is good but the problem is only engineers could understand the normal L1 L2 you know resources of course they don't have time it's not that can't do it so there are you know users calling you in and you have a small set of you know support teams to support it and they can't go back and then understand all your lock files to support them right we need to have a better mechanism to identify pinpoint all your failures within your infrastructure so fast to react within you know minutes instead of you know taking you know half hour one hour to figure out actually what's going on in your infrastructure so we have to build open stack towards that rather than you know just and it is not you know working for you know other use cases of real operational use cases and so this is a small team that you know built all of this and you know if you want to reach out to us to talk to us any of the you know lessons we are happy to share and we have to happy to learn from you guys and you can send out an email at cloud at cloudatplayfold.com and I will open it for you know question whether you know I could answer some of your questions if you guys can yeah so it was a hard decision for us I guess we started with you know horizon itself right so what happened was we rolled out this you know dashboard to the end users when they created the VM and it's too many steps basically I need to have as owner two clicks to get what I want like you know I need an extended storage you go and format it they don't want to do all of that right so they want to have a simple workflow and we looked at you know horizon and we don't want to you know go and make to you know core horizon and we cannot mesh from upstream so that is one of the reason actually we wanted to go back and look at different options how other people are using the other public clouds or maybe any other cloud in the world and we looked at Netflix they have you know a cool you know workflow but you know you know you know you know a cool you know workflow built already then we wanted to leverage that that's how we started with that and we are going to be you know expanding more on that like integrating with our you know availability monitoring as well we are planning to bring everything into the same dashboard actually so it's on Java and one good thing is actually we have a lot of Java developers within the organization and they want to do some cool stuff you know building some new UI or whatever on the HTML and it will leverage them as well because this is what it enables us to you know bring the entire company into this there was a challenge actually when we said okay we are going to be you know working on the open stack then other rest of the organization they said yeah it's good but how in it is going to affect my job at the end of the day you know it's every one question it's coming in and what's my role in that okay we clearly said okay you don't want to do the same thing over and over you don't want to click the same button and running the same script you have to be an engineer you cannot be a guy you coming in every day at 8 o'clock to office run the same thing over and over you're not learning anything and they're going back in the evening so instead okay the code is there we are all part of it we are not restricting you go and take it and then go to you know meet observe what not and learn to open stack and go and contribute it's good for the company good for the community we are all growing together so it's really really enabled us the same thing with Asger right okay good but in Java okay there you go you have a rara and then you know go and you know what kind of so it's really you know we helped us to bring the rest of the organization to the hall makes yeah yeah so typically you know we don't share how many definitely it's in multiple thousands for sure hypervisals it's all yeah so that's what you're talking about right we introduced caching for that is specifically with keystone yes exactly keystone performance is you see we got you know rabid mq issues right so basically you know got dropping a lot of messages they had to restart them rabid mq and there was one more issue actually we ran into we couldn't even figure out what was happening is okay rabid mq is running but still you know the messages are not getting spanned to the controllers and the communication is not happening between your controllers to you know compute nodes you said okay rabid mq everything is running whether it's not working use everything up upon running to run your rabid mq we need to have at least one gig of you know where like one gig of where lock space right so let run we figured out because you know it was keep filling up and we didn't have the you know good lock rotation and it was spilling up and the VM was not responding back to rabid mq so it's all you know challenges within you know within rabbit himself but it takes time to you know identify actually you know what's failing yeah so right now we are managing three controllers with at least 500 600 hypervisor 500 yeah yeah so we are building small clouds we are not building the whole big cloud and if something happens actually we don't want to affect the whole cloud when something happens okay one cell it is getting affected even still we are not affecting our real business right so we wanted to be careful in that actually so we are building a small clouds and we wanted to bring everything into you know cells and bring it back into root cell so that's where actually we are really looking for a single keystone that you always authenticate against this region and internally it propagates all this tokens to other cells internally right so one bottleneck today is actually we are dealing with the multiple keystones it's not because of you know we don't want to do that because the other layers in above IIS actually they don't want to deal with multiple yes that's why actually we built our actually you lock in with this and you can select the region where you want to you know manipulate and the same keystone token username and password is working for you so we'll make sure that when you log in we authenticate against multiple there's a work current for us and we'll make sure that the same token is working for you know all the other regions as well so we are not going to be keep asking you username and password again right load balances are very critical for our availability that of course anyone who is running serious business and e-commerce load balance is you know unavailable for you guys right so we have our hardware and the load balancer service that we internally built we are managing hundreds and hundreds of load balancers and we have a load balancer service built on top of in all this right and it is internally built to map our network topology and we haven't seen all those use cases as part of Neutron yet because I was one of them in in San Diego summit actually we proposed we bring in all this into the same room we all agree upon the common APIs we all agree upon the plugin architecture and also the driver architecture we built all of that but still you know the use cases that we wanted to solve it is not there in Neutron so of course we are going to be so far my engineering team I'm really sad this case are really really unhappy on me that actually I'm asking these guys most of our engineering you know effort into operationalizing open stack rather than you know building the code and you know it is really really you know stopping us not to you know go and have all this you know blueprints created and community go up all of that and contribute back to the community with the real use cases and we are going to be focusing on that 2014 for sure and we are building our CI CD and you know Jonathan's team actually you know so he's building CI and CD for me and my developers actually they are just going to be you know checking in the code and it goes through you know different you know stages in the pipeline and at the end of the day actually we have a you know stable code that is getting into the production so once we have that you know engineers are going to be relieved from the day to day operations and they are going to be working on the real-time use cases and the real-time use cases are going to be translated into blueprints and it's going to be in community actually it's good for the community good for us actually I don't want to deal with you know another set of you know service or the code base within you know PayPal to manage all our load balancers because it is again cost and also when we designed our load balancer service internally we missed some obvious things and when I did talk to you know Yusuf from Citrix and he was running with the atlaselby I don't know whether you guys aware of atlaselby and it was started from Rackspace and the decent production with HP today and you know even Rackspace is also using the same thing and he did a good you know really fantastic job of you know bringing in all this you know different aspect to you know tenant-facing APIs and you know we had a lot of you know challenges using load balancers for the real-time use cases where we built a lot of operator APIs we put together everything into community in the last you know San Diego design something that not yet realized into Newton it so that definitely we need to put a lot of effort to make it happen for real-time use cases yeah yeah for something really yeah so with that okay we start maybe you know I think we have done maybe I'll take it there