 Okay, all right sure the company founded in 98 and it's headquartered in San Jose, California And we have hundred and sixty nine million active active accounts It's not like you know registered and then people don't use it It's really really, you know, they're using today for the transacting online and then offline come in stores and 203 markets right hundred and plus currencies and Also, we processed four hundred million billion payments last year alone and we moved to an N35 billion You know exchange hands through PayPal platform So it's really really mission critical when that comes to money It's very very critical in terms of security or compliance and stuff like that It's very very important for us to you know, keep our you know, business as it is while you innovate on something else Like you know cloud or maybe pass IS doesn't matter and you can't disturb the existing business, right? so it So when we put together The you know the design goals for the cloud platform three and a half years back It's not that actually we didn't have the cloud We had multiple generations of the cloud right in the name of automation like you know We brought in you know bunch of tools from the vendors and we had it in our data center You name it we had you know different type of you know tools all over the place, right? But you know, they were not meeting in a sum up of our Organizations goals that we had in in mind like you know developer agility, right? So scale scale as you grow instead of you know buying you know multiple tools and then hey I use this tool for you know scaling to some extent now. It's not working out You brought bring in you know some other automation tool But the problem with that actually you got to change the other layers ever the infrastructure So that we wanted to completely eliminate, you know those kind of things so so basically now We put together, you know some design goals this in principles before even we get started with you know What do you want to do for the cloud platform engineering? It's not only going to work for any year. It will at least work for five to ten years, right? So we took you know three main You know goals that design goals, right a platform to collaborate internally and externally What do you mean by internally and externally? So I'll talk let me talk about internally, right? So paypal is a size of company close to 15,000 to 80,000 employees we have in our company and then it's still growing at a lot of Acquisitions is coming into our company and how we are going to be keeping all of these developers and different, you know Partners and the operations teams or networking team or you know storage team We wanted to have some collaboration in Collaboration platform so that actually we go in a way together rather than being him storage team I got some cool technology. This is kind of I'm going to be rolling out You know you guys integrate with me it is not so it is not going to work because the larger organization Everybody wants to do innovate at the same time everybody wants to do things They're all but we wanted to have a common platform so that actually we can collaborate better and then we could move faster And what do you mean by externally? So it's not that I said we develop everything ourselves, right? We don't build switches and routers or firewalls or load balancers, right? So we wanted to partner with absolutely within all the vendors Right, but if you do everything your own internally, they don't actually how we are automating all this, you know different Things in our data center. So what we did was okay. Is there anything that actually we can leverage so that you know They know actually what people is using and how I can integrate with those platforms so that actually, you know People doesn't care about actually what kind of cavities that I have in the devices I find a way to expose it at the API so that just you know people going to use it So before you build your product make sure that actually is compatible with some APIs it works for you know People right the test everything and then ship the product as well as the integration points also So we wanted to have that as well as part of our design goal So the next one is very critical as well, right? So we had automation teams not that actually we didn't have automation team. We had cloud platform team, right? Okay, great Okay, we build everything. Okay a guy actually coming in. Okay. He got a metatool and after some time hey, I have the script you have to put it in my framework it executes and Doesn't give you all the feedback loop that you wanted to have in terms of you know logging or maybe mining Or you know if you wanted to have you know multiple executions or parallely or you want to scale out you are controlling itself So we didn't have a lot of flexibility in the tools that we had it in the past, right? So we wanted to have an agility for the developers That we are going to provide from the cloud platform point of view and they are going to benefit Okay, infrastructure is exposed as a service Okay, they come come in and then self-service compute APS storage APS load balancers firewall or whatever you name it in the data center It's all being self-service through the API. So whatever the cloud builder itself He's screwed because all the hard work actually what he is doing So he has to do it himself instead actually we wanted to have some lot of flexibility and agility for the developers itself If you want to change so to name one particular example actually we ran we ran through right So we wanted to filter add more and more filters, right? So for example, you know if you want to distribute across in a multiple racks your VM Then how we are going to be doing that if there is a vendor tool you are bringing in So you have to expect them to provide that you know platform capability and then go and add your you know features to that Rather than being actually just an insert and filter and it makes yours that actually nothing is breaking So you can introduce you know more and more of that actually is a pluggable fashion So we wanted to have that agility as well for the developers, right and very important thing is actually well-defined cloud APIs Right, so we built you know multiple generations of cloud within PayPal and then we had enough soap API is rest API is CLI different version, right Great, but what happens is you know we are only few architects actually we used to sit together You know fight each other or brainstorm and then come out saying that hey, this is the API Okay for creating the VM you you know pass bunch of attributes and you're going to get a medium after that Right, but over the feed of time what happened actually we had only limited set up knowledge and there are you know bunch of you know Other people outside of our organization. They're very smart. We wanted to leverage them, right? And when we design, you know some of this APS we have to treat more and more and then we always had gaps in terms of you Know that the API is if a cavities and then you know Attributes so we wanted to have a well-defined FBI and it is working for packet compatible as well as in the current generation Actually, you know over actually going to be using the new cavities. It's not going to break So we wanted to definitely have one, you know, we'll defend it cloud APIs So when we put together all those things What happened was hey, okay, there is no one in the world is going to So they have the product for that and we took our existing products as well at the same time we were looking for some new options as well in 2012 July, right and We did, you know POC is within a cloud stack open stack and then eucalyptus and our own, you know Existing tools that we had in our data centers I don't want to name what are they but actually you name it actually we got it in our data center, right and We carefully reviewed each one of them, right? We took more than two months to identify Okay, we put together design goals, but you have this but you wanted to go there, okay? What what will what will it call for you? Then what happened was during this, you know evolution period of two months with POC some stuff like that Okay, so open stack stood out Even though actually we started very very early in sx time frame But we decided to go with that because the community is really really faster and then a lot of you know vendors You know they're coming in and also in the foundation was formed even though actually open stack is there from you know 2010 we were looking at every year actually we evaluate our technology of you know Technology audit within the company and then we'll compare with other competitors and then technology market where we are in terms of the technology Company when compared with no other financial companies or maybe commerce companies, right? And we do that every year, but 2012 when we did that hey, okay open stock is there But there is no founders and we don't want to you know just jump in and then get burnt in there So open stack found us and also found I think six months before actually when we started, right? That was the right time and then we decided to go with that so if you look at now some background about you know PayPal cloud specifically in open stack cloud that I'm talking about actually we started in 2012 with an engineer and Since we invested a lot of money in through you know our existing infrastructure and the existing platform tools You know there's no one going to sign up for something new because you know everybody comes in They talk about something after that actually they leave so basically you know what we did was actually we got only one engineer That's me and then 16 decommissioned engineers Decommissioned servers that we got it from one of our data center Right, so we put it into the lab and then we went with the POC and when executive team and stuff like that And then a lot of back-and-forth and I'll go through actually you know where we started because there I have in a very brief slide actually Right, so today it is one of the world's largest open stack private cloud I could probably say that right it is not that actually we are running only you know qa or dev workloads just for fun It's really really missing critical the first slide. I talked about right It's a 245 billion dollar money that we moved through this platform, right? It's not where dollar or two, right? And without compromising our availability security and the reliability at the same time, you know more importantly You know without disturbing existing business Right, so it's real we moved it right and now we started with you know 16 decommissioned servers Actually, they're supposed to you know go out of our data center But today actually we are running 8,000 plus hypervisors right and 400,000 plus course I don't actually who is largest, you know after this I think Walmart cash you know closer to this I believe in private cloud, right? And number of VMs 82,000 plus and block storage to petabyte So if you talk about actual object storage, we just got started with that, right? So people doesn't have you know a lot of you know object storage You know requirements here they are all still coming in and we are in a building out of object storage Cluster also bigger and bigger and you know they're going to be a lot of lot more use cases going forward, right? Number of availability zone 10 plus availability zones and largest easy with we built during last You know summer it is 2,500 hypervisors, right? It's the largest one we have built so far, right with three different cells So what we did in a business enable me they're all great technologies you brought in open stack darkness, right? So we are hosting 100% of a fall traffic, right? Except that a business and messaging, right? That's three and a half years of journey and it's powering 100% percent of you know platform as a service, right? a layer on top of IIS and Dev QA mergers and acquisitions whatever bring in our company Actually, you know we you know use the same platform because if they have some colos or whatever We put them into the roadmap and then we migrate them into our data centers Right to save cost and for a lot of other benefits, right? Network monitoring and stuff like that So and also we deployed sd and sd and was also very you know fancy word and then you know Over you over used to word in the industry now and we deployed it in in 2013 and also we ran production workload also there So when you say production workload, it's not like you know all the production workloads that we are running in there We carefully choose pick and choose because you wanted to be at the bleeding edge But you don't want to burn yourself in terms of an availability security that we always talk about right and also the Reliability of our you know business itself So Okay, this is where actually we are today. Okay, and also we clearly put together a vision You know everything starts with the vision and it's not that actually now. We are very different from them, right? So the vision is very simple actually I talked about that three and a half years back when we started open stack in other conferences and also we talked about You know Vancouver and all the other summits as well. It's nothing changed. It remains as it is right So provide a platform that enables agility availability and innovation for the company as a cloud platform This is what actually you wanted to have it So if you look at right there is payments well at mobile ad, you know, there are you know Social payment there are a lot of other, you know products also we have within paper And all of them are going to be using this, you know Just one platform platform as a service and then we can it actually you know infrastructure as a service And if you could look at you know any data center products Actually below we just want to abstract to them out, you know using the infrastructure as a service API right it was a vision and also you know, we wanted to have the configuration management systems that you know Goes through all the way from past to IIS so that actually you can correlate things Together so that I just something going wrong in your switches or routers. You know exactly actually what is going to get impacted in your application Right, so we wanted to have the entire correlation with your data center your applications everything So we definitely barely need a you know configuration management system Also the monitoring remediation alerting whatever actually you need to run a large scale infrastructure, right? So you cannot just like that, you know throw them out because you are bringing something new, right? And one good thing actually what we had in our infrastructure, right? So we had very very experienced, you know professionals because you the cloud could be you know The new cloud vision could be new in 2012, but this company founded in 98 remember, right? So we had a lot of best practices we used to follow in terms of running this largest One of the large infrastructure So we got all this, you know different talents in the company and then we made sure that actually there We are going to be working for the next generation cloud platform, right? So we leveraged them, you know In full so there is no confirmation that right and if you look at right So this was the vision we had but the software comes and go like you know We bring in one particular tool you use you know puppet sometime and then bring in chef Or maybe you know you bring in ansible or whatever, right? So if you look at right the infrastructure as a service we completely standardized on you know There's one no one new transgender. That's the only API that we provide, right? And if you look at a platform as a service, right? So that's very very critical So it needs to manage the entire application lifecycle management for our developers so that actually they don't care about Hey, how may application is going to you know get You know into the production and how it is going to be monitored how I'm going to be you know rolling out the new code So there is a complete you know automation from the developer box to the production remember so before you know this platform We had in a multiple different you know cloud footprint where you know somebody is running on a stack in dev QA Somebody will be running on LNP in a different stack and you know bare metal servers on production is completely different But there was always complaint. Hey, it worked in my LNP environment But in production actually I get more latency for it introduces a more latency for my application So if you're not going to be running the same platform across the board all the way from developer box to the production Of course, you will have differences what we did was okay The same API it serves all different varieties of workload right dev QA LNP prod or any mergers and acquisitions that we are bringing into our You know company, right? So that really really you know saved a lot of time troubleshooting time Specifically when you talk about in a performance of your application itself because if we talk about payments, right? It's very very critical. See you can't stand in the queue to pay your payment for a minute It has to be within a second or two or three three second maximum, right? Otherwise you are going to get to stop doesn't mean that as you bring in a cloud platform But your transaction time got increased by 10 second that that is completely unacceptable, right? So it really solved a lot of problem for you know the developers where hey it worked in my dev box Hey, I also tested in my QA also. I tested my LNP, but I don't know what the operations do But it is not it's not working in production only the number of VM's and then the pool size difference But otherwise, you know the same set of infrastructure. We run everywhere across You know even the dev VM's also in the same cloud in the same kind of flavor They use whatever they use in production, right? So it's nothing different So if you look at you know the configuration management system actually CMS actually we wrote internally So it has you know, it's called in a CMDB CMDB is nothing but CMDB and it's being open sourced as well You could try it out and also the open source side as we are using puppet salt and then Ansible and we use it for you know different in a purposes So I don't think I can go through all the different use cases where you use puppet where we use salt and where we use Ansible and if you look at you know bunch of other tools we wrote on top of that to manage this cloud itself, right? A lot of homegrown tools right to run 10 plus availability zones and the largest open stack You know easy with 2,500 high-provisors Things will go wrong. It's infrastructure Network will break or maybe you know, there is a switch will go down or even we you know one one time we One of the core router also went down both active and standby So how are you going to be you know remediating all of them so that actually, you know, you are business not get impacted So we brought you know bunch of tools and you know I think one or two in this already open source and we will open source others as well and open source I'd actually be using elastic search last last as Kibana and Zabix for monitoring, right? So so if you look at you know the layer above it actually platform as a service Right that also you know comes and goes because if you look at you know Our platform as a service that we have it in paper to deploy the code all the way You know from developer box to production through the self service manner We cannot compromise on sale release management and then change management because thousands of thousands of developers You want to roll out your code to the production, but if you just say hey I check it into the code gift cap. Hey doctor is great But how do you know that actually it's not going to impact you know others, okay? You tested but still you want to have a lot of change management and all those things so that actually something goes wrong You know for sure actually what was rolled in Right and rolled out to the site and you have to go back and then immediately roll back all of them Right so these things actually we cannot definitely confirmation all of that So you know we also you know pick and choose the technologies based on you know What is coming in the market? We will make sure that it's of those APS also very very clearly defined for platform as a service also, right? So if you go back to that right so this is where we started, okay? This was in the lab, okay? It's first unused compute nodes that we got it from our lab and we talked about actually what where we are today Okay, this is a whole journey of actually where we started, okay? I think you know there are bunch of you know targets and then we went through you know multiple phases clear execution of Hey, this is where we started and we didn't just stop it actually We just keep going and whenever we Ran into issues we will make sure that we reiterate and then we consistently keep going rather than being hey We stuck but we are going back and then you know revisiting our vision itself that we put it together two and a half years back So it is still continuing actually we started in 2012 August if you look at actually the POC was completed on the lab And then we moved to production so one thing actually we did something different is We put our you know open stack cloud in the production directly Right a lot of people think about came I roll it out and they want dev QA and then I promoted to production But we took in a different approach When we looked at you know the infrastructure issues what we have seen is the QA is the one of the you know Most challenging one because the rate of change is so much but production is more controlled So there were a lot of discussions went around that hey We are saying that actually we are going to be in production But hey how guaranteed is going to be that but but we took you know a lot of you know issues that We ran into our last one or two years in terms of many the stages Q environment It was more complex than the production So so for me actually you know if you talk about the cloud APIs itself Right once you built the VM then you also provision the pool to the load balance and stuff like that He's continued to exit you know continue to just exit as it is most of the time in production unless otherwise We add more more compute nodes and stuff like that you add to the load balance It's going to work as it is but in QA it's every every time actually the developer checks in the code It completely changes and the rewiring also happens, you know very often right in the QA environment, but production is much more stable And let me you know jump into the deployment architecture I'm sure okay. There are a lot of information three and a half years I'm really really running fast, but you know if you want to catch up on me Actually, I'm going to be sharing my email ID also you could send I know your specific question And I'm happy to you know answer for that And I shared the same slide and you know Vancouver also it nothing changed, right? It is very similar to what AWS or maybe what GCE gas in terms of you know the region's Availability zone and within the availability zone actually we run multiple cells as well for scaling for scaling purpose So here the very interesting thing is okay your availability zone is your fault domain What does it mean? Right? If something goes wrong in your core routers, okay? Your application should not get impacted at any cost Right, but it is not the case in you know traditional data centers, right? Okay, you run 10 VMs for in particular application at 20 VMs or whatever if two or three of them going down It's a big deal for an application owner. Hey, you My application you know pool capacity went down 30% now. I don't care actually you have to bring it back But okay, even if owned availability zone goes down. How are you going to making sure that okay? You spin up some more VMs you roll up the code and You are going to be you know bringing back the same capacity for your application pool So how many applications are ready for that first of all? Okay, how many applications are you know? They are going through the platform as a service because remember platform as a service also evolving in every company like You know, we are not very different from that to right a lot of companies still might be going through the same You know old, you know, we have been on deploying applications to the productions, you know going through you know multiple phases, right? And if unless other ways you have a click up a button actually spin up 10 VMs or 20 VMs or whatever actually you got affected in one particular availability zone and You add in other availability zone and you deploy the code bring it back online That is peak traffic. For example, if you're talking about in you as anywhere, you know night Afternoon 11 o'clock afternoon I think morning 11 o'clock to 2 o'clock for any e-commerce company or maybe payment company also the same Okay, what about actually if if one of the core router went down and one of the availability zone How you are going to be taking the traffic in other availability zone within a minute or two because every second every minute counts for in terms of revenue Right, how many applications are ready for that? So when we built all this we were carefully in a choosing Hey, what's your availability zone your availability zone is at least, you know 15 tracks Okay, and there 15 tracks goes down It's not get impacted for your application cannot get impacted But a lot of people they have said it's not going it's going to be impossible to do that Then okay, we didn't want to wait for you know all this You know other components to move into the past and then they go through in auto scaling and stuff like that So we wanted to move the needle You know when I just grows we want no every legacy infrastructure bare metal and then we know legacy You know the cloud we wanted to move into that So what we did was actually you know initially you want to actually be talk about a Z as a fall domain today We said actually a rack may be the half rack the network fall domain in a top effect switch level itself I said it's in fall domains whenever you create an you know VM for a particular project We will make sure that it is going to distribute it across you know multiple racks and then multiple hypervisor as well But it's completely anti-cloud pattern. How so one hypervisor goes down You have five VMs in that particular hypervisor and screw, right? But ID leads you to not but you know, we are at in a easy level today, but earlier it is not So this is again the same thing within an easy actually what we run we have you know typical You know data center architecture where we have the internet and then you come to your core routers and then of course You know we have DDoS production and the you know edges So I'm not mentioning all of those here and you come to aggregation layer and then in access switches Nothing, but you were you know tough of a switch and then tracks if you look at you know very interesting the last in the last portion of this slide itself then you know You have you were in a control plane and the data plane and then VMs and where you run your SDN controllers And then SDN K2 SLBs and firewall if you look at this whole thing, right? It's a very very important slide so earlier we used to have our infrastructure racks separately You buy one or two racks you put all your controllers everything there. What about actually you need to scale yourself up Right, you have to scale out your controllers as well Then you can't wait for one more rack to come in and then you know go on you know spin off all your control plane So this was the older architecture we had it actually we started with that But currently actually you know we also run in the same infrastructure where all our you know customers also Customers internal customers also being having their own VMs, but they're all running in a different network, right? They're completely isolated. They're all different firewalls in between and Also, I'm going to be talking about the VPC in the next thing So this is definitely you know solve a lot of problem for us in terms of you know making sure that I like how to control plane itself You know scaling out well And there is only one rack actually still we cannot move away from that that that is the reason actually we wanted to run our SDN gateways load balancer than fire was everything into the VM itself so that you know say if you want to add Capacity this particular zone. Okay, you want to add more load bands as far What about actually if you don't have a space for that in those racks, right? You know you cannot just like that, you know by another rack that man You may not have power there or maybe, you know, where are you going to be putting together, right? So the goal is to you know have kind of NFE functions move all the load balancers than firewalls into You know the same cloud racks itself so that actually you are moving away from some, you know pits kind of You know racks into your cattle kind of racks So that actually you know you can scale out your firewalls and load balancers also going forward and it's going to be personalized as well right so Also, you know one thing actually very important for us is actually you know instead of building in a multiple clouds for you know Varieties of workloads are varieties of you know different type of workloads like in a QA LNP or maybe you know you might be having some external cloud that is you know completely know some kind of you know blog post or whatever that you know people want to deploy it outside and You know you might have you know a lot of you know security zoning within your data center is for compliance reasons Actually, you cannot put everything into the web layer. You have to protect somewhere in the compliance layer and stuff like that So in typically people what they do actually they have you know they build in a multiple you know cloud What we did was actually we introduced you know VPC that it's very much similar to you know what you know Other public cloud providers have so VPC is the one actually it's logically you know separating it out You know different type of you know security zoning for us For example in the same cloud we run dev QA as well as production or LNP or EXT or whatever But there is a clear boundary is being defined for that what they could do what they cannot do and what image each VPC could use and including you know your DNA zones what DNA zones a particular Particular tenant could create a VM or maybe the load balancer weapon stuff like that the DNS names so all this being are Logically grouped and they're all constructed under you know bunch of attributes for the VPC We defend clearly defend VPC attributes and then you know added to the in a particular admin project because Whenever there is a project created we'll make sure that we inherit all those attributes for that particular you know project in a particular VPC Right, so VPC the collection of open stack tenants again, right open stack projects And it is a security zone we talked about okay who could talk where and where they can access what? So basically you clearly define the security zoning and then introduce you know different type of technologies to isolate this traffic itself If it is a newer data center we have VRFs There is a older data center where you don't have you know all the you know fancy You know newer generation of networking gears So we use you know we lands and then we use you know firewalls in between to separate out the traffic But the construct you know stays as it is even actually we use different technologies to you know realize it But but the construct says as it is for the tenant point of view right and also the keystone changes to make it happen because of you know There is no VPC concepts actually we modified you know keystone introduced in a lot of you know modules further to make it happen and Also, you know VPC in the virtual private cloud world It's a single large virtual router Right when you see where overlay we have a single large Router so that actually the route each other they have their own you know router for a particular you know VPC But if you are going to be using you know, we are up the next generation one actually if you are going to be running on you know The British my British network actually we use you know VRFs to separate the traffic right and VPC model the scale is used because specifically you know when you go to overlay, right? So if you are VPC has you know 2000 hyperbasers Okay, so whether one logical router could handle that much of a traffic, right? So if you are talking about you know 2000 hyperbasers, even if you put you know, 10 to 15 VMs each It's going to be 20,000 VMs, right? If it is going to be active now stand by router. Okay, how soon? The traffic would fail over if something happens to your active active router Right. So those money flows How how quick how seamless it is going to be for taking MSN critical, you know payments, right? That's where actually we are very careful and okay selecting the size of that particular logical router itself so that we don't impact much and we are evolving as we grow and We run at also at the same time We don't want to say hey I don't want to run my production on overlays because that's the ultimate goal that we wanted to have But we'll carefully choose the workload that we wanted to run overlay what we wanted to run on bridge, right? So VPCI slail well, you know over a period of time on the network gears Okay, that we talked about and on challenges in managing, you know, 10 plus availability zones, right? Okay, thousands of services across multiple data centers. What do you mean by that? Okay? We have controller services Nova and Neutron and then every hyperbases run its own, you know Nova service as well How are we going to be you know making sure that all of them running all the time so that actually your cloud health and then cloud of Cloud resources are being up to date. Whatever you are going to be running So we ran into you know so many issues in terms of you know The in terms of you know keeping up in all the services running if something goes wrong How fast actually you can detect and how fast actually you could remediate them say suppose, you know You are one of you were you know Novel service went down and you are trying to you know spin up a VM Even though you have a capacity you are not going to get a VM But that's not the you know best place you wanted to be in so what kind of automation you have so that actually Constantly monitor constantly auto remediate so that actually, you know You run your cloud at scale all the time without interruption to the end users, right? So chicken bites of logs, I'm sure okay. If you are running opensack, you know how much log that you will get every day So okay detecting is one thing actually, you know your Nova service went down You have Xabix monitoring it is going to alert but how quickly actually you can identify what's the root cause Right So thousands of no log file how you are going to be making sure that actually this is what it is triggering That particular outage, right? So we have to drill down so many logs and then you have to do a lot of stuff to do that because no It's not going to be simple But we have to aggregate everywhere and then push it push it to the lock stairs and then you know Elk stack and then you know do a lot of mining to identify that we put a lot of effort for that To collect all these logs and then mind them right and mainly moving parts again the cloud It's not like a database application. Hey, you know, there's my application I have middle layer and the database okay The data is not getting saved or maybe one of your database attribute is not getting updated exactly No, actually what is not happening? You can go to your log file that could be a stack race You can figure it out but in this, you know corporate network and the data center There are multiple backbone multiple, you know different people are responsible for different different parts of the infrastructure And you know backbone network held up infrastructure and you have DNS infrastructure anything go wrong, right? And how we are going to be making sure that all of them are going to be working so that your cloud also works Sometimes actually, you know your DNS entry is not getting resolved But it will be it will become cloud problem Even though actually you're not directly responsible for that particular zone or maybe whoever is not replicating from you where you know production network into the car because completely too different infrastructure, right? So somebody is you know connecting fear dev box and their DNS Entry is not getting resolved Maybe you know the DNS anti-cott missed when they sink from you know one particular infrastructure to the other infrastructure There are multiple moving parts, but this will become a problem How we are going to be you know solving them and what kind of partnership that you're establishing within the same organization So that actually we all have in a clear SLS defined between each services So that actually you satisfy your own customers as well, right? But for any user doesn't matter actually who you are partnering with but it cloud needs to work for all the different components and then network infrastructure differences over the period of time as I mentioned actually this infrastructure is built more than 10 15 years and It is being you know Retrophied at every you know three years four years whenever you know We did decommission our infrastructure for you know tick reference But you know four years you have to still live with you know the older generations of the gear actually may not upgrade You might not be able to upgrade some of the in a firmware or whatever right so still you have to deal with them But but but it will definitely affect you are in availability of your cloud API itself So network infrastructure changes overall failure versus one or two. Okay, this is very critical, right? So, okay, this is what actually always talk about actual dev cloud is you know Very you know complex than the production cloud say suppose you are a developer you spin up a VM We are doing some you know some R&D work or maybe you're developing or whatever you save something local But you don't use you know some volumes or whatever, but the Kuiper was there is going down And for some reason actually you cannot recover back because the disk went bad or whatever and you have to send it to RMA And then you have to reimage then you lost your VM. Hey cloud is broken, right? 100% okay, I got only one VM that is completely gone But in the production is still much more simpler You have 100 VMs for your application, but one or two going down. It's not a big deal because of stateless You spin up a VM roll out the code and add it to the load balancer. It's going to work But that the effort that you wanted to put that we have one VM not going down for that particular developer It is a big deal, right? So we put a lot of effort when we rolled out our dev VPC a lot of support we know that we caught it in one we made sure that actually the hypervisor is not going down and You are hypervisor failure rate itself. It's not more than 2% that's our internal goal that we had But when we started actually it was our 5-6% because of the different generations of the hardware that we have to the infrastructure Right and how fast actually you can upgrade all the bios version and then Firmware and all those things so that actually, you know your hypervisors actually keep it healthy all the time Right and the global keystone and this one actually you know since we have in a 10 plus availability zone We don't want to have You know every developers going there. Hey, this is my keystone one for this particular availability zone keystone two for another one Keystone ten for another, you know ten availability zone. So we introduce global keystone So there's only one endpoint the customers comes in and then he get you know token and that is usable for across all the ages Right. We rolled out global keystone. Also Otherwise, it's going to be nightmare for the end user square quantum integrating with cloud APIs and sink issues between ages and cells Hey, great. We got cells. We introduce cells because you know, we want to scale out our you know Rabbit I'm key infrastructure and stuff like that But it reduced in a lot of other problems as well. There was sink issues between, you know availability zones to Cells and availability zones. So we have to deal with you know bunch of bugs to you know make it work And it took us, you know, some time to figure out actually. Hey, you're showing actually 15,000 VMs in this particular easy But actually the you know, all three cells are showing 15,000 VM for the easy level. We have only you know 14,000 plus or something Right, then later on we realized actually there was some sink issues going on But when we rolled it out, we didn't do that much of a scale testing, right? We didn't have you know that that big of infrastructure to you know simulate as well Right and firewall between control plane and hyperbasers. This is another you know big thing. So Since PayPal is a you know compliance company, there are a lot of security Involved in it. So there is a firewall in between your control plane and then your hyperbasers Okay, you have no voice running But there is a state for firewall if there is no interaction between your hypervisor or a controller is going to start down Right and whenever there is you know, the novel is done with you know, it changes And if you're going to be you know putting back some you know messages to the queue queues sitting in the control plane Different infrastructure completely different VPC and the message is not going to reach. So how are you going to making sure that? Okay, the state is up to date in the novel database. So there are some, you know You know bugs there were some bugs actually in the Havana version actually it got you know resolved in kilo and then kilo upgrade Actually we're solving a lot of problems, but we had to work around for that actually we got to pass that It took very very long time for us to figure out actually, you know why okay everything becomes no, you know Rabbit MQ issue. It's not a rabbit MQ issue actually, right? It is infrastructure issue Because I know people are you know Exchanging messages millions of million plus messages every second in rabbit MQ. It's not a rabbit MQ issue So how we use rabbit MQ, right? And we talked about you know, generation of hardware firmware version bias difference I'll tell you a classic, you know issue that we ran into recently Okay, our hypervisor, you know, rate of failure was in a five to six percent. Why is that? Because you know, there was an order for hardware and then firmware itself, okay There was a buggy, okay, even though the temperature is not reaching the threshold It is you know bringing down your hypervases very badly Right, then we have to upgrade. So how are you going to be upgrading thousands of thousands of hypervases overnight? It's going to be impossible. So but you have to deal with that when you talk about you were You know cloud at scale and then you want to make sure making sure that actually, you know, you have you know Your hypervases are healthy all the time because you know it's taking production Then you can't just like that, you know overnight you cannot upgrade all of them, right? And then config drip management, I'm sure actually this might be going through this as well Okay, if you are running, you know thousands of thousands of you know hypervases You are expected to have config drift, okay? So you roll out puppet the hypervases might have been down since we are not running puppet automatically Then you will have difference in the car. You were in the config file But how are you going to making sure that okay? The config is going to be up to date when the hypervases comes back, right? So we have to do some automation surrounded so that actually, you know We have all of our configs also up to date Uncapacity management cloud back office. Okay. They're great. So cloud is you're saying cloud is unlimited, but but We say, oh, we don't have capacity for 30 decor, you know, we have when I how are you going to be doing this? Oh man. Oh Okay, give me three months and I'm going to buy a capacity and then added to the data center, right? But we will say hey, but public load us not that way But actually why are you taking so much time then we did a lot of automation to minimizing this, you know The whole life cycle of you know bringing in a new capacity and added to the cloud itself We did you know punch up, you know automation still going on and instead of having you know four to six weeks of you know Capacity at process actually we are bringing down to you know weeks, right? So we did a lot of automation for that so any cloud provider if you're you know operating at that scale with missing critical You know workloads running and if you want to add, you know 20% 30% of capacity year over year for your data center You know absolutely you need to have all this automation and challenges, you know Do you actually we talked about you know bunch of things I wanted to you know cover, you know some specific areas as well So unpredictable AP usage pattern people, you know do crazy things in the dev QA a single point of failure for a single VM we talked about that and variety of workloads Okay, it's not like in a web and made or maybe some database applications people bring in some Kafka cluster and They just go and concentrate as something else and they do a lot of you know LNP testing as well in the dev cloud But they sometimes actually did us over, you know load balances and then firewalls as well So how much of you know data that pushing through the firewall between you know different different, you know VPCs It is going to you know choke your firewalls also So you cannot predict that but production actually absolutely no actually, you know, what's your traffic pattern is, right? And then you had to connect to the requirements. Hey, I got this dev cloud But I wanted to ask some something else in somewhere else actually in the infrastructure And you have to find a way to you know help them in terms of you know how you are going to unblock them Right and patching the VMs. Okay. There is a slot of security patching is coming in How fast actually you can patch the dev VMs where you don't have the access and they just change the root password And then you don't then they're not even using the keys or whatever. How are you going to be patching them? Right, so as an infrastructure provider actually in the private public cloud actually, you know Cloud provider they don't care but in the private cloud you are responsible for both, you know tenants as well as for the You know the cloud one process itself You got to pass your hypervisas and the tenant VM as well, right and enforcing the discipline around, you know Among thousands of thousands of developers, right? So with that, okay, I think we had a lot of You know information that we wanted to share in last 35 minutes, but I don't know how much time we got it I've got for Q&A I'm happy to answer some of your questions and if you have more again, you know You can send me an email and if you want to be part of this Actually want to learn more and more we are hiring as well, right? So any questions or no too much of information you guys want to digest before asking the question