 Hello everybody, thanks for coming. I'm glad you could find a seat. But my name is Eli, and this is my associate Ty. We will be talking about what we did with OpenStack and Cloud Foundry a gap. We cut this presentation down from the one I uploaded, because we had less time to talk than I assumed we would. So any questions? We'll hang around after or out in the hall. Happy to answer anything. All right, let's get right into it. Platform choice. So this was a couple of years ago, and OpenStack was not one of the platforms that was referenced architecture. So we had to make a few choices. Luckily, we aligned pretty well with what everyone else aligned with. So our choice was between whether deploy on our OpenStack or on VMware. And when we originally created our private clouds, 2013, it was to combat basically the speed of deployment. VMware infrastructure and engineer would create your VM for you, and it might take a week. It might take a couple of weeks, depending how busy they were. And so we wanted that to be hours or minutes. So we deployed OpenStack. It worked great for us. So when we came to look at Cloud Foundry, it was kind of natural for us to decide to put it on OpenStack. So when it came to Cloud Foundry, there's a couple of questions we really kind of had when we were looking at the business and what we were trying to do. And we were really trying to create a new completely development ecosystem. We were using the old style where we did code. We had SVN. We would check things in and check them out. But they were still manual deployments. They're still manually packaging that software. Gap wrote its own revision management control system to handle configuration management. We were phasing that out and phasing in new ways of doing that kind of thing. So we really kind of came together and said, here's this whole ecosystem. We call it the rapid delivery platform. And Cloud Foundry was just the source of to get us to the hump of microservice-oriented architectures, 12-factor authentications. And we wanted a very opinionated platform because we wanted to prevent bad design decisions from developers and really guide them with guide rails to help us get to where we really needed to be. And so one of the things that when we look at Cloud Foundry versus other containers, either Docker or Jenkins, with Kubernetes, it was way too early and new in that timeline of that whole product. And what was happening, Docker had some of its own issues that came with that that really kind of excluded it. We also needed to be a robust prototype. So we were running a lot of e-com on this platform that's starting to take live orders, credit card payment data for all of our websites for Gapel, Navy, Banana Republic, and stuff like that. And so we needed something that we had good vendor support with that and help us with an ecosystem that was going to grow over time. And then finally we needed the traditional technical requirements, so balancing, scaling, self-healing, these kinds of things, right? And so the Cloud Foundry choice that we did has served us very well in that regard. It is very opinionated about how to do things. Regularly we go back to our developers and say, nope, don't do that. Platform doesn't let you do that for a reason. All right, so back to OpenStack. We needed to figure out how OpenStack should look. And we kind of combined the Cloud Foundry release with our new Next Generation Clouds. So these included object storage, block storage, a better SDN, and of course, Nova Compute. So as we're looking at deploying, one of the things, one of the arguments for VMware, obviously, is HA to all your compute instances. And also our OpenStack scheduler wasn't rack aware. It wasn't switch aware. And we didn't want to overload anywhere, right? Also, iSCSI, very heavy in Cloud Foundry, so we needed to make sure we weren't gonna have any network latency issues. And lastly, our Cloud Foundry instances needed a large memory footprint. And generally, we did not allow users to build large memory VMs in OpenStack. If they wanted some crazy 64 gig VM, they could go to VMware and they could do that. I didn't want it. So we created a dedicated non-routable storage network for iSCSI traffic. That was important to me. And it's worked out well for us, but we lost some use at the top of the rack, maybe a little less compute per rack. And we also needed to additional expenses in creating HA bonds across our network cards, adding an extra network card, making sure that our network cards did iSCSI offload and then for the SDN did VXLAN offload, which basically means you need two of the exact same card, right? But it's been great. Also, we decided to do 32 gig Diego cells, and then of course some of the smaller instances. So our CPU over subscription was really low, right? We're not CPU bound in Cloud Foundry, we're memory bound. And we never oversubscribe memory, just blanket rule. So we did up our CPU oversubscription six to one. We found that to be perfect for Cloud Foundry. But that mean we needed larger memory footprints on each of our compute hosts. So as you see, a 40 core hypervisor, we needed 512 gig of memory. So the great thing about Cloud Foundry and Vosh doing the deploys is I just picked two random hypervisors and they're really well balanced. I mean, they're within a few gig on disk space. They're using about the same amount of CPUs and about the same amount of memory. So it was not a problem doing it this way. Yes. So like I alluded to before, one of the planning of we go to this is we gotta provide guardrails for microservices. We wanted to prevent IaaS lift and shift. So in our organization, a lot of people when they transitioned into new platforms like when they went from hard host to VMware for virtualization, it was just pick it up and transfer it over and boom, we migrated and we got the new platform in and we're good to go. Whereas with PCF, you're gonna have to refactor your applications. The developers are gonna have to go back and rewrite those applications. And even now that we've been in this journey since 2014, 2015, they're still having issues where they're trying to violate microservice kind of thing. And it's just the nature of a developer mindset and depending upon the developer sets that you get. So we really wanted to prevent that lift and shift pattern. We wanted to prevent like file read write. So we have some people who by nature of how we designed everything, we use NFS file volumes to read off content for websites to dynamically build web pages, right? And they come to us and say, we have to have file IO for NFS, Malice and PCF. You have to enable this. Pivotal wrote something called file volume services for you to implement. Put that in, we have to have it. And we went with Pivotal and our senior management and Kyman came back and said, no, I'm not going to do that for you. That's a design pattern that we want to prevent. It causes all kinds of problems with latency. You're gonna start having problems talking to NFS shares. How is your container gonna talk to that networking to make that happen? All these kinds of problems that's gonna bring up. So it's an anti pattern we don't want to take care of. The other thing is it handles all the metrics and logging, right? It's a great thing about PCF. You don't have to manage your logging of how you do it. It's managed for you. It goes to Splunk, just how we're managing our log searching. It just, in that regard, they don't have to deal with it. We have a new relic agent that we use for capturing metrics and putting things out so that when you bind to your service to your auto scaler and new relic and then you get your logs, pushing your application is pretty simple, right? You just write the standard out and you just push your app. That brings some challenges with it, but it's been a good design and planning process, right? Just taking care of that for them. And then, you know our requirements. We had a large team, like we had storage engineers and we had OpenStack cloud engineers and we had business management process people and we had the lead of the developers and we had multiple architects come together and put this together. And if we hadn't done that, there's no way it would succeed. Us as the OpenStack private cloud team, which is where I started, come and say, I want to bring PCF in and let's do this and transform the business. Nobody would listen to me. You must have senior management buy-in. You must have a multiple differentiating, multiple people come and put this thing in. And you need all that support later from when you go, hey, they're violating design principles and this is causing problems. You need that senior management buy-in to say, yep, you're right. Let's help steer that correctly and get that moving. So those are some of the challenges that we had with doing that. So, guardrails, like we've talked about this, maybe you probably know PCF is pretty opinionated, but in terms of like those guardrails, you know, microservice, 12-factor off is what we basically told people, hey, 12-factor is your model of what you want to use to redesign your applications, no state, right? Minimal scheduling, minimal file IO kind of things. You're linking from one to another to another, do one thing, do it well, pass off to the next, you know, that kind of thing. CI CD best practices, this has been a real challenge for us. In getting pipelines enabled, one of the things we find is we have a team who's implemented pipelines and the problem is, every developer shares those pipelines, but they don't know how they work. And so when their applications fail, they're not sure how to troubleshoot them, right? So that's one of the things. But having a generic pipeline, like pushing that out, doing cloud native, you know, blue-green deployments, hey, you can switch, you can deploy code without affecting direct production, multiple instance counts, you know, these kinds of things. Those are the guardrails that we kind of wanted for that kind of thing. And then, you know, the tools, the platform specification, the business process, it's all of those things together that really allows it to be a platform. It's not just PCF, which is just a component, right? It's all your other things that you're using for it as a back-end engine. So, like, how do we prevent bad design decisions, right? And how do we constantly do that? We have something called an ARB, it's an architectural review board. And anytime somebody wants to push something out, and probably the most recent one that I can use is an example of people wanting to start using Node.js, right, and we had not used Node.js up until this point in Cloud Foundry. We were only using Java Build Pack. We had a couple teams doing PHP, believe it or not. But when they came to Node.js, we're like, wait a minute. There's a lot of things you can do in Node.js, and we're going to restrict what you're going to do with that and hear the things that are approved and hear things or not. And then the other one was IngenX, right? They want to push IngenX to PCF. That seems like a really bad design pattern problem. When we use IngenX as a proxy inside PCF, yeah, no, I don't want you to do that. There's all kinds of problems that are going to happen with that just by nature of the routing that's inside PCF, right? And now you're going to start proxying things for it and you're going to create a spaghetti mess and nobody can figure out, right? And that's the real issue here, right? It's just maintenance, maintaining and performance. So it really kind of come down to that whole, yes you can, right? It's kind of like when we were having the argument with about file volume services, our lead architect development made a really good point. I can implement a web browser in an assembler language and run it on Commodore 64. Should I do that for business? No, because nobody can debug it. Nobody can maintain it. It's your own code. Nobody will know what's wrong with it and it will fail eventually and you will have to fix it. And the chances of it costing you millions of dollars is a real risk, right? And so that's why I say yes you can, but should you like, you know, Ian Malcolm famously in Jurassic Park asked the same question. Your scientists didn't think about just because they could and they didn't think should they, right? It's a bit of a generic quote, but that's really kind of what we're talking about. And then your service offerings, right? How do we deal with what all services are we offering and what's correct? We have New Relic, we have Autoscaler, we have Splunktile service that we actually use to, you know, throw all the things over it and some of those services are good, but the other problems that we've ran into is with service offerings, if you're in a constrained resource environment for the services within PCF, such as Rabbit and Redis and MySQL, they all take up their own instances that run inside the foundation. They all have their own resource requirements depending upon how many people are using them. And then the other issue we run into is everybody wants to customize their rabbit usage for their queue sizes or how many replicate queues they have or they wanna customize Redis to their cache sizing and how they do that. And initially when we rolled out services in PCF, it was just shared service. Everybody gets the same. You can't customize it. I'm sorry, you can't do that because you're sharing it with everyone else in the foundation, right? And so we kind of made the generic. And the real issue was resource constraint and then some bugs within the actual services, right? With RabbitMQ, there's quite a bit of challenges for some things that it doesn't handle well like sudden service outage. You disconnect applications running in RabbitMQ and you have queue messages. If you disconnect and restart that service, you lose the queue messages, right? That's kind of a big deal when we have sourcing things that are feeding into this, they're feeding into Rabbit and then consuming how many product they need to deliver out to different data or distribution centers. It's a bit of a problem, right? So that's one of the reasons why with service offerings for us has been a challenge on private cloud. Now when public cloud, that's completely different. We don't have resource constraints, right? It's just how much we wanna pay for. It's not how much physical hardware we have to run the capacity on, right? So and things are changing. And then also with PCF 2.0, things are really changing with services and their architectures and that you can have customized service instances that are your own, right? Versus a shared service and that makes a big difference, yeah? All right. So some lessons we have learned in our journey. Some of them will be pretty obvious or should have been, but so in OpenStack we had a problem where CPU soft lock errors were happening in VMs that had large memory footprints and what the heck, right? It turned out to be on the hypervisor, KSM, you've gotta turn that off, right? Because when pages are read and de-duplication is done, nobody can access their memory for that microsecond or however long it takes. So disabling KSM, huge performance gain. So have enough memory on your hypervisors, don't oversubscribe memory and don't let any de-duplication of memory occur. Also our object storage while we really wanted to do Ceph was not something we could fully implement at the time. So we went with Swift, which works great for blob storage but not great for Cloud Foundry, not quite enough S3 support. So we considered Ceph, we ended up going with RIAC for our blob storage. And then the block storage is a critical piece of architecture for Cloud Foundry. So your block storage has to be available and it has to be accessible and it has to be accessible long term, right? Because Cloud Foundry might not talk to that block storage for a month, but when it needs to talk to that block storage, it needs to talk to that block storage. So tweaking multi-path D is critical, tweaking your iSCSI subsystems is critical. We found that some of our hypervisors for no particular reason just weren't multi-path. There was one path. And we ended up working a lot with the storage vendors and the storage team to tweak our configs on each hypervisor. So it's an important consideration. Yeah, we actually spent some time with between RAC Space, Rcells and EMC for the Extreme IO. And there's actually problems with the Extreme IO with the OpenStack drivers to talk for the multi-path D where they can actually get to where they're only consuming a single path and then also they can overload the Extreme IO queue size. And so we had to tune out on just on the Extreme IO side to handle our iSCSI requests. So it's just one example of how that's critical and how it relates to PCF is, okay, so you do a Bosch deployment for your Diego cells, right? And you're gonna update all your stem cell updates and it's been a couple months since you've done that when it goes to disconnect the block storage from the Diego cells, it times out, right? Well, when that happens, then the Bosch deployment fails, right? And all it basically says is, I can't talk to the block storage. Well, you can go into OpenStack and you can force those things to happen but they do prevent deployments until you resolve the issue, right? And it's that it's lost the multi-path handle to talk to that block storage. So it's one of the issues you gotta manage. From the cloud foundry side, so one of the biggest headaches that I really get and it's indicative of how often I get headaches, our object store is a critical piece, right? So the React, React generally works pretty well. We use React and React CS, we're not migrated to the new React KBE yet. So the React CS is the cluster services, it's the REST API that you can use to talk to your buckets and then the React is the federated storage cluster itself. So what we come to find in our React stores is occasionally our React nodes die and it's an out-of-memory process, it's a memory leak is what it is, okay? The problem is the React CS nodes, we have a VIP offloading that, we have an F5 hardware load balancer, we've got five React nodes, we should be able to tolerate two failures and so be okay. But the React CS cluster service talks to the local host React store. So if you lose the local host React store, then you lost that React CS cluster node as well, right? And it's in a round robin from the F5 because there's no great health check to figure out if React itself has died. And so while I have a load balancer to talk to and balance traffic load across the F5, I don't have a way to intelligently look and say, is it healthy or not? Well, so if your object store dies in your Cloud Foundry instance, you're mostly of your running applications, they don't notice. What you're really blocking is your deployments, right? They can't talk to the object store to upload things. You can't talk to a build pack to download it, right? If you have foundation-based build packs, if you're trying to push up new things or you're trying to create new containers, it has to talk to the object store. So it doesn't affect your running production cluster, but it does affect people's ability to push new things, right? That's why the object store is kind of a critical. The second thing was the services, and I mentioned with React and, I'm sorry, Redis and Rabbit, we had some issues with those. Our thing is we have a constant business problem where nobody wants to take any downtime, everybody wants to be up 24-7, the teams don't want to babysit their applications, right? And those are all common demands and it's really about how you handle them. The real issue that we kind of ran into is people were complaining quite a bit when we would do maintenance and the only effect would be is that we restarted the services within PCF. Well, the problem is like for Rabbit, for example, if you don't have special reconnect code, and Pivotal provides on their website, if you don't have reconnect code for your Rabbit instances, when we restart the Rabbit service, your application crashes. It loses its binding, like it's gone, right? And you have to restart it, right? Well, that seems like such a simple thing to restart, but when you're doing it 2 a.m. in the morning and nobody's awake until 10 a.m. the next day, it's a big deal, right? And so it's really just communicating with your customers in that regard is one of the biggest lessons we learned and providing best practice. Hey, you need this reconnect code. If you're using Rabbit in PCF, you must have this in you as part of your application. It will allow that service to restart and you can reconnect and move on, right? The other one was like, with scalability, we have a lot of people when we would do maintenance and their applications restart and they'd be upset and be like, you know, you can manage this if you would deploy more than one instance of your application. It's a best practice for PCF. Okay, well, let's go take a look. I have approximately 500 unique applications deployed in my production PCF and I did a query real quick against the REST API and I found 380 of them more single instance deployments. Guys, you have to deploy multiple instances to get scalability if something dies and breaks. The platform has self-healing if you follow the rules, right? So these are some, and then the other is we have build packs, right? We had to take down a build pack from public. We had to fork it and then we had to put in all of our custom CAs. So we're our own certificate CA, right? We issue all of our own for all our internal stuff if it's not public facing. Well, there's no simple way to inject that into a cloud foundry app that gets deployed unless it's part of the build pack, right? So now we have a build pack that's at 3.8 Java version, which is really old and there's problems with it. It has memory management problems. We get a lot of out of memory management. You can set the XRSS size to something a little bit higher than the memory you're actually giving it. If you set it to four gig and your app is at four gig, you will get it out of memory problem in 3.8 Java, right? If you move up to the forex build pack, you don't have a problem anymore. The memory management is a different algorithm. It's handled differently. It knows how to pad in addition to the size you set. It handles it better, but getting people migrated, right? Well, if we just suddenly pull it out of the foundation, what happens? People can't deploy anymore because they're packing in as part of their Gradle scripts, the CI CD, the actual build pack column manifest when they push it. Which is not a best practice, but if you're pulling things down and you're maintaining them yourself, this is one of the ramifications you have to deal with that, right? So it's really just end user communication and really talking to your developers and having a two-way conversation. Now just here's your foundation, go deploy your thing and when it fails, let me know. It's really more of a, what are you doing? What is it you're trying to accomplish? Okay, here's what's going on. Here's some best practices. Here's how you consume this platform. That's really what we learned. And there's no, it's not magic, right? Like everybody knows you have to talk to people. You have to communicate. You have to have two-way communication with your systems and your developer people. That's really what kind of stood out as we set back and we take a look at the thing. Sounds so easy, doesn't it? Yeah, it does. I mean, it really seems to be like such a simple thing. But then the final, I have up there, apps restarts versus downtime. When I'm in the change review boards, like when we have all of our changes up about what we want to do and do we have approval for management to do that. And they're like, is there any downtime? And they're like, no, platform stays up, never goes down, applications that are running stay running. Now you could have an app restart. Like, well, that's downtime. No, it's not. It's an app restart, it's container. This happens all the time. They're throw away. That's not a monolithic application. So that education is also the people in business and management who've been used to managing IT systems but are not in the weeds and understanding the tech details. If they communicate and educate them, hey, it's a bit different. Here's how it works. Remember we did this? Well, this is like this now. And they have to believe you, right? You have to crest that whole trust issue with them and say, hey, I understand the job you're trying to do as a change management process. And here's what I'm trying to achieve with mine. When we work together, we get there faster, right? And so that's really the user education piece. And it can be really challenging, especially depending upon if you have a hostile environment or not, right? All right, so he reminded me, one of the slides we pulled out was a little diagram of how, you know, he was talking about single deployments, how we managed our HA and that's by separating resources, physical resources on the aggregate level, which is part of the reference architecture, but also separating it across the data center. So this rack, which is part of the first PC, or Cloud Foundry aggregate, has this power, this end of row switch, this core switch, while this rack all completely different. And this rack might share a few because you only have so many core switches, but the idea is if you lose a UPS, if you lose a switch, if you lose a rack, you stay up. So something to consider if you're laying an open stack for it. And I believe that brings us to the summary, which you've heard 10 times, but plan ahead and design for failure and partner with people and educate everyone, including yourself. I don't like that part. So we are pretty much at time, but any questions? Happy to answer. All right, thank you for coming.