 Well, welcome everyone. Thanks for joining this afternoon. We're going to talk about the performance of VM versus containerized cloud foundry. Jeff Hobbs, director of engineering at SUSE, and Vlad is one of our system architects. So what are we comparing? We're looking at two standard installations of CF deployment. One is the CF deployment of Bosch on GCP, so Google Compute Platform, the VM-based one. And then there is the standard helm installation of SUSE Cloud Foundry, which is the containerized cloud foundry on Google Kubernetes. So you're really looking at the Bosch and Fisile approach. We're all trying to mix together too many times. So what are we doing? What's our goal? We're really trying to validate. Obviously, you might have seen a lot of talks about containerizing cloud foundry, and there's all the presumptions that, hey, everything from the user experience doesn't change at all, and then it's all going to be great from the operator experience. But there's a lot of non-functional requirements in software. One of those big important ones is performance. Did it get a lot worse? Did it actually get a lot better? Did it get a little better? What's going on? So we're not really looking at a better or worse, but just to make sure that there weren't any critical flaws, because we've gone through exercises like this before and realized, oops, oh, we've added two or three hops in here, and that makes things worse. So it is addressing one of those non-functional requirements in software. The next is what can we improve? What learnings can we take away from this? And so we'll go into the methodology, and here I'll pass off to Vlad, and he'll talk exactly what the environments were and the tests that were run. All right. OK, so first, about the environments, we had two VM-based environments and two container environments. We're going to split these up into minimal and heavy environments. I'm going to explain exactly how much they cost, how many CPUs they had, and so on. So we wanted to see, OK, if we take the minimal Bosch deployment as referenced by CF deployment, what can we get out of that? If we deploy something at the same cost in the container world, what do we get out of that? So this is how a minimal environment would look like for Bosch, let's say. So we have three VM flavors for a Bosch deployment. You need tiny VMs, one CPU, three points 75 gigs of memory, small with two CPUs, and small high memory with four CPUs, and respective memory. This is the cost for each of those types of VMs. And in total, for the minimal deployment that you can currently do with CF deployment on GCP, you reach an estimated cost of about $780 per month. You get 22 CPUs out of it and 93.5 gigs of memory. So that's one of the environments. It's counterpart in the Fisal world. So in GKE, you can provision nodes that become the workers of your Kubernetes cluster. We used the small flavor. And because it's a homogenous environment, we just have one type of VM in our cluster. We tried to estimate so that we get roughly the same cost. So for our environment, we are a bit below. There's a small hiccup here. I think we should have used the small high memory that have given us a better match with the CPU and memory counts. We kind of take a hit. So the containerized approach takes a bit of a performance hit here, because it has less memory and less CPU at roughly the same cost. So anyway, this is the minimal. And then we have the heavy one. So we wanted to push the minimal environments to the brink where they couldn't accept more applications. And then after running those tests and see how it performs, let's just create more space and then see what happens when everything is OK and when we don't run out of resources. So in this case, you essentially add five more cells to the Bosch deployment, which means four more small high memories. So you get a ton more memory. You also add some routers and some APIs. So you end up with a much beefier system. It'll cost you around $1,500 a month. But it'll accept more applications. Again, for the fissile-based environment, so for the containerized one, there's a homogenous environment. We use 19 small VM types. That gives you 142 gigs of RAM and 38 CPUs. Again, should I use small high memory? We would have gotten around the same cost but a much better approximation with memory. So this is what we're running on. Yes, so just to clarify, what we were doing here was trying to create cost equivalent systems and then see in a cost equivalency what kind of performance we were getting. There is as well, we could be pushing and doing all the small high memory. It would look a little different, but we're not targeting that. We have a couple of slides at the end for if you did want to do that, what your results might be. So what do we get? Oh, one more thing. How do the tests look like? So we use the locus framework. It's a Python-based framework for writing agents that do things and then it'll measure how the requests happen and you'll get reports at the end. Our agents that do Cloud Foundry things can perform these types of actions. They can push the Dora application. They can make requests to random applications. So they look at the available applications that have been deployed. They try to make a request, see if that succeeds or not. They can delete a random application. They can also call the stress endpoint of an app. So the Dora application, if you're familiar with it, in the Cloud Foundry acceptance tests, there's this Dora application that has a bunch of endpoints that you can use to trigger a behavior. One of those is a stress behavior, which essentially would start eating CPU and make intensive IO calls, essentially acting like a bad app, like a bad agent, either a bad application or a malicious agent. So sometimes these tests will actually call the stress endpoint, simulating a bad agent in the Cloud, and then listing applications. The numbers that you see in parentheses are the weight of that action. So it's 50 times more likely to make a request to an app than it is to push an app. So we run 20 of these agents across five hours and see what happens. And of course, it's much less likely to delete an app because we want to build up. We don't want to keep churning and not reaching the limit of our cluster. We actually want to push it to the brink. And of course, it's much less likely to actually get a bad actor, but that happens as well. So once that happens, essentially one of your CPUs is running 100% on whatever that stress test does. OK. So results. We're going to go through some charts and I'm going to try to explain what they mean and what we're trying to learn from them. So first, pushing applications in the minimal environment. Success versus failure. So in the green here, we have how many apps were successfully pushed by the fissile environment. With blue, we see the Bosch. How many successful apps were pushed using the Bosch environment? Red is failure for Bosch. And the orange is failure for fissile. What you want to see when you have sufficient space is that failures are zero or nil. You'll see that in the large, in the heavy tests. We never reach the capacity of that cloud, so these will be zero. You never want to see errors here. So what do we see? How can you interpret this? Well, using the same amount of money, it was like $700, we can push many more apps in the fissile-based deployment. So it'll take more applications because you can pack more in given that it's a homogeneous environment and you run cells on each of the nodes that make up your cluster. And obviously, there are fewer errors because you reach capacity later. So you push more and more applications and you reach capacity later and pushing application duration. So how long does it take to push an app on each of these environments? You don't want to see a large difference here. In here, we do see that the Bosch one, on average, takes a bit longer for the minimal environment. And it goes up as the cluster runs out of resources. And we see it, for the fissile one, it also runs out of resources and levels off here. But the difference is not that large. So you see that here, we have 50 seconds here and 55 seconds here. So not that big of a difference. And then application requests. How long does it take to send a request to the application and get a response back? And here, it's very similar. Here is very similar. There is little difference. The fissile one is a bit faster. I don't know exactly why. In this case, it's a bit faster. It's maybe something that we should investigate. And maybe the CF deployment topology can change a bit or maybe we can. We have the supposition is that GKE is just operating straight bare metal. So you're just taking one layer of virtualization out and getting a little advantage. OK, so these were the charts for the minimal deployment. What we can learn from it is there is no real difference when it comes to application requests. So the networking part seems to be the same. You're not taking a hit by doing containers and containers. You do get a benefit of application density. You can push more and get more out of your cluster. So for the same amount of money, you could run more applications. OK, so this is the heavy environment. In this case, we have much more space. The cluster doesn't run out of resources, so you can push many more apps. So everything is running OK. In this case, we see that the Bosch VM-based one was able to push a few more applications than the fissile-based one. There are no errors for any of these, which is great. So you can keep pushing apps. They're both stable. They both allow the developer to run their applications without error, so that's great. How long does it take to push an app? So here, we see that you want this number to be lower. In the case of Bosch, when we added resources, the time to push went down a lot, and the average dropped. For fissile, it stayed the same as the minimal test. We think that this is because of our build packs. The build packs we shipped with SCF have bits for more stacks. We have OpenSUSE 42 and SLEE. So the build packs are bigger, so they might generate more traffic, so it might take longer to push these applications. It could also be that we're not doing as much caching because we have online build packs. So it's probably one of these. So it's something that we learned and we're going to take a look at. So this also makes sense because if we push applications a bit slower, it makes sense that Bosch gains an advantage here in how many applications you can push inside five hours. So a bit of an advantage there. And finally, the application request duration. This one is basically the same. You can't see any difference for Bosch and fissile, which is great. I mean, at the end, I think this one is what we were mostly scared about. So once you push your application, you're not going to push one app every 10 seconds forever. But you are going to get requests, a lot of requests for your applications. And this is what matters, that we don't see discrepancies here and the fact that we run with, oh, and by the way, the fissile deployment with Helm was done with load balancers. And the VM-based one was done with load balancers as well on GKE. So as you probably know, in Kubernetes, we also have services. There's an extra network that's available there. We were, of course, everyone's afraid that that might make us lose some time. But we don't. So that's great. OK. So we learned that we don't seem to have made any huge errors when containerizing Cloud Foundry. We are, after all, running the exact same bits. That's why we're certified by the foundation. So the exact same bits that are running in VMs are running in containers as well. The infrastructure that Kubernetes has doesn't slow down application requests. We do have to learn about these differences. I think the VM team, so the CF deployment team, could take a look at these maybe and think about how to reduce that initial footprint that a VM deployment brings. Because there's so many resources lost when you just start with 15 VMs, you spend a lot of money, and you can't push a lot of apps, really. And of course, we can look at the reason why push takes a bit longer with FISL. So what's next for these? We want to add more test scenarios, like binding to applications, using applications that use services, things like that. We want to add these performance tests as part of our pipeline. So when we release SCF, we run it through a whole battery of tests, including cats, rats. We have our own set of acceptance tests. We want to add these performance tests and grab a snapshot of the data each time we release. And then, of course, we want to investigate all these tiny discrepancies that we saw. You've probably heard a lot about containerization and irony. So for containerization, with regard to performance, you have to think about adoption of CF deployment. We want to be able to deploy our containerized cloud foundry the same way that the VM-based one is deployed, have the same composition of roles, be able to manage it using a cube operator, which means we won't have monitor anymore. We'll further reduce the amount of things that run in order to have a CF deployment. So the footprint that we'll have will be even smaller. One of the aspects of the containerization is, again, we are seeking to have a certified implementation. There are some limitations in that in the containerization or going beyond containerization cubification. And this is part of the ongoing conversation that we're having in terms of, well, if we're not thinking only in terms of VMs anymore, how else could we refactor and gain further efficiencies? The operator is a big piece of that. And an even bigger piece is? Irene. So competing with Jules and Jules and Irene above, but one note, so all of these numbers are done, again, as we mentioned, with Diego and the container and container approach. Irene is pure CF push to Kubernetes scheduling. So you remove Diego, remove all of the extra stuff that Diego adds, and rely purely on Kubernetes for your scheduling. It would go right into a namespace next to your other apps. This also would improve general efficiency. You don't have to worry about how many Diego's do I have. You're really only thinking at the platform layer of how much space is my Kubernetes platform have. I wanted to add one more thing to this aspect. We do have some control now of how we optimize the topology when you deploy to Kubernetes. A good example is anti-affinity rules between routers and cells. We notice that if they would be collocated, so if the same Kubernetes node were to run your router and your cell, performance would drop drastically because the router needs a lot of CPU, and your app might compete with that, and then it just gets slowed down a lot. But we can do that, but with the Kube operator we might get even better at it, and we might be able to share that knowledge with CF deployment. So that's going to be pretty great. So for Jeff's request, we have a bonus around the slides. What if you wanted to run the minimal SCF? So we looked at the minimal Bosch. It takes 15 VMs or something. But what if I wanted to run a very small footprint of SCF? What does that look like, and how does that compare to the minimal Bosch? So what we usually do when we deploy SCF minimally is we have around 20 gigs of RAM. That means three small VMs here. You get about 22 gigs of memory, and it has a cost of $247 a month approximately. So what does that bring you? So again, we see the chart, success versus failure for apps. You see that Bosch does allow more applications in, but remember it's also three times as expensive. And we have successes for fissile. So you can push about 80. I think it was 83 applications. Each Dora app is set to use 256 megs of memory. So in this case, you can run 83, and Bosch can run 120. But you paid a lot more for the Bosch one. So it kind of turns out that you would be paying $6 an app in the Bosch case, and about $3 an app for the fissile case. Of course, if you pack all of your applications into the smallest environment possible. Again, average duration, fissile gets warm, and it's similar. You get an average of 80 seconds to push an application with the fissile one versus about 50, 60 seconds for the Bosch one. And again, the application request is virtually the same. There's almost no difference. Fissile one takes just a bit longer because we run in such a more constrained environment. But yeah, you can get started more quickly. And this is a much affordable number, I think, if you look at it, to get started with a CF deployment. And another thing that we're taking away from all these tests, you saw that we're using, it's actually most easy to set up in a Kubernetes environment to use a homogenous. We chosen small. We didn't choose small, high memory. We think, we didn't test, but we think that that wouldn't give us the proper VCPU to memory ratio. Or rather, there'd be too few. You've got to have a ton of memory around the GPU. We would actually like to test that and see what the real impact is. And all of this part is that we do this and then recommend reference architectures. We've been recommending kind of about this one to four ratio. And to move to something different, we'd have to see what the real impact is. So that's another takeaway for us. Because if you did that, we could have had just one small, high memory. I could have cut another 33% off of this cost. But I'm not quite sure what the ab duration. And now I have to approve another couple thousand dollars worth of time online for Vlad to figure this out. Yeah. I think a bottom line is that if you were ever worried that a containerized approach was not as stable or was not as performant, you don't have to worry. It turns out it is. It's stable. It'll save you money. So yeah, we think it's a good choice. That's it. Take some questions. I guess I'll make an observation. So does it mean that containers over containers over containers doesn't matter? There was a lot of question about that. One, you're adding in some extra with just Kubernetes itself, ingress controllers, and all this. And then container and container on the back end. And the answer is no, apparently no. So you are sharing the kernel. It's not like you're running VMs and VMs. That would be a terrible idea. But with the containers and containers, you aren't using the same kernel of the host. Now if you were doing some math calculations, I don't know, you're computing pi for some reason, you might find that there's a difference there, maybe. But that's not the correct test for this level. Yeah, so for our purposes, for web application, I don't think it matters. You intend to have different workloads to try instead of just Dora? Oh yeah, definitely. And more frameworks. So Dora is Ruby, by the way. But we want to push different apps too and exercise all of the build packs. Any questions? I can't believe there's any questions. No questions. I have a question too. But yeah, go for it. So was the Bosch deployment using online build packs as well? No. The default is built in months now. OK. My question is more from a scientist perspective. I mean, I know you brushed up on the methodology and everything. The key question is, can somebody else reproduce this? Or you have enough information that people can go and try it? And of course, try it on GKE or different other places. My goal was have a Docker image, point it to a Cloud Foundry cluster, and then it spits out these charts for you. So I'm almost there, yeah? Almost there. So it's going to be open source. And you'll just be able to run the image, or point it, and then it'll spit out these charts for you. Cool. Thank you. No other questions? OK. Go run your Cloud Foundry on containers. It seems like, oh, if you do that sort of problem. Thank you, Vlad. And Jeff, appreciate it. Thanks so much.