 Okay, so today we're going to talk about, you know, all the architectures that you actually are exposed to and you might have gone to so many sessions and might have seen a lot of things that are changing, right? So you may be wondering, is there somebody looking at the performance and scalability and hardening of all these different frameworks? That's the first thing. But also you may be wondering, oh, I'm seeing all these things are actually talking to each other. They're getting integrated. Somebody is looking at, you know, these things working together, whether they can scale. That is exactly what I'm going to talk about today. Luckily, you know, we have folks looking at exactly, you know, how these frameworks are actually working together right from the architecture point of view. This most of the time, when you look at it as an end user, right, from multiple industries, because you need to really run your business. So like, for instance, some of the things like you have, if you have a healthcare provider, you are using Cloud Foundry and, you know, some of the healthcare providers that, you know, I work with, you know, they have the consumer solutions rolled out to almost 11 million customers and they expect certain performance and scalability. And same is the case with banking, airlines, car rental, consumer appliances. So all of these industry customers, when they look at all the architectures that are changing and it's cool that, you know, you're getting new things, but how the performance and how things are actually getting hardened so that I can go to production, right? That's exactly. So I'm going to share, because when you look at all these open source frameworks, these are all individual communities, right? You know, if you look at, you know, Kubernetes, Cloud Foundry, Istio, Container, these are all individual communities working on actually, you know, getting these frameworks out, right? So, but what is happening is there is, all these things are actually merging now or at least some points where there is an integration between the two. So all of you might have seen Irene, for instance, right? Irene is, you know, you're actually having CF and Kubernetes working together. And then some of you might have seen a project where the routing components of Cloud Foundry are actually getting replaced with Istio, because Istio is going to actually give you more qualities of service. So we would like to absorb that and actually exploit that, right? So when you look at all of those things, right? So one specific thing that I'm going to talk about today is the integration part. So there are three aspects today we're going to get. The first one is the current architecture itself, because most of your applications and systems are running on the current Cloud Foundry architecture, because Irene is actually coming, but you have some things already in production. So I'll start with that. What are the things that we look at from a performance and scalability point of view on this? Mainly, I'm going to talk about the platform today, because when you talk about performance and scale, you have an application point of view and a platform point of view. Today we'll be spending more time on the platform point of view. Then we will get into the new Cloud Foundry architectures, whether it is Irene or the Istio integration with Cloud Foundry. And lastly, we're going to talk about observability, because as you can see, there are multiple components, multiple frameworks working in tandem, having the observability of your transaction, how you can actually effectively trace your transactions through all these different frameworks, and then correlate those transactions together. So you need to have a kind of an integrated dashboard that will make the operator's life much easier. So I'm going to give some of my point of view on that, and you can actually look at what is there right now and where we would like to get. Again, my name is Surya Dugral. I'm from IBM. Luckily for me, I'm fortunate to work very closely with all the three major open communities. I'm the co-chair for Istio, one of the performance and scalability workgroup, jointly with Mander from Google. And I'm also actually working on the Cloud Foundry, and I'm also looking at the Kubernetes scheduler. We'll actually get to some of the work that we are doing in our research lab, which we will be open sourcing it soon. So when we talk about hardening, hardening a Cloud Foundry platform, I consider that actually I'll start from like four different aspects. The first one is the containers, because everything is container. Whether it's the garden container, or because as you might have seen in the previous talk, everybody is actually going towards container D. If you look at Kubernetes 1.13 or 11, afterwards the containers are actually container D, right? So when you are standardizing on containers, you have to look at how many containers actually you can pack. The container density is very important, because if you have a node, let's say you have a cell basically in Cloud Foundry, or a node in Kubernetes. So how many parts you can actually pack in a node, and how many garden containers you can pack in a cell, that's important. So we should look at, there are certain things that you need to look at. For instance, you have a VM that's using para-virtualized as the virtualization technology, versus HVM, which is the hardware virtualized. You can clearly see a significant difference. Because we did that initially when we moved to Diego, we were using PV as the virtualized technology. And for a four-vCPU 32 GB cell, you could pack only like 30 containers, then your container is saturated. When we switch that from PV to HVM, all of a sudden, that's the only technology we changed. All of a sudden you could see you can go 7x, you can go up to like 200 garden containers to saturate that cell. So having that level of virtualization technology, how it impacts is important. As a Cloud provider, you should be looking into those. Then the front-door network hops. As you develop your Cloud Foundry platform, because all the stuff that stands actually in between your client, like a browser or something, to the go-router. Because go-router, that's when the actual Cloud Foundry boundary starts. So you have some multiple components like you may have data power, you may have a proxy other server for security and other purposes, right? So how the networking layer in the front-door, how many hops you have to get to the actual application, residing either in the garden container or now it is Kubernetes, that's important. So I'll show some data, how it's really important. Then cell size considerations, because you may have a four vCPU cells or eight vCPU cells, finding the right cell size is really important because sometimes what happens is like, let's say you are pushing your application, you're staging an application, you will see that some of the, you know, seven CPU spikes that you may see. They're all like kind of, you have to really consider for your use case, I think what platform, what kind of, what size of the cell is really important. And then you also have, apart from the Cloud Foundry, you have many support services. Like you may have a vulnerability manager that will be scanning for, you know, vulnerabilities within the system or you may have, you know, some few other network things that you have actually running inside, they all add to the CPU. And you know, you need to really look into those things also, because when you look at all these, adding it together, so even in idle condition, you may be using almost like 20, 30% of your capacity. So you need to really look at, as a Cloud Foundry platform operator or provider, you should be looking into all those four. So these are some of the things from a current, you can actually consider it from a current one, also from future architectures. Also these things are equally valid. These are some of the potential performance issues that we came across. So if you are running microservices, you know about long tail latency, right? Long tail latency is really bad for microservices because that can actually impact your scalability, right? So that is one thing that we have seen and then staging failures. When you're pushing something, staging failures are often their common thing if you are actually not taking care of some of those things that we have mentioned. And then the BFF, like back and for front end, that's a more popular microservices architecture. So you have a question? Yeah, the question is about long tail latency. So long tail latency is, let's say you have 90th percentile, a median percentile latency, and then you have 99th percentile. So the difference between a median or average latency to the 99th percentile is the long tail latency because sometimes your average latency may be like 10 milliseconds or something and long tail latency is like 2,000. So that will be bad for microservices because one small service can actually take your whole application for ransom, right? And then BFF, back and for front end scalability and then the service integration because all your applications are actually using multiple services. So how the integration service, the service is integrated to your applications. How are they actually performing? That is the integration service integration is also important. So what we did so far is actually we have like four of these things that we came up with different optimizations to address each of those things. The first one is go-out or keep alive to reduce the long tail latencies. We introduced the go-out or keep alive in the upstream keep alive for the go-outers. And then we have redesigned the build pack mechanism such a way that you don't have the CPU spikes because you're gone to a layered file system rather than a flat file system. So that is OCA build pack that has actually solved those CPU spikes and that helped us avoid those staging failures in a congested cell. And then the BFF latency issues that container to container networking that eliminated many of the network hops by exploiting container to container networking. And then some of the schedulers like some of the runtimes that you may use like some of the middleware, right? And they may have some schedulers that will impact if you have a backend service that you're accessing. And these schedulers sometimes are impacted. They may not be agile enough if the backend service latency is significantly high. Like as an example, if you have a mainframe that you're actually talking to a female application. Typical mainframe latency should be around 100 to 200 milliseconds. But let's say if you have 800 milliseconds or more than a second, then that will impact the runtime algorithms. So these are like a typical things that you can use in existing architectures or you may have the same thing you will face in the new architectures also. But the new architectures will bring more issues and which needs to be looked into. Like for instance, like these are the three, right? We are actually CFCR because now we are containerizing the actual CloudFondry runtime environment itself, right? Like we have CFE from IBM and we have with the Bosch contra-transition from Pivotal. So we have the CFCR is actually bringing Kubernetes at the container level. And then of course all the CFAR, the irony stuff that you all are familiar with now that is actually bringing the Kubernetes at the application schedule level. And then of course the Istio, that's the third part, right? So CFCR, one of the aspects that IBM's product for that is the CloudFondry running on Kubernetes is our CFE, like CloudFondry Enterprise Environment. So what we have seen with that is there is significant optimization on the front end and on the front door that has given us from the previous architecture to the CFE. We have reduced enough the front end layers that we can clearly see the improvement unscalability in this new architecture. And from an irony project point of view, I think we need to now we actually were getting into like changing the Diego scheduler to the Kubernetes scheduler. So when you go to Kubernetes then you have few things that you need to really take a look. For instance, now you are getting into Kubernetes-centric how the algorithms like, because if you look at Diego you have certain CPU sharing algorithms. That's how the platform or the scheduler provides resources to your applications. In Kubernetes it's a different way you do. Like these are the things like when you define the requests and the limits for your application parts. If there are, these are the three different qualities of service, the guaranteed qualities of service if you specify the requests and limits exactly the same. And then if you want to keep requests smaller and then increase, okay, you can actually burst into the limits, then there's a burstable quality of service. And then if you don't specify anything it is the best effort basis, right? So when you look at the priority from a platform perspective, you can see that top priority will be given to the first one that guaranteed qualities of service. So all these things now you need to actually take into account because sometimes you may have, you may have to look into how your application is actually managing the resources that you have. And then again memory is the scale's resource here. We need to understand how the total memory or RSS is actually calculated. Because Kubernetes one good thing is it doesn't use any kind of compared to Docker, right? It has, there is no page. Page cache is not there. So otherwise in Docker, normally you allocate some page cache that will add to the overall memory, right? So these things actually you need to consider if you have Java workloads, for instance, native heap and non-heap areas and they're all adding up to become that RSS, right? And another major thing that you need to really look at when it comes to Kubernetes. Kubernetes comes with a default scheduler. A default scheduler of Kubernetes in my opinion actually needs some work because it cannot identify a cluster that is not balanced. Because let's say default scheduler schedules your parts in multiple nodes in your cluster. Let's say one node is saturated and then other nodes you have resources. But the default scheduler it takes into consideration only the static requests that you specify as an end customer. But it doesn't really look at how those resources are used at the node level. It doesn't take into consideration any dynamic node level resource usage. That's really not good because you can't have a balanced cluster and then your application may have a bottleneck. So in fact, I brought this up in one of the KubeCon last year and I think we wanted to have a re-scheduler but I think the community is trying to get a de-scheduler. But now even that is actually not giving us enough for this. So what we did is we came up with a safe or smart scheduler. This algorithm, what it does is it takes into consideration the dynamic CPU and memory usage at the node level and then feeds that back into the default scheduler. So it is an extender to the default scheduler and that will now default scheduler can not only take the resource, how much the nodes are actually consuming from a CPU or memory perspective. In addition to the static requests and limits that you have. You can, it has a support for overcome it as well. So we have this working but we are planning to actually open source it so it will be useful for everybody. So another thing that we have is the Istio service mesh that we are talking about. So Istio service mesh, it gives like four major qualities of service like connectivity, security, control, and observability. So because all these qualities of service are essential for any microservice, instead of cloud foundry developing all these things from scratch. We are trying to bring those things by adopting Istio. But then how are we adopting that Istio? So we have a plan to have like in four different areas that we can exploit Istio. Like the first one, not-south traffic, that's where you have an edge server, where the traffic goes through good order. So the plan is to replace that good order and then use Istio's on-way proxy. And then you have cloud foundry will develop a co-pilot. And some of you might have seen at the main tent that's the weighted routing and all that stuff. That's actually a feature that you got from Istio. So there we have some from a performance point of view because you have a co-pilot that gets the data from the cloud API and also the Diego components. And then it needs to keep some state there. And then you have an adapter for cloud foundry that's developed and that's part of integrated as part of the pilot, which is an Istio component. So the scalability of Istio pilot is important for us to manage the thousands of routes within the cloud foundry which we are used to now, right? That's one touch point. And then the second thing because now the on-way is actually is used for proxy as well as the side car. So you need to make sure that on-way is actually designed and scaling for us. So those are the two main things that we are working from Istio community point of view. I'll show you some of the things that we have worked on in recent 1.1. And then we have the other security stuff and then east-west traffic, that's where the side car, which is the proxy that attached to each and every service there also we have to look into. So these are some of the main things that we have worked on. Actually the slides are available. You can actually go look at all these things. We have opened many of the GitHub issues to address and add some new features. Like for instance, now we introduced a namespace filtering as part of the pilot scalability, which is useful for the cloud phone scalability. And then we have externalized some of the configuration parameters in Istio. Like for instance, on-wise, because by default on-wise when it launches, it will launch multiple worker threads. The worker threads used to be almost equal to the number of hardware threads that you have in the host. So let's say think about you have a 64 core cell or something. So now you're talking about almost 128 worker threads spawned per on-wise. Let's say you have so many on-wise side cars, then you are quickly talking about thousands of worker threads. Each one takes more memory, right? And you'll have a context switching also. So we externalized a parameter called concurrency. Actually I'm going to go over. These are some of the tuning parameters that we introduced in Istio 1.1 that will enable the Istio integration with Cloud Foundry and able to tune the pieces that touch in our Istio integration with Cloud Foundry. And we did get a lot of these optimizations in Istio 1.1, but we have a laundry list of things that we have still in post version 1.1. I think we are targeting them for Istio 1.2. So as you can see now, as we have the requirement from Cloud Foundry, like scalability from the Cloud Foundry and a number of routes that you can actually scale to. And the Istio, because of this integration from the community's perspective, we're able to get the requirements from Cloud Foundry and able to change the Istio design itself. That's a very good working together as a partner there. And then we also have a proxy side car. Some of the things that we are trying to do is let's say you have a proxy side car and a service. So the traffic has to go through the kernel. So if you have to go through that kernel, then you have to spend some bandwidth, that network bandwidth. So that is actually adding to the latency. So what we are trying to do now is can we avoid going to that kernel and then just go use the user space, because they both work in the same space there. So those are some of the architecture level optimizations actually we are working towards, so that it's not only useful and scalable for Istio, but also you have the scalability for the massive routes that you need to support for Cloud Foundry. So when you put all these things together, so we need to look at observability, because now you need to monitor the Cloud Foundry. You need to monitor applications. You need to monitor the Kubernetes layer and then now Istio. So you are looking at multiple frameworks. All, of course, your application doesn't care, because it is the application that you're deploying in your Cloud Foundry. But end of the day, if you have an issue, you need to look into all of these things. Like from application-centric monitoring, you need to look into the APM tools. And then that's where you correlate the transactions. And then the runtime-centric, like for instance, you have the different build packs. You need to look into what's happening in the build packs, whether you have any scalability issue, what bottlenecks do you have. And then you have to come one level down to the Cloud Foundry itself. And then you have the one layer down on the Kubernetes Orchestrator. And then, of course, Istio Service Mesh. So now you can see each one of them. I'll show you, each one has its own dashboard. If you look at this, this is our Cloud Foundry Enterprise environment, and it has its own. It gathers data from a Prometheus and then puts them in a Grafana dashboard. And then you can see from a CFE control plane and data plane level, you can clearly see that data points there. But it'll give you only at that level. And then, of course, it has some data points from the underlying Kube layer also. It'll give you. But if you look at, from a performance point of view, we created another Grafana dashboard to add additional things like node level or IO centric data or pod level data or the container level data. These are, again, Grafana dashboards. But if you look at the cube itself, you have multiple services. Like, for instance, Istig is one of the monitoring tools that in IKS, we use it as a monitoring solution. So that gives a separate view. And, of course, you can drill down and get the data. So now you have data that is coming from the Cloud Foundry dashboard. And some of the data actually coming from the Cistig. And there may be some overlap there. And then think about Istio. In Istio, what we did, we created an integrated dashboard where you have one graph. Again, Prometheus is used, different adapters. You have a dashboard for a pilot, a mixer, individual basic components. And then Istio has its own animation also. So we have a Keali and a Veseral. So as an end customer, you will be overwhelmed because you have multiple dashboards to take a look at. So this is one area we are actually trying to see if we can have just one dashboard that plugs into, so they have one master dashboard that has the visibility of all the pieces when you deploy an application. So these are the things that actually we are looking because this has a direct impact on looking at the performance or scalability of your applications. So there are some references here. You can actually take a look at some of the new features that we have in CIFI and how the integration with Istio is happening. So again, just one thought that I just want to pass it on before we end this call. I think there are many, many new architectures and new features that are actually coming as part of this integrated Cloud Foundry platform. But we, from the open community perspective, there are multiple work groups actually that are working together, like for instance Istio, the one that I'm running, some of the engineering interlocks, Cloud Foundry folks are here and there, and then Istio teams. So it's actually we are getting the data requirements from the Cloud Foundry, and that's enhancing Istio framework itself. And then Istio new features, new architectures that we are getting, and then Cloud Foundry teams are actually trying to exploit it. And then from a Q perspective, as I mentioned, we came up with this new scheduler. So that scheduler can be exploited in this new Irene later. So as you can see, all these things are actually working together. We have a lot of work to do. But end of the day, having an integrated one single dashboard, observability dashboard, that will really make life easier for everybody. People are actually thinking in that direction, too. I think we are right on time. So any questions? Thank you.