 So our talk is going to be about consumerization of health care and what we're doing at Kaiser Promenente on Cloud Foundry. And we'll talk a little bit about some problems as well, but first we'll do the quick round of introductions here. So, Surya, go for it. Hey, good afternoon. My name is Surya Dugral. I'm from IBM, SDSM for One Cloud Architecture with me, Alex. Alex Rubin. I'm a principal architect at Kaiser Promenente. And part of my daily duties is to work with different teams across KP to talk to them about the capabilities that we have on the platform, the services that we have on the platform, how to utilize those services, best practices around microservice application design, and then also to kind of learn a little bit as the teams are progressing, as they're taking their applications from POC all the way through the development and go into production to learn kind of where they stumble, what are some of the issues so we can amend best practices and we can look for tools and capabilities that will address some of those areas. So then we'll take all of that knowledge and we'll wrap it back up and give it to teams to help them run better and faster. Yeah, with that today we're going to talk about health care in general, how Cloud Foundry is being used in health care industry and why you need Cloud in health care. And we'll talk about, as Alex mentioned, the KP's experiences running on Cloud Foundry, what exactly worked well, what didn't work pretty well with the, you know, RX application or many health care applications. So then we will talk about what we did to make these applications perform better, scale better, and then work better. And of course we still have a few other things that are open items for the community. So we're going to talk about some of those things and that's how we're going to spend the next 30 minutes here. So health care and cloud, right? As some of you might have seen, I was talking about banking on Cloud Foundry. Now how about health care industry? So you can see there are multiple issues with the health care application, health care industry applications. The cloud is actually trying to solve. For instance, cost is one of the considerations for some of the reasons why in health care applications that actually health care industry is actually looking at cloud. That's not all, that's one of the factors, but you can look at the digital disruption and the multi-speed IT because the speed at which you can actually deliver the latest features and the requirements that the consumers are actually looking for. That's the second one. And of course the third is the agility and embracing the kind of different business models. So I will let Alex talk about the reasons why KP chose to go to cloud and select Cloud Foundry. Alex? All right, so Kaiser Permanente is digital journey. So I'll spend a little bit of time talking about kind of where we are, where we started and where we're going and some of the problems that we've hit. So first of all, can I get a show of hands who knows what who Kaiser Permanente is? Anybody know? Okay, pretty good. Okay, great. So for those folks who may not be aware, Kaiser Permanente is one of the top 10 largest integrated health care providers in the United States. We operate 39 hospitals. We have over 670 different medical offices and outpatient facilities and 200,000 plus employees in the family and 11.8 million members altogether. So lots of members that we're serving and a pretty big company. We've been around for 70 plus years. So there's legacy applications as well as the new microservice applications that we'll talk about here. But one of the key things for us is kind of this whole idea that as folks, as consumers, as individuals, we like to be engaged with in different ways, right? So there's many different things that we prefer. We prefer to be connected via different devices. So maybe communicating via cell phones or social media. We have all of the IoT devices around too. So for folks who may be looking at after hospital stays, how do you improve the overall quality? You may be deciding that it's better for you to recover at home. And consequently, that also happens to be a cheaper alternative than for you to stay in the hospital. Taking care of some of the chronic conditions, it's also better done if you have devices that are helping you manage those conditions. So there's lots of opportunities, right? And even for things like farmers markets in the community and being able to go and know where to go for farmers market or being able to connect to your doctor in a way that's easier for you. And maybe that doesn't include going to the hospital. Maybe it includes doing an electronic visit and just seeing a doctor over video chat and asking questions. All of these are preferred ways for us, but they're also in the end making healthcare cheaper for everyone, right? Because if you don't show up at the hospital, things just get a lot simpler. So with that in mind, in order for us to build all kinds of experiences and applications that enable you to connect in all these different ways and give you value as members and as employees in all these different ways, we need to create a digital platform, digital foundation where we can start building some of these components and we can start delivering some of that value, right? We don't want to be rebuilding continuously the same capabilities we want to innovate. And that means that you have to have a layer which you can use to kind of allow developers to very quickly deploy applications to utilize different services. So fail fast kind of mentality, right? So you want to create an app. You want to use a SQL DB. Well, you tried it. Maybe you got some feedback from users. You have to change your data model. So now you're going to go and use a no SQL DB. How can we do that very quickly? I just want to be able to spin it up and go, right? I don't want to have to wait for a few weeks to stand up one of those databases and then try to haggle around who's going to maintain that database and upgrade those servers, etc., etc., right? So from a cost perspective, that's very expensive and gets very laborious. So what we wanted is we wanted a platform where we can really leverage all of these capabilities, not just from their runtime perspective, but also from a service perspective. And we picked IBM Cloud, IBM Dedicated Cloud as the platform for us. And so we've adopted that. And over the last few years, we've been kind of developing and integrating all the different capabilities so that we can deliver value. So just providing a platform is not enough though, right? As I mentioned, we're over 70 years old and that implies that we do have a lot of systems of record and a lot of, you know, Cots products as any large enterprise, right? Nothing ever really goes away. It just gets, you know, added to. And so, you know, in order to provide and deliver value to our members, we have to be contextual and we really have to know and some of the background and history around how some of these things have developed. And so we need to be able to access backend systems of record, which also means then that we have to tie everything together and we can't just look at, you know, a brand new application, Greenfield in a cloud, you develop and that's it and it runs in isolation. It means you have to have, you know, the integration, full fledged integration with your systems of record and you have to be able to track transactions. You have to be able to look at logs holistically. You have to be able to, you know, set up CI CD pipelines for your deployments and also think about how the versions play together and integration happens. So there's a lot going on in really operationalizing something like this, right? So once we were able to integrate Cloud Foundry and in this case, IBM Cloud into our enterprise a little bit better, we were more comfortable running different applications on it. We started out with just doing workforce applications and then now we're, you know, looking at more complex applications which are member facing. And so as part of that, we have a KP.org, which is one of our premier ways that members can connect with us. It's a premier member facing site. Millions of users coming to the site day in and day out to do all kinds of different things from, you know, scheduling an appointment, sending a message to your doctor, refilling your prescriptions, you know, all kinds of capabilities, right? And what we wanted to do is, you know, originally this application was a monolithic app that was running on Was. And what we wanted to do is we wanted to refactor this application and take it into a microservice world and be able to, you know, provide better experiences to our users, allow teams to decouple from each other and really run fast owning different aspects of this and be able to integrate together in the end to provide kind of a holistic experience, right? So as part of that, what we started doing is we started kind of building out these capabilities and brought in integration to the enterprise. So kind of what you see here is at the top is kind of what users do and what their experience is. And in this box here, you see that we have KP data centers. And this is the IBM cloud. And you can see basically some of the microservices that are running in IBM cloud. So at the high level, you basically have a user. The user might, you know, authenticate against some on-prem system. Then they get this kind of like welcome screen and they can decide where to proceed from there. So they may select, let's say, pharmacy application or some other application. And, you know, they will get static content sort of from AM. We're using AM for our content management system. And then the dynamic content would come from the microservices that are sitting over here in IBM cloud. And what typically happens is there will be a request that goes into a microservices tier here and that microservices tier will be comprised of maybe a gateway written in Node.js and we'll have some business microservices written in Java. So it's polyglot and some of these will actually send requests to the backend systems of record, maybe to electronic medical record or to other systems of record to be able to fetch the data and provide you with relevant information that you really care about. And then those requests will be processed here and then sent back. We're also using a bunch of different services. So this just shows you a few. But, you know, in addition to that, we have object storage, push notifications, postgres, a whole bunch of other ones, right? So this is just an example with a few. And so then you can build out these complex systems. Now, when you have something like this, it kind of looks easy on the slide, right? Not a lot of components, but it's a simplified slide. There's a lot to it and you really do have to start thinking about, you know, how are we going to have different teams running and building out, you know, all these different capabilities? How is this all going to fit together? How is the platform going to hold up? You know, how is the performance going to be, right? So you want to optimize a lot of these things. So next step, what I'm going to do is I'm going to talk about a few challenges that we've hit along the way, right? Not everything was absolutely perfect at all times. And some of these challenges were interesting. So, and then what I'll do is I'll pass over to Surya to talk about some of the kind of lessons learned and best practices and, you know, kind of how we're working together to address this and also where the community can help us with some of these issues. So one of the things that we had an interesting problem with is multi-tenancies. So basically, we have multiple Cloud Foundry environments that we're running in. And one of our environments is a Dev environment where lots of people are deploying code. And so what happened is, you know, we have some projects, as I mentioned early on, they're like in POC stage. So they'll just do sandbox, they'll deploy from their laptop, you know, they'll push some sample apps or do some tech POCs to see if certain frameworks run and how fast they run, et cetera. We'll have other teams that are going to be going through a standard, you know, pipeline from Dev to QA to UAT, et cetera, right? So you might have somebody running standard set of tests in QA, UAT, just checking the functionality. And you'll have other teams that are going to be running performance testing and they're just going to be pushing their apps to try to see, hey, how much can I squeeze out of this app in terms of transactions? Am I fine? You know, where do I need to go patch? What do I need to go optimize in terms of my application capabilities and how fast they run, right? And so what we found is at one point in that environment, we saw like this behavior where this kind of shows you IBM Cloud Dashboard that kind of gives you a highlight of what your resources are doing at the high level. And you can see there's a bunch of Diego cells here. And you can see that a bunch of these Diego cells are kind of pegged at like 100% CPU. And there were literally like most of them in the environment were pegged at 100% CPU. So that was a little bit problematic. And you can see kind of all the behavior in the platform here showing kind of this bad behavior. So that was kind of fun. And we'll talk a little bit later in the talk of kind of what happened and how we addressed some of this. Suffice it to say we should care about CPU at this stage. The other problem we had is as I mentioned performance testing, we wanted to make sure that our developers are able to deploy quickly and that they're getting good results on the platform. So what we've done is we've built out some very simple sample application. The idea behind the app is that I have a data center and I have some service being made available in a data center. I have Java Liberty in this case app which is going to make requests to the data center. And then I have a node app which is going to talk to the Java app which is going to make requests from the data center. And all we are doing here is we're just connecting Jmeter to that and saying, okay, let me just get some static content and have that content come from here and it's just going to pull it with a bunch of calls here. And today a bunch of these calls are going through the go router. So if you think about Cloud Foundry and from the architecture point of view, this connection to this will be going to the go router and this connection here will be going to our on-prem and this connection also goes through the go router. So there's a bunch of connection hops. So one of the things we found is for this particular case we picked a pretty good back-end system and there was going at about 100 millisecond response times so things were looking pretty good there. And believe it or not all of our systems are able to give you 100 millisecond response times. We do have some legacy systems and those systems can take quite a while to respond but we wanted to keep things simple in this particular use case and not actually show some of that data and not worry about caching and all of that good stuff so we just simplified it. And then so measuring from Java directly going to the back-end, you can see the spread of 500 to 800 milliseconds now. As you go up to the node you have node making request to Java which then goes back to the back-end and comes all the way back now you're looking at five to seven seconds. That's pretty bad. So and that's a pretty simple app right? So we said okay well yeah definitely there's something wrong here so where's the problem right? Is it the hardware? Is it the network? Platform? Code? Like what happened? And we did not of course you know we did not want this type of behavior once you have a more complex app, right? Things become very very difficult to manage. So what we've done is we've worked with IBM. We kind of pulled together a more representative use case and we provided that use case to IBM and they've been looking at it and running some tests and running tweaks etc and we've been working hand-in-hand on some of the best practices around this as well. So this is where I will pass the baton to Surya to talk about some of the solutions and best practices. Thank you Alex. So as Alex mentioned we have this healthcare application and we have some challenges. I can classify them as three different things. One is the application specific issue like as Alex mentioned we're trying to actually go from a traditional java application to a more cloud native BFF application. So do we have any lessons learned from that point and then the second point is about the platform itself. Cloud Foundry itself as we are trying to scale. Do we have any kind of inherent issues within the Cloud Foundry or do we have any issues in the cloud platform where we have multiple different layers, right? So what we did we took a representative application like the BFF back and for front-end pattern because most of the applications that KP uses are based on the BFF. You can see we have two different types like you know two layer, two tier and three tier like you know the node application calling a java API which in turn calls the SOR systems of record. So you can see that there are two different types. So the different network issues that we have within the thing. So basically we wanted to see whether we have any latency issues. If at all those latency issues are actually coming from there and then we also saw some long tail latencies also with this. So let me start with the application first. From a BFF point of view what we identified is actually a lot of these recommendations are applicable for not only just for healthcare for any other industry as well. So if when you look at the back-end services because as Alex mentioned we have the systems of record where these applications are actually accessing the back-end. So when we are designing these applications we need to take care of the back-end service latency because some of the cloud applications when you have very high back-end service latency then some of the runtimes will misbehave. So you need to tweak and tune the runtimes that is one of the things. And then when we talk about microservices again the main value proposition of microservice is the Z-axis scalability. So we need to make sure that we are finding the knee of the curve and actually we are scaling and we are sizing it such a way that we have those things built into your application and then topology also. So then we went into the Cloud Foundry itself. From a best-packs perspective Cloud Foundry as you can see we need to make sure that Cloud Foundry when you're using Go Router we have to because every call let's say you have BFF node calling Java API another calling the second Java API all these calls will go back all the way to the firewall and then get into the data power and then gets into Go Router and then gets into the second instance. So there are lots of network multiple hops that these applications these transactions will go through. So those are some of the inherent design issues but there are certain things that we have within Cloud Foundry like the new features that we're actually driving for instance the Go Router Keep Alive. So starting 253 we have a Keep Alive upstream channel Keep Alive is enabled for Go Router. We need to use that and why we need to use that and what exactly it will solve. I'll show some data there and then container to container like if you want to avoid all these kinds of network hops there is a new feature called container to container networking with the CNI plugin where an application in turn can call another service without going back all the way to the firewall and getting into those things. So that will help and then another thing like Alex mentioned about okay see I'm almost saturating my you know Diego cells almost everything is saturating what exactly is happening. There are two things that are happening there. One is the the test environment. The second one is the production environment. The test environment is actually more impacted because there is constant you know pushes that are happening there. So each time you do a CF push if it is Java application you will see a significant spike because the amount of droplet the bigger the droplet size the bigger the spike is going to be. So those are some of the things and we have a new feature in Cloud Foundry that is just just in 279 I think it's resolved. It's there in OCI that's called OCI layered file system. Basically the the the build pack mechanism now uses a layered file system like Docker rather than a flat file system. So that will actually reduce these CPU spikes. So things like that then you you need to look into and then there are other things like the algorithm the C group algorithm how when you push an app the algorithm doesn't take into consideration whether that particular cell has enough CPU left because it's the algorithm is based on the memory. So that also will have an impact because the cell is completely full saturated from a CPU perspective but still the push will go to that cell and then that staging may fail. Some of those things we need to take into consideration. So you can see here and then another major issue that we saw was the long tail latency. We saw certain certain transactions like you know if you go 98th percentile to 99th percentile they're almost like 10x difference which is really bad for microservices. So resolving the long tail latencies is a really tough one but you know we identified that to be the good order keep alive once you enable that then we resolve we could resolve the long tail latency. So I suggest all of you to to take advantage of that to eliminate the long tail latency and you can clearly see the difference on the on the left hand side you can see that that is without go router on the right hand side you can see with go router the difference in the latency just by enabling the go router keep alive there. This is with the BFF application after enabling the keep alive for go router you can clearly see the difference between the 90th and 99th percentile is much lower there. Another thing that we found is the front door like because you have a front door that you know like the data power and other things that we have in front of the go router if those front door areas those layers are not tuned right like for instance data power is not tuned then you can clearly see the significant jump in the latency on the right hand side you can see that the one that misconfigured front door how much it's impacting the overall latency of the microservices and this is what Alex was talking about the spikes what we did we did a temporary fix and a long-term fix temporary fix is we have doubled the capacity we went from four vcpus to eight vcpus in a cell so that you don't need to actually you won't be saturating and then you will have enough room headroom so that you can actually get the pushes going through fine without any staging issues of course the the long-term fix is the oci so which will tone that cpu spikes down right and after applying all that we can actually clearly see that now we have the the healthcare microservice application performing and scaling and as you can clearly see the knee of the curve that's where one single instance of this microservice application is actually hitting beyond that of course the only thing that'll increase the latency so you get to scale horizontally so these are some of the lessons learned and I will let Alex talk about some of the pain points that are from the cloud phone re fabric itself right that that you have identified yeah thanks so one of the things we learned is that I mean doing active monitoring for things like cpu utilization is a good idea and if you're working in the managed environment you may not have a lot of visibility into those things so you have to work with your provider to understand how you're going to manage that right so they may have a view into virtual machines and how they're running you may not have access to that level and if you don't have access to that level then you better be talking to them about okay great how do I you know how do I get to that level how do how am I going to know when I get to that barrier right so we've developed some code some scripts and we have some dashboards that now we're tracking how our applications behave so we're trying to you know proactively look at those those things and carefully analyze what's happening the other thing we found is that it's interesting because teams will optimize for areas where you set quotas so if you tell your team hey you guys we're being charged by memory and we have to optimize the memory and if you know you know cloud foundry a lot of quotas you can set based on memory there's not a lot you can do based on network throughput or based on cpu utilization so what you find is you find clever developers who say well okay if I have to make a trade-off between being more cpu intensive and lower memory versus being less cpu intensive and higher memory guess where I'm gonna go right so they're making these decisions and in the end that only makes the problem worse right so if you're not tracking it it just gets worse the other thing is the performance testing becomes an interesting problem an interesting challenge because what happens is your teams basically can run performance tests one day and you know from from the point of view of cgroups it will allow you to spike right for your cpu to spike and you can actually if you happen to be on the Diego cell which is not very busy you can use up more cpu on the Diego cell for a particular application now what happens if you get on a busy Diego cell then you only get whatever cpu shares are available for your application based on the amount of memory that is allocated right and so one of the challenges we've seen is you know depending on where your application lands and where a specific application instance lands your performance testing may show some different results right in some cases you may think everything is great in other cases you may think things aren't so great right so and you have to be careful about that and keep that in mind that there's some consistency here that you need to worry about and a few other things in terms of support for workload rebalancing so those are the things that we would you know appreciate folks the way in from the community perspective and support for cpu quotas so I think you know a lot of talk in this conference about kubernetes and you know how kubernetes will potentially you know work with cloud foundry and maybe in the future be part of you know just runtime for cloud foundry I think those are all interesting conversations for us because you know we want to make sure that we understand you know and can control some of the additional capabilities that some of these additional resources that you know kubernetes may be able to provide us with knobs to control better so yeah with that there is one more session I have tomorrow like around 345 that talks about how you compare cloud foundry with kubernetes and also I think dawn from IBM already announced about the cfee the IBM cloud foundry enterprise environment so you will actually get to see some of the data from the cfee and also we'll talk about a little bit on on istio and how istio can be supported on cloud foundry what's the future and all that we have a talk tomorrow at 345 so with that any any questions I'm sorry I'm sorry okay go ahead what did I monitor okay the question is what did we use for monitoring that's a great question so we're using for monitoring we're using dinotrace so we kind of we have a lot of flexibility into what we can look into with dinotrace so that's kind of our product of choice right now yeah one is the dinotrace but also I think you have some custom yeah yeah of course there's also some custom scripts basically that we've written as I mentioned so basically some of this stuff because it's a platform so we may not be able to install like agents on the VMs to actually get some of the data so what we end up doing is writing some scripts against the apis and pulling the data from that way and basically then dumping that data into a store and then doing some analytics with dashboards on top of that so that that's the other that's the other way to integrate and and yeah great thank you thank you folks