 OK, so good afternoon. Welcome to Hybrid Cloud, the HIPAA compliant enterprise with Kubernetes. So today, we're going to talk about how a large health care provider in Southwest Pennsylvania manages their HIPAA compliance on on-prem systems in the cloud. And we'll learn a little bit of our journey along the way of how we went from our on-prem systems to the cloud. So if you've noticed my swag, I work for Heptio now. But when I wrote this talk, I actually worked for UPMC. So from this point on, I'll be a UPMC employee into the last slide. So at UPMC, I was a software architect. I was responsible for bringing containers and Kubernetes to UPMC and UPMC enterprises. So there, my job was to help move applications from developers' workstations to the cloud and do that in a HIPAA compliant way. I've done a lot of open source, and you'll see that throughout this talk. And if you check out our enterprises GitHub repo, you'll see lots of things that we've written. So who is UPMC? Well, they're the University of Pittsburgh Medical Center. So there are 16 billion dollar enterprise, 80,000 employees, more than 30 hospitals, 3.2 million member health plan, as well as some other stuff. And the other stuff is where UPMC enterprises comes into play. So UPMC enterprises is outside of UPMC, and their job is to create revenue streams for the hospital. So we do that in a number of ways. One is to write software and then commercialize it. We'll also write software, spin a company off of that and have that company be successful, or we'll help invest in companies and do the same kind of thing. So that's kind of where I work. That was my job, doing R&D kind of work and sort of more things out front in front of the traditional hospital sense. So a lot of those applications needed to access clinical data. And if you think of an enterprise of 80,000 employees, there's lots of systems. But when you think of clinical data in a healthcare environment, you get a picture like this. So this slide is about a year old, but there's over a thousand different applications within UPMC that has varying data sources. So things from electronic health record systems to radiology to just the lab works. It goes on and on and on. Data now is counted in petabytes per year. There's a big diversity in the behavior of the data, the type of data that is contained within the performance of accessing that data, as well as the access patterns of how you get to it, as well as the sensitivity. So everything from mental health status to HIV status to evidence of child abuse. You could imagine this is the most sensitive data you could deal with. And we need to treat it correctly. So with that background, let's talk about what HIPAA is. So HIPAA is the Health Insurance, Portability, and Accountability Act. Now if you've read through HIPAA, there's lots of different pieces and parts. There's lots of white papers you can dig into. So the one I want to focus on today is the piece that talks about prohibiting the disclosure and misuse of information about private individuals. So TLDR, we want to keep your patient data healthy and safe, and only letting the people who should access it access it. We're talking about things like social security numbers, medical record numbers, or MRNs, which are IDs within your EHR to help you find you uniquely. Even your name is PHI. Basically anything that can describe you specifically, we want to protect. Because with that data, we can go back and find out who you are uniquely. So why create HIPAA? Why make up these rules and these regulations and these things? The idea is we want to come up with sort of standards. And I have standards in quotes, because there's not a standard. But we want to figure out ways that we can talk about this data. So there's one way to have us manage our own data correctly. But then you have to think of when we move to public clouds or other places, they're managing our data now. Or if you have vendors that have feeds of your data, right? You're only as strong as your weakest link. So if I have the most secure network, but my vendor is pulling data and they're weak, then I still have an issue with data. So we're going to talk about some things around standards and how this data is going to get put together. So again, if you don't want to read through all of that, I'll summarize the main topics that we want to cover in terms of meeting HIPAA compliance. And the first one is encryption at rest. So while data is sitting on a disk, not moving, sitting in a cache, you need to have that data be encrypted. It can't be in plain text. You need to have encryption in transit. So you're moving from point A to point B. That needs to be over a TLS connection. Over whatever protocol you're going to manage that through. We need to have auditing down to the user level. So I need to know who accessed the data, what patient they looked for, and be able to identify that specifically to that user. We need a lot of service accounts in the hospital. So these accounts that just have full access to the system, those aren't good enough. Knowing that service account 1234 accessed Steve Sloka's data doesn't help me in terms of auditing who looked at what data. So I need to have auditing down to that level. And I need to have a BAA, and this is with vendors in other places. So that's a business-associated agreement. So basically a contract between you and a third party to decide how we're going to manage this data in a HIPAA compliant kind of way. Cool. So those are the kind of rules that we're going to focus on today in terms of meeting compliance. So for on-premises, we'll give you a, and again, I'm going to try and be open and clear about what we're doing. You may see things that you could do better. It's just the reality of where we've grown from our lives of Kubernetes and everything. So for infrastructure, right now we run some Red Hat Atomic. And we're looking at moving to some CoreOS Titanic as our solutions. This is all running on VMware servers within our corporate data centers. We started our Kubernetes journey in January of 2015. So then nodes were called Minion still, and it was still data. So it was only six months old. It didn't GA until that summer. So we were sort of on the bleeding edge of Red Hat Atomic, as well as Kubernetes, and then our own learning of how this stuff worked. For storage, we use things with NFS for good or bad. And we also use local storage. Some applications we have may have 20 users, right? So the complexity of being cloud native and having dynamic, persistent volumes and all those fun topics, sometimes that complexity outweighs the simplicity of just running data locally to a disk for those small kind of applications. So sometimes, you know, do the simplest thing to solve your problem and don't get too crazy. For load balancers on-prem, we use a lot of F5. So F5 can route to node ports. We're looking at doing some ingress routing as well. We also do just straight node ports. So an engineer may push an application, expose it at a node port, and then hit one of those nodes and just access it that way. In terms of workloads, we run a lot of CI CD. So we have some rules where our source control systems must live on-prem. So our GitLab runners and our Jenkins workers, those will build images and push images out to Amazon, where we're actually gonna run these things in production. We also have some local agents. So things like data collection and proxying, which we'll get into in a few slides. Those things run in our internal Kubernetes clusters. So I have this slide just to show you that on-prem is harder. It's not impossible. I think the one thing that on-prem misses that we're missing, at least in our environments, are a lot of APIs. So giving us a good API, you can do anything, right? And that's what cloud environments typically give us. Be able to script out machines and script out load balancers and those sort of things. I'm not saying it's impossible on-prem. It's just for the infrastructure we have. Those are the struggles that we ran into. And that's kind of why we did this move to the cloud just because it was simpler and faster for us to get things done and get things moving. I need probably that story over and over, I'm sure. So again, we're in a public cloud mode now. So our cloud infrastructure looks like this. We run everything on AWS. We have a few services in Azure for authentication to our back-end AD systems, but mostly everything's in AWS. We run single region within multiple availability zone. So we'll run in US East 1, Virginia, across three zones, typically. We'll run multiple worker autoscaling groups. So an autoscaling group is kind of like your deployment. So it says I want to have n number of these machines running. And then those reference launch configurations, which is your spec file and your deployment. So that tells you, hey, I want 10 machines of this type. And we'll mix those up so you can have a general purpose kind of machine running. You can have machines with GPUs maybe, and maybe high memory machines again. So you can mix and match those environments to meet the capacity of our users. Everything's running on container Linux from CoreOS. And it's some custom cloud formation that we've kind of pushed together to help deploy those environments. Again, I mentioned the CI pipelines. Those pushed to Elastic Container Registry in AWS. We run classic ELBs. And we do a lot of VPNs to the on-prem systems. So from on-prem to cloud DPCs, we'll create those VPNs. Here's a look at what our cube cluster looks like when we deploy it to AWS. So we deploy following the NIST templates, which is the National Institute of Standards and Technology. So this outlines basically two VPCs. So you'll see on the left you have an application VPC, and on the right there's a management. And they're paired together. So everything in the application VPC is private to the world. So there's no public addresses. The only access is via the Bastion hosts in that management VPC. We'll manage those users on those Bastion hosts with opsworks. And if you're an AWS guru, that sounds funny. But we're using the ability to drop SSH keys onto those boxes. So the cool thing about opsworks is it'll manage SSH keys across regions. So you can have a machine in every region. I can drop an SSH key in there and it'll deploy that for me. If I kill that key, it'll revoke that from all the machines. So now I kind of have key management at the edge for free from AWS. Again, this is open source out on our GitHub for enterprises. So if you're curious about how that looks, feel free to dig into there. This screen is going to be super hard to read, so don't try. But this shows you all of the HIPAA compliant offerings in AWS. So again, in that BAA agreement, AWS says, hey, we'll certify these things that you can use. And the idea is that we can only put PHI in a service that AWS lets us do it. And if they don't have it on that list, then we can't put data there. It doesn't mean you can't use a service. If it's not in that list, it just means you can't put PHI in it. So at one point, ElastiCache, which is like Redis for the cloud, was not compliant. So we couldn't use it for storing patient data, but we could use it for, say, a phone number lookup or a zip code lookup. As long as you don't put PHI in those systems, then you'll still be compliant and you won't miss that up. As long as you can manage that cleanly within your environments. So not everything is compliant, like you said. So typically new services aren't. So things that came out in re-invent last week, mostly those are not gonna be compliant when they're new. Like EKS that came out probably won't be compliant for a while, even being it's still in preview. But things like there's a spot between an internet gateway and ELB that's not. So if I have an on-prem system and a VPN tunnel to my VPC, that spot right there is not encrypted. So I have a full tunnel and that little spot basically doesn't meet my encryption and transit story. So we have systems on-prem that don't do TLS, right? They just don't do that. So if we fire those streams up to the cloud through this encrypted VPN tunnel, we're gonna miss HIPAA compliance because that little spot there doesn't meet compliance in AWS's cloud. So these are the kind of gotchas you have to watch out for. It's not always super simple. And that's why you pull in a good solutions architect from AWS. So this is a good pump to Mike Kuhns. So if everyone go tweets him now and say good job, Mike, I'd appreciate it. So he's been a great help from us. So he helps you build out your solutions, know how new things are coming, how things can adapt. He's been a great resource to help us manage ourselves in the cloud. Because it is honestly a lot to deal with. So let's talk about some workloads. So things on Kubernetes we run. So we run some stateless applications, right? I mean, you think of stateless application, you think of they're easy peasy, right? Stateless apps was the first workload we've ever run. That was my first Hello World demo on Kubernetes. There's self-contained, there's no storage, there's no dependencies. We can scale them with deployments. They're super simple. But all you have to manage there is encryption and transit. And you're pretty much sold, right? So those are easy. We'll talk too much about those. But what gets interesting are stateful applications, right? I need to store state somewhere in my application. So how do I do that? Well, there's struggles that we run into. So things like how do you manage persistent storage? How do you resize and upgrade? So I have this application running and I want to upgrade it and make it bigger. How do I do that? How do I reconfigure it? Am I templating? So I have this thing specced out somehow and I scale it up to a different number of replicas. How do I store that back in my configuration so I can redeploy that easily? How do I back it up and make sure that I can restore in case of disasters? Well, typically we use stateful sets for that, right? And stateful sets are kind of the story today of how we do things. But the problem is that what they do is they allow you to have persistent storage attached to a volume. And I think of the my buddy. So I went from the 80s, grew up in the 80s. It's my buddy and me, right? Wherever your pod goes, your volume's gonna follow. And you get basically one volume per claim. And this solves one problem, right? This solves the issue of how do we manage persistent storage? So in AWS's world, we can use EBS volumes, which are network attached storage. And as long as I stay in the same availability zone, then that volume will follow it around, no problem. But the problem there is that we still have the other pieces that are missing, right? How do we resize? How do we reconfigure? How do we back it up? We haven't saw that story with stateful sets. But so the answer kind of then is maybe an operator. So CoreOS kind of coined this term about a year or so ago. Obviously an operator is a tool that lets you manage application-specific knowledge within a complex app in code. So now instead of having a YAML file describe your application, you have an operator. And the operator's job is to know how to do those things, how to back it up, how to scale it, how to resize it. But I kind of want to simplify this and think of it this way. Let's just make it work like a cloud-provided offering, right? I don't care that it's an operator and that it has CRDs and it has all these crazy things that are fun and interesting. As a user, all I care about is that I can deploy this and make it look like my AWS resources. Anytime I can offload work to AWS, I'm gonna do that because it's easy. Things like an RDS database, which is say MySQL. I can tell AWS, hey, build me a MySQL database. I can check a box to scale it across zones, back it up at night and do patches at 3 a.m. And then I'm hands off. What I want to make is these operators work the same way. I'm gonna have a bit of YAML with my CRD to describe I want a cluster of this size and I want to scale it across zones and have it backed up automatically. And that leads us to the Elastic operator. So this is an open source tool that I've written. It fills the need of us needing Elastic Search. So we have a large system which is pulling in documents from the hospital. And then we wanted to have full text search on that system. Amazon offers Elastic Search, but it's not HIPAA compliant. So again, we had to go build around. So that's where this kind of got born out of. So the Elastic operator basically mimics the cloud offering. So it has full TLS, meaning it has automatic search generation. So if you don't provide search to it, it'll go build its own. Again, we wanna have full encryption and transit. For encryption at rest, we're gonna use encrypted EBS volumes. So it'll create storage classes behind the scenes and that storage classes will have that checkbox to encrypt the volumes. So within KMS, you'll be encrypting those EBS volumes automatically. And it implements encryption and transit via search guard. Search guard is a tool from FloorGun. So they have a commercial offering but they have an open source version as well. And that's worked really well for us over the last few months. So it also spans availability zones. So when you deploy this, I mentioned we go across three AZs in AWS. So this will make sure that it puts the data nodes and evenly distributed across those zones. So in the case of there's a zone failure, you'll have the proper replicas in those right zones. It does automatic snapshots to S3 and it deploys idons as well. So things like Kibana, which is a tool used to manage your data or visualize your data. And there's Srebro, which is a tool to manage Elastic itself. So you can get stats off of that. So again, if you look at this, this is kind of fitting the needs that I wanted to have in terms of the pieces that we're missing for stateful applications. But there are some gotchas, right? There's some work to do, some things that we haven't be able to run through. So things like shard allocation and zone allocation. So I mentioned we're distributing this across the zones, but it doesn't have, we have to tell Elastic how to do that properly. Some things like the Elastic cluster status, we don't have implemented yet. And that will help us enable basically rolling restarts and upgrades. Cool. So the next thing we want to talk about is the KONG operator. So there's the second operator that we've written based around the KONG open source API gateway. So this open source gateway is used to basically we can isolate traffic within our applications. And the problem we wanted to solve was we wanted to enable HMAC authentication for clients. So here's a picture of kind of what we would do typically. And this may be familiar to some people as well. So at the top we have our client which is going to send traffic into our cluster. And we have three APIs. And each one of those APIs is going to implement authentication and logging and rate limiting and all those different features. And the problem here is that each team is spending time building these things out, validating these things. They have to patch them if things go wrong. They may be different stacks. One may be Ruby, one may be Java. And again, if there's a vulnerability you may fix one or not the other. And it's just hard to manage sometimes. We want to get to a world that looks like this, right? So at the bottom the teams don't have to deal with all those things in the middle box. So basically all traffic is going to route from the client through the KONG API gateway. And then that thing is going to do the logging and authentication, all those kind of bits. So this way our teams are freed up to not have to worry about all those things that are repeatable. And they can focus on just building their application. And the only thing they have to do basically is read some headers that come off of KONG. To know who a user is and what group they're in, that sort of thing. So here we have greatly simplified kind of the developer experience. So right now this is still using the third party resources because we're still on a one six cluster. But you can sort configuration basically as that YAML file still. Which means we can check that at a source and we can code review it. KONG uses, to program KONG, it's a RESTful API. So you can't check a RESTful API in the source. It's just unless you wrote some bash, which would be kind of weird. So that's kind of why we wrote this operator. So now we can describe the services, which service to turn on, which plugin, and what the upstreams match to in the cluster. Describe that in YAML. And then we can write that to the cluster and then magically KONG spins up and auto configures itself. So here's what this might look like. So here we have two namespaces, right? And say each namespace is an application. So namespace A is application one, namespace B is application two. You'll see that we've deployed a KONG pod in every namespace. The idea is that traffic again will hit KONG and then it'll authenticate and then it'll get passed to the application. The problem we've now introduced though by doing this model is that nothing stops pod one from talking to pod two. And by doing that now they've circumvented authentication and logging and all those sort of things which we wanted to implement centrally. So we don't want to do that. What we want to do is enforce that we have traffic route from the application pod through KONG again. So now an application talking to another application is going to be enforced that could go through those same authentication mechanisms that we've implemented. And how we can enforce this are things called network policies. So network policies were implemented by a policy controller and that controller will write things to IP tables to enforce them behind the scenes. But what you have to do is you create label queries. You say, hey, only an application will only accept requests from maybe a label type KONG and that will make sure that only KONG pods will be able to talk to applications. So I won't go into great detail about network policies work. Basically you can limit the traffic. You can send around your cluster. And in 1.8 now you can also limit your egress traffic. So traffic that leaves the cluster you can also set network policies around that as well. So I must have talked really fast because I spun right through that really quickly. So this is the end of my slide. So again, I'm Steve Sloka. I work for Heptio now as a customer success engineer. I hope this was helpful and beneficial. If you have any questions, I can help you answer those now. Sure, yeah, so the question was, how long ago did we build this and what would we do differently now now that we kind of know what's going on? So yeah, the service stuff on like Istio and those sort of things are interesting because the problem is having the full TLS between services. We end up writing a lot of self-signed certs. We pass these certs all around and it just becomes a mess because it's just hard to manage. We've looked at doing things like using Save Vault for TLS management to manage, like become our own central CA. But that we just never did because now we've got to manage another layer of storage across zones. But yeah, I think something like a mesh would be helpful because now you can turn on TLS between all the pipes and then also enhance network policies in the sense of like the demo today with Tigera, it was like, hey, let's put policies on every single pod and then we can manage the traffic. Yeah, for sure. So I think that's what I would do. I just don't think Istio is there yet in terms of being production ready. So I think it's coming. I think it's neat, but I don't know that we could go there tomorrow if we could. Yes, yeah, everything, yeah. Yeah, everything needs to be encrypted, yeah. Yeah, so the question was that if you use Istio, you're gonna have a sidecar next to your pod and then basically traffic's gonna route over like local host. I don't know the hip answer to that, I guess. I think it never leaves the network namespace of the pod. So it should be okay, but yeah, sure, yeah. I guess, yeah, we're not using Istio, so all of our traffic basically hits, it's always TLS as it is because we don't have the option right today, so something to check into, yeah, yeah. Yes, so yeah, this guy was saying that it should be okay because you're not going over the wire, so things aren't gonna route, yeah. And a lot of this too, like nothing in AWS stops you from using a non-compliant service, just your own legality does that. So they don't like, my console doesn't look smaller, have less things, it's all still there, it's just up to you to make sure you don't do the wrong thing, I guess, which is kind of cool, but kind of scary. And it's always changing. That's why it's nice having AWS on your side because every day, every month, there's always something new that comes out that wasn't compliant that now is, you know. Yeah, yes, yes. In this example, yes, we are, yeah. And each KONG, so each team manages their own KONG implementation, so they're kind of in charge of here's my KONG and here's my upstream APIs. Right, yeah, so the question was, where do we deploy this YAML, so we deploy a KONG pod per namespace, and they're not sidecars, they're just separate services, yeah. So the, right, yeah, and this, yeah, we were thinking about having one KONG namespace and then writing network policies out from there, but it just kind of got tricky and our initial version was let's keep it simple still and deploy it this way and just bite the overhead of KONG, which is just basically NGINX, if you're not familiar with KONG, just running another NGINX ingress kind of pod, so it's not really a high overhead, I guess, yeah. Yes, no, no, we just, I don't know if we had a specific, specific version of that, yeah, no, I'm not sure. So yeah, the question is, is that there's things in Google that are using that's not HIPAA compliant yet, but you want to, it is encrypted. I think you'd have to check with Google on that to see if you'd be an illegal issue. I think, again, you can do whatever you want, I think. There's nothing stopping you, but it's a matter of if you had a data breach, I think Google's gonna say, hey, that's not our job, yeah. Yeah, yeah, that's compliant, yeah, so that is still compliant, it's just, I guess, the matter of if there's a legal issue or not that you want to deal with, but, because if Google's not certifying that as HIPAA compliant, then they're not gonna help you out if there's a legal profit, if there is a breach. But you are doing the right things, you know what I mean? Like, I could store data, I don't know what a service is, like, API gateways isn't a service in AWS, that it's compliant, but caching is not. So I guess if I stored cached PHI in that API gateway, that sounds weird to know, but I think they still wouldn't help me if I had a breach because they said that's not compliant, so you shouldn't do that. But you are doing the right things in terms of encrypting data, for sure. Yeah, you would be compliant, yes, yeah. Sort of not really, so we have FluentD running everywhere, and we typically just write standard out to make that simpler, so we append the header of the log with something like HIPAA, or some unique value, and then we can put a filter on that, and then we can, you know, when that FluentD picks up that those records, they can dump them to an S3 bucket somewhere, that's, yeah. We also have one team, so we have lots of teams doing it all different ways, which is probably not weird, but one team wrote a little sidecar to dump it to S3, so they just write to a place on disk, and that was just because they were running on-prem, and they were running to an NFS share, and then we switched to the cloud, and S3 was the right place for that, so that's the best thing we've come up so far, yeah. Yeah, it is, it's always hard, yeah. Yeah, and then, yeah, it becomes this, yeah. So wait in the back. I'm sorry, what was that, he's repeating. So the question was that HIPAA compliance has to have antivirus running on all the nodes, well, I guess we don't, I guess, so we're running CoreOS, yeah. Yeah, that is interesting, yeah, I know, I know we have lots of people in our environment wanting to run security scans and all these things, and then they find out they're running in CoreOS, and then it just, they don't deploy, because there's no agent, you can't install anything, so yeah, we don't have anything like that, at least on our Kubernetes infrastructure. I know on the Windows machine, there are a few Windows instances that people run that may have those on there, but yeah, yeah, there's a trade-off, it's kind of like, yeah, so I'm just saying, yeah, that's okay, yeah. So double check, don't make me, if Steve said you could do it, you know. Yeah, cool, anything else, yeah. Yeah, so the question was compliance around the images we're building and the binaries, no sort of things. We don't have any rules around it, I know we're looking at using, like, I think we're implementing Twistlock as one of our other solutions on the on-prem systems, and that will help us do image scanning in terms of vulnerabilities in the images. And we're also looking at things in the open-source side of Claire from CoreOS, we'll do scanning and that's a free open-source thing, and there's a thing called Kate. And Kate will, basically when a new pod spins up, it'll scan it with Claire on the fly. So you may scan an image up front and then, you know, the next day there's a vulnerability that comes out. So Kate will basically say, hey, this thing was good yesterday, now it just spun up, so there's, you know, things, issues with it. We didn't actually implement any of that yet. But yeah, those are things that I would do, I guess. Yeah, no, so it's goofy, I mean, HIPAA compliant, it's not like rule, cut and dry, it's just the better you make your networks and it's more secure you are, so all the best practices come into play. So I think it's great, it's the less vulnerabilities you have, the less room for error you're gonna have, so I think it's always a good thing to do that way. Yeah, anybody else? Yeah. So the question was, network policy is enough to satisfy our compliance. I think they were, at least the last time we did this, we typically deployed an environment per, or an AWS account per application or per team, so which just seems very small. So we didn't have big multi-tenant clusters where we had to deal with lots of different applications, so yeah, those were enough, I think to satisfy our folks, yeah, yes. Yeah, so the question was, how do we deal with incidence response? So say we get compromised, what do we do with that? And so we have a whole cloud kind of team that manages these environments, and at UPMC they have a big split between development and deployment, so engineers can't access production, so at least they're off the hook in terms of working with that. I know they have ticketing systems and the architecture has those bastion boxes, so again, if we had some compromise, we can turn off those bastion boxes and basically isolate ourselves. But so if I could figure out what container, yeah, so if I could figure out what the container was, I would just remove those labels from it so that it would come out of the deployment and it would still stay running, but then Qube will spin up a new version of that pod and they'll still stay there sitting. In the cloud environments, we don't have any local data, they're all, I guess we mounted the EBS volumes if we had that, that's how we would isolate that, yeah. And most of our data stores in the public cloud, sort of elastic, are all AWS managed services, so things like Aurora were used and Elastic Cache and the queuing thing, so they're all things outside of our control or use that we can talk to, yeah. Anybody else have questions? We're out of time, so. Hey, thank you again, appreciate it.