 Right, so welcome back everyone. This is session room number two. I hope you are all well rested after the break and We have Lucas Fernandez Aragon and Maulik Shah both from Red Hat here and The topic will be implementing high availability for the cloud. Okay guys, you can take it away. Yep. Thank you Thank you very much good morning like I'm Lucas and a software engineer in the AI service team at Red Hat I'm a full stack developer that now works in the SRE part of the project I've been working in the Red Hat OpenShift that the science plug from for about nine months right now And today we're gonna talk about like the features we implemented in it Oh, I love music and I used to play in a few bands and I love cats and here's my teammate Maulik I'm Maulik. I'm also a software engineer on the same team as Lucas, which is the AI services here at Red Hat I've been working on this project called Roads for all my career after Graduating is which is about four years. I also contribute Most of this components with upstream project which is open data half and I find your video games And I love all the cats that my teammates have So let's yeah We're gonna start talking about high availability Yeah, like just a quick introduction a high availability is a term used to describe the period of time when a service is available As well as the time required by a system to respond to our request for a user interaction So in terms of a service deploy high availability is a quality for that service or a component to assure performance for a period of time When setting up deployments minimizing don't times and service interaction is one of our highest priorities Downtimes are the key areas in cost of error in satisfaction Take into account that a system that guarantees 99 percent of our value in a year can have up to 3.65 days So of downtime that's a lot. So we are aiming like a 99 point 99 a time percent of our value That's less than a day the main goal with high availability is to identify single points of failure in our system They and these single point of failures are elements that could cause a service interaction And now we're gonna attack how we have identified them and how we have implemented high ability in our Components. So first of them, it's Jupiter half Jupiter half is the most popular Jupiter notebook and you better left environment It's a way to go when you need to serve to their number Well, what's a notebook a their application for data scientists and AI engineers for creating and starting computational documents They're widely used for people around the world the the world to process data creates AI models train and share them with local setups and dependency management as we can see in the picture here the architecture is Easily it can be easily adaptable to our services to fit our services having the half servants here at the center That's a one controlling the database authentication management the spawners like the actual notebooks and like it's being redirected by the process And now Moets gonna talk about traffic yep so Jupiter and traffic were the two biggest point of failures which are coupled tightly in the service Now the goal was how to get this to height I will be T So you've been I've added in memory proxy Where a user request comes in it sees what's in memory and it's routed to the correct part But what if the Jupiter server pod went down? We lose all the proxy information for the first goal was to decouple the traffic prop the proxy into this external traffic proxy, which is like a musically edge router Which follows like this set of dumb rules in a HCD cluster or like any key value pair where it just says oh an incoming request comes in Hmm. I know what to do with this I'll just send it to the right part if If it doesn't know what to do with it like the rule doesn't exist yet it just sends it to right the main Jupiter hub server for like login authentication and spawning operations and That's pretty much all traffic does we ended up using it because it's super lightweight. It's highly resilient. It's horizontally scalable and That just all all of her purposes where we are not finding a bunch of computation to run a proxy and If anything goes down something else just takes over and it just works so This is architecture that you're talking about here for Jupiter hub and trifik where a user logs into the dashboard It goes request goes to a load balancer where it goes to the trifik now for a new request What it does is it's routed to the Jupiter hub server pod Where it sees okay, so there's not have a running notebook pod It three it spawns a notebook pod for the user backed by a PVC and it writes that traffic rule to a config map Where The traffic is constantly looking at the config map to see is there any update to the routing rules? so once this notebook pod is up and running the Rule is committed to the config map Traffic just the exact request routes it to the notebook pod now Jupiter hub is a stateful application So we ended and which needs like a lock on the database But how do we make the database persistent hand highly available so for production clusters? We ended up using ideas which is Amazon relational database servers and for like internal clusters we ended up using postgreSQL operator Which provides a good enough solution for our internal purposes, but not like customer production ready use cases So one interesting thing that we did here was Jupiter for traffic it need like an ITC backing, but for the core of it It just needed somewhere it can read the rules in a key value pair So instead of running like yet another service which could fail We just tried to leverage like Kubernetes API itself in this case a config map where Jupiter hub server writes to it exclusively and traffic just reads it successfully for it okay, it works and It's again persistent backed by the Kubernetes cluster itself Like if the config map is failing you probably have bigger problems with the cluster itself instead of the servers So this is the final architecture that we ended up like going ahead with and we have been able to achieve 99.9 in person uptime in all almost all our clusters Barring like some weird problems which were like the root cause of which are not our service itself So moving ahead with the demo. Let's go to this So here I have a cluster with like the whole project running as we can see we have Three traffic proxy like I get horizontally scalable. We have the dashboards running We have the Jupiter hub server running as you can see Jupiter hub is a stateful application You can't horizontally scale it So what we ended up doing was have like a leader election kind of thing here where there would always be like one leader pod Where it's You can see like once it starts it's trying to acquire like leader lease it couldn't so Well, it could so it started leading which is the same pod if we were good to a separate pod Jupiter It's it's trying to acquire the leader lease, but then it just found that oh a new leader was already elected So we have this as a part which is now the leader now if that part was to go down For example in this scenario I'm just waiting for one of these Ports to take over It takes around five seconds said so only downtime which we can see and This system which was five six seconds generally so Yep, as you can see a new leader was just elected and So this Pod which was running since like some time now instead of the one that's been up It got elected as leader and it started running a sting So in this case what Jupiter only does is authentication and spawn operation So while the city election no need downtime that we really have as a just spawn operations But if a user were to have like an already running pod, so let's scale down Jupiter hub So if you put up was down all together But I as a user already have like a running pod The horizontally scalable traffic proxy would just take me to it Oh, maybe I'm not authenticated Yeah, I think you're coming Let's check Yep so We one second I'll show this it works. Well, let's go to the traffic part of it now If we were to scale down traffic to like Maybe one hour level pod in this scenario Yep, it's still not just fine So it's probably gonna ask me to log in here. It's just why already so Let's do this for I in range 900 Yeah in port time Time to do that. It's not How Doing it's thing now I should have a valid session cookie So now let's scale down Jupiter hub and yep Jupiter helps gone Why is it doing this? I know why? Let's try to So if I have valid user notebook part running it should take me to that with even if Jupiter is completely offline So what this does is minimize the downtime for like most of the users Except the ones who are trying to log in while leader leader election is taking place We haven't seen a scene where like Jupiter was completely out with no actual leaders Only downtime we have had there where like for like barely five six seconds for the new direction to take place Yet another solution that we ended up custom. I like implementing custom wars So where are the config maps? Yeah So as we can see here, we have this traffic rules config map where Whenever I knew you if our comes up Jupiter just writes the routing information to it to this config map which traffic would constantly read in the loop Well, not a loop. We are now Telling traffic there's update to config map take this new file kind of thing using QVP eyes and This is how we were able to achieve 99.99% of time for Jupiter hub and The thing I wanted to highlight was we can have similar strategies for any applications. Well, if it's Horizontally scalable you just scale it if there's something where you need Persistence you could use like Kubernetes secrets or config maps to have that backing storage for information like that For a stateful applications You could still run multiple pods all you need to do is add leader election side cars And you would always have that one active and two hot swappable nodes ready to go So and that was like the bigger focus on this in terms of Jupiter back How do we take any services and get it to a highly available state or not to IH a but as close to it as possible Well, you don't have to worry about getting pinned at three in the night Yeah, and just so yeah, and just like what you've said a For example in the election a the current implementation led to some drawbacks that we need to address it For example a we need to add some business and readiness probe because when we added a leader election is a mechanism as any other and We'll have some problems for example The load balancing we're predicting traffic to all three boats. Even the ones are they that weren't elected So we need to implement a readiness probe in order to say hey I'm the one who's he giving the traffic and also like it could Fencing in which two or more Leaders could be elected. It's some some log problem with the price condition and it's kind of things So yeah, we have a weakness proof and when it this problem may occur the non-leader elected will be killed So yeah, that's things that we need to take into account When we have like new mechanisms as this like it could let suit to another a single point of failures that you need to address Yeah, that this has to lessons learned Yeah like we have a lot of things to cover up here and Just like our three main topics are a first simplify things For example in the direction we try to use another image for little action But we didn't realize it was not supported anymore and we wanted a more basic leader in election mechanism So we ended implementing our ourselves and instead of trying in complex architectures We didn't defy our needs and turn that that like turn out that It was a simple implementation the thing that we needed. So yeah, we went for it Then think about it's our if For example, it's never a good idea to run a database on the same test or the service because because it's something goes wrong Teresa of catastrophic failure and this leads to our last point that things break you must always bring mind that things could break and We as a developers need to have redundancy in as few points In a few points of failure are possible So that's why we have the conflict mass for predicting groups Replications in both traffic and you could answer reports props for daily production mechanism Both Amazon Air Air DS and post-release SQL for serenity at least and more So, yeah, we always bring my dad and we try to act up and prevent those those issues Yep last but not least We have all the repositories with the actual code and They built image and if you want to test them or like look at the code you can go down and Check it So, yeah, I hope you enjoy our talk and I guess let's proceed to the Q&A Yeah, let's go with the Q&A I don't see any questions from the audience we still have a couple of nights left No questions in the Q&A. No questions in the chat So, I think that's it for now, but of course Lucas might be hanging around in the rocket adventure afterwards So, I think you can be there So, thank you. Thank you guys. Thank you for the presentation At any time and I'm gonna post out the problems here for sure