 Today what we will cover most is obviously why and the motivation of why we're doing this today and then we're gonna take you take you guys through the different architecture that we've built. We are trying to set up a demo for you live. We'll see we'll see if the demo gods are with us. We did sacrifice a few clusters this morning so let's see if that comes through and and then obviously we'll have some questions and and and we'll have some fun together. So my name is Pedro Oliveira I'm a Senior Solutions Architect at SpectroCloud and I'm joined here today with my esteemed friend Tyler. Pleased to be here today honored to be presenting here at KubeCon. I'm a Principal Software Engineer at SpectroCloud and I manage our advanced projects team so we get to build POCs with all the latest, greatest, exciting Kubernetes tech and then sometimes integrate it with the core platform. One example being 2-node HA. You get to play with the cool stuff. Yeah that's what I mean. Cool. So let's let's take a step back and try to picture the scene that we're an enterprise that effectively is managing or running different petrol station stores or maybe different coffee shops or perhaps a retailer right and in this scenario the matter of the fact is that you will have a lot of locations right so perhaps you want to deploy Kubernetes applications there at the edge because you're going through some transformation and you want to give your customers a better user experience or perhaps you're running inferencing at the edge because these days everything is about AI and as you do that your applications need to be highly available okay so because they're critical. There are situations where you might be able to get away with the one node and running those applications and if that's the case and your mean time to resolution if the application goes down is a few hours or maybe a few days and you can sustain that then that's okay but what we see actually with most of our customers and enterprises that we work with is that there's a massive need to actually run business critical applications at the edge and when this comes you actually need to have highly available infrastructure and obviously highly available applications. What this means is things really start to stack up okay if you're in the data center things are easy right we have highly available power we have highly available virtual machines bare metal nodes life is good but at the edge that's not really the case sometimes we are highly constrained when it comes to power when it comes to size and also when it comes to money because things really start to add up so high love high availability at the edge really typically means multiplying everything by three okay so and why is that and I think everyone will be very familiar with at CD so typically at CD is the culprit and what really happens is obviously at CD is the distributed key value store for Kubernetes and at CD uses raft which is a consensus algorithm to figure out who's the leader and who's in charge now you can run at CD with two nodes but you have no failover if that two node goes down then your at CD has basically a catastrophic failure because you lose quorum and if you guys have been doing Kubernetes long enough and you have to restore at CD you'll know how hard that is doesn't matter how CKA makes it look easy is really not is really not so we really trying to look at how can we move away from this in a way that as the title says we move from two from three and we we go into two so are there other alternatives so we we spoke about single node yes we can go with single node but then you have no no failover so if that goes down and if that's okay with you if your mean time to resolution if your critical applications can go down for perhaps hours maybe days that's okay but then you have to perhaps ship your engineer out they will have to go there because the manage the store manager doesn't know anything about Kubernetes or computers or whatever might be then to know that CD it's kind of the same it's pretty much the same as almost running a one node cluster because if that cluster goes down and let's let's do the maths right if if we're doing things at scale tens of thousands of devices or even thousands of devices the probability of an SSSD disk to fail on 2000 devices is really high right so there will be failures and if those failures exist then at CD gets locked and then we we get into a conundrum external at CD yes we can go that route but then we are just adding more complexity to our to our architecture and the same goes for cloud witnesses or external witnesses where perhaps you move this to the cloud at CD or you will have a witness to then manage who's the leader who's right who's wrong all that kind of stuff but the problem is again you're adding complexity and not only that what happens if your connectivity spotty if you have intermittent connectivity or if you have to be air gapped then this doesn't actually quite covers all ranges so we really went out on a quest me and Tyler I put my blonde wig up Tyler's Frodo kind of never seen them seen them in the same room together and on this great quest we we set out to effectively go after a few things and these are our principles for deploying to node highly available applications at the edge so we focusing on to node infrastructure with highly available stateless applications and the key here is highly available applications we're not focusing on infrastructure availability we focusing on application availability okay so that's the first one architecturally simple so when it comes to external at CD external witness we we don't want to deal with any of that so we want to be architecturally simple so it's easy to deploy manage and maintain okay no external dependencies as I mentioned and what we really didn't want to run into is these computer science conundrums so we don't want to have to solve the two generals rule we don't want to go into the onto that rabbit hole no control we don't want to redesign control plane logic none of that so we really try to make things simple and when I say we I really mean Tyler he's he's the brains behind it so the the other thing that I just wanted to to point out is when I mentioned these highly available applications I really one of the things is obviously you can start using things like no the affinity anti affinity rules all that kind of stuff but in in the application that we're actually running for our demo to the gods we are actually using the the latest topology spread constraints okay and this basically just spreads our application into the two nodes that we have that we have available and and this allows it to have highly available application so onto the money now how does it work okay so here we see the standard configuration in a happy path and we have these two machines fundamentally they're powered by kairos which is a meta linux distribution which allows you to pick your os and kubernetes version in this case k3s and ship that bootable image as a container so that's what gets us k3s in the first place and I've split out the control plane components just for illustration purposes they're obviously bundled in reality and then sitting beneath that we have a couple of things kind and postgres which are layered into the image during build time so they're immediately available on boot and then we have our liveness agent which alongside some other controllers that are spectro cloud specific manage the state of the system so there are a couple different state machines one that executes every time the machine turns on and then a continuous reconciliation loop that we call the liveness service we'll get into that a little bit more in in further detail but the gist of it is that the leader machine is configured so that kind points to postgres running on local host and then postgres is also running on local host on the follower but it has a separate kind configuration which points to the leader's postgres database and there's some convoluted well not that convoluted but there's a little bit of a back and forth to ensure that the databases have the same authentication and then we have logical replication configured basically to copy state so every right that hits the leader gets streamed over to the follower and that's eventually consistent and these liveness agents are constantly pinging each other with health checks and they will change the state of the system depending on what they find out it's coming up so we're waiting for our cluster to come up because we have to reconfigure so we will move on with the architecture and hopefully come back with a demo for you guys real demo okay so assuming the leader goes down what we call a promotion will happen where the follower decides that it has to be in charge so maybe the power cord just got unplugged or maybe the hard drive failed could be anything but we're in a situation now where all of the sudden the leader is no longer online so I mentioned health checks we do just an ICMP between the two hosts we also do a tcp connection to the api server and also the cube vip endpoint so I didn't mention that earlier but that's we're using cube vip to load balance between the two control planes and then any traffic that's routed to the follower in the standard configuration will just actually via the kind configuration end up sending any rights to the database on the leader so we have this default period it's configurable but every 30 seconds we run these checks and if enough of them fail we'll decide that we have to initiate the promotion and in that case we stop all of these services and this is initiated by the liveness agent and then we do a little bit of magic to swap the kind endpoint so basically we just reconfigure kind instead of pointing to the leader we now point to postgres running on local host and then with a little bit of sequel we massage things so that the database is the way we want it when it roots up again so really this is a learning I had which is that uh screaming replication doesn't replicate sequences the database is going to look okay but really it has holes in it and uh you know there's high valued IDs then you turn on k3s and all of a sudden you're getting rows coming in underneath which are violating the constraint of the kind table which is that you shouldn't have a previous revision higher than the current ID and anyway we figure that out and um we also delete the uh kubernetes default services endpoints which forces k3s to drop the web socket tunnel to the impaired host and that just makes k3s a little bit happier so once we've massage things to how we want we turn the services back on and now we're in this single node operating state which is fine and there might have been a small amount of data loss I just want to clarify that so in terms of cap theorem I would say we have like a AP with a big asterisk or a medium asterisk so we can survive network partitions as I've illustrated here if things go down the the leader just resumes but had there have been a right that hit the the the leader at the moment that it went down then that would be lost and in terms of availability like we said our goal was to achieve high availability for the app for a stateless app and there is no downtime but during this reboot process that I explained it does take about three minutes for kind to turn back on so during that time we basically lose api server writeability but at the edge that's maybe not such a big concern because the application is running and that's what matters there's often not a lot of changes hitting the api server at a remote edge location so three minutes go by the services come back up and now we're good to go again so that's kind of the asterisk on the availability but at that point we have a couple of options maybe the power cord just got unplugged in which case plug it back in the here original leader call it leader prime turns back on and then we initiate what's called a demotion second option is maybe it's a you know irremediably damaged and in that case we're going to need to get a replacement host so we ship the failed host off get another one comes in the mail to the retail store or what have you you just plug it back in make a few edits on our user interface which we'll hopefully show you here soon and then we perform a host replacement but first let's consider what happens in the demotion scenario so we have leader prime and it just reboots so now we have this other state machine called the kind endpoint reconciler and it's a one-shot system d unit that just monitors the configuration for kind on boot so prior to letting kind turn on it will reach out to the liveness agent on the other node and ask you know what's the current state of the world from your perspective and it's going to respect the opinion of the other node over what it sees or if there's a difference essentially it'll say okay well i think i'm the leader but your state is more fresh than mine and you think you're the leader so i must not be the leader um in which case the kind reconfiguration will happen to point to the new leader and then the database gets dropped completely and then we reconfigure the streaming replication from zero so that way we avoid any this is how we survive network partitions as we do a complete recopy when this machine starts and now it's our new follower aka warm warm standby at least from the database perspective so now you can see we've got logical replication configured in the opposite direction and similarly kind and of course the health checks will resume at this point and we're able to you know flip flop back and forth as needed so what if we aren't able to do that because it's not as simple as just an unplugged machine in this case um we just like i said get the new device in the mail plug it in it'll become available um in a pool of edge nodes on our user interface this can also be done declaratively um not through user interface but essentially what will happen is this new follower will boot and it has an updated version of what the cluster should look like and it will basically announce that via um an api so there's two the two liveness agents are also running liveness servers they can communicate that way and it will just say hey i'm your new follower forget about your the follower you used to know about um and what that looks like is pictured here it's simple like you you know think we're familiar with the diagram at this point but uh you we've got the replacement host on the left and it just basically tells the leader that it's the new person to communicate with it'll set up replication copy the database and we're good to go what about upgrades so there are options we've chosen a simple option that does involve a little bit of downtime but we basically disable the liveness checks in order to ensure that we are able to upgrade both the hosts without any change in behavior of the state machine so we tell both of them to just stop caring and then we arbitrarily upgrade one and then upgrade the other and there are other ways to do this you could you know it would require two state changes per machine though and that's a lot of change basically which involves risk so for now this is what we've opted to do and like i said there's no promotions or demotions they retain their original roles but there is some downtime for the api server when the leader gets rebooted we use system update controller to stream the new kairos image or it can be loaded in with local content but essentially there's a reboot involved during the upgrade and for the period of time it takes for the machine to boot then the api server on the um leader can't accept the rights but of course you can still read on the follower and applications will be up so logical replication is really just a form of change data capture and which is a common pattern where something happens on one machine and you want to stream that near real time to another in order to maintain consistent state or as close to it as possible we initially worked with a tool called marmot but due to some issues with phyps compatibility we ultimately landed on postgres marmot uses gnats under the hood and really hoping that gnats will be phyps compliant here soon but that was a small hiccup etcd mirroring is another thing that we might consider in the future um postgres's logical replication is just pretty battle tested so we went with that for now i already mentioned the sequences but uh that's just a good thing to know you should read all the docs before you try anything um and then zero time zero downtime upgrades maybe something for the future and the big thing that we need to look at is persistent storage with longhorn is the most likely candidate that'll allow us to have synchronous data replication but perhaps at the cost of performance although it's not uh there's nothing about this architecture that would preclude stateful workloads it's just if they need high performance that might be a concern is it yeah i'm back um yeah so we'll we'll show you guys a little bit of a demo the plan was to show you we have three nooks not because we're doing three nodes because we're doing two nodes and the goal was to replace one um the demo gods are not with us today at least in this room so what we're going to do is we do have a backup plan where we can use our usdc uh so we have uh we have yeah the two node at the bottom so if thaler goes to that two node so we have a effectively is the same thing right but we have two node aj running in s2 virtual machines instead of bare metal but the concept is obviously the same so if thaler goes to the nodes you'll be able to see that you'll have we have two nodes we have a leader we have a follower this is pallet our spectra cloud product where you're able to manage and orchestrate clusters at scale in this in this scenario now if thaler vpns into into our usdc lab um and we will be able to see so we have a highly available application which is super mario or game box which allows you to play um the the the good games from back in the day highly pixelated games um and what we're going to do is we're going to show you mario running uh and then we're going to kill or pause stop one of the vm's uh and then you we will see which one so basically we will pause we will show you a promotion so we will pause the leader okay so that the follower has to take over and then you will see uh most of what um you'll see most of what thaler just shown you in logs so on and so forth and all kind and k3s will have to restart so on and so forth so we'll just have to go for a quick uh quick setup here and uh and we'll show you we'll show you that for now awesome so what what um what thaler is now doing is basically he's grabbing the liveness probe logs which is the media the the screen in the middle so that screen is going to show you when the liveness probes fail and as you heard earlier so we have some checks so and we're checking every so often and if it fails if sequentially those checks fail obviously we're going to fail over so things like network partitioning unplugging or even the completely failure of the device so if we go there and if you click on the 8000 port as well so that so that's our application so that's mario as you can see and we're happy to share this so if you guys want to play in your own time um but it's pretty cool pretty cool application as you can see it's running he's i promise you he's doing it behind the computer i think you can hear it even um but um but yeah so now what we're going to do is we're going to cube config uh into the into into the cluster as as thaler said we're using cube vip so we cube we are load bouncing between the api servers we'll be able to see um i can i can kill it over here yeah i'll just check which ip it is so uh while thaler uh once thaler is set up i'll be able to i'll i will be able to yeah you have the yeah in the node i'll be able to do it just finding the node right now so refresh here and we will have can you show me the ip of the leader because my internet is having a good time 107 awesome thank you this one yeah okay so i'm just gonna kill the leader right now so i'm just gonna power it off and uh we'll start to see some liveness probes failing as well and and obviously the status is gonna change to ready so thaler if you want to go back to the super mario the game is still playable and as you can see some of the liveness is gone is going through and we will see yeah so we have a couple liveness checks that have failed here and as soon as another iteration goes through we'll have sufficient failures to initiate uh promotion so we see we're below the failure threshold of three here um we're printing out which checks have failed um just give it another 15 seconds or so and we'll start to see some more stuff happen okay so the control plane endpoint is down and our cube ctl get nodes failed which is expected it'll be about three minutes before that comes back up and we see there um this machine was the follower and it says initiating promotion and because of the grep that i have we didn't see all the logs so one second uh i believe what's happened is it has stopped it has killed kind or we'll be killing kind here soon let's just see if the user interface is showing us that that edge host is edge host is down yet yeah we'll have to give it some time so what we'll do is we'll give it some time and obviously we are always constrained on that uh so we'll just finalize it we'll open for for questions and then uh and then we'll come back as we have some questions we'll come back to this so if you go back to the yeah so as you saw we did get to play some super mario um and like i said we're more than happy to to share this with with you guys the idea here is that we are building this architecture so that at scale as you deploy your kubernetes clusters um on different across different industries right so like i mentioned if you are your petrol stations your retail stores your coffee shops um and you are starting to push all these highly available kubernetes clusters things start to add up right and uh and then if you want to make your cfo happy then having the ability to have all of this highly available capabilities with a 33 reduction in price in hardware alone will actually be a reason to put a smile on his face right and when we're talking about thousands of devices tens of thousands of devices most likely which is the kind of skill that we tend to deal with our customers then we're looking at hundreds of thousands of euros dollars pounds in savings in hardware alone and then we have to count for cabling engineers on site so on and so forth and as of today the other very important part to talk about is obviously sustainability as well so if you're shipping out less devices you you're reducing your footprint as well because obviously less size uh less power drawn so on and so forth so back to the dynamic duo of two highly available nodes it's not me and tyler it's the two nodes uh although i would be batman he can be robin and um so yeah so what are the next steps the where do we go from here so so effectively as tyler said so this is a project that we're running in how in our advanced project team right which is run by our ceo and led by led by tyler and we effectively have a tech preview an exclusive tech preview that we're announcing on this in this talk and we have about five places left we have three customers running today this tech preview with us so if you guys would like to join we have five places left so please come and talk to us we also have we also have another talk which is going to be with one customer of ours dense plea and they are rolling out 3000 edge devices this year for dentists and that's going to be in 255 on s04 and you know come talk to us let's schedule a meeting we can go deeper and go take a deep dive onto this place of mario or aladin or doom whatever you you choose and please give us give us some feedback so if you guys have some questions we're more than happy to hear yes we have questions thank you guys thanks for for great talk do you do implement any split brain mitigations like stonings or something like that yeah so with a split brain scenario we would resolve that by it would have to require a selection of which node you you want to be the winner basically which like i said before there's the opportunity for a little bit of loss of of state if some rights hit the leader and then it failed and they didn't replicate to the follower but let's just say they're both um unable to communicate like you mean with a network partition most likely yeah okay so they're both the follower would promote you'd have two leaders and and you would have inconsistency now you just need to decide which one you want to win and all you would do is reboot the machine like assuming you have restored the connectivity you just reboot the one that you don't want to be the leader it will um get demoted so it'll drop its database the replication will copy the content of the leader you selected and then you'll be back in action so sorry can we have the microphone very quickly for him I mean they are going to compete yeah because if you have two nodes that think that they are primary then they are going to compete and how they are going to resolve who's going to win like that's why I was asking about the stonings like should the other node in the head if you have that kind of approach or how this is going to resolve would you said something solid sorry I didn't stony like like this was a solution back then when there was a redundancy for routers when you have a similar approach you have like two competing nodes and when like there was a kind of a competition one tried to kill the other and the one that wins the wins and we have a winner but okay I get that this is kind of a not right now yeah well I mean they couldn't communicate so how would one kill the other but they both have a state file which represents their understanding of the state of the world and assuming that the network barrier is removed one of them would have a newer time stamp and that one would win automatically that's that's the deal breaker it's it's just that simple they would communicate ask for each other's view of things and yeah whichever one loses that it yeah that is the the tie break it's just time stamp yeah and we can go deeper so after all let's just want to give other people the opportunity to ask questions but we can go deeper into that so if there's other questions yeah there's one there we're making you walk sorry why not buy the cheapest raspberry pi and build three members with utcd and don't have all this mess sorry what was the question why not use raspberry pies and have three of them so yeah that's a very good question actually so the problem is raspberry pies are not enterprise grade so when we when we're looking at having enterprise grade solutions most of our enterprise customers do not want to use raspberry pies they want to use something more ruggedized they're familiar with so intel del Lenovo all of those thin clients it's what they more comfortable with we can use raspberry pies that's fine and like like we said earlier we do support kairos which is an open source project as well that we that we maintain and we have maro here doing a talk later and kairos does support raspberry pies as well so we could use raspberry pies and and to your point if that's what you want to do and you want to use that cd and use three raspberry pies by all means you can do that right we're just giving you an alternative that if you want to reduce costs and still use an enterprise grade solution with del Lenovo intel nooks sometimes quite expensive especially when you start adding gpus onto these things we're talking about ten thousand ten thousand dollars a box or five hundred two thousand dollars a box then it starts to becoming starts to add up so that's that's the reason and it's not just hardware there's cabling and other considerations that i mean might sound like nothing for a couple cables but multiply that by ten thousand and you are saving cost and places other than the hardware yeah and also you get rid of at cd right so you don't have to do at cd restores maybe postgres but you know pick a poison is there any other questions yeah sorry about it and thank you so much for the questions we love the engagement by the way how you do you bootstrap after a power failure full power failure of both roads very good question so the way it works is with with our solution we we do iso bootstrapping so what happens is actually in this nooks that we have here i have actually a bunch of usp sticks so for the for the for the talks and the demos that we're doing so you bootstrap with an iso you bootstrap the nook with an iso and that will have all it needs to talk back to pilots so there's no inbound connectivity it's all outbound so you bootstrap the nook del whatever it might be with an iso that iso basically has the initial operating system that it needs to communicate with pilots it might also have all the local content required like for example all the different dependencies as well so it doesn't have to pull anything from the internet like the operating system once actually it's registered in pilot you'll be able to then swap if it's a running cluster you can swap in the note pool and if you're creating a cluster you just add that node which is registered onto the cluster and then pilot will bootstrap Kubernetes on top of it so answering a question very simply with a with an iso usp stick and it's headless and zero touch provisioning so we you don't you actually can ship these boxes onto your retail stores factories whatever it might be plug them in put the power cable in or the wireless and then they will once they reach out and talk to pilots it's zero touch provisioning you deploy the cluster it builds and life is good