 Hey everyone, thanks for joining us for our coupon session if we're marketers and if we can learn distributed systems So can you my name is Betty Janad? I am senior director of multi-cloud solutions at VMware and I'm here with my buddy Paul Bert Hi, I work at hash court and I'm a senior PMM there working on open source stuff Awesome. So we are super excited to talk to you about something that's near and dear to our hearts First off the slides and the reading list. We've got an extensive reading list at the end of this Can be found at this link and we will make these slides available from to this CNCS so y'all can download it So first let's start with why We're all here because we're interested in Kubernetes, but have we thought about why why do we want? Why do we want Kubernetes because everyone else does it's there's FOMO everyone's using it But we really wanted to start with the question of why Why do we want Kubernetes? We really are thinking why it's it's at why behind why we want to Distributed systems and really distributed systems are all about failure So fundamentally accepting that systems will fail your applications will fail your server will fail your network will fail Anything can fail at any given time and it's not that it there's it's someone's at fault for that But it's a specific mindset that applies to your design So if you accept that failure is inevitable in any layer of the stack Then what you do is you will design your systems to fundamentally handle that failure and Handling that means two things not only building your system so that it is more resistant to failure so that you can prevent it But also knowing that planning for when failures happen across the environment how to handle that gracefully So what will you optimize for in the event of a failure? So failure is inevitable and we know this because we used to live in a world where we deployed the Single app to a single server and we're like this is great now It's running but we all knew that if that thing went down the whole thing failed right you can't get to the app anymore So what do we did? What do we do? We said we need more servers So great we went from one to three to five or whatever and now we have a cluster of servers And so we have some fault tolerance built in So that if one of them fails we can route the traffic to another one We can write some rules to route the traffic to another machine And this is not a new problem. It's not something that started, you know, just recently with You know in the recent years because of Kubernetes It's actually a problem that we've been trying to solve for a very long time Started way back in 1969 with Arpanet This was a military project military and academic project thinking about Can we build a decentralized system so that if a headquarters location was Suffered a catastrophic failure for whatever reason Can the systems and command centers still be online because we've distributed that information now So this is the first first sketch of the Arpanet when it was launched in 1969 You can see that there are four locations and these were distributed across different states different cities and having those Environments connected to each other through dedicated phone lines The first ever one of the earliest systems whiteboards And what happens is when you start to distribute systems when you start to scale out Even just in the cluster and then you start to think about Distributing those systems across different geographic locations You start ask you start having to answer a whole lot of new questions, right? As you distribute systems, you need to understand how all these systems know what is the correct plan What was the idea intent for that system? How should it behave? What should be running and then how will those systems talk to each other? So that they know what's going on, you know, is everything behaving the way it should and When it finds out that something isn't What is it? What are the instructions? What are it? What is it supposed to do to get it back to the desired state a? Simple example here is for those of you who've ever played the game of telephone You know as a child you've got like you and your 10 friends lined up in a row The first person sends up whispers a little message into the second person's ear And you know by the time it gets the other end It is nothing like what you first said so distributed systems in many ways trying to solve that problem But that's not necessarily an easy problem to solve so Distributed systems are hard and Paul. Can you go into like why it's so hard? Definitely, thanks for asking So we'll talk about some theory up front cap theorem is the big one that you need to know about it's classic choose two of three And in this case we're choosing between consistency availability and tolerance of network partitions That last one is something turns out is really hard to avoid in the modern internet So essentially we're choosing between consistency and availability or a CP system or an AP system We'll take a look at what consistency and availability are in a second here with some examples suffice to say most modern systems and what we'll really be looking at in detail with raft use a leader and follower topology and Then they have to find consensus or figure out availability in the face of a failure based on that topology They also have mechanisms baked in to achieve quorum in case maybe a leader fails If you can still communicate with a majority of the cluster in this case two out of three nodes You can still have consensus and return responses So we'll look at elections and how all that stuff works in a bit, too So first up is availability availability means we always return our response even if there's a failure And in this case that means that the data might be slightly out of date Or it might be out of sync that server that got disconnected can still Return responses if it needs to just means Our data is a little bit stale and this isn't a big deal for things like likes or anything along those lines So it's a great way to go if you need an AP system think about that for social media That's sort of stuff on the flip side if we're dealing with money or shopping parts We kind of have to have the correct data. So what we really want is a CP system in that case and Kubernetes as we'll learn is based on raft which Generally gives you a CP system. So we want consistency between all of the nodes to ensure that we're returning the correct data Eric Brewer is the person who kind of came up with cap theorem originally and he's noted recently in a follow-up to that that He thinks we can do more interesting things with consistency and availability using them in sort of contextual Moments to provide a faster or a better service because trying to achieve consistency can slow us down And there are some really interesting theorems and protocols that follow that like tape here and pack out But we'll leave those as things in the notes for you to discover sort of a exercise best left to the reader And we'll just talk about cap theorem as it relates to raft Which is what most modern distributed systems are based on which we'll learn But before we dive into that, let's talk about what consistency is Specifically because database consistency is a bit different than distributed systems consistency So with database consistency what you're really looking forward to is all the rules of the database are followed So the schema the constraints that you've placed on stuff all of that gets followed when a transaction gets committed Whereas distributed systems consistency what you're looking for is a majority of the nodes in your cluster are In agreement when it comes to returning a response So you're checking that there's agreement on the data that you're returning to make sure that it's correct Another issue is availability high availability is slightly different than some of the other availability that's discussed So there's great IBM paper that talks about high availability in the abstract and high availability for some people Might be just having a good battery system connected and regular backup schedule Whereas high availability can also extend at a larger scale to using multiple clouds or multiple regions The scale that you want to go to with high availability kind of depends on you And distributed systems availability means when there is a partition or an issue with the system You're still going to give a response even though your data might be slightly stale or slightly added So one thing to note about cap theorem is it's sort of a boundary it sort of makes us take the Problem seriously that we can't just do everything Which was an issue when we first started designing distributed systems But it doesn't necessarily help us build distributed systems Thankfully there are things that help us build distributed systems. So do you want to talk about raft Betty? Well, actually you mentioned a lot about Raft and the big question is like WTF is it, you know, it seems like a lot of systems are based on it And so, you know, I know you did a lot of preparation on this. So I will let you I'll tee you up for that. All right. Thank you um So raft, uh, whoops, I think Let's see if we can go back a slide Yeah, so ref school is to be a more understandable paxos Uh, and that's just what is paxos? That's a good question. We should cover that So paxos is a formal system that was built to help design distributed systems So it's been proven with a language like tla plus Um that it's a correct system. Uh, and the issue with paxos is it's really hard to understand So, uh in getting his paper reviewed Diego Angara the creator of raft, um, you know Got comments about how just ridiculously hard paxos was so in addition to Correctness and performance. One of the goals of raft is to be a subset of paxos that is also understandable So we'll cover that in a second here. Um, one thing to note with all this is that as our size of our cluster Grows when we're building a cluster that's based on raft Um, it gets a little bit slower because the networking gets more complicated We have to talk to each of the nodes and try and get a majority rule on things before we deliver a response from the leader That happens for reads writes, uh election results. Everything is sort of a majority rule And uh as an example, uh, let's talk about how rights happen On you can kind of work out how reads happen as a result of this But uh, the leader will get a right request It will stage the right in the replicated log the replicated log is sort of the shared data state Um, that's linearizable so we can replicate it deterministically The leader will then send a message out saying hey, uh, majority of the systems I'd like you to stage this on your log and they'll give the leader a response once it hears back from a majority of the systems It will then say okay great everybody go ahead and stage that commit uh or commit that And then that is committed to the log and that is the state of truth at this point When a read request comes in after this, uh, the leader just has to check with the majority to make sure that it is indeed agreeing With that majority and as part of this, uh, the heartbeat message goes out to each of the followers when it's taking some of these actions Like a an append entry or it'll just go out naturally As a heartbeat message itself And the heartbeat message basically tells the system or the followers in the system that they are still connected to the leader and everything is good If they stop receiving the heartbeat then they know something has happened to the leader either a network partition or the leader has crashed And in that case, uh, on you know, 100 millisecond or 300 millisecond sort of random interval The candidate or one of the followers will promote itself to a candidate and then vote for itself And then ask other followers to vote for it as well Other followers may also promote themselves to candidacy And vote for themselves, but uh each member of the system is only able to vote once so eventually One of these systems will Be validated as getting a majority of the votes and be elected to a leader and each time this occurs A term number is incremented and that term sort of helps the system recover from a failure or Know which log is the most up-to-date? In case the cluster is restored or networking sort of comes back online Uh, the creator of it's very democratic. It's a very democratic system. It is And it's kind of like, uh, it's it kind of flips that game a telephone on its head where it's always one person sending that message Right. Sure. Yeah. And to your point earlier about how do we communicate or how do we know what's correct? A lot of it is just Talking to the other people in the chain and making sure that they agree with you Rather than just passing the message along without doing that sort of checking Um, so this gets very complicated, uh, relatively quickly And the creator of raft has some great talks out there There are a lot of other talks about raft, but uh, we have links to this With more detail on the actual specification that you check out So, um, Paul, you know This this sounds amazing because you know, what we're trying to do like we say we're trying to solve for You know designing for failure, right and with and this is important for bigger and bigger systems, right? Because you can't manually touch everything, you know You'd probably be getting as an operator like a gazillion or Pings a day and so a lot of this is around having the system do that checking autonomously and adjusting for these things at a time, but Sometimes with all that kind of automation built in it can be problematic if something is off so Can you I know you've got a great, um example here for us to talk through on like, you know, when you can have like a real world Cascading failure because yeah has all the automation. Let's take a look. So target. Uh, thankfully shared an example Thank you to all of the companies that share their failure stories. That's just how we learn Places so complex They had an upgrade to their Kafka cluster which did messaging and Collected logs for all the various systems. They had sort of a heterogeneous network And that caused intermittent Network issues where the Kafka system sort of slowed down during that upgrade as a result all of the kubernetes systems that had logging sidecars determined that They were having issues and they're all of their pods and everything needed to be rescheduled because things became Unhealthy and it triggered a thundering herd Uh, so this thing just yeah spinning up at the same time and just hitting the system, right? Yeah, exactly. That's that's the gist of a thundering herd. So it's uh an event comes in Everything wakes up at once and because it's all automated. Um, it's all pushing forward at once It's trying to do good things, but because it's all synchronized at the exact same time Uh, it really hits the system like, uh, uh tsunami wave in a way Everything just starts toppling down like dominoes. So in this case Um, I think target said over 41 000 nodes were spun up really quickly and added to their service discovery Before everything sort of calms down So this can get really big and really nasty very quickly It's just one of the fun things you sort of get to deal with when you're working with distributed systems So, uh, that's sort of the theory. Do you want to talk about some modern implementations? Betty? Yeah, so in this we'll go through kind of four different examples of distributed systems and how they approach things differently And I think one thing that's really important for us as a community to understand is that I we hear a lot of x versus y Which is better and really, um, we are only here at this point because of some of the work that has happened over the decades and so, um It's not really an x versus y. Um, there are generational aspects related to the technologies For distributed systems as well as in many cases different ones are better for Better for different types of workloads or personas. So this is an interesting area to dive into so You know back. Let's go back to our timeline a little bit. We had arpanet. We're talking 1969 way back when the next kind of sets of innovations that impact what we're doing with distributed systems are in the 80s and 90s first is the x86 CPU standards that actually brings forward this era of more commodity based hardware Which then leads to mass virtualization. So introducing, um, you know, a broadly accessible abstraction layer from from physical compute with modern virtualization in the late 90s And then arpanet Coincidentally ended in 1989 and that is also the same time the first isp is offered internet access So make of that what you will but those two things are pretty foundational to The next era, which is the 2000s and what's interesting here is we have had more innovation within the distributed system space in the last 20 years than in the previous like, you know, decades And that's really a confluence of a number of things. Um, we have You know, we have like mass Adoption of the internet since the late 80s With that came the birth of web scale companies. So companies have fundamentally delivered experience over the internet, right? So Web apps you see like the likes of google facebook twitter Netflix were all all kind of founded in the late 90s to the 2000s And so but that's only possible with the internet and broad-based connectivity for everyone So you had that infrastructure that global infrastructure plumbing there um, the availability of commodity hardware and Concepts like virtualization. So people are able to buy lots of Lots of servers running big data centers and from that they needed to solve problems of how do I make? How do I, you know, make better use of my individual? hardware resources cloud computing Once people started to move to like using things in the cloud that then further abstracted this concept And then the last bits are containers and open source There is a clear line in orchestration systems in a pre and post container era And because what containers did is actually kind of change it from being just a hardware level thing that you needed to fix And a data center A data center kind of pooling to now like i'm doing some stuff in cloud I'm doing some stuff on prem and also now i've Blown on my construct. So I now have I've made my distributed systems. We're distributing the systems itself So we're distributing the little application components across them. So We've added more layers. We've diced up the stack even more So let's first start with mizzas Mizzas started out in 2009 Out of berkeley and really what they were looking at is how do we turn A bunch of data centers data center orchestration So how do I cluster a bunch of systems in the data center to make it look like one giant? effectively one giant server like I could have like a 10 000 servers underneath, but it looks like one server So then I can't Schedule a workload on top of that and be able to pull, you know, just pull from a pool of resources So that workload may be using Compute and memory from, you know, three or four machines, right? So, um, it's almost like the reverse of virtualization one of their premises was to kind of Eliminate the concept of bm's and instead just look at it as Isolation of resources that can be assigned to a workload So what are meso strengths with scale, you know, it was run in production at twitter For a long time it is can scale to like tens of thousands of machines to then present itself like as one machine For the operator and another strength that was brought up was modularity so that things like all these things were pluggable and things like frameworks and such could be um Developed independently of the core architecture and it was very popular for very like for workloads like Kafka adobe Those ones listed below which is also which is specifically like what twitter use This for and there's lots of blog posts and such on how they use that and For all their data processing because they're so memory intensive um an interesting thing about mesos is that Just this year earlier this year it almost went to the attic which is kind of a patchy's way of ending a life cycle on a product But it was um, it it didn't make it and so it's still there. It is still running in production at a lot of places But it's not something that we see a lot of today Because some of the shifts over to kubernetes. However, it's still very popular for certain workloads So how are apps represented? In mesos apps are represented as frameworks a framework is something that people can develop It can be schedulers. It can be for certain types of workloads, etc It actually includes two bits one is a scheduler and this is really what talks to the the master to understand How much resource needs to be applied to this thing? And then it also includes an executor which is really the task itself And and later on so mesos started before containers and later on added support. So the task so The the task itself can be running containers now And with the advent of kubernetes and the concept of pods They've also added support for something called task groups so that you can have a collection of containers as part of this And how do the apps communicate? What's interesting here with mesos is they drew a clear line and what that While data while networking is very important in the data center A large part of that is out of the scope of what mesos will actually orchestrate And so instead what they've done is made the networking pluggable and integrate two existing network solutions So as an example mesos supports two container runtimes the mesos container and docker containers And so with that they actually support the c&i spec as well as a docker container networking spec And so with that they Allow to create You know enable ip for containers you can create a network and then attach the containers to them And swarm so next we'll talk about swarm and full disclosure. I worked at docker for about five years So I was there for the history of swarm. And in fact, I joined the company right after the first First generation of swarm was introduced in 2014 And so like what is what is swarm it's the way to cluster docker engines and swarm actually had two lives First we know the original swarm is now called classic swarm And then around 2016 there was something introduced called swarm mode In the first generation of swarm. It was a simple clustering networking had some ability to like Link containers to each other. It was directly done in compose files, which is the which is how you Write an application And swarms focus was really around the ease of use the ease of use and simplicity Not from the perspective of like lacking You know, we're not going to build a bunch. We weren't going to build a bunch of features But from the perspective of like building for the developer, right? It was very much a developer experience Let's have simple commands in the cli to do these things at a multi multi No level connecting multi container level and the idea of have the simplicity for having fewer components Even as the migration happened to the next generation of swarm mode It was the idea that you didn't need all of these other bits in order to make a clustered environment So swarm mode has a lot of similar kind of constructs as does kubernetes constructs But what they did is actually built that all into the docker engine So as you instantiate new nodes Turning one of those into a manager node or worker node is single command Things like joining and leaving clusters single commands also with a lot of built-in defaults So upon instantiating it as a clustered node things like pki Other security certs and tokens were all kind of automatically handled by the swarm itself So really focusing on that You know the ease of the experience And how our applications Represented well in the docker construct. There's something called compose file. It's a yaml And there's also a compose version one and compose version two And because as applications and the use of containers got more interesting and And more complicated as people started using more of those technologies compose v2 also took those things into account And so that it defines the the the containers how they should be instantiated what You know What images are going to be built from services and the environment how they should be scheduled and how they should be networked They're all part of that or part of that file The definition of services is a little bit different in in compose as it is in kubernetes And I know a lot of kubernetes developers love compose still and use it. It's just translated to kubernetes Yeah, it's been super popular. I just I just saw on twitter today that one of my Someone I know they're now actually receiving Applications from their software vendors as compose files. So it's great It's a very simple way to define things and you can kind of define all of these things in an order And with that, you know, specifically like networking, right once you start having Multi-container applications. They fundamentally need to talk to each other, right? So in In the docker construct of swarm mode, they there's, you know, swarm classic swarm in swarm mode In classics form you did things like links and you would actually define in the file like i'm going to link container a to container b With swarm mode, actually, there was a number of default network drivers that were provided. So things like you can do bridge networking host You know host networking where it's kind of you don't need any isolation that everything on the hose can talk to each other Overlay networks so that containers across A number of different hosts can talk to each other as well as a mac plan, which is something that's also popular in kubernetes So they ship with those network drivers allowing The operator kind of set up whatever they wanted You can instantiate the network and then attach your containers to them Or or to find them in the compose file as well as use a number of network plugins and this is where you could plug in Various network solutions from the ecosystem Um and with that paul Since you work at hashi corp you want to go into nomad? Definitely. Yeah, so uh hashi corp runs nomad or builds nomad primarily And nomad has a lot of strengths similar to some of the other stuff we've talked about One big strength that I think differentiates it is it's really easy to plug into an existing pass platform On that's because it's very simple and flexible. Um has a lot fewer moving parts than most of these other systems It's also a bit more flexible on what it manages or what it schedules So it can run a bunch of different types of processes in addition to containers And multi cluster and federation are already a feature. So it's fun to play around with if you like experimenting with that stuff um similar to what you mentioned with uh docker betty A job file kind of contains everything that you would expect to see in a deployment and a service It's sort of a monolithic file Similarly based on a hashi corp configuration language, which is sort of a close cousin of yaml. Um, it's just slightly more Uh readable and friendly uh borrow some ideas from toml and that sort of thing Um, and then as far as apps communicating with each other nomad prefers keep things pluggable so you can run it with console doing service discovery and service mesh type stuff for you Or you can just run it bare uh by itself in which case it shares the host network For any of the applications that are running on top of it, and then you're free to kind of customize it however you like Uh, should we cover k3s next? Yeah Cool. So this is sort of like kubernetes little brother Uh k3s is uh smaller lighter and uh, it's a single binary Um, it's sort of a batteries included solution So it's similar to other distributions of kubernetes where all the tools that you kind of need to make it work Are there packaged with it, uh as it's installed so it contains helm for instance And it's unique in that most of these systems that we've talked about operate on raft or some raft like system So far, uh, this runs on top of a traditional uh relational database instead of at cd It can also run on at cd, but um, you know that comes with trade-offs of in the case of a failure Things may not quite be as correct as they would be in a system designed with raft and xcd But it's a nice trade-off if you want to do something like have a managed postgres service from the cloud Manage your your kind of state management layer, which is what xcd generally does for kubernetes So k3s is great because you get the same kubernetes ecosystem because it's just a slimmed down version of kubernetes And same deal. You also kind of network things together in a very similar way I think they include their own load balancer as an add-on Which is nice, but for the most part it's basically kubernetes under the cover So we should probably talk about kubernetes since we've been doing all this So yeah, let's let's dive in. Um, so you may be aware of pods and kubernetes and the control plane all of those components, but Really where xcd and the raft component lives within kubernetes Is in the control plane. It's a set of nodes that xcd gets deployed on And the control plane may include other components like core dns the api server controller manager A proxy scheduler which helps place workloads that come in and this is all just sort of What runs the stuff that? Your kubectl connects to when you're trying to deploy something It makes the decisions. It does reconciliation for you. This is kind of the brains of kubernetes And what all these things do is they push their state down to xcd for the most part So xcd is sort of the most complicated or the most frustrating part of the distributed system for a lot of people Um, it's the source of truth, right pal. That's yeah, exactly. That's you nailed it Um, and then when we look at one of the worker nodes that's connected to kubernetes where the actual apps get deployed to um, you know They're things like cube proxy and kubelet which receive communication from the control plane and Send communication back. They actually execute the commands that they receive Um, and then we need something to run containers. So container d is running there as well And uh, you can probably explain container d better than I can. Uh, so what's container d betty? Yeah, so container d when you look at um, docker, it's it's and you had the docker engine container d was the component of that as the core container runtime and what we had done Years ago at docker is take that part and then actually donate it into the cncf and make it part of the community So that everyone In the ecosystem can leverage that for the core runtime aspects of containers awesome Yeah, so now it's part of kubernetes. Um similarly at cd is part of kubernetes and like we said at cd is where a lot of state gets pushed to so if you're really Looking to learn distributed systems and want to learn more Start learning about the failure modes of at cd as betty said earlier failure is sort of the key of distributed systems Uh, and we have a lot of great content that helps teach that in the notes of this presentation And speaking of failures, uh, I think you can help us understand the Byzantine fault tolerance or the Byzantine general's problem Yes, and so this is an interesting one and so a great way to uh, great kind of a human example of this is You're out with your group of friends one night and one of your friends trips and falls And they're obviously bleeding and you say hey, are you hey buddy? Are you all right? And they're like, I'm totally fine You can obviously see that there's a problem, but that person has decided to tell you that they're fine So, uh, and then the rest of your friends you're trying to figure out what to do so in the example of distributed systems is that You know you're sending traffic and you're sending data and the system is not returning response or behaving like it should But all the health checks turn out fine so it's kind of like this mismatch in The desire what you know, the state should be but it's not behaving that way But then the rest of you can't figure out what to do yet Yeah, and that's a problem for raft raft saves us from a lot of problems But uh, most protocols can't save us from all problems. So there's always going to be stuff like this that Kind of lurks in the shadows and will bite you when you're working on distributed systems Uh, kind of closing things off. I really like this example from Kent C. Dodds. He's got a master's degree in management information systems And when you're evaluating any of this type of technology, it's incredibly complex and incredibly difficult and sort of The beauty of working in the open and doing things collaboratively through open source and where the community Is we all get to kind of share this burden together and learn from each other So, you know, we learned from the target example earlier because they were great gracious enough to share with us Um, there's a lot of other great examples you can learn from if you get involved with the community Yeah, we've got this really long Reading list for you with all with links to everything that we've mentioned in the talk as well as other things There's also, um, you know, there's vendor options where they handle some of these other Managing some of those other bits for you so that you can get some of the experience But then not have to handle all of the life cycle stuff with it And then there's things like fully hosted options in the cloud where you can play with Actually deploying the focusing on the application side and maybe not have to worry about managing all of the the internals