 Hi, everyone. Welcome to the third day of the conference. I hope you all are doing well and have been enjoying the conference till today. So today we're going to talk about the eight fallacies of distributed cloud-native communities. I am Nabarun. I'm a staff engineer at VMware. I maintain the Kubernetes project. I'm one of the steering committee members and also chair of SICK Contributor Experience. Hi, my name is Madhav. I also work at VMware. I contribute to areas in communities like API machinery, scalability, architecture, and contributor experience. I'm a TL for contributor experience and also a GitHub admin of the project. So we've all heard about this term distributed systems. But to set some context, let's say you have an app. It takes a request. It gives out a response. Things are good. But for some reason, the app starts becoming popular. You see a spike in traffic. You have a lot of requests that you weren't capable of handling just yet. So now you don't know what to do. So you know what? I'm going to buy a bigger machine to run my app. So now you have a bigger machine. You have more capacity. You can handle all this increased load that you have on your application. And things are good. Now, again, your app becomes even more popular. You have a lot more requests coming in. So now you have two choices, right? What do I do here? Do I buy a bigger machine? You can. But there's only a certain amount of bigger machines that you can go up to. You have physical limits as well as very valid, monetary limits to this sort of exercise. So one thing that people have started to do is, you know what? I'm going to start running my application on a set of machines that are cheaper to use but to communicate over a network. So this is what is a high-level TLDR, super, super, super high-level TLDR for a distributed system might look like. You have a bunch of machines, heterogeneous machines, not necessarily similar to each other that are talking to each other over a network. Now, you might like your distributed system to also be replicated across multiple regions of the world. So that if something goes down, you have something else to pick up the slack. So now it's that it might look something like this. So all of these also communicate with each other. But the nice part about this is, you have all of these machines communicating over a network across all of these different globally distributed regions, all trying and working towards achieving a shared goal, in this case, serving your user that is interacting with an application, right? All of them are working towards achieving a shared goal. Now, having a globally distributed set of machines talking over a network gives us all sorts of nice benefits. If one of the set of machines are unavailable, as I said, we still continue working on towards a shared goal. We have something else to pick up the slack. Now, not all machines need to be specialized to do the same thing. That means that, for example, the yellow box over there can be optimized to run back in code. The blue box there can be optimized to run a database to run so on. Not all machines need to be good at everything that needs to be done in order to achieve a shared goal. They can each have their own strengths, communicate asynchronously, sometimes synchronously, and achieve that goal. Machines can also work in parallel and get more work done in the same amount of time without needing to have synchronous communication. Sometimes they will need to have synchronous communication. And interestingly, synchronous communication is the key to losing scalability of a distributed system. If you want to get more formal about that, it's called the universal scalability law. If you want to look it up later in case you're interested. But again, there is no free lunch with all the niceness that comes a slew of challenges with distributed systems. So when things go wrong, who fixes them? Do I need to intervene as an admin, restart a machine manually or do something there? Can the system heal itself? What's the process here? Who's gonna fix it? Now, communications can arrive super late, sometimes not at all. And due to no fault of anyone or anything, sometimes a natural disaster or a lightning bolt can strike a network cable. Network connectivity might break. All the things that make distributed systems powerful, all the things that make distributed systems achieve a shared goal are the very things that can break and make it fail. So why are we doing this? Why are we adopting this so widely? Why are we adopting this so widely? And importantly, why do we like to inflict pain on ourselves in this manner? Now, as our complexity grows and as our system grows, so does its complexity and the challenges that come with it. But it's these set of challenges that we get out of a globally distributed set of machines that make distributed systems a really, really interesting, elegant and beautiful field of study. Interestingly enough, most of the challenges that I spoke about are not solvable. In fact, the formal name and literature for these challenges are impossibility results. Impossible is right there in the name. So what do we do? What's important and often the solution or at least the first step towards the solution is acknowledging and understanding that these challenges exist. Because once you do that, once you accept that these are challenges that have physical limits that you cannot overcome, then you sort of start innovating around them. You sort of innovating over them and start to build systems that are resilient to them. So we've talked about distributed systems, but let's start talking about cloud-native communities. So this was our model of distributed systems. Cloud-native communities don't look very different. Over here, we have a globally distributed set of people, all collaborating towards a common goal. Again, some folks can become unavailable, but that's all right. We help each other out here. Here too, folks can continue working in parallel without the need to communicate with each other. Sometimes they will need to and we will talk about that a little bit more. Again, with all the niceness, we get a bunch of challenges. And these are challenges that are arguably more difficult to solve than the ones we spoke about in distributed systems because these involve humans. These involve actual humans. So challenges like maintain a burnout, onboarding new contributors, time zone differences, language barriers, cultural barriers, so on and so forth. And we've had so many amazing talks over the last week on these topics that you should definitely check out. As before, some and even most of these challenges are not solvable, but our jobs as maintainers, contributors, and end users in general, interacting with cloud-native projects and communities is to ensure that these challenges is to acknowledge, accept, and be empathetic while exercising kindness towards these challenges. Now as our community grows, so does its complexity with it and the challenges that come with it. So distributed systems plus cloud-native communities, needless to say there are similarities between the two. And once we sort of enter this land of complexity, it's important to know how to navigate it better, how to navigate it the right way. So when distributed systems started becoming a thing and started becoming mainstream, there was a set of fallacies that were introduced to help guide programmers as to what not to do. So this was navigating complexity by knowing what not to do. And these were called the eight fallacies of distributed computing, our distributed systems. And these are listed over there. Now what we want to talk about is as cloud-native communities grow, evolve, and become rightfully more complex, we need a set of fallacies to help us navigate it better and sustain it and support it. So we wanted to introduce the eight fallacies of distributed cloud-native communities. So we have for distributed computing and distributed systems, and then these are the eight fallacies of distributed cloud-native communities. So having said that, let's dive into what each of these mean, what you should keep in mind, and what you can benefit from them as an end user or a contributor or a maintainer, whatever your role is interacting with the cloud-native community. So the first one that we are going to look at is network is reliable. So software applications are written with very little error handling to tackle networking errors. These results in the application to be stalling to even wait for a response from whatever network service they're talking to. The solution to this is very simple to retry and to have mechanisms where you retry without having an undue bandwidth issue on the network or on the application. Now let's try to correlate that to something that we see in the cloud-native communities. Now take for example, a code base, a massively distributed code base. Now people might expect that the quality of every merge to the code will be the same. However, that's not the case. With every merge, there can be issues. Now anything that can go wrong will go wrong. We have been hearing about Murphy's law for a long time. There can be bugs, regressions, and CVs introduced into any commit into the code base and this is entirely possible. What this means is this can also affect the timelines of the release of the project and it can affect the code quality. To give very close examples from the Kubernetes community, mother and I talked about this in one of our talks in previous KubeConz and this is a problem that happened a few releases ago. So the way how Go handles its crypto libraries, they changed behavior of a signing mechanism which effectively broke our CSR validations in Kubernetes which resulted in a correlation of things which resulted in us even potentially discussing to delay our release timeline. Now this doesn't happen very often. We try to make sure that timelines are there too but then we have to realize that timelines are optimistic and this is what we want like every cloud native project or any project that you maintain to have as an expectation. The next fallacy that we want to talk about is latency is zero. If you ignore network latency, there can be packet losses, there can be unbounded traffic which will result in like wasting of network bandwidth but how does it relate to cloud native communities? If you're in the key nodes in the past days you might have seen that we have over 175 cloud native projects in the landscape and it's huge. These are some stats from the 2022 project report. At this point, Chris mentioned that we have 220,000 plus contributors in the cloud native ecosystem and it's highly distributed across geographies. People are there across the world and with the highly distributed network of people, we come into the problem of time zones and time zones are very hard. If you want to schedule a meeting, let's say to talk to someone, if you are in, let's say US and you want to schedule a meeting to talk to someone who is in Asia, it's going to be like either night or day for someone and day or night for someone else and that creates problems to give a very good example of what we have seen in the Kubernetes community. Kubernetes has a release team which does release releases over and over every cycle. When I was the Kubernetes one to one release team lead, the team consisted of members from UTC plus eight to UTC minus eight. So I had almost 16 time zones of people where I needed to make sure that opinions of everyone are heard and they take adequate part in the discussion. The problem with such distributed contributor bases, feedback loops can't be tight. They have to be loose. You can't have synchronous communication. It is nearly impossible. What you need to do is communicate asynchronously as much as possible to reduce the overhead. Don't expect that once you ask in question, you would get an answer like instantaneously. You should take into account some amount of overhead in that. One of the advices that we have seen around in the Kubernetes community and it works really well to scale is to discuss in a meeting but not make decisions. Decisions should always be done by lazy consensus taking into all opinions. Awesome. So let's talk about the next one. I wanna talk about the fallacy bandwidth is infinite. So if you ignore the physical limits of bandwidths in networks, you can have a lot of bottlenecks in your system. And the parallel I want to draw to cloud native communities is that maintain a bandwidth is infinite. So a lack of bandwidth does not mean a lack of time anymore. We unfortunately live in a world that is far from ideal and peaceful. And as a result of this, our communities are going to be affected either directly or indirectly. Which is why in times like this, we need to be extra empathetic while interacting with communities. So maintainers love the projects they maintain and they really want to help you out no matter what your queries are, no matter what your requests are. But when life happens or things happen in general to you, to your friends, to your extended community, there is a tried and tested formula for maintain a burnout. That is a feeling of lack of control along with a dash of lack of empathy when you're talking to a maintainer or just anyone in a community, it will result in maintain a burnout sometime down the line. And we need to be very mindful of this, especially in the world we live in today. So it's always good to ask questions. It's always good to request new things. It's always good to interact with any open source communities and all the niceness that comes with it. But along with this, we need to make sure that we are increasing our levels of empathy. That is help maintainers help you as we heard from Nikita and the keynote. Also, provide the fuel for the journey you're asking your maintainer to take for you. Or even better, hop in in a car yourself and take the journey with the maintainer yourself. That's the best way you can get something done in a community. And that's the best way you can grow in a community, help your maintainers, lend a helping hand and help the maintainer help you. So that being said, we move on to the next one, which is topologies don't change. That is changes in network topology can have effects on bandwidth and latencies. That's a false assumption. In reality, these things change quite often. And the parallel I want to draw to cloud native is that commitments don't change. So number one talked a little bit about timelines are optimistic and timelines are concrete. I want to talk about, I want to elaborate on that a little bit further. So we have this law called Hiram's law. And this is kind of popular in API design and compatibility design and things like that, which says with a sufficiently large number of users of an API, it does not matter what you promise in the contract. Ultimately, all observable behaviors of the system are going to be relied on by someone else. What this essentially means is as your project grows over time, there is no way you can know all the ways that it's being used. There is no way you can know all the other ways that you intend, you didn't intend your project to be used, but it's being used. So this also means that a project can have a diverse set of ways that in which it can be used, ways that creators, maintainers and contributors of the project didn't foresee. And we saw a great example of this in the keynote by Tim Hawking today, wherein we need to adapt to what industry is telling us because Kubernetes was founded almost a decade ago, but it's super relevant now and it's becoming more and more relevant in the future. So how do we adapt to these growing changes? So projects still want to accommodate all of these different ways that things can break, all of these different ways that you want to use the project. So if you're using a project in a novel way, go tell your maintainers because that is invaluable feedback that they get from you as an end user. And you don't even have to contribute code to a project to be called a contributor. If you contribute feedback as an end user, that is invaluable. So go tell your maintainers as to what the novel way you're using a project, what's the back way you're using a project. For example, I wrote a Kubernetes operator to play most code messages on my caps lock LED. Kubernetes wasn't meant for that, but it's fun, so why not, right? Go tell your maintainers that you wanted to do that. However, sometimes a project can go into survival firefighting mode and as a user base of a project grows, so does its need and commitment to minimize blast radius and maximize compatibility. So whenever that happens, this is what innovating in open source starts to look like. Periodically, not always, but periodically. As a result of which, your niche breakage might not get fixed in a timely manner or ever, for example. And this is just the reality and it's nothing wrong. This is a very valid acceptable thing. And this is something that we also need to keep in mind that if a project makes a commitment to you, that's gonna change depending on what the circumstances are. If you really, really want it fixed or if you really want something in the project, lend a helping hand, fix the fire that is blocking that, that is blocking someone from fixing it for you, or fix it for yourself and again, help maintainers help you. Again, so Nambarone mentioned the talk that we did on firefighting. If you're interested in that sort of thing that we talked about and did an analysis of firefighting that happens in the Kubernetes community and the sustainably aspect of that. So if you're interested in that, you can check that out. Awesome. So now we'll talk about another fallacy of distributed computing, which is the network is secure. So often like when we try to write distributed systems or our software suite might assume like certain security primitives, but at the end of the day, complacency regarding any of those security primitives will result in us being blindsided by malicious users and programs that will try to take advantage of the software and elevate privileges which they not need access to. Have you ever downloaded the Kubernetes source code archive? Here's one of the links that you can download the archive from, but if you have not ever tried it, you should try at least once. Should you try it from the URL that I showed you? Probably not. I'll tell you why. Because it's a malicious payload. If you look at the URL, it's a Kubernetes.zip domain and not actually a valid Kubernetes source code URL. There's a great talk by Sean and John who have talked about how you can do supply chain attacks using .zip domains and why .zip domains are big security risks. The talk already happened two days back. So I highly suggest people who are interested in software supply chain security to go ahead and have a look at that talk. Now where should you download Kubernetes from? You should always download from verified sources. For Kubernetes, there's a website called download Kubernetes.com where you can get all our artifacts that we ship across releases along with certain things called signatures and checksums. But you don't need to believe me whether this website is verified or not. You should go to this website and as I was saying about checksums and signatures, you should download the artifacts, the signatures and verify them on your own whether they were actually built by the Kubernetes maintainers or not. There's a great deal of documentation on this. You should, there's a link to the documentation where it gives steps on how you can use six store to verify the artifacts that we ship with the signature that is provided on the website. And you can see like who is the verifying, like who is the signing entity which will give you an identity of the person who has actually written that package. The fallacy here is software supply chain is secure. It's not. You have to do your bits to make sure the software supply chain is secure. There are a lot of ways that you can do it. One of the very prominent ways these days also supported by OpenSSF is using the Salsa framework. So if you go to salsa.dev, slsa.dev, it states certain levels of software integrity that you can take inspiration from or follow them right away to build your software supply chain and make it like secure for your consumers and for yourself as well. I highly suggest to go ahead and look at the documentation. The next fallacy that we want to talk about is there is only one administrator. The fallacy talks that multiple administrators may institute conflicting policies of a very large distributed system. In case of communities, most of the projects have more than one maintainers and maintainers can have like different motivations or different visions of the project. To give some context about Kubernetes, like Kubernetes is a very large project and similarly like maintenance of any large open source project is very, very hard. To give some numbers into the context, the Kubernetes contributor base is almost 83,000 plus as of October with 1,800 org members across like 350 odd repos and this creates a lot of problem. And to solve that, we have a multi-tiered governance structure where the Kubernetes steering committee takes care of the non-code governance aspects of the community and it delegates like all areas of code related governance to specific SIGs, special interest groups. They are supposed to build policies or build their charters to satisfy their area of code and the steering committee oversees them. And this is true for a lot of projects. Even if you see CNCF, CNCF has the technical oversight committee as well as the governing board handling like different aspects of the foundation. Now I talked about like maintainers can have different visions for the project, right? And it's fine, it's very natural to have that. But this incoherence on envision should not affect the long-term sustainability of the project. To take an example, Kubernetes puts on checks and balances. For example, the steering committee instituted a policy of if we want to take decision on a certain change which is community wide, we need two thirds of quorum. So with seven members in the community, we need votes from five members to even pass a resolution. Similarly, other projects also have multiple maintainers and they may have their own agenda. The vision may differ in the long run, but what they need to do is they need to compromise to come to a common conclusion to any resolution or problem in the community. So essentially the fallacy is that compromise is a rarity and not the norm, but the reality is compromise is actually the norm and not a rarity. So coming to the last two fallacies of our talk, the next one is transport cost is zero. So you have these hidden costs while running and maintaining a network, but if you ignore them, you quickly run into shortfalls. The parallel I want to draw to cloud native is cost of sustainably onboarding contributors is zero. That's wrong. And this is something I also talked about in another talk of mine just day before yesterday. This is the Contributex maintainer track session. So new contributors are the lifeblood of any open source community and are crucial from a sustainability point of view. They come into the community through this thing called the contributor funnel. So new contributors come in and existing project maintainers help these new contributors grow and become what I called as episodic contributors or ECs. So episodic contributors are people who have proof of effort. They have done the work. They have managed to contribute. And these are the people who are potential maintainers of the project down the line. Or in other words, these are the people who tend to become maintainers in the future or the lifeblood of sustainability of any open source project. Ideally, they become maintainers and that cycle continues and then all is good, but all is not good. However, the cost of converting episodic contributors to maintainers is often quite high. And the reason for this, I mean, there are multiple reasons for this. So one of them is maintain a bandwidth is not infinite as we saw, but also as a community grows in size and a project grows in size, so does the amount of undocumented context in it. So if a project, if a contributor were to become an owner of a project area and start leading it, start maintaining it, they would need or it would help if they had all the context that went behind everything that that project area includes. However, since all the context or some of the context is not documented in accessible places, so they may be decisions made by word of mouth or they may be institutional knowledge that is just existing in, for example, my head that I haven't put out for other people to know so that I become a single point of failure. So these sort of things hinder people becoming project owners in the future. As a result of this, episodic contributors start leaving. And this is kind of serious because episodic contributors are the next maintainers and if they leave, what's gonna happen next? But we still need new people, so let's do more outreach. Let's get more new contributors so we get more new contributors into the funnel. And hopefully that solves the problem, but maintain a bandwidth is still finite. That still has not increased. As a result of which, in the Kubernetes project, considering it's as large as it is, we don't have a mechanism for new contributors to get the help that they need because maintain a bandwidth is finite, episodic contributors are leaving, so we don't know what's going on. So new contributors also start to leave. And this isn't a story that, you know, it's in my head and I'm just telling you because it's kind of fun to tell. But these are actual stats from the Kubernetes community. DevStats is the source for this and you can find this out if you are curious about what the episodic contributor and the new contributor health is for your own project. So the top line is episodic contributors year on year and the bottom line is new contributors year on year and both of them have started to decline after a certain point. Okay, now as a project and its community grows, we as maintainers need to put conscious effort into uplifting and growing existing contributors in the project in order to avoid gridlock. If we do this well and if we do this right, not only will this help projects, not only will this help sustainability, but it will also help new contributors that are coming in because now they have more people to reach out to, they have more people that they can seek help from. And we've done quite a good job of this in the Kubernetes community. One-on-one mentoring does not scale. So we started doing mentoring cohorts. We've grown a bunch of new maintainers into different areas of the project. There's still a lot of work to do, but these are steps in the right direction. Yeah, so if you need more people, getting newer contributors might not work in the same way you thought it would if your project is sufficiently large. Finally, the last fallacy of our talk, that is the network is homogeneous. If a system, if you assume a homogeneous network architecture, you're going to run into issues. And the parallel I want to draw here is staffing across project areas is homogeneous. Now a community can almost feel like a black box when you first interact with it. So, but the more time you spend, the different classes of the community start to emerge. Open source communities are a web of socio-technical dependencies. So you have multiple people interacting with each other. You have technical dependencies like the CI infrastructure, the build infrastructure, all of those things. And soon it's not hard to see that critical dependencies are there in the project. That is one thing, if not there, might cause or might wreak havoc in your project. And since I've mentioned critical dependencies, it is almost mandatory for me to display the XKCD comic strip here of single maintainer dependencies. So in a more general sense, not all areas of an open source project are staffed in proportion to their workload or their critical dependence. In fact, it is the non-shiny areas of a project that are critical and are often understaffed. So when the community still feels like a black box, it's easy to sort of do the math in your head that, oh, you have so many contributors. You have so many resources. Why isn't Initiative XYZ moving forward? Well, it isn't moving forward because there are critical dependencies that aren't staffed. And if those break, you won't have a project to complain about. So those are some of the reasons why it's not moving forward. You know how you can make it move forward? Come put some of your employees to work on the project. Understanding staffing needs of a project you rely on is critical from a business continuity point of view, as I said. And sometimes funding contributors to work on areas that you don't directly rely on but is critical for the project itself can be the best thing you do for the project and yourself as well. So let's conclude. So to conclude, some of the fallacies that we talked about have a solution, but some of them may not have a solution. What we wanted to highlight was, it's important for every community or every project to make sure that they are cognizant of the fallacies. They know what they're getting into. They know the dispute pitfalls of running a community or running a very large open source project. This is mostly to ensure a healthy contributor base because you have to talk about or you have to think about the long-term sustainability of your project. To summarize the realities that we talked about, timelines are optimistic. We should prefer communicating asynchronously. We should be extempathetic to the maintainers and help maintainers help you. If you use a project in unique ways, you should contribute your feedback and your skill as well. You should make sure you make your software supply chain secure. You should take into account a diverse set of opinions. And with large communities, you should spend efforts on growing existing folks in addition to getting new contributors. And critical areas of the projects are the ones which are often understabbed and we should know which those areas are and try to make sure they are staffed adequately in the future. From the Kubernetes community point of view, we have a couple of sessions going on. So there's a session going on right now if you want to know from the Kubernetes contributors and maintainers directly on what challenges we face and how do we solve them. This is happening at W470AB and it's going to be happening until three o'clock. So if you want to learn from us, drop by the session. If you want to know what the Kubernetes training committee does and what are the kind of things that we have done to ensure the project is sustainable in the long run, you can come to the training committee maintainer session as well at three o'clock. I think it's at W19 something, but you can see the scheduling and see the room and come to the session. And mother mentioned about contributor experience. So in sick contributor experience, we have interested and learned a lot from what is happening in the community and how we can solve the problems and how we can grow. There was a session on Tuesday, talking about a lot of these aspects. You should check out the recording when they're available. And with that, we are at the end of our session. If you have any feedback, please scan this QR code and do provide us feedback on the session or if you have anything to ask us, you can also ask us and if you drop some socials of yours, we can get back to you. I see that we have seven seconds left for the session. So we can take questions like in the hallway right now, probably not on the recording, but thank you everyone for attending the session.