 Hi, welcome everyone. This is a talk about writing scalable distributed systems using Elixir. I'm Odit. I'm a software developer for Nelinso, which is an employee-owned software consultancy. You have a desk right outside if you want more information on that. We write high throughput web-scale services for a living. And this talk is primarily about a live game engine that I built for one of our clients called Insider.in. It's out on prod if you want to check it out. All right, so this is briefly the agenda of the talk. First, we'll discuss why Elixir is a good fit for writing a distributed system. Then we'll take a sneak peek under the hood, figure out how things work on the Elixir VM, and also introduce some of the constructs that Elixir provides for writing distributed systems. And once we have that tool set, we will go through writing a dummy live game engine. We'll rinse and repeat, make it better, more redundant, more powerful, so on and so forth. So why should you use Elixir for writing a distributed system? Also, by the way, throughout the talk, I may use Elixir and Erlang interchangeably, because even though they are two different languages, at their heart, the ideology to writing a distributed system is very same. Also, before that, let's have a very brief conversation about why distributed systems are hard in general, and I'll not go into any depth at all, because if we go into any detailed discussion on this topic, we'll be stuck here for a very long time. Right, so these things, state, computation, reliability, and ordering of events, they are easy to reason about when you are in a single node context, whereas if you move to a distributed system, just the mental overhead to reason about these things become very large. So let's take the instance of state. A database running on a single node would be the primary example of state running on a single node, and it's very easy to reason about this thing because you have a single source of truth. You don't have to worry about replication or sharding and whatnot, and you can easily follow through data evolution. On this setup, if you just add, let's say, a read-only replica, now you have to worry about replication and also about failures and also what happens during that small window wherein your primary is failing over to the secondary, right? Yeah, so like I said, the complexity simply explodes when you go into the context of a distributed system. So why use Elixir? Well, the very first thing is Elixir is distributed out of the box. You don't have to use any external library, you just write something in your configuration file or just write this one command on your terminal, which I'll show later, and bang, you have a distributed cluster running Erlang nodes in parallel. The second thing I want to highlight is Elixir uses asynchronous message passing across the board. So this may not seem that powerful. However, what this ensures is that your communication mechanism remains the same, whether you are communicating locally to a process running on your own VM or to a different process running somewhere on AWS cloud, right? So the mental overhead of thinking about, hey, how is this communication going on is exactly the same. The third thing is there's no sharing at all, like all the processes in Elixir or Erlang, they have their own data and there's no sharing at all. Since there's no share data, we have sort of sidestepped the entire set of problems that we have because of concurrent access of data. The other things are Elixir provides excellent primitives for dealing with concurrency and hence by extension parallel computation. It also provides primitives for handling fault tolerance. I'm pretty sure many of you would have heard about supervisors. If not, we'll have a small chat regarding that in the following section. All right, the next section, what goes under the hood, which is like what happens inside the Erlang VM. So Erlang VM is known as the Beam virtual machine. Beam stands for beyond Erlang abstract machine. Beyond is the person who maintains the code base for Beam. There also used to be Joe's Erlang abstract machine, which was maintained by Joe Armstrong. Beam is like any other virtual machine. It compiles your code down to bytecode and then it runs it on the machine. Erlang, Elixir, Gleam, LFE. These are some of the languages that can be compiled down to the bytecode by the Beam. Yeah, that is it. All right, let's get into the thick of things. So what is a process? A process in Erlang terminology actually means a thread of execution. So think green threads because they are like super lightweight. They have a very small memory footprint. They are very fast to create and terminate. And the scheduling overhead for them is like super low. So a typical process that you create in Elixir will cost you like 600 bytes of memory. So because of this small memory footprint, you can potentially create millions and millions of processes, even on a single machine. All right, so the next point is communicate via message passing. So the only way to tell a process to do something is via sending it a message. You cannot go in and fiddle with it state. You cannot access it state as such. You can only send it a message and then as per the discussion of the process, it will do whatever it wants to do, right? And every process is single-threaded. What I mean by that is a process will read the first message from its message mailbox, execute updated state, read the next message, execute update. So on and so forth until it gets suspended. So I'm not sure if that diagram is visible, but this is what the memory layout of a process looks like for an Elixir process. So on the top, you have the process control board, which contains things like the processes ID, if it has a name, what was the initial call and the pointer to the latest mailbox message. And then there's this contiguous block of memory on top of it, which we have stack and at the bottom we have heap. So stack grows from top to bottom, whereas heap grows from bottom up. Stack contains things like local variables, function parameters, whereas heap contains larger things like your mailbox messages. So in such a scenario, the garbage collection happens when the stack meets the heap. So either if you get too many messages in your heap and it grows way too much, or if the number of local variables that you have for your functions execution, it's way too much and it touches the heap. That's when the garbage collection will happen. And to deal with it, Elixir has either compaction or full copy. So what I mean by that is the new memory requirement, if it is small enough to be addressed in the same memory block, the stack and the heap would be compacted and the memory in the middle would be freed up. Otherwise what happens is it will copy over the entire block, entire memory block into a completely new memory location so as to allocate more memory to it. One thing worth noticing is that garbage collection always runs on the processes CPU schedule. So what I mean by that is where garbage collection would run when your process has the CPU. So it can essentially eat into the CPU time of your process and hence if you're doing too much garbage collection, the throughput of your process can go down. Then there's something called the schedulers. These are the processes that actually provide, well they are not the process, but these are the things that actually provide processes with CPU. On the right, I don't think if it's visible, but on the right we have a finite state machine of all the different kind of states a process can be in. So you can see there's runnable and from runnable it can transition to running and then garbage is garbage collecting and then you have exiting and so on and so forth. So schedulers, they keep a queue of all the runnable processes within them and what they'll do is they'll pick up the very first process from this queue, the CPU to start executing. The schedulers are soft preemptive. What I mean by that is if even if a process has overshot its CPU quota, the scheduler will not suspend it. It would let it run till the process goes to the next valid state and be it suspended or exiting or waiting. So what this allows is your scheduler does not have to worry about intermediate execution state. And this allows for faster scheduling because now you don't need to persist, hey my process was in the middle of this execution so I need to store the stack variable somewhere and what not. So this is how you would write a process in Elixir. Apologies for the very small font size I had to smoosh it in. However, this is the more interesting part. So here what I'm doing is I'm starting off a receive block which is saying that hey, I will block till I get a message in my mailbox. When I do get this message, if it is exit, I'll just do shutdown. Otherwise, what I'll do is I'll print the message and then call myself recursively. On the right hand side, on the jiff, if you can see, I'm starting off the same process. And now I'll send it a message which is hello and you can see it gets printed on the console and if I exit, it just shuts down and if I check for the process's aliveness, it's not alive anymore. All right, so processes by default, they don't have any names associated with them. However, they do have unique bids and that's the only way that you can talk with them. So you'll say that hey, send a message to a process with this bid. So process registrations are a way around that so that you can give it a more meaningful name. So there are certain strategies to do that. On the left, we have the inbuilt strategies that are present in both Elixir and Erlang. On the right hand side, we have some libraries that do that. So no name is basically you don't give any name to your process and the only way to communicate it now is via the bid. So if you don't have reference to the bid, you cannot send that particular process any message. Local registration means that all the processes that are running on your local VM, they would be able to contact the process that is registered locally. If you're on a different VM altogether, you will not be able to contact with this process. A global is like the globally registered process. So like anyone anywhere can send it a message. These two libraries, PG2, SWAM, they are used for the same thing. However, capital R registry, this is something more specific to Elixir. Although it's a built-in, it's not an external library, but it's only specific to Elixir. That's why it's on the right hand side. All right, so supervisors. So supervisors are specialized processes that have only one job, which is to monitor other processes. So they are the ones who help us create really fault-tolerant applications by automatically restarting the child processes that they are monitoring. So Erlang has... Erlang is touted to have like nine-nines of availability and what they do is they would smush supervisors on top of supervisors so that you have a complete supervision tree. So that your application almost never ever dies. Right. The next thing that I want to talk about are gen servers. So the process thing that we discussed so far, they are also applicable for a gen server. It's more like a better abstraction, like it provides a better abstraction over state, although under the hood, it's still just a process. And in our production code, you would typically never use a naked process. You would almost always use a gen server, and we'll see why. So, yeah. So it has... So if you remember the... how we did message passing to a process, and at the very end of the received block, we were calling itself recursively. Gen servers sort of abstract that particular thing out. That behavior in gen servers is kind of implicit. So the way of communicating to a gen server are like these three functions. Info is exactly the same as sending a process, a particular message. Cast is also like info. However, instead of PID, you can give the process name here, and under the hood, it would figure out what's the PID associated with that name and then send it to that particular process. Call is more interesting because call will do what cast does, like it will do the name translation to the PID, but it also has like a timeout. So in a way, it is sort of allowing you to do synchronous code by doing asynchronous message passing. So what happens when I do gen server.call is I'll send a message to process, and whatever the timeout is, I'll wait for that much time without doing anything for that process to reply back to me. And if that does not happen, then I'll get a timeout and then I as a process can do whatever I want to do on that timeout. All right. So let's discuss distributed nodes really quick. So nodes, well, Elixir VMs are known as nodes in the Elixir LAN. So to have a distributed Elixir cluster, well, you can connect them either by providing the startup configuration or by like manually giving it the connect command, which we'll get to see really quickly. The interesting thing here is like they always form a fully connected mesh network, which basically mean like if you have five nodes in your cluster, each of them would have a separate TCP connection connecting to each of those things. So it will be like fully connected. And the other important thing is each of these nodes will keep on sending a heartbeat message to the other processes so that the other folks are aware of the liveness of this thing. So in this GIF on the left-hand side and the right-hand side I start to LAN VMs. If you see the node.list on both sides is empty right now and then I'll just say node.connect and give it the other VMs name and you'll see that the node.list is magically populated on this side with the other node and on that side with this particular node. And now what I'll do is I'll like I'll register the name for the console as following console and then I'll send this particular console a message this particular beta message and on the right-hand side if I flush I'll get that message here. So while you now have a connected elixir cluster however you will not do this thing on prod. You would obviously use something like Kubernetes or maybe some sort of DNS configuration. So there's a library out there which is called libcluster. What it does is it leverages Kubernetes metadata API to figure out if there's a new new pod that has spun up and then it will automatically add to your cluster. The interesting thing to note here is like although auto new node discovery does not happen, however auto like the cluster will automatically detect if an existing node has died out because of missed hotlines. So I think we have all the tool sets that we may require to writing a new live game engine. So if you look at the live quiz engine from a very high perspective these are its primary responsibilities. So first what it will do is it will publish a question and then it will receive an evaluate answer from the players and then finally it will broadcast the result. So if you really really squint your eyes can you see that it's sort of like a restricted chat. So think of like an IRC server when you send it a message it actually broadcasts it to all the other clients that are there on your IRC room. However this way it's more restricted because now it's like the server will only first publish the question and when it receives the answer it does not actually broadcast your answer, it actually broadcasts just the summary of that. Moving on for this demo I have written a TCP listener what this thing does is it it will spawn up a new TCP connection for every new player that you connect to. This is just for demo purpose in actual production you would typically use something else. So we in our project use something called MQTT that is also a very lightweight protocol for doing for doing pubs up and message passing. MQTT is more interesting because it provides us things like delivery guarantees like exactly ones or at least ones. So let's move on to the very first approach that we may take to writing this live game engine which is like just have one single process that sends the question as well as receives the answer and does the summarization and then broadcasts the summary again. It would look something like this. So the interesting bits are these. So what I'm doing here is like I'm starting a just a single process I'm registering it as single and this is global so anyone can call it and then if you look at the incoming function which is being passed to the listener in the previous code as a callback. So what it's doing is like hey whenever you get a new message just queue up that message in the message of this single process. So this approach would work for smaller number of players because obviously it's like single threaded it's absolutely single thread because you just have one process which is doing all the stuff like it's so it's actually publishing the question it's also evaluating the answers and then also like once all the answers have been evaluated it will then publish the summary. So the problem obviously with this approach is that since you have just a single process its message queue will build up if you have more number of players and the process will keep on executing it one after the other and eventually the entire time for your round would be like really huge you don't want that. So let's go to the other extreme let's create a process for every player right and this would look something like this so which is like for every new connection that you get you start off a worker and you pass in that particular worker's incoming connection as the callback. So what happens now is like every for every player you have a new worker process that's running that can that can evaluate that person's answer. So this approach is much faster because it is like one is to one map between the player and the process. So you have a lot of concurrency however this approach will not work in our live game engines example. This would perfectly work if you have like a regular service which has like consistent traffic across the board. However if you look at a live game so the traffic is very very spiky because you publish a question and then immediately everyone keeps answering and then like you have this traffic spike and then it just dies down till the next round comes up. So in that scenario if you have like a million players playing your live game what would happen is you would suddenly have like a million processes that are up for scheduling and the process queue at your scheduler would be very large and hence although like your process is like your process would be very fast to evaluate the response but the time for this process to get CPU would be larger and hence this is also like this will work but it will not scale very well for a very large user base. Right. So the third strategy would be to use something balanced. So like have your worker processes map like have N worker processes and have each worker process map to M players so yeah. So if you look at the code here so what's happening here is like when you start of the aggregator process at the very start itself it will start off a bunch of worker processes and then as in when a new connection happens as in when a new player joins in what you're doing is you're hashing that particular player to a particular worker ID so that now like you would every worker would actually support a bunch of players. Right. So in this scenario what you can do is like you're obviously doing sharding and this approach would definitely work and also this is a which is discussed in the Discord's blog post about scaling to 5 million concurrent users. Also one other thing that I missed while talking about a one to one and N2M strategies is that since these processes are independent so you're not actually restricted by one host machine size. So these processes they can actually recite on any machine. So like if you have a five machine cluster you can say that hey this machine hosts one process and the other machine hosts like 20 processes and what not. Alright so we do have live game engine right now it sort of works it's mostly like an MVP because if you take this to production it will fail for some of the reasons that we're going to discuss now. So A we don't have any supervision at all. Like the worker processes are just being spun up at the very top. So the very first thing that you would do is add supervisors on top of that so that even if the worker process dies there's something to restart them. The other thing that you would potentially do is separate out the processes. So in all the three approaches that we discussed the singular process was doing all the heavy lifting it was a maintaining clear state it was accepting answers it was also executing business logic. So typically in production you would want to separate out these three responsibilities into three different processes and so that not only modularizes your code you can then apply different supervision strategies as well. So for instance you would not want to lose state ever so maybe you would want to keep that process much more securely and as for the process that's executing business logic that can be restarted as and when you require so maybe that doesn't need a very good you would actually supervise it but I mean the supervision strategy between a process executing business logic and the process that meeting state would be certainly different. So the third thing that I want to talk about is distributed processes. So we discussed that in the one to one and M2N approach you can actually distribute these processes across nodes. So that works however if you remember at the very start I said that Erlang or Elixir distributed cluster actually forms a fully connected mesh and these mesh are actually connected via a single TCP connection. So now if you're sending multiple messages between two particular nodes they would be serialized over this TCP connection. So it's fine to distribute processes across nodes. However you need to be aware that this particular communication between two different nodes is actually being serialized over a single TCP channel. So typically to address this thing what you would do is something called an island architecture wherein processes can communicate to each other in the same island. When I say island it means like if they are on the same Erlang node. However as and when they need to communicate to a different process on a different node all together that needs to be a more specialized operation. So that sort of reduces your cross node chatter and hence you don't hit the network latency as well as like the single serialization of the TCP connection. Alright, that's all I have. These are some of the references that I used for these slides. Thank you.