 Okay, thank you everyone for coming. This is PG2 and you getting distributed with elixir My name is Eric Enten You might know me on Twitter github or IRC is anti-packs and I work for a company called Fandle and We have a couple elixir projects in production all of which are distributed So just to get started. I just kind of wanted to get like a show of hands Who here has never written any elixir before? Okay, cool. Who's like, you know been playing around with it a bunch and you know in the free time. Okay, cool Who's doing it at work at all? Okay, cool and who has something in production And this is like unbelievable honestly like last year's elixir comp Maybe like five percent of the people in the room had something in production So I think there's been a real shift here From it being something that people are just interested in to something that people are like getting actual real work done And and I just think that's awesome and so You know one thing that I think People have a little bit of trouble with when they Learn elixir is actually making the jump from elixir on a single node to elixir You know on multiple nodes And I think we can all agree that distributed applications are awesome, right? You get fault tolerance. You get better performance You know all the advantages that we're all familiar with from distributed applications, right? And if you've used it, I think we all can agree that distributed elixir is really awesome I mean just look at how much fun these guys are having and You know, I think that we all kind of know that What makes distributed elixir really awesome is is distributed Erlang and OTP and And you know if anybody hasn't seen this before this from a YouTube video called Erlang the Movie And I highly suggest you watch it. It's really entertaining And it's kind of like mind-blowing, right? Because you Watch this video and you see them like interacting with You know The Erlang shell in like the 80s and they're still typing the exact same things that we're typing today, right? It really hasn't changed that much So even though distributed Erlang and OTP and distributed elixir are really awesome, right? Distributed applications are hard. They're really hard, right? For a number of reasons, you know, just the nature of physics, right? The speed of light makes distributed applications hard complexity of distributed applications Makes them hard all that stuff, right? And Even though we all love elixir and Erlang Erlang OTP is not magic Right and thank you to whoever Created that thing on Google image search I don't know if anybody here's ever done a presentation before and they just type something into Google image search Hoping that maybe someone has created an image that fits a term and when I typed Erlang is magic This is what came up and I It's unbelievable But it does provide some really nice tools that make distributed applications like a little bit easier to build, right? And one of these tools shockingly is called PG2 and PG2 provides Distributed named process groups. Okay, so what is that? PG2 allows us to create join and query groups of processes across a cluster, right? So in detail, you know We can access a group of processes by a common name, right? So this is similar to like process registration. You give a process a name You can find it later on the local node, right? Except in this case There's a set of processes that are given a name and it can be across multiple nodes, right? And we can send a message to one some or all of these group members and If a member process terminates, it's automatically removed from the group. That's like really useful and important, right? So That's kind of the high-level idea of what it does, right? And I'm sure some people here like I don't really still don't really get it So I always find it's really useful to take a look at, you know, an example of how how the tool is actually used, right? So we're gonna take a look at a real-world example of PG2 and use in a chat app The classic elixir example, right? Using Phoenix channels, right, which I'm sure you're all familiar with Phoenix channels probably at this point but for anybody that isn't it's a API that Phoenix provides That gives you bi-directional communication for soft real-time functionality, right? So let's take a look at kind of a diagram of how our Phoenix chat app works on a single node, okay? We have a Phoenix server We have a couple devices and Let's say one of these devices wants to send a message to the chat room, right? So it's gonna send a message then our Phoenix server simply sends a message to the other devices, right? That's you know classic broadcast And that's how on a single node a chat app can work, right? We receive a message we broadcast it to The rest of our listeners and you know, that's easy, right? That's not complicated at all You know, I'm sure people have built that themselves before you know before Phoenix existed, right? But you know, what about when we outgrow one server, right? And I've put a little note here that it's a probably for redundancy because you're probably not gonna hit Phoenix's performance characteristics limits on a single node, right? So let's take a look at our distributed Phoenix chat app now, right? So we still have a Phoenix server But now we have a second Phoenix server, right? And this little line in the middle just sort of is representation of the fact that Anything that is on, you know, except for the servers any items in this diagram that are on separate sides of this line are not Connected to the given server that they're on the side of the diagram with right? So we have a tablet and a phone connected to the first Phoenix server and a browser connected to the second Phoenix server So let's see what happens again when when our first client wants to send a message, right? We send the message to the first server, right? But what do we do now because we have this browser client that's connected to a separate Phoenix server and this phone client Which is still connected to the same server, right? So we can't use that exact same strategy we used before What we can do is we can send the message From our server to our other servers And to any clients that are connected locally and then our other server can send that message on to Any clients that it has connected to it that are interested in that message So the question I mean so this is a familiar pattern, right? This is like fan out Sure people have seen this before and this kind of can Scale out to you know as many servers as we have essentially, you know Instead of it being one arrow between this first server and the second one we have this arrow between multiple servers, right? But the question is is is how is this actually implemented in Phoenix, right? How do we do that? You know and if we take a look at a really sort of simplified Version of how Phoenix pub sub works, right? We have a pub sub server which manages some channels Which multiple sockets can be interested in each channel and then we have clients which are attached those sockets, right? And you know again, this is an incredibly simplified version of How this works, right? But the really important part here is that this pub sub server is a process that's running on each node Right, that's the important thing to note here. Okay, so going back to our Distributed chat app, you know that we have a question, right? If we have a pub sub server on each of these nodes and we want to send this message from this first server to the second server How do we find the pub sub server on our nodes? Right like we can't just Register it because that's a local thing, right? I mean technically we can Kind of get around that But it's not the best way to do it and you know so What we can use is PG-2 And and Justin Schneck just asked me to put this gift in a slide when I posted on Twitter, so This was the only place it fit So Yeah, by having our Phoenix pub sub servers join a PG-2 group, right? We can fan our messages out across the cluster and I just wanted to put a little disclaimer here that You know in practice pub sub is implemented, you know via adapters so we can use PG-2 Redis As Chris was talking about earlier, you know, maybe one day we can actually use Phoenix presence for process groups So maybe in the future Phoenix won't even need PG-2. So this talk is already irrelevant Not really But Let's take a really quick look at like a really simplified code example of how this actually works within Phoenix, right? so Here we have a super simplified Phoenix pub sub server that uses PG-2, right? And it's a gen server and when the server starts it's gonna create a group So one thing you may notice here. That's an advantage of PG-2 over You know normal process registration is that your group's name doesn't have to be just an atom, right? It can be any elixir term. So in this case our group's name is a tuple containing the atom Phoenix or PHX and a server name and the server name is kind of just a way for if you have multiple Phoenix apps within one node To separate out their you know their pub sub servers So this server name is pretty much going to be you know your OTP apps name Just you have an idea what that's for And after we create this group, we're going to join this group and another thing that you may notice Is that when we join a group? It doesn't have to be Ourselves it could be any process that we have the PID of right So after we've joined this group, what do we do with this, right? And if we take a little bit further down on the code We'll see something similar to this. It'll be a broadcast Function right which takes a server name a topic and a message and All we're going to do is is use PG-2 to get the members of our group, which are going to be PIDs and Then for each of those PIDs, we're going to send it a message saying that we're broadcasting this topic and this message, right? And then in our you know usual handle info callback We're going to receive that message and then we're just going to call this sort of pretend local dot broadcast function Which is actually going to go and see are there any clients actually connected to the server? You know that that we're currently running on that we can actually broadcast this message to that are interested in this message, right? so In practice Phoenix Pub sub is is more complicated thanks to like extensive optimization, right like we won't you won't actually see a Message being sent to every member of the group basically If if we're sending a message then Locally, we'll just automatically Try to broadcast locally. We won't send a message to ourselves for no reason. We're just going to do that We're so we're only going to send the message to other nodes But you know, there's a couple other optimizations in there, but this is essentially how it works, right? So I mean, I think the interesting thing here is that that's kind of the sum total of PG2, right? Like that's all the functionality in PG2. You can create groups. You can join them. You can get their members That's basically it, right? But I think one really interesting thing that we can do with PG2 is we can look at how it works, right? because basically it's actually using all these pieces of the OTP toolbox to implement its functionality, right? And and by reading the code of PG2, right and understanding how it works, too We can learn about kind of how it behaves under load So in order to answer these questions and learn exactly how PG2 works. I actually translated the code to elixir, right? So Introducing repg2 because I didn't want to call it PG3 And it's it's a highly documented translation of the original Erlang PG2 implementation to elixir, you know, purely for educational purposes So I'm sure some people are saying, you know, like why exactly would you do this? Like you could just read the code, right? The PG2 code And I mean, that's true, right? The true specification of behavior, you know, is the code itself, right? So by reading the code of our favorite software, we can get a deeper understanding, but You know, not everyone knows Erlang, right? And despite the high quality of the implementation PG2's Erlang code It's not necessarily easy to read even if you know Erlang You know, the the OTP code is pretty old and so sometimes there's a lot of kind of like warts in there So anyway, sometimes you just do it for fun because you're bored Right so In the view of, you know, trying to accomplish these goals I kind of set out some like guiding principles for the translation, right? Because you could translate code from Erlang to elixir in a lot of different ways make a lot of different trade-offs, right? So first of all Repg2 code should be idiomatic easy to read and fully, you know, perhaps even over documented elixir, right? And Repg2 should be identical to PG2 in terms of its functionality and performance characteristics even if it has been refactored to increase clarity and Code which exists purely for backwards compatibility may be eliminated in the interest of clarity, right? So the idea here is that essentially I'm trying to preserve everything. That's kind of important in terms of how the original Erlang PG2 implementation works While kind of removing the stuff that maybe makes it harder to understand at a glance Additionally a test were also written for X unit for full repg2 code coverage, which includes a distributed suite that interacts with multiple nodes so You know, what are some? Differences between repg2 and PG2 right given these principles So, you know, as I mentioned repg2 does not have the same backwards compatibility as PG2 and it's only been tested on OTP 18.3 and elixir 1.2.4. So you can see when I wrote this was a while ago Also You guys might not know this but So PG2 is started under something called the kernel safe sup or supervisor, which is a special OTP kernel supervisor for important services that it considers safe to restart so you as a user don't get to put Processes into this supervisor and these are other services like So PG2 is one of them I'm actually drawing a blank right now. I know there ones that are in there But all the really kind of important stuff that you wouldn't want to possibly have affected by User code are going to be in this special supervisor and repg2 is just a normal OTP application And also PG2 will actually start itself if it hasn't been started yet when you use it Repg2 actually expects to be added to the applications in your mix.exe and it won't start itself So, you know, how much work was this translation, right? It actually Really wasn't much work at all because PG2 is like really tiny And in fact PG2 is only 333 lines code, right? So that's that's kind of mind-blowing right that like That you get what seems like a really complicated piece of functionality With such a small amount of code, right? And again, you know that it's this simple because of other useful tools that OTP provides, right? And repg2 uses all these tools as well, okay? So if we take a look at PG2's OTP toolbox, we have a couple things that we're probably familiar with and maybe some things that we're not, right? We have gen server that's probably familiar to everyone here. We have ETS. That's also probably pretty familiar to everyone here We have the global module. I think some people probably haven't heard of this one before And then we have node and process monitoring and I'm sure people are aware of process monitoring But maybe they haven't seen node monitoring before so Let's take a look at each of these individually And all these apply equally to repg2, right? Because they it uses these tools in the same way and You know additionally by looking at how PG2 uses these tools I'll actually be able to gain, you know, like as much of a high-level understanding of kind of how it works as I can give you guys today So first of all looking at you know how it uses gen server Each node which is using PG2 has a has a PG2 server process running, right? And this server process serves as the central point of interaction for PG2 between, you know, and within each node Okay, that's a pretty common pattern, right? In terms of ETS, if you haven't heard of it before ETS is an in-memory concurrent storage solution for elixir terms And in PG2 ETS is used to store process groups and the memberships So one advantage of this is that reads can happen from any process But in order to avoid race conditions writes are serialized through the nodes PG2 server So, you know ETS is a really useful tool. It's in the Getting started guide on, you know, the elixir laying homepage, right? So obviously it's pretty important and this pattern in particular is very common And in fact that getting started guide on the elixir laying site actually, you know Goes through this kind of pattern where you can read from any process But you serialize your rights through a single server in order to prevent race conditions So the global module provides a few different things one thing it provides is Cluster global name registration for processes that you probably shouldn't use because it has a number of performance issues But it provides a couple other things including cluster locks, okay, and one function that global provides is called trans, you know, which is for transaction and This function actually acquires a lock across the entire cluster and it uses any elixir term as a key Runs a provided function and after the function completes a lot the lock is released, right? So one interesting thing you can do is That by combining and one thing the PG2 does is that by combining this this trans function with gen-servers Multi-call which actually allows us to call all the processes registered with a given name within a cluster PG2 can actually ensure that only one process across the entire cluster can modify any given group at a time Right, so what this means is is that across the entire cluster anytime we join or create or leave a group We're actually acquiring a lock across the entire cluster That is ensuring that when we send our message to the PG2 server on each Node that we're the only Process that's being allowed to do that at any given time, right? And this pattern can actually be very useful in our own code We have to be careful because you know, we're introducing a lock. We're introducing network round trips all that kind of stuff but Trans that you know a global transaction plus a multi-call can be very useful in terms of synchronizing activity across an entire cluster, right And then we have node in process monitoring, okay, so One interesting thing that a lot of people are probably not familiar with is a function on the module net kernel called monitor nodes and This actually allows the calling process to register for notifications about nodes connecting and disconnecting from the cluster So that's like really useful. You can use that when a new node joins to Start processes on it or you know do whatever you have to do and in PG2's case When the PG2 server receives a notification that a new node is connected It actually merges the groups and memberships between itself and the new members PG2 server And additionally PG2 registers a monitor for each process which joins the group, right? And if the monitor reports that the process is down Which could be either because the process died or because it's node disconnected the process of membership is actually removed from the local view of the data That's pretty much all that's how PG2 works, right? That's that's kind of at the high level like that's it really, right? So kind of what are some like key insights that we can take away from from those PG2 implementation details, right? So first of all PG2 uses global locks, right and that's not ideal So While reading group memberships is very fast, right? Modifying them is actually a globally locked operation requiring multiple network round trips, right? And what that means is is that we may run into problems with lock overhead if our groups contain a large number of memberships, right? So PG2 may not be the solution for you if you have thousands of processes That you're trying to put into a group across a cluster that you all want to send a message to right that probably not the right solution However, if you have only maybe one process per node That's going into a group. This might be fine and especially if they don't restart very frequently. This may be totally okay So another insight is that PG2 is actually a distributed database, right? We have some data is distributed across multiple nodes. We can query it. We can write to it It's a distributed database and so when we talk about distributed databases We talk about the cap theorem, right and in terms of cap PG2 is is AP. It's available and partition tolerant, right and The reason why it's AP is because cluster partitions will actually only see groups and memberships from nodes that are reachable right However PG2 is eventually consistent in that it will automatically heal from any partitions, you know again like as I mentioned before When a new node joins we're actually going to merge all the data from those two nodes together, okay? and you know this is kind of like One of the reasons why PG2 can be so simple is because process groups are actually like uniquely easy to distribute Due to monitors and the fact that conflicts can be easily resolved by emerging, right? Like normally in a distributed database you have to determine what to do when there's two conflicting rights You have to determine what to do when you know a new node joins or node leaves, right? Because process groups basically the semantics are like if I can't reach this process I don't want it in my group locally anyway When we see the process has gone down due to the monitor We just remove it from our local view of the data and that's fine And because when we want to you know resolve the conflicts we can just merge all the data together because these PIDs are You know unique It's just it's very easy to handle these kinds of problems that normally show up in a distributed database and So, you know, I'll leave you with that, you know overall PG2 is like an amazingly powerful tool But you should just be aware of the caveats, right before you use it and that's kind of maybe one of the reasons why What Chris was talking about today is so cool is because you know when we kind of move to this totally decentralized You know lock free implementation of process groups, you know We can have those like process groups with you know, maybe thousands of processes in them and that's not a problem so I'll leave you with you know, you should check out repeat you to for more examples You know, I wanted to kind of put a whole bunch of code on the screen today, but that never really turns out so great So check out repeat you to for more examples including how to build a distributed test suite Which I think is something that people are probably interested in seeing how to do and repeat you to actually has a Little bit of extra code that actually handles Starting and stopping the other nodes in the test for you So you don't have to manually start nodes when you do your test, which is kind of nice And that's my talk. So thank you all for coming and if you have any questions Yeah, so I'm happy to answer questions about this talk or my work at Fandall or Mix X ref or mix test stale or any of the other stuff that I've worked on Hi How do you connect Erlang nodes to each other? If you have an autoscaling deployment because you need to be able to connect nodes together to create a pg2 group, right? Yeah, sure. So I we actually do this On ec2 at Fandall and basically the way that we implemented is is that each? node that should Be connected to each other is actually given an ec2 tag and then essentially when a node starts one of the workers that's added to its Supervision tree Essentially just every 30 seconds it goes and queries the ec2 API for all the nodes that are in that tag And then it just tries to connect to all of them that it isn't already connected to so it's pretty simple Yep, I guess expanding on that How does it signal to the the cluster that like it is actually gone and not just disconnected from okay? Sure. Yeah, so that's actually It's part of The your OTP configuration your Erlang configuration is actually a tick rate So I think the default tick rate is 15 seconds and the way that that works is is that Every 15 seconds every node that's connected in your cluster is gonna essentially ping every other node And also what's configurable is the number of these ticks that can pass by without a response from another node To indicate that node is is no longer connected to the cluster. So once that number of ticks Passes the node is considered gone and it can come back by you know in the case that we were talking about before It comes back by maybe you know once the network issue or whatever resolves That every 30 seconds when it checks the tags to find all the other machines It'll eventually come around and find a node that it can connect to and it'll reconnect When you're designing Starting to design an application and you know that at some point you're probably gonna want to use PG to Should you is it better to sort of just design that way? You know use it from the start or do you just? Do you find it's easier to sort of assume you're only running everything locally and then sort of add PG to on later? Yeah, good question. I think One of the really cool things about elixir and our line in general is the location independence You know, you don't need to really know what node a process is running on if you have its PID, right? Because the PID contains information about the node So what I would say is is that as long as you provide some kind of interface that's kind of like you know do thing for name blah with data blah right even if initially that only you know Runs the code in the current process or Sends a message to a local process a single local process in the future You can enhance that to be you know to get members of a PG to group and and do something more complicated and in general, you know Adding PG to an application is pretty easy. It's really just adding that Group create and that group join to your init function for the processes that you want to be in a group So, you know, I don't think you need to necessarily design it with PG to in mind initially But when you find that you need the tool, it's it's usually pretty easy to to add, yeah, just a quick follow-up to Using PG to for like a singleton sort of process like a process group just that could always contains Exactly one process is that a you know a pattern that's used commonly or yeah I mean, I think that There's tools that are better at doing that like generally if you want a process That you know a single process in a group then that's technically a process registry and not a process group So you have the global modules registration Stuff as an option if you want to do that you have If you only want to do things Locally you have you know G proc as an option. Maybe you've seen that before You know if you only want to do it locally and you want your names to not just be an atom You can use G proc and then there's a number of options in terms of an actual distributed process registry And they all kind of make different trade-offs in terms of what their failure characteristics are and their performance and that kind of stuff, but yeah, I have seen that before I have seen just a Single process in a PG2 group and it works fine, you know, you just have to make sure that really only one process is getting into that group Which can be? You know complicated depending on you know how things are set up. I think we have time for two more questions Hi, can you maybe give us some examples about how you guys at Fandall use PG2 or distribute your applications? Yeah, sure. Absolutely So The product that we use PG2 so we use PG2 actually in multiple projects One project essentially involves having a number. It's kind of so basically it's a it's a work queue, right? but instead of let's say like a Q advertising some work and workers grab it the Queue itself actually connects to nodes and runs work on those nodes as you know using You know Erlang's various facilities for doing that kind of thing So it's kind of an inverse of how work queue usually works And those workers actually join a PG2 group in order for them to be found across the cluster And then additionally we use PG2 in another product, which is a It's a game so we have a number of back-end servers which will actually have a process Which is kind of like the gatekeeper process for that node join a PG2 group and that process is responsible for Returning answers to requests for information about stuff like how loaded is this node with games at the moment? Is this a good node to put a new game on to and that kind of stuff? So it can be really useful in terms of like, you know being able to query multiple nodes for information about Some things that they're doing you know that sort of stuff. So that's kind of two ways that we use PG2 Anyone else? Well, thank you guys