 So we're talking about Phoenix, obviously. And I have a cheesy tagline, new heights, because I've given a talk sometime last year or two years ago called Phoenix Takes Flight. So now we've reached new heights. But no, but really, this talk's going to be about where Phoenix has been over the last six months, especially where it's been since last ElixirConf. So ElixirConf, last year, I talked about some exciting new things we had planned. And today is going to be about some new milestones we hit and some, I'm going to do a demo of the presence features that we have implemented recently. But the first big milestone is we released programming in Phoenix. So this just came out a week and a half ago. It's available in ebook. And it should be in print this month. It was about eight months of work. And I'm super excited that it's done, but I'm super happy for it to be done. Because writing a book is a terrible experience, but then you forget it's terrible and you want to do it again. So check this out. But one of the biggest things that we had happened is we hit 2 million channel clients on a single server. And this was a huge milestone for us. These weren't just raw WebSocket connections. So these were active Pub Subscribers running supervised channel processes that were being monitored so we could notify the client if the channel crashed or if it disconnected. So this was doing quite a bit of work. And to get this kind of scale is not something that I even thought we would be able to do. So I'm going to talk a lot about the process that we went down when doing this. But it took a ton of work. So it took, it was like two weeks of work just to try to orchestrate this whole benchmarking process. So if you've ever tried a benchmark, Elixir, or Phoenix applications, it's actually very hard to even match your CPU. So to properly benchmark 2 million channel clients, we had to provision 45 Rackspace Sung clients. Sung is an early client library for benchmarking HTTP and WebSocket connections. So we had to make a whole fleet of 45 servers to open 60,000 connections each to one rather large server. So we got 2 million connections. It wasn't a tiny server. It was a 40 core, 128 gigabyte memory box, which seems crazy, but it's only $1,300 a month on Rackspace. So probably a similar box would be $1,000 a month on EC2. And really if you can support 2 million active users for $1,000 a month, that's like groundbreaking. I think we're going to really see some innovative new applications and software as a service come in and probably eat some other company's lunch for the reduced overhead that we can get. And it used about 83 gigabytes of memory. So we did stress this box quite a bit, but it still had plenty of room left. We just ran out of connections to open. And if you go back here, we actually were capped at 2 million connections. We had about 2.5 million theoretical connections based on this 45 servers. But when I was setting up, when I was provisioning this box, I set the U limit, the file descriptor limit, to 2 million as some crazy high number. Surely we never hit 2 million. And that was our cap. And I had forgotten that I had done that. So by the time we realized it was actually the U limit, I figured 2 million was enough for the hacker news post of we got this. And we were paying by the minute, mind you, for like 50 servers. So at that point, I was like, I'm done with this. So it was a ton of work to do proper benchmarks. I mean, this took a lot of work. Gary Rennie, he's on the Phoenix Core team. He was instrumental in helping set this up and fridgeing all these boxes. Same with Justin Snack as well. His company is the one that donated about 40 servers on their Rackspace time. And it was kind of fun. While we were testing this app, it was a chat application we were benchmarking. And we had 2 million clients join the same chat room. And we never thought that this would actually be a usable application. And someone said, let's go try the app out now. And we thought for sure if you sent a message, it would just crash the app or something terrible would happen. So immediately, Jose, instead of just posting a message, he copied and pasted an entire Wikipedia article into the chat message input and sent it. And it actually just sent out to everyone in like two seconds. And this is like the network traffic chart that gets graphed out by the Sun client. And on the right there, you can see that's the Wikipedia article. So we saw like a huge, enormous network spike because I had to push a Wikipedia article out to 2 million active subscribers on Rackspace servers. And so we were like geeking out about that, that we could do this. And then we momentarily panicked because we were trying to figure out Rackspace charges you for bandwidth. And Justin's company had given us 45 servers. And we just sent a, we kept doing this. And we sent a Wikipedia article to 2 million active subscribers across Rackspace network. And then so we were doing this for a while and then we were like, wait, is this costing like bandwidth internally? So since it was internal traffic it was fine. But for a moment it was like, OK, this might not be a good idea. But this is crazy that this worked. Like we were blown away. So this is a video I took. We had 2 million users in this chat room. And I brought up two browser tabs. And you can see when the input disappears on the left is when I hit Enter. And then like two seconds later it shows up in the other browser tab. So just to give you an idea, what's happening here is I hit Enter, it pushes over the channel. And then I broadcast to everyone in the chat room, which is 2 million active WebSocket connections. So that has to do, on the server that has to do a message send to 2 million active processes and then serialize that to JSON and push it down the tubes to 2 million connections. And this goes out in two seconds. This is not something that I thought would be possible. So this isn't a use case we were trying to optimize for. We were just trying to max our PubSub system and tease out bottlenecks. So I'm blown away that this is possible. And you probably wouldn't have a chat room with 2 million people. That wouldn't be a good experience. But you might have a NERVS network. So if you imagine an actual use case of this, you might have weather stations throughout the state. Disaster scenarios, whether you're communicating information across some ad hoc network. So I think that I can't think of a use case for 2 million subscribers yet, but the fact that we, on the same topic, but the fact that we have that is really exciting to me. But the most exciting part of all this was the optimization story, going from what we had initially with a Phoenix 1.0 release to that 2 million mark. That's probably the most fulfilling process throughout my entire Elixir and Phoenix journey. So just to give you an idea of where we started, with Phoenix 1.0, I had written the channel layer just using OTP principles, but I wasn't really focusing on performance. I hadn't really benchmarked it just because it's so hard to benchmark these things. So Phoenix 1.0 was all about, we're going to do a best effort, best hunch guess. This should be fast. Write it, make it work, and then make it fast, that strategy. So we got 1.0 out, and then I was able to dedicate time to these optimizations. But I'll just walk you through the optimization story. So we initially set this up, and we got 30,000 active subscribers on one server, which was way too low. We weren't really thrilled with this. So this is where the crushing self doubt sets in, where it's like, ah, Phoenix is supposed to be scalable. Like, what happened? Is my code terrible? I'm going to have to rewrite everything. But immediately we doubled performance. And the cool thing was we actually simplified the code. So the def here is we added 14 lines of code, we removed 16 lines of code, and we doubled performance. So I was like, OK, that's pretty cool. But 60,000 subscribers isn't like the WhatsApp scale that I was hoping for. But then I fired up Observer and found another bottleneck, almost immediately. And then to fix that bottleneck, we simplified the code. We actually removed code again. Added five lines of code, removed 38. And we ended up with the order of magnitude more performance than where we started with. And at this stage, I'm getting pretty excited, because now we have hundreds of thousands of connections. And I've actually removed code. And I'm thinking like Gary and I, while we were running this, we kept talking like, surely there's no more low hanging fruit. Like after going through two almost immediate performance improvements in removing code, I was thinking like, this is too good to be true, that this must be some hard limit until I have to rewrite everything. But after that, we had our biggest performance improvement, which was one line of code change. It increased our arrival rate by 10 times. So the amount of connections per second that we could establish was increased 10 times. And we added 120,000 subscribers. And we were only capped at 450,000 at this point, because we started with a smaller 16 gigabyte racks-based box to test. And we ran out of memory on that box. So the optimizations you see here, the diffs, is what we had to do to go from something that supported 30,000 subscribers to 2 million. And this blew me away. Mainly because you drink the Kool-Aid of whatever platform you're a part of. It's unavoidable, and there's nothing wrong with it. And Kool-Aid is delicious. But you listen to talks, you watch videos and books. And we preach all these things, like functional programming leads to more maintainable code. The Erlang VM is super scalable. And the tooling is just super robust. But to actually have that live up to the hype, and that's like all along my journey of creating Phoenix, the hype has lived up to reality. And that's been incredibly satisfying. And that's what happened here, where it's like, I actually removed code from something that was like an initial best guess effort. And with very little effort, I went from something that was mediocre performance to something that's world-class performance. And mainly because that was a, we were using built-in tooling. And we didn't have to, I didn't have to do any complex performance analysis, or performance profiling on this code. What happened was when we initially had that 30,000, 60,000 limit, we fired up Observer on the remote node. So we ran the server on Rackspace and you were able to run Observer. You type in observer.start in IEX, you'll get a GUI. You can actually connect your local Observer to a remote node. So during an active benchmark, I fired up Observer, clicked on the Processes tab, and ordered by message queue size. So processes have a mailbox, and anytime you see a bunch of messages and a message queue queuing up, that means that that process is becoming a bottleneck because it can't, it has a bunch of messages in its mailbox that it's still waiting to process. So I saw the timer server show up at the top of this list with hundreds of messages in its message queue. And I was like, what the heck, why is the timer server here? And it turns out, I immediately realized that I had mistakenly used a timer send interval instead of a process send after or early start timer. And I was doing this as part of the PubSub system for timeouts. And what timer send interval does is it will rely on a central timer server process to send you a message on that interval. So what we had happen was at like 10 to 15,000 new connections a second, we're all doing a timer send interval. So that timer server, that one process was getting tens of thousands of messages per second sent to it. And instead, we could just do a process send after. Once we get that message, we do another process send after. So by just removing that, we had a five-fold increase in performance. And this is how we did most of our performance optimizations and how we teased out bottlenecks. We just checked out like the table viewer to view our edge tables and we checked out message queue sizes to see message queues building up. And that was all we had to do to optimize that 30,000 connection to 2 million. It wasn't like we didn't have to do, I thought we were gonna have to dip down into like actual profiling or byte code, but it was way easier than it should have been. Like I'm still kind of blown away that it happened this way. And I found a tweet last night actually at midnight, Adam Kittleson tweeted this. He was playing around with Observer and he spotted a memory leak. And after that, it took him 30 seconds to see what it was and where it was. And I was like, wow, this is perfect for my slide deck. And I screenshot it and added it. And then three minutes later, a GitHub issue was opened up that there was a memory leak in Phoenix Pub Sub by Adam Kittleson. And so I said that he was trolling me. Like right after I added that to my slide deck, it was actually, he was using Observer to diagnose a memory leak in Phoenix. So it was just odd timing. But yeah, it turns out this is a bug that I definitely have to fix, but it should be very seldom hit by most people unless you're doing, if you've never done an unsubscribe in the Phoenix Pub Sub layer, which most people don't do, it's not a problem. But it's definitely an issue. But the coolest thing that plays one on this story is like, Adam, he's been doing Elixir for quite a while. He's active in the community, but he's not on the Phoenix Core team. So he had saw memory growing his application. He wasn't familiar with the Phoenix code base. He fired up Observer and in 30 seconds, he was like, oh, there's a bug right here in Phoenix. So for me, it was just another story of how easy it is to diagnose these issues in running systems. But now I have to fix this. That's for the flight home. For our 10 times increase in arrival rate, this was a one line of code change. And this is where you really need to know your Ets types. Ets is like Erlang term storage. It allows you to place any kind of Erlang or Elixir data structure into a table and reference it without talking to a process to hold the state. And we changed the table type. We started with a bag, which is like a key value store dictionary. And going to a duplicate bag is what gave us that order, magnitude, performance increase. And what happened was bagged Ets type doesn't support duplicate keys. So insertion grows linear in time and a duplicate bag allows duplicate keys but skips the key check. So it's a constant insertion time. And for us, every subscriber is unique. So we could go with a duplicate bag when we were able to get huge improvements from that. You can see on the left-hand side is when we had the initial table type. The green line is the number of open connections and the blue line is the number of actually successfully established connections and they immediately diverge during the benchmark run. And then on the right-hand side, you can see they stay nice and tight. So that was an enormous improvement just by changing one line of code. But at this point, we hit kind of the top end of what we could do. Like you know, we had 50 servers almost up and running and we had improved. We were able to run two million subscribers and we almost stopped at this point. But occasionally, what we had happened was one of those smaller rack space servers that was opening 60,000 connections would run on memory because just by opening number of connections, it was maxed at its memory and it would crash. And when it crashed, our PubSub local servers on each node would have to process 60,000 down messages at once, which would lead to timeouts on the single PubSub local process. So we were thinking like, is this a real use case? Like do we need to try to fix these timeouts? You're gonna ever have tens of thousands of users signing off at the same time. And maybe not, you might have, at the end of a broadcast, if you're doing some kind of conferencing or video, you might have thousands of users drop off. But at the same time, those two million broadcasts actually were taking five to 10 seconds. At a million, they were taking about two seconds, but at two million, they were taking several seconds. And we were thinking like, wouldn't it be nice to get that to around a couple of seconds? So we were at the limits of our Ets subscribers and our PubSub local process. So then Justin Schneck had the idea of, couldn't we shard the subscriptions? And then we immediately realized, we could also, by sharding the subscriptions, we could pool our local PubSub processes. So if you ever have a bottleneck in a process, like the first thing you do is you create a pool of those processes to fix timeouts. So by sharding our Ets tables, we were able to also pool local servers. So from something like this, where we use PG2 to broker messages across nodes to a single PubSub local server on each node back by an Ets table to something like this, where we created an Ets shard. And for each Ets shard, we had a local PubSub server per shard that would manage its own table. And then we still use PG2 just the same. And this was like 150 lines of code to make this change. So it wasn't insane. And we were thinking like, what do we actually shard by? Like we knew we should shard, but we don't have users. Like these could be anonymous users. Like what do you shard by? But it turns out we have a unique identifier, right? We have a process ID. So this is code lifted straight out of the Phoenix PubSub library, where if you wanna convert a PID to whatever shard it should be a part of, this is all it takes. So the nice thing about Erling is every term has, you can convert it to a binary representation. So we can just call Erling term to binary. And we get the binary representation of a PID. And then you check out the Erling docs and it will tell you what that binary representation is. And you can find out how to lift that. Like when I inspect a PID you get like zero five seven zero. If you want that number 57, you can just binary pattern match it out. And then you can calculate the shard. You just divide that value by your shard size and take the remainder. So mathematically to pick a shard is just divide by the shard size, get the remainder of that. So in this case, this PID knows which PubSub server to ask to subscribe to. And it knows which Ets table to write to. So we're able to easily shard solved our timeout issues because we could distribute those messages across each shard. And then we were able to paralyze our broadcast. So this is what let our broadcast go back down to two seconds because when you broadcast, we have to go, if I'm broadcasting to rooms 123, I have to read from Ets, get all those subscriptions out of Ets and then send a message to those processes. But now since Ets is sharded, I just say, okay, what's your shard size three? Okay, spawn three tasks, read from each shard and broadcast in parallel to anyone that happened to be in rooms 123. So this is what let us keep our broadcast super short and also fix those timeout issues we had at the very, very top end of those two million connections. So that was PubSub. Like I said, the most gratifying experience I've had just because everything lived up to the hype, we ended up with something that was actually less code. And we went from something that was a best effort approach to a world-class solution with shockingly small effort. Like the hardest part of all this was setting up those 45 servers and forgetting to set the U limit properly. But it was really gratifying. So following that Phoenix presence was my next focus and I talked about this at ElixirConf last year. And there are some tricky things to solve here. And the most exciting thing for me is we are putting cutting edge CS practices or cutting edge CS research into practice. And most people when they first look at presence, they're like, this is stupid simple. Like what do you need cutting edge CS research for? Like are you just going off and doing something fun? But there's actually some really tricky problems to solve here and I'll pitch you on why we're going with the new ideas. So if we look at presence, initially you're like, okay, I wanna know who's online. So if I have a user one open a browser tab and a user two open a browser tab, I wanna show a list of users, right? What's so hard about this? Well the problem is most people start here and they'll say, okay, if I wanna implement this, I'll just create some gen server, presence server that monitors channel processes and anytime they join, they'll just call a function and I'll broadcast that they've joined. Anytime I see that process leave, I broadcast that they leave and I remove them from some internal state. So they'll implement something that looks a lot like this. I've seen code just like this. And what happens is when they get a join on their channel at the top, they'll say, you know, presence add, this is some pseudo API and that will be a gen server call which will broadcast the join event, add them to some map in some internal state and then continue on. And then they get a down event, they can say, okay, broadcast that this user has left and then remove them from the internal state. And people will do a named agent instead of a gen server here but the named agent can't monitor. So the worst solution is they'll trap exits in their channel and then call the agent when they crash. So best case scenario is most folks end up here but we'll walk through the problems that happen. I've been down the same path myself. So what happens if user one opens three browser tabs? We get three duplicate join events and now what do we do? All right, is that three user ones? Do we show them in the list three times? All right, do we have the client de-dupe that'd be one solution but then what happens if a user closes one of those tabs? Now they left one tab, the server's gonna broadcast a leave and the client's gonna remove them from the list but they're still there. They're still there in two other devices or two other tabs. So the problem is most people don't think about presence as having, there could be multiple unique presences for the same user. User could be online from their iPad or their iPhone and their browser at the same time or just on multiple browser tabs. So users has not left until they've left from every connected device. So we need a better solution. But it gets worse though because the often overlooked problem is if someone will develop this on their laptop and they'll put it in a name gem server or named agent and then they'll deploy it and they'll add another node and the problem is that presence server is now running on each node and it was only gonna list the users that happen to connect to that node. So we have a data synchronization problem so this can't work. And at this point, most languages and web frameworks would say, aha, I know how to handle the case where I need access to share data. We just put in a database, we'll deploy Redis. Problem solved. And I don't wanna crap on Redis. I've used Redis many times in the past. I think it's awesome. But I think the issue is runtimes that don't support a great distribution story which is most that they push you towards bad solutions. And especially where presence is unique in that it's all a thermal state, right? Either the user's there or they're not and when they're gone, that data's gone. So there's no reason to put that data into some cold storage if it's a thermal state. Like we have a great place to put a thermal state in elixir and instant processes. But let's walk through this. So I think like the best case scenario is most frameworks and languages will end up with a solution like this where they just say, okay, we run some process on each node and we just read and write to Redis or storage engine. Problem solved. And this works, but there's a critical issue which most people just avoid or just don't tackle at all is what if node two in this case happens to catch on fire? So boom, node two is gone. Now all those entries in Redis or your data storage engine are gonna be orphaned because the process responsible for cleaning up those connected users died unexpectedly. And you could say, okay, do we just have some, maybe have a storage engine that supports expiration? So you end up with like more and more convoluted layers. And it's like, well, what if the user stays connected beyond their expiration time? Do we have to like heartbeat in every entry in the storage engine to like bump its timer? Or you have every other node, try to clean up any other node that it saw die. But now if you have dozens of nodes, you imagine if you have dozens of nodes, all those other nodes are gonna have to be competing to write to Redis to clean up everyone else and it becomes a non-scalable solution. So there's a better approach here. It's can't work. We have to handle a couple problems. So we need to handle local node concerns. So we have to account for unique presences for the same user and multi-node concerns are we have to handle no down events. And this is something that almost no one handles properly. And multi-node concerns also is we have to replicate the data across the cluster. So this is where the cutting edge CS research really comes in. So our ideal solution is something where we wanna have no single source of truth because we don't want to rely on some central storage engine. And that would actually give us no single point of failure. So we're getting some benefits here where we are able to make our system more robust and able to make it more performing because we don't have the single source of truth that everyone has to hit for data. And we can do this with a CRDT and a heartbeat or gossip protocol. And I'm not gonna get into CRDTs. Alexander Sange, he gave a great talk last looks or conflict about CRDTs, but he is the one that's implementing our CRDT for Phoenix Presence. And CRDTs stands for Conflict Free Replicated Data Type. And if you know about CRDTs, Phoenix Presence implements an ORSWAT, which is a Observe Remove Set Without Tombstones. I'm not gonna talk about that, but check that out. But what they give us at the high level is we can support replication without remote synchronization. And this is huge. So CRDTs, you implement strategies for updates that in a way that's mathematically impossible to have conflicts. So we don't have to have global locks or some kind of a consensus algorithm any time we add or remove data. We just add it and if we add it out of order or we add something, messages are missed or messages arrive more than once, it doesn't matter. This CRDT is gonna have strong eventual consistency if every node eventually receives all messages. And CRDTs are very, very hard to implement correctly. So it's like, if the code is almost correct, it's not correct at all. So CRDTs have to be mathematically correct. So if you're interested, the Phoenix tracker has a Phoenix tracker state module that implements an ORSWAT, you can check that out. And then we're using a simple heartbeat protocol for replication. We're interested in implementing a true gossip protocol. Heartbeat protocol is every node sends every other node a heartbeat occasionally. And that will eventually break down for a ton of nodes where a true gossip protocol will do like infection style information dissemination. So like it will only contact other nodes and information will spread throughout the cluster and kind of like as an infections would spread or epidemics would spread in a society. But those are harder to implement and the Erlang VM uses heartbeats for its own cluster membership. So if it works well enough for dozens of nodes, our approach is let's see how well this works and the first person that needs beyond like 50 nodes can sponsor us to make a gossip protocol. But yeah, heartbeat is just every node's gonna send each other a message saying hey, I'm here and we're piggybacking the CRDT Delta as data changes. So if no data changes, it's just gonna heartbeat with nothing. If data does change, it's gonna send the message saying here's the users that I've seen join and leave and that data's just gonna be sent across each heartbeat. And then if node one sees node two miss a few heartbeat windows, it's good to say, well node two is either gone or there's some kind of net split scenario. I cannot see this node anymore. I have to presume that they're down and then we clean up any local users that we saw that were part of node two. But we can use some other neat things. So the problem with having no like shared database is what if we miss some of those messages? How do we detect that we've missed messages and then what do we do when we find out we're behind? We can't just go read it from a database. Like what scenario do you take? Oh, we can use vector clocks in this case. These are version vectors technically and vector clocks, like they seem like a very fancy name. So I didn't know what they were a few months ago. I just checked out Wikipedia. I took the MongoDB approach. I read a Wikipedia article and implemented it and it kind of worked. And I was like, all right. No, we did more than that. But yeah, it's like they actually were pretty straightforward to understand the Wikipedia article was actually quite good. All the vector clock is, it's just an integer value. And anytime state changes on a node, anytime it has an internal state change, it bumps its integer value. So it's just an integer and anytime state changes, it bumps the clock. And that way, node one, in this case, let's say node one, every time it receives a heartbeat from another node, that node's gonna say, here's my current vector clock and here's a map of vector clocks that I've seen for everyone else. And these are just integer values. So let's say every, if we heartbeat every two seconds, node one's gonna say, when it goes for its heartbeat, it's gonna check its vector clocks. So let's say node three went from vector clock one to two. That means node two has updates for us. And if we also saw that node four went from two to three, that means node three, or node four had change information and has updates. So in this case, if we find out we're behind, because we can see those clocks bump, that we maybe missed some window, what do we do? It's like, well, we can send a message to every node that we've seen that we've missed data from. But then if you have dozens of nodes, that means you have to send dozens of messages across a cluster and we have to send dozens of data structures back across to the response. So what we can do is we can collapse the clocks. So we can easily optimize this by looking at those integer values. And we know node three and node four have data for us, but we can say, aha, node four said that it saw node three at vector clock two. So then we just say, well, we can actually guarantee that I can send node four a message saying, hey, catch me up with the data I missed. And we can guarantee that we get every change that node three has. That makes any sense. But it's just simple integer values. And if we know that the node has seen updated information from everyone else that we care about, we can just send one message instead of dozens. So it's a easy optimization we can perform to update information that we missed. Whether we've gone offline, whether there's been like a net split and we've missed a ton of messages, we can ask the minimum number of nodes to catch us up. So that's the cutting edge CS research that we're trying to put into practice. But not everyone cares about that, right? Like it's super cool that we can do that, but a lot of people would just say like, well, Redis is cool, I use Redis, let's just use Redis. But that's not the point. Like the point is at the end of the day, no one has to be concerned about this except for the Phoenix core team, but we like this stuff. At the end of the day, the actual user lane API is just trivial. So what you get out of this is you get operational overhead. You don't have to worry about your DevOps team deploying Redis. You don't have to get a call at 2 a.m. that your Redis box has gone offline and now your entire application infrastructure is broken. We can just ignore all that. As long as you can deploy a node, those nodes are going to self heal, catch up on data they've missed. So it's gonna make your life better. But at the end of the day, you can also, you don't have to sacrifice any nice API, right? This is the whole server API. If you wanted to create a list of users in your channel, you're gonna be able to say presence track. After you've joined the channel, you say presence track, track my socket, track this user ID, and you can even store metadata. Like any ephemeral state that you've set in a UI, like maybe you've set your status to a way in a chat app or online. You can store that. So if I set my status to a way on my desktop, but it shows my status on my phone is online, like which one do we pick? And it's like, that's a client concern. So we support multiple presences and the client can look at the metadata and say, well, if you set yourself to a way, we're gonna make that a priority. But again, this is up to your application, but you have access to all that data. And then we can just push all current users down with presence list. And this is the actual API that we have that's working today. And that's all you have to do on the server. So like the fact that we're using a CRDT and Gossip doesn't matter. And since this is actually all backed by Phoenix PubSub internally. So this will even work on Heroku, which is nice because you can't actually cluster nodes on Heroku. But since we're piggybacking on PubSub, we have PubSub adapters that work on Heroku. And another thing you can do, this is optional, is we invoke an optional fetch callback on your presence module. And if you can imagine, if I have a list of users and I'm replicating this data across the cluster, I might have hundreds or thousands of users that get sent across as a CRDT. So if 100 users join, I don't wanna have to go through 100 database queries for every user to get some kind of extra profile information. And we don't wanna replicate that data or just shovel all that data in as presence metadata because now we have a caching problem, right? Metadata should only be ephemeral state. So we give you the fetch callback, which means we're not gonna have you call 100 queries for 100 users. We're gonna give you those 100 entries of those user IDs and you can make one database query for any data you want and then return a map. And now you've done one database query per CRDT that you get replicated across the cluster. And this is optional. Not everyone will need this, but if you need to pull in that extra information, we give you an API to do that. And then on the JavaScript side, it's almost as simple as a server. So we have a new presence object that you can import from Phoenix. And then you wanna say room on presences. Anytime you push down a list of the entire user list, you can just say presence sync state. And this is almost like a CRDT on the client. It's not technically a CRDT, but it's like you have some initial state. And if you have a disconnect event, you're gonna get this whole presence list again. So we're giving you a sync state to say, I wanna sync any state from the server with my local state. And then you have an optional callback you can add for any joins and leaves that you see. And then if you wanna react to users joining and leaving, you're gonna get this presence diff event called. And you can call presence sync diff. And that's just gonna sync the diff of users joining and leaving. And again, you can also pass a callback for on-jewel and on-leave to react to that differently. And you'll be able to detect, has a user joined for the first time or has a user joined from an additional device? You have access to all that and it's super simple to use. And then the list of users you just call presence list. And again, this is not just listing every user. This is listing every user's presence. And this is where you'd be able to say, I have a list of users, but maybe they're online on one device and they're away on another. You could select which one you want to do by passing a callback function here. And now I wanna show a demo. And this is like super basic. So all that backstory just to show you this application here that's being hidden away. So this is just a simple little react app. And all of this just shows a list of users, right? So now I'm going to open a new tab and these are running on two different nodes. So you probably can't see it with the font size. So this is port 4001, this is node one. And we have port 4002 is node two. So we can see that node one and two are listing each other's information, but now I'm gonna open up a new tab on node two. Can add its name as Brian on node two. And you can saw within a second, Brian on node two appeared on node one. So that data, we have a heartbeat of a second right now. So every second it's going to replicate that information across the cluster. So that data was replicated across, that user showed up. We could attach metadata if we wanted to about that status. And now if I show my two original tabs, I'm gonna close that second, I'm gonna close Brian's tab. And we should see him immediately disappear from node two. And then within a second he should disappear from node one. And he's gone. So we have like the basic case of like, has a user, has their process joined and left, that works. But now we've even handled the tricky case of node down. So both these servers are running, they're clustered together. I'm going to kill node two now. So node one's on the top and node two's on the bottom. Node two is now gone. And we should see node one detect this failure in just a moment. There it is. So we saw a node one here says, I've seen a node down for node two because it missed those heartbeats. And if we go back to our browser tab, we can see that Chris dash into is gone from node one. So we were handling that tricky edge case that almost no one handles properly for you. But the cool thing is recovery. So if I go back, I'm gonna have this cluster recover. Node one's still running. I'm gonna start node two back up. It's lost all this information, but node one already detected, hey, node just came up. But wait a second, this is the coolest part. They synced each other up to date with information. So we can see that node one sent a transfer request to node two because as far as node one's concerned, it saw a new node come up, but that node could have been on the other side of a net split. So that node may have new users that have connected to it. So they both sent a message to each other saying, hey, please catch me up with data. And then we can see node two received a transfer acknowledgement from node one and it got that replicated data about everyone else in the cluster. And if we go back to our browser tab, we can see that node two is connected. So even the browser client PhoenixJS has exponential back off recovery. So even the browser tab on the right reconnected to the channel process which tracked the node two presence again, which replicated the data back across on the left. So there's Phoenix presence. So it's just to show so simple as the user is actually a ton of problems to solve correctly. And it's not quite ready yet. So like this works, but we have some problems to solve on our CRDT still. The delta that we send across kind of works. So right now we have to send the whole CRDT instead of the CRDT delta, which for thousands of users would not scale well. But we will solve that. And I can't give a super hard release date, but maybe in the next month or two, I'd love to have this out. You know, the JavaScript API works, the server API is in place. It's just a matter of implementing this correctly. And if you have experience with CRDTs and we are gossip protocols, we'd love someone to help verify our implementation. So that's Phoenix presence. Now I now want to talk about Ecto 2.0. This is out in beta and there's a lot of really cool things happening. So I think they're worth mentioning. The biggest thing, at least the biggest feature that I wanted was subqueries. This is something that we've really longed for at Dockyard. In Ecto, previously to you know, if you wanted to query, let's say in this case, calculate the average number of visits per post, you'd write a query like this. But then you couldn't actually do a subselect from that query. It's like we're in SQL, if you wanted to say select from, per end select, you couldn't do that. It was just impossible without doing raw SQL fragments. But in Ecto 2.0, I want to do something like, okay, given this query of the average number of visits per post, I want to select only the average number of visits for the top 10 posts, let's say. Now in Ecto 2.0, we can just say from P in subquery, so subquery of that query, select the average visits. And this allows us to compose queries as subqueries. And it just works for everything. So you can even give it aggregates, select the average, and it's going to do a subselect on that. So for me, this is a huge feature instead of cleaning up a lot of nasty raw SQL. And we also got many, many associations. This is probably the most requested feature, yeah. So you just say many to many, you give it the join table and it just works. So that's awesome. You know, we got some other great features. So concurrent transactional tests were a huge feature. So if you notice before any new Phoenix app you generated, you had to do async false in your tests anytime you hit the database. But now we can run each test in its own transactions. So you can run your tests in parallel. Tests are going to be a lot faster and then it just works. You don't even have to think about it. And now you can also insert data directly with structs. You used to have to, prior to Ecto 2.0, you had to do like a build ASOCK, so build the association and then put it into your parent. But now you can just actually provide the raw structs, like comments in this case. I can just give it a list of comments, structs, and Ecto is going to properly insert that data. And then these are just very much maybe ideas, right? The coolest thing for me for throughout this whole PubSub and Presence process is I, if you watched the first ElixirConf a couple years ago, I talked about distributed web services and service discovery, and that's something I always thought would be very far off for me to look into. But it turns out solving these presence ideas pretty much gives us service discovery. I've even implemented a proof of concept of it. So what we have is Presence internally is implemented on a Phoenix Tracker behavior. And the Phoenix Tracker does membership, cluster membership, because we have to detect node up and down. And we can also store data and callbacks on nodes going up and down. So that's pretty much service discovery. Like if we have a presence, if I have a process that wants to say I can do user registration, it can just use the existing presence implementation and register that presence. And its ID would be the name of the service that it's a part of. And that data would just be replicated across a cluster. And if we saw a node go down and it happened to user registration, we could then dynamically spawn another user registration service. Or as Joe and I were talking about this, we allow you to store metadata on the presences. So if you imagine, your whole cluster could start up multiple, the web crawler's a good example. If I have multiple web crawlers, that's kind of expensive. I could start up X number of them on the cluster, and then I could do a presence list. So instead of show me a list of users that can do web crawling, I could say show me the services that do web crawling. And the metadata could be the amount of current jobs that that service is processing. And then you could just sort by the least number of jobs and you'd have load balancing. Like a client would load balance and call the service that has the least number of jobs in its queue. So you can do some really neat things on top of presence, outside of the web context, outside of the chat room context. And then in the future, I want to really look into optimizing presence with a true gossip protocol. A swim is one that we're looking at that's doing epidemic or infection style information dissemination. But that's, I have a CM swim branch, but that's not something we want to go down until we see that the heartbeat is hitting some kind of wall. I think at that point, anyone that hits that wall is going to sponsor the project because they're going to be doing extremely well. And then Garth and I are talking with the Nervous Project. I really think that Phoenix could flourish in the embedded space. We're already seeing, like Justin Schneck at ElixirConf, he built a Phoenix channel powered game board with gyros. So he was sending gyro information from his iPhone over channels to the game board pie that was controlling the gyros on the board. I think we'll see really neat applications of Phoenix outside of the web space especially in the embedded space. And then we could say, now that we have, if we have presence information and service discovery, you can really do some neat things in the Nervous world, but we have to find a way to make those two worlds fit together. Because my world is like HTTP and the embedded world is kind of beyond any layer I've ever operated at. But I'd love to see how we can collaborate and make some really neat cluster type features. But throughout all this, and especially optimizing the Pub-Sub layer, I've come away with a few thoughts. One is good platforms drive you towards optimal solutions. So we started the Pub-Sub layer and we just did this best effort hunch, like, well, this should work, right? Like I'm using supervisors, I'm spawning processes, like I'm told decision scale. And we ended up with close to an optimal solution, right? It was at 30,000 users. That's not great. But with actually removing codes, simplifying the code a little bit, we ended up with something that could support 2 million. And we didn't have to do a great rewrite. It was just slight tweaks here and there knowing how to actually use that's tables properly. But to me, that's the whole point of the ecosystem in Erlang and Elixir. It's like, we were driven towards that solution. It wasn't like, how do we implement this? Let me go open a design pattern book. It was like, no, I have these primitives in the language and the internal library, or the standard library frameworks and we just use them and we ended up with something that's like world-class solution. So we could trust, and everyone here can trust that following these principles is gonna lead to fast, maintainable programs. And we hear that time and time again, but it's lived up to the hype. It's like, you can implement these things. These problems have been solved well and they're borne out in the industry for the last 20 years. And what I've said from the beginning is that fast code does not have to equal dense code. And the PubSub optimizations proved that. It wasn't like, we're handling web requests like this standard thing we know how to solve. It was, okay, we've got this complex PubSub implementation. We're sharding, but I didn't have to throw away, or I didn't have to write fast code and make my code really complex, right? And the flip side of that is productive code does not have to equal slow code. So I can write code as a happy path, like the beautiful code that I like, right? I think some people frown upon calling code beautiful, but for me, like coding is a very creative process and I like to write code that I like perceive as like something of beauty. And I don't have to throw that away. We can actually keep that and that doesn't mean we have to write something slow. It's like, we don't have to drop down into a pointer arithmetic for like the problems that we want to solve. For me, that's an extremely gratifying experience. But really the easiest way to put this is good platforms let you focus on what matters and that's your application, right? You don't have to add like complex layers of caching. Like we heard stories out of like Bleacher Report where your lanyards come from. They were able to remove all their caching layers from their Rails API and they rewrote their API with Phoenix and they hit the database directly and they got an order of magnitude performance improvement and they run their whole platform on two servers. And they only run two servers because they want redundancy in case one catch is on fire. And that's just removing caching using the same database. So the adage that the database is your bottleneck anyway is just not true and it hasn't been proved out in the real world. So for me, you're able to focus on what actually matters, you don't have to focus on, okay, now how do I optimize this? Code for the happy path and then when it comes down to it and you're benchmarking you have the tools like observer available and building profiling tasks actually or building the mix now to diagnose what goes wrong and within like Adam had 30 seconds he found the bottleneck or he found the memory leak and you can fix it. But good platforms are nothing without good communities. So I'm really, really pleased with how the communities are progressing both in Lixar and Phoenix. It's the same community in my opinion and we want to keep them in the same communities. But for me, the biggest like happiness I get out of this whole project is like we all like, we all level up together. I think there is a really good, there's a really good mentorship in the community where new users come in and oftentimes on Slack or RSC like everyone almost will say like, I'm sorry, I apologize in advance for new questions. And I always say like new questions are the point of Slack and RSC. But it's really cool to watch a new user come in they'll apologize in advance for their questions and we'll say like, no, this is what this is for. And then we'll help them out and then two weeks later I see them answering new questions from someone else. So I think if we continue to do that we've got a world-class VM and I think we can eat the world with, we've got the VM, right? And we've got nerves. I think we're gonna eat the world on the embedded side. We've got to eat the world on the website and Phoenix, obviously. But combined with a good community I really think we're going places and this has been such a pleasure to be a part of. And the last thing I want to do is thank Dockyard. As Brian said, I joined Dockyard a few months ago and I've been working almost full time on the framework since then. So everything you heard about today from those pubs of optimizations to presence is directly related to Dockyard support. So it's been such a good experience to be part of the Dockyard team. So check them out. And that's all I have.