 Hello, everyone. My name is Ravi Chandra Padmala. I also go by Nina. I work at Nelenso. We do a variety of web services on the server side, mostly across a variety of languages. I was fortunate enough to write some Elixir code for about a year last year, this year, some of this year. And in the process, I gave a talk at Functional Conf last year at which someone towards the end of the talk, Mr. Saurabh Nanda from Vocational Labs, asked me what people mean when they say let it crash and what do you let crash. This talk is mostly a response to that, but you'll find that it also in the process ends up talking or touching upon a variety of other things. Also, a quick show of hands who here hasn't been, didn't attend Andrea's talk earlier today. Okay. All right. So about half my talk intersects or is covered by his talk. So I think that might even be helpful. We can move faster on some of the slides and go slower on some of the specifics, but that might also mean that I might talk about things that aren't necessarily on the wall a bit more in detail. There's no code in the talk and there are no pictures, so you'll have to do a lot of imagination. First things first, very quickly, actually another question for everyone. Is anyone here not able to confidently explain what a GEN server is? Yeah, not a... Okay, maybe that's... Is everyone here able to explain what a GEN server is? Can I get a show of hands? Okay, cool. So that helps. I guess because this is a more generic functional programming language conference and not just Erlang, Elixir specific one, it's worth talking about what Erlang is very briefly. It's a programming language. It's a single programming language, but if you look at the languages on the Beam VM with Erlang and Elixir and a couple of other smaller communities, there are a variety of programming languages. What makes it stand apart? One of the key differences that has it standing apart from most languages is how in when you're building Erlang, and during the course of this talk, I'll use the words Erlang and Elixir interchangeably. My experience has mostly been with Elixir, but yeah, the right term would be to call it a Beam system or something of that sort, but please just treat it as either word. So what makes the language stand apart is largely the idea of having independent processes communicating via messages and inboxes. It comes with a lot of distributed tools out of the box, and it has been doing this for a long time now, and I think finally in the last 10, 15 years, we're able to take advantage of it, or rather more people are able to take advantage of it and more people are needing it. Yeah, so it's an 80s language design for modern problems. Who uses it? WhatsApp most famously. Adroll, because I saw a tweet about them hiring. If you're using RabbitMQ or eJabidi or CouchDB, the community uses it from Unreal stock. And getting back to the question that I asked you earlier, OTP is a library that comes with Erlang or Elixir and defines a set of common patterns or what are called behaviors over processes or over actors. Gen servers are effectively just processes to whom you can send synchronous messages. This is a regular process. Sorry, send asynchronous messages, and this is a regular process as well. But in addition to that, it gives you a way to, in a standard fashion, send synchronous messages as well. Similarly, there are supervisors. Do this one task more or less of restarting processes with some parameters. Now that you have processes that can respond to messages and now that you have processes that can restart other processes, we can talk about, I think, what started this whole talk in the first place, which is if you have supervisors that can restart processes, you can, at this point, oh, has that been there at the end? Oh, okay. Cool. Just give me a second, sorry. I'm using a new presentation tool. All right, sorry. These are the slides. They're not very fancy, sir. You didn't miss much. Yeah, but the punchline was now that you have supervisors and they can restart your processes. You can finally let your processes crash. That was the cell. That's what we were told we could do. Not quite. At this point, and this was something I couldn't answer for myself even two months into the project, it's difficult to say what you're letting crash. What is it that crashes and what is okay. And before that, even I think it's valuable to just have a common vocabulary, just to have a common vocabulary across. All of us here define what that is. So if we're defining faults as the system going into an incorrect state and errors as unhandled faults that somehow propagate through some part of the system and failures as the catastrophic scenario where an error finally makes the system stop responding or terminate. Let's go with this lingo from here. But at this point when you're talking about crashes and when you say let it crash, you're in the space of anywhere between fault and error. Generally, when people say let it crash, they don't mean let the server stop responding or let it terminate. Just to be clear, that's what we're talking about when we say let it crash. Now, well, the earliest we can handle this is at a point when it's still a fault and not an error. And there are some patterns to deal with faults. I have a random sample of them here, a random sample of fault-tolerance patterns. And they're useful to talk about just so, and a lot of these are to do with distributed systems in that these are patterns that you would use for fault-tolerance in distributed systems. But again, we're not talking about distributed systems at all really today, though it will keep coming in and out during the conversation. Yeah. So there's load shedding or overload protection, circuit breakers, bulkheads, fail fast as a pattern. For load shedding and circuit breakers, the load shedding is listeners dropping messages if they recognize that they have too much load. So this you end up, you can potentially end up using some library or the other. Circuit breakers as well, temporarily avoiding hitting services that you know are not unlikely to be upright now. And there are libraries for that as well. There's one called Circuit Breaker. The other two patterns, bulkheads, it's probably something you'd want to do yourself. A library can't help you too much with it, but then there isn't much to do either. Having separate processes for different pools maybe. If you have a pool of workers, maybe you want to instead make that three pools of workers, one of them dealing with right traffic, one dealing with read traffic, and one dealing with some sort of critical traffic that is more important than either of the other two. And then similarly, anything that they use can be partitioned. So if you have connection pools or caches or other processes that these depend on, you can also bulkhead those things. And you can pick and choose based on what you need. Failing fast is what it says. You fail as soon as you know you're going to fail instead of letting more things happen before you finally fail. There isn't too much to speak about that. But now that we have some ways to deal with faults, at this point are we able to confidently say that we can let it crash? Not quite. So we've defined what crashes are, but we haven't defined what crashes. Yes, a crash is something between a fault and an error, but what goes there and what makes it go there and how is that even okay? Well, it's the usual stuff. Unsafe input, external resources, divide by zero, any other sort of bug effectively. Yeah, well, aren't these bugs? Are we saying bugs are okay now? Well, not quite. And this is where I mean... A lot of this stuff is covered in Andrew's talk again. If anyone on the internet hasn't watched that, please watch it. Yeah, so are we saying bugs are okay now? Well, not quite. But I think it's valuable to make the distinction between business logic, errors, bugs, and availability and reliability-related bugs. And this is not a complete set. These aren't the only two kinds of errors or bugs that you'll have. But it's worth making this distinction for the purposes of our conversation right now. I'm not saying the latter is something that's okay and that you can ignore. But that's more often the thing in the realm of, yes, restarting this or retrying this is fine. And that is the way to solve this problem in that you wouldn't maybe... There isn't much you can do in terms of logic or in your code to handle it. Yeah, so if we're saying that, that is it, availability and reliability-related issues. And this can show up in the form of, say, you're unable to get a connection from a connection pool within your timeout. And your timeout causes you to raise an exception because you don't want to handle timeouts because there's nothing for you to meaningfully do at that point. And this is not something that you see on a daily basis because this happens once you've reached your system's limits and gone beyond that. If you're in a situation like that, that's definitely a place where there's very little you can do in terms of code and you just let things restart or retry. Yeah, so at this point, maybe we can say, let it crash. We know what is going to crash and we know when it's going to crash. And we've narrowed that down a little bit. But what ends up happening or what did end up happening for us was that we did look at all these things in the process of building whatever system we were building. And in our heads, we were like, yeah, this is all taken care of because we're paying attention to these things and we're building our system. But there's some care to be taken when some care that's necessary. When we're designing our system beyond just looking at these as a high level, these kind of bugs will cause the system or these kind of circumstances will cause the system or certain processes to restart. And that is acceptable. The consequences of that go, depending on the design of your system, will go beyond that individual process, potentially causing you to take your entire system down. And to avoid that, we had to spend a lot of time thinking about how we design process trees and that will be a majority of the rest of this talk. So there's a couple of maybe ideas that we can borrow from distributed systems and apply them here as well. The kind of thing you would do maybe at a company which is running web services that has a lot of scale. As you grow, you end up tearing out or teasing out parts of your system into other smaller parts in a specific fashion. So it is your bottleneck, single function that is bottleneck that ends up becoming a separate process or the stateful part of your service that ends up being a separate service. In a similar fashion, I think you can take a lot of those things that we've learned from distributed systems and apply them to process trees within a single node. So you'd want to separate and pool expensive resources, connection pools, or caches. And this is something you'd end up doing in any other language as well. But the idea behind separating it into a process is just because it's expensive to retry or recompute. Once you lose state, whether that state is a connection because connections are expensive to set up or caches are far more difficult to repopulate, once you lose state, you are at a loss. And restarting that process is a lot more expensive than restarting your stateless processes. So you do want to keep those separate. And you do want to pay attention to ensure that your stateful processes are more robust maybe even than your stateless processes. To the point that, I mean, there's the obvious scenario that most people don't account for or design for, which is that when you depend on something so much, say if you have a Redis cache, which is 98% which has a 98% hit rate, and that's what's keeping all your systems up when it goes down what is going to happen is... I mean, most of us encounter this issue at some point but you do want to design for that, but in the process also want to design for not having that fail and not having to deal with the consequences of that. You also want to have processes be isolated in general and this, I haven't found a general principle to apply to figure out how to isolate processes. So it depends slightly too heavily on what domain your business, your code is modeling. And we'll talk about specific examples in a bit. You also want to have processes whose message handling doesn't make sense for processes to be ad-important, obviously, but when I say that, I mean, the way they handle messages, they're ad-important or if I may use the word rejectable, in certain cases, if you're having message queues and workers or job processes working off them, you can do retries and having things be ad-important helps there and that's a pretty straightforward approach. You put some task in a queue, you try it, you fail, you try it again because it's ad-important, you can do that. There are other scenarios where, say, if you have an HTTP API, a client calls you once and you fail for some reason. You can retry, but you're not sure if that's what you want to do. You want to say I've failed in whatever fashion that is if that's by saying 500 or whatever. And you want to allow the client to decide whether they want to retry or not. And you want to do this so retries are possible either from the client or within yourself and that the consequences of all this affect a small surface area. The most important thing I think that we learned, which was maybe not even that surprising, actually, is that you want to communicate very mindfully across your processes. You want to pay attention to queuing in processing boxes at different layers and this blows up very soon as you have more processes. We had a system that had a, it was more than just a bunch of request processes, satisfying requests and dying. There were some long-running processes and then some others that they coordinated with and so on. And as you have this kind of a setup, you end up with having to deal with distributed systems problems but within one system and at every layer. So effectively every process you can almost treat as a microservice mentally because that's the amount of attention you often or at least sometimes want to pay when you're designing its interfaces. You want to think about queuing. You want to think about capacity planning across the entire system. If you have a connection pool of some size and a cache with, say, replicas of some size or some number, you want to make sure that that makes sense, that works with whatever your inflow looks like. So if we were to take these as some ideas that we can apply when designing processes, let's try to build a web server. So maybe even a very obvious example, we've spoken about it just a couple of minutes earlier and we've seen some very pretty slides earlier today about a very similar thing. We've said we want to have isolated processes. So the obvious step there is that your requests have one process per request and we know this is fine. It seems like a straightforward implementation. We are saying we want messages to be either idempotent or rejectable. I think at this point we've even spoken about this example before. We're able to, say, return a 500 to any client calling us and allow them to retry at work. So cool. We have, at this point, a reliable request handler process. Yes, but if any of you are assuming everyone has written a web server, you would have a question about what you do with the connection accepted processes and do you want to even be bound by one process there? Do you want one thing to listen for new requests and fire off new processes or do you want to parallelize that? All right, so maybe you have a pool there as well and that's isolated from your request processes. So is this it? Well, sort of. I think we've built a part of Ranch and that's what Cowboy's on and it's reusable for a generic socket library that you can build things on top of and Cowboy's on top of it. This is a pretty narrow example. Let's take a slightly, let's take the Hello World example. Hello World being us building a chat server. So just to quickly define what a chat server is, I think we can, if people are familiar with IRC, let's say we follow that design. We have rooms and you can join rooms and you can leave rooms and rooms are places where you have a couple of clients connected as long as they've joined and not exit and anyone sending messages to a room, that message gets found out to everyone else in the room. So this is our design, this is what we want to build. Some interesting things about this problem as we are seeing it. I've listed it down so we don't have to think about it. We have processes sending messages to other processes, most likely. Assuming we might want to separate our client handling processes from our room process. Sometimes processes are stateful so whatever this room processes clearly has some state, it knows who's connected, it knows what messages are enqueued and what messages have been found out. If you want to go one step further and if you're building a messaging platform for the present, then you will end up having to store history at some place and that might be an entirely different process altogether. And you have dependencies across process, so all of these obviously depend on each other. The clients need the room to exist if they're trying to join it and the rooms need maybe the room history process to exist if the clients would need it if they'd want to look at history and so on. And all of these processes may share external dependencies. If you want to say push every message to Kafka so you can do some analysis on it and then figure out automated response suggestions or whatever it is people we are doing these days. All of your clients would end up depending on whatever this Kafka process is. So let's maybe pick apart each of these points. We're saying we have processes sending messages to other processes. You want processes to be isolated and we don't have that. If we look at the design that we've spoken about so far, if you have one process per client, one process per room and say one process per room history if we're doing that, your processes are isolated to some degree. If a client dies, your room should be able to handle that fact and we'll get to this in a second but your room should be able to handle that fact and just be able to find out messages to everyone else. Similarly, if the room history process dies, you should be able to have a conversation with people even though you can't see your previous messages with them. And we're saying processes should be able to deal with messages in an important or rejectable fashion. Well, as long as the message has entered your system, it's not entirely always straightforward but it's often and in our case certainly straightforward to make it that important. Once I say hello, once you know that, it's easy to tag that message and then make sure that everything behind that is able to treat that message idempotently. Sometimes processes are stateful, so let's talk about that. We've said that we'll separate and pool expensive resources, connections, connection pools or caches. Stateful processes are also expensive and expensive in very different ways depending on how you get that state or how you maintain it. So in certain situations, you can actually throw away stateful processes just as easily as you would throw away stateless processes. Assuming you're able to somehow rehydrate and I'm making up a lot of words during this talk, but if your process is rehydratable in some fashion and if the cost of rehydration is low, it's entirely possible to treat these stateful processes with maybe even slightly less respect. But you do need that. You do need that as long as you care about that state and there are also situations where you wouldn't, where maybe you can actually run on production at full load without your cash being up and if your system is designed to do that but only uses the cash as an additional optimization, then yeah, you don't need to regenerate cash when you come back up. You can do that over time as it naturally would. You also want to have these processes, these stateful processes be isolated. You want to make sure, especially they handle messages in an important fashion because if your state is broken, that can have consequences across your system in very strange ways. And you want to do the degree possible, keep your state-free processes distant from your stateful processes, however tempting that might be. Because simply from a system design standpoint, you want to be able to restart your state-free process even if you don't need to. Or you want to be able to say I can make 10 of these without having to replicate data across them. And you might not need that today. So having one room process might be fine right now but you might need 10 later. There's a reason rooms have limits on most of these communication platforms in terms of number of members and maybe to handle that you want rooms of rooms and so on. So you definitely want to keep your stateful processes as separate from your state-free ones as possible. We've also said we have dependencies across processes. We've spoken about this briefly also. The key thing here, and we found this out the hard way but thankfully very early on. And the fact that you find these issues very early on I'd say even a side effect or a direct effect of Erlang putting these issues front-end center all the time. You're always dealing with distributed system problems on your laptop. So when you're doing this, when you have these dependencies across process, you want to be able to communicate very mindfully. The most trivial case or the most trivial failure is when you have one process making a synchronous call to another process and your caller could die because it has a timeout on the synchronous call and it's failed. And this is easy to debug and this is easy to understand. It's easy to understand at this point what's happening and you wouldn't reach this point until you have some amount of scale where you're like, oh wait, call has a timeout. I forgot. But the more contrived case and one that you might end up hitting maybe even sooner than you'd imagine is when in a more complex system you have some message coming in from the external world and then hopping across different processes which have maybe some assumptions and some of these hops are maybe not even synchronous. Some of them are asynchronous. Maybe your job processor, which gets its job in an asynchronous fashion, eventually makes a synchronous call to some other external dependency which ends up holding onto that external dependency for longer than necessary of which something that deals with actual traffic would fail. And even this is a slightly simplified version of what we've had to deal with. So you do want to really give these processes all the respect you'd give services so you don't have to do this. Face the music. Yeah, so that's that. And you also have, we've mentioned, we've said that these processes may share external dependencies. Well, you want to do the same thing, separate your connection pools that access those external dependencies. Maybe at this point bulkheads or something you want to consider again or the variety of solutions or ways, patterns you have to deal with. This has a general distributed system problem. You want to also keep in mind what we spoke about communicating mindfully here as well. You want to pay attention to how your, if you're using libraries to access these external dependencies, you want to pay attention to what the communication patterns there are between your process that's wrapping the library within that and outside. So your other processes and the one that's wrapping the library. Yeah, so we've done all this. At this point we know more or less what we would do or have to do to build a chat server. Are we done? Well, so sort of. The broader idea that I was forced to digest over the 10 months that I had to work on this project was that, well, we know that distributed systems come with a lot of complexities, but Erlang or in my case, Alexa had forced me and everyone else on my team to deal with them at every point and for good reason. At the end of it, once we addressed them, we had a very formidable system where we were able to be quite confident about how the system dealt with either load or situations. Yeah, so towards the end, once we digested this and acted on everything that we learned, we were able to, with some confidence, say, let it crash, but ensure it recovers. That's my talk after the crash. Thank you everyone. Can you give us some examples of the kind of issues that you saw in your system and how you fixed them? Yeah, so I think some of them were even the trivial case at some point, the trivial case being we're making a lot of synchronous calls in a lot of places that we don't need to and that's forcing failures to propagate in a fashion that we don't want to. It's forcing your processes to expect other processes to be up. A lot of these interfaces we initially had, initially were more tightly coupled across processes. So for example, if you take the chat server example, your clients and the thing that's coordinating stuff between the clients was either synchronous immediately at that hop or synchronous at that hop and then internally also synchronous and all of this ended up with cascading failures across the system. In some situations, something as trivial as one Redis call failing, taking down the entire system and so on. So one of the first things that we did very early on was to just make your interfaces as loose as possible, loose or uncoupled as possible, loosely coupled as possible and that solved a lot of problems for us, I think. Like 90% of our issues was that from like early days to slightly more than early days. Are there libraries in comparing Erlang and Elixir? Elixir has a lot of libraries packaged, having some of these patterns already built for. So did you use any of those libraries or you built all the patterns for your design in custom fashion? So I'm unable to recollect right now, but I think we ended up using one or two libraries to deal with these kind of problems. But more than libraries, I think just the change in thinking helped us a lot more. I think it's basically just us recognizing that our processes need to be more polite to other processes. And once you start thinking that way, you're not really changing too much. I mean, you're still having a process and a message to another process, but you're just putting a lot more thought into how that is happening. The example that you shared was like idempotent and projectable, a pattern as well as anti-pattern sorts of. So that's what I wanted to say. So with that particular example, there's no general thing you're going to do to have your message processing be idempotent. It's going to be very tailored to what your message is and how that can be idempotent. In some situations, they can't. In some situations, you do want things to be exactly once. So in many of your slides, you mentioned about processes. And the first point was it was at many slides. It was isolated. So aren't these online processes already isolated? I mean, OS processes are definitely. So I guess online processes, they must be isolated. Yeah. But did you do something more? We didn't do anything to make processes more isolated. That's a very big question if you're asking me that. I guess maybe you meant it supervised. No, that's not what I meant. The idea that I was trying to convey was that you want to, isolation comes with processes. But what you want is to isolate in the correct fashion. So you want the fact that you're saying your request lifecycle process is separate from any other process. Is you saying this request is isolated from anything else? And that is the isolation you're talking about. The isolation of this request handling in particular, not that processes are isolated. That's true. Yeah. Just the separation of concerns. But in a specific fashion, in that you wouldn't want to separate things into processes just because they're different concerns. You would want to separate them because you expect them to be stateful or fail in different ways and take down different things when they fail. And you want to draw these lines with those concerns in mind, not just any general, sorry. Does that answer your question, though? I don't know clearly. Okay, let me give it another shot. No, we didn't have to do anything to make processes more isolated. Enough for our purposes anyway. The idea that I was trying to convey when I said, when you're designing your process trees, you need to make sure they're isolated. It's not that the processes are isolated, but that they're isolated in the right fashion. Because you can put everything in one process in the most extreme example and be isolated. Or you can also over isolate and create processes for no reason. Okay. Thank you. Thank you so much.