 and thank you for having me here. My name is Sasha, and this is going to be a talk about high availability with Erlang. And when I say Erlang, I will really be discussing some things available at the runtime level in the Erlang virtual machine. So whatever I'm discussing is really going to apply to any language that runs on this machine. So Erlang the language itself, as well as Elixir, or at least for Erlang. Now, high availability is a property of a software. And like any other properties, it sometimes makes sense. Other times, maybe it doesn't. Maybe it's even counterproductive. So I want to first describe the kind of software where we can benefit from this property. And I call such software a software system, which is like a vague term. It could mean different things to different people. But for the purpose of this talk, what I consider as a software system is a piece of software, which first and foremost runs in production. So I'm putting aside any kind of one-offs, experiments, prototypes, explorations. It's a production-grade software. And once you put it in production, it has to run for a long period of time, possibly years or maybe even decades. And it has to operate continuously. So think web servers or backends. They are like poster child examples. You deploy them on the first day, and people can use them any time of the day, any day of the year for many years. So they have to work continuously. And they also need to do many different things, many tasks or jobs or activities are pending in the system continuously. At any point in time, like maybe your request and my request, those are two different things. Maybe we need to manage some caches, some states, run some background jobs, talk to other services. All every single thing we need to do to provide some part of the service is an activity in a system. And we need to do many of those different activities. And most of them are really unrelated mutually. Like I pick any box out of the bunch, and maybe it will have a couple of dependencies. And even these dependencies are not permanent. They happen at some particular time points from time to time. These boxes work together so they can provide some part of the service. But for the most part, these things are unrelated. And that's kind of what I consider to be a software system. So as I said, any kind of a web back and regardless of the business domain scale or complexity is a software system per such arbitrary vague informal definition, it has to work continuously and has to do many different things, most of which are unrelated. Now ideally, in such a piece of software, everything will work always. For everyone, every single task will succeed. Like you make your request and chances are 100%, you will get your response within a reasonable time, which is obviously an illusion. We cannot expect this to happen. Things will go wrong. The system runs for a long time. So there are many chances of things going wrong. And we as humans will write bugs. We will make mistakes. So even if we could write a perfect software, it runs on hardware and hardware can fail us. And we may depend on some external dependencies like databases, whatever as a service. And those things can fail on their own. They are out of our control. Maybe networking fails between them. It's unreliable always. So things go wrong and something will go wrong. We need to accept the fact. So the next best thing is of course to improve our chances of succeeding, to somehow proactively work towards that goal. So like when you make a request, chances are not 100%, but they are very high that it will work for you. And even if it fails for you, the chances are still very high. At that same point in time, it will work for most of everyone else, maybe for everyone else. And this is kind of loosely speaking availability. We kind of keep an illusion for most people that the system is working flawlessly and maybe for some others, they see a glitch here and there, but people are happy. And really this is an essential property in a software system. If a software can benefit from availability, it needs availability because the only alternative here is that it fails a lot. It's frequently failing software. And if you have such piece of software, then basically end users will be unhappy and they will look for alternatives, all sampler cells and the system ceases to exist. So it can only really be used if it's available enough per some criteria, which of course you need to carefully and properly choose. Now, this is not an easy challenge. And how can women get this challenge? How do we improve our chances? So trying to profit from the property that we are running many different, mostly unrelated things, we can try to isolate failures of individual things, right? Like single little yellow box fails, maybe it doesn't have a lot of dependencies, so most of the rest can keep going on and provide most of our service. This is what I like to call a mitigation or isolation of failure effects. And this failing box, of course, it might be some kind of a permanent service in the system, a piece of software that needs to run continuously. So obviously we need to bring it back up online. We need a self-healing system that can restore the full service as soon as possible. Otherwise, we will basically lose our system piece by piece. It will be slowly decaying. So those two things, in my opinion, comprise a fault-tolerant property. And it goes without saying that we need to send our responses or we need to execute this task within some reasonable time, whatever that means, of course, from case to case. But if somehow, I don't know, milliseconds turn into minutes or maybe days, technically we could argue that the system is working for all intents and purposes. It's useless, it's worse than useless, it spends time and it's not used, the results are not used. And we need scalability so we can address the load challenge, right? Because hopefully once we put the system in production, it will become more used, more people will be attracted and they will arrive and we will have load increase and we will reach the capacity and the maximum capacity of our current hardware. And obviously we want to address this quickly by adding additional hardware and have the system take advantage of it immediately without us needing to change a single line of code. And those are like most important properties, in my opinion, which can improve our availability. It's not binary, of course, but the more we strive for each individual properties, the more we can improve our chances for success. And Erlang is a technology particularly made for this challenge, right? That's the reason why we have Erlang and this is the challenge it has been solving in production for some 25 years, 20 to 25, in large systems, diverse, many different domains. The technology has really proven itself in time and it's really well battle tested and has been in, of course, in the process continuously improved and evolved. And it is also the only technology to my knowledge which tackles the challenge at the runtime level, at the very foundational level of the software, making some guarantees and providing some tools and making some trade-offs, particularly well suited for software systems. So that's the reason why personally I consider Erlang as the default choice for any kind of a software system which also includes any kind of a web backend or server. It's the first thing I will look for because it's the tool made particularly for that job. And I want to make it perfectly clear that it does not happen by magic, right? So it's not like you're writing your software in Erlang or Elixir and they somehow become available. I mean, to some extent, but not really to full extent. So you need to work for that. And what Erlang really gives you are set of simple tools very low-level, lightweight tools, but with strong guarantees you can work towards that goal in a predictable, systematic way. It's kind of straightforward. So everything revolves, all of these properties revolve around concurrency model in Erlang. And Erlang has its own implementation of concurrency. Unit of concurrency, concurrent entity is Erlang process which is not an OS process and it's not an OS thread. It's kind of a lighter thing. You create a single process within a couple of microseconds and it requires about two kilobytes of memory for its stack and heap. And then of course it may grow if it needs more. And you can really create these things in abundance, in piles, like 100 billion is a hard limit on a single instance of Erlang VM. So you can really run a lot of these things. And a process is a computational thing. It's a sequence of expressions. It's like a single-threaded program, you know? So every piece of code in Erlang or Alixir runs in some process and one process can create additional processes. So you have your typical code, you know, like flow stop to bottom, bunch of expressions and somewhere in there you can start, you can spawn another process. I invoke this spawn function, I pass some lambda and this creates a separate process. Lambda runs in there concurrently to me. So now possibly, even in parallel, as soon as this process is started, I can move on, do some other stuff. And of course when the lambda is done, then that process terminates. So it's a pretty straightforward thing. And these processes are completely separated. Really for me, in Erlang concurrency it's completely synonymous to the word separation. You know, two concurrent things are two separate things. They are born separately, live separately, have their own separate execution paths, terminate at their own convenience and share nothing. There is no shared memory. Really two separate things. This is very important from the standpoint of software systems. And of course it's a system, right? So it cannot be like everything for itself. Some of these things need to work together to cooperate and in Erlang, we use communication to do the cooperation. So processes can send themselves messages where a message is arbitrary piece of data. Like whatever term you can construct, integer, string, maybe a deeply nested list of key value pairs or whatever. And to send a message, a sender process needs to have the address of the receiver. This is the PID, a process ID, I think uniquely identifying a process. So you somehow obtain this PID, you shape your message, you send it by invoking the function, and then the sender can move on doing whatever it needs to do. The message, the content of the message is copied in memory. Usually it's full copy with some exceptions. This copy is placed in the mailbox of the receiver and the receiver is not interrupted in any way. So it's a sequential program, it may be busy doing something else. When it has the time, when it finds fit, it pulls one message from the mailbox using this receive construct. Usually it's going to be like the oldest one and it does something with a message. It processes it, it handles it. And if there is no message, of course, receive will block until a message arrives or some specified time occurs. So this is really like UDP, just without networking. You know, I have a message, I throw it to you and I move on and you pick it up when you have the time and do something about it. And on top of this, we build request response patterns. So basically a program, both parties, the sender sends a message and then immediately awaits a response and the receiver is programmed to take the message, do something with it, produce the response message and send it back to the sender. So that's kind of like a request response style, again, without networking. And using these two communication patterns and spawn, we frequently build so-called server processes, which are very important. They are, in my experience, most frequently built ones in Erlang or Elixir. So this is kind of like a low-level sketch. We have some higher level abstractions, but the idea is very simple. A server process is like a single-threaded server with some state that changes over time. So what I do is I spawn a process and in this process I'm looping for a long time, maybe for as long as the system is running and I enter the loop with some state, which is again an arbitrary piece of data that I need. You know, I somehow pick initial state, maybe use some default values, maybe fetch some stuff from the database. Either way, I enter the loop and then I'm receiving, I'm awaiting a message. It's a server, right? It's a responsive thing, it's passive thing. So message has to arrive to make it do something. And when a message arrives, based on the current state and the message content, I will do something. I will have some kind of a switch case, like branching-like construct. Maybe I will do some computation, maybe I will fetch some stuff from the database or store some stuff to the database. Maybe I will talk to different processes in the system. Maybe I will send a response back to the caller. And ultimately I need to decide what's my next state, resume the loop with the next state and rinse and repeat, so the next message operates on the next state. Single-threaded server, it either handles the message or it awaits a message. I want to make it a bit more concrete, so I could have, for example, a single server process responsible for a single bank account. And at the very least, the state would be float, right? Money that we hopefully have. And I implement the server process to handle particularly well-defined messages, such as withdrawal or deposit or maybe balance message. And in the case of a balance, I will send a response with the amount back to the caller. And then we have a bunch of these client processes from these perspectives. So they could be like HTTP request handlers or maybe some batch background processing jobs. And they can use these messages to interact with this particular bank account. And they can even send their messages at the same point in time, and still messages will be handled one by one in the order of arrival because the server process, like any other process, is a sequential thing. And in this way, a server process acts as a point of synchronization for competing requests, avoiding or preventing race conditions and ensuring the consistency and integrity of its own state. And of course, it goes without saying that if we have a bunch of these bank accounts and server processes for them, then we can handle them concurrently, maybe even in parallel. So this is how we deal, how we reason about concurrency in Erlang or Alix here for that matter. And in my experience, it's a much saner approach to concurrency for like vast majority of situations. It's easier to manage, it's easier to grasp. And precisely because concurrency is fairly easy, manageable and because it's cheap, we use it a lot in Erlang. We write highly concurrent systems, right? So like a very vague idea is that whatever I can somehow label as a distinct activity, a logical activity in a system should be usually powered by at least one separate dedicated Erlang process. And frequently there's space for more further subdivisions for some lower level technical reasons. So for example, when I'm handling requests in a web server, each request will be handled in its own separate process. Like I make an HTTP request, the process is spawned there and the computation, the handling of my request happens there. The response is shaped there, send back to me and then this thing closes on the server. And if I wanna handle some server-wide states, like things that outlive, extend beyond the scope of a single request handler, like server-wide cache, scene memory, user session data, maybe for multiplayer games, states for each individual games, I will have server processes for each type of data which I wanna manage and then these request handlers can use messages to interact with those things to fetch some parts of their state or maybe make them change some parts of their state. And if I wanna do some background jobs like maybe periodically fetch stuff from the database, populate caches, purge caches, or maybe run some long processing jobs, I will have separate process for each of these things. If I wanna talk to the database, I will have a pool of connection processes. So single process, single connection process holds a socket to the database server and thus it ensures that on this particular connection, there can be at most one query running at any point in time. And then of course I have a pool of these things so I can exploit concurrency features of the database and hopefully run some things in parallel. So that's just like the tip of the iceberg. There's a lot of space for concurrency and we exploit that space in Erlang. We write highly concurrent systems. Like in my experience, smaller systems might go for a few hundreds or maybe few thousands of these things at big times, medium to large, tens or hundreds of thousands, sometimes even millions of processes at some point in time. A lot of concurrency, that this is the important part. And because concurrency is separation and because of some other guarantees and tooling that we get with Erlang, we can get a lot of benefits with respect to availability and software systems. So a little bit about runtime. When I start my concurrent system, what I'm really doing is I start a single operating system process and just a handful of threads. Most important of those threads, OS threads are scheduler threads and they are running. We have by default one per CPU core. It's completely configurable and these threads are running our processes. So they are pumping the work and they're of course spreading the load amongst themselves, which is how we get multi-core by default and we are more efficient hopefully. And this also makes the system vertically scalable. So if I'm like mixing out on the current machine, I put a system on a bigger machine with more cores and the load should hopefully just spread, which is just a natural consequence of the fact that I have taken a single big chunk of work and split it across a lot of smaller parts, mostly independent and then I give the opportunity to the virtual machine to spread that load. So that's a vertical scalability, which is a pretty nice property. Also by doing this, having these small chunks of work, I get a lot of benefits in fault tolerance. So what happens is a process crashing is an isolated event. Process crashes, maybe because of some exception, everyone else keeps running and because there is no memory sharing, the crash thing by default at least will not leave any inconsistent garbage for someone else to trip over it. So like you write your concurrent system, you have maybe 1,000 things running, one of them fails, you still have 999 things running, providing most of your service, which is again what I like to call a mitigation of effects of failures. So just split this big chunk of work into very small pieces and then a failure of single pieces really not that dramatical usually. And also a crash in Erlang is not abrupt, but it's also not silent, meaning anyone in the system, any process can be notified about a crash of someone else. So like a process, I can invoke some function and I can say to the VM, let me know if this other thing terminates. And if that happens, I will get a message, a plain Erlang message just like any other, and this message describes a crash. It says, process full has terminated with a reason bar. And I handle this message by doing something. Maybe I will start another full in the place of the previous one to resume the service. Or maybe I will take the role of full. Or maybe I will crash myself as well because maybe my work does not make sense without full's work anymore. Or maybe I will send some notifications. There are a bunch of patterns we can do. And this allows us to approach the problem of self healing. So on top of these simple guarantees, we have some abstractions available as part of the OTP framework that ships directly with Erlang. And one such abstraction is Supervisor, which is implemented in plain Erlang. Really just build some top of this simple but powerful guarantees and you could implement it yourself, which is like a good exercise, but of course for production, you want to use the real thing because it handles a lot of edge cases and subtle nuances. Now, a Supervisor is a generic implementation of a server process, which we can use to start other processes in the system. And usually we make this distinction like a Supervisor and the Worker. You know, the Worker is a process doing the real work. This is where the action is happening. This is where you're handling requests and producing some service. And the Supervisor, of course, we can use to start those things, watch over them and restart them if they fail. So what happens is I start the Supervisor and I tell it, start me this Worker with these arguments and this is what it will do. And it will also expect to be notified if a Worker terminates. And if that Worker process terminates, the Supervisor gets a message and handles it by starting another Worker with the same set of arguments, thus ensuring that the service is restored, fully resumed, and of course there's no sharing. So the new Worker starts with a clean, fresh, stable state, hopefully has better chances of working for some time. And also we frequently build so-called Supervision trees. So single Supervisor process can watch over multiple processes. Not all of them need to be workers. And we kind of nest these things into these OTP Supervision trees. This is like idiomatic Erlang. And this really brings us the chance of working with different things separately. So these are all processes. So they run separately and they could run possibly in parallel. And you could as well have a situation when on a single core, a Supervisor is recovering from a single failure and maybe on another core, you're providing a service and on yet another core, you're providing some different service. And on maybe core number four, you have another Supervisor recovering from another failure. So we deal with those things separately. This Supervision, this error recovery is also an activity in the system itself. Somewhat self-induced, but still a job we need to do. And in Erlang, we do these things separately from producing the work. And this also keeps the code of the worker processes more focused on the real work they need to do, right? So they are not ridden with this try, catch, everything, do nothing, paranoid constructs. They usually, we walk down the happy path down there and frequently we even assert our expectations with things called pattern matching. We try to fail fast if our expectations are not met and we rely on this Supervision layer to help us with handling some unexpected failures. And this is kind of what is known as let it crash. And it works really well in production. I had some good experiences. You know, things fail from time to time and the crash happens, the Supervisor catches it, the error is logged, you will see an error reason and you will see a stack trace and the system keeps going and it recovers. And I mean, it will of course fail again, right? So this just keeps the system afloat when things go wrong. It will of course fail in the same set of circumstances. So we need to fix this thing until we do, we will have these failures repeatedly. And this is where I get a lot of help as well because you know a system can't self heal but it cannot self fix a bug or it cannot self remove a performance bottleneck if that was the cause. So this is something I need to discover and I get a lot of help by the VM because VM is highly introspectable. It has a bunch of functions exposed which I can invoke to get some data about the system in general as well as each individual process. And building built in Pure Erlang on top of this simple functions we have a tool called Observer. It's available out of the box, ships with Erlang and it kind of summarizes in a nice GUI this data. And I can, what's really cool is that I can start this thing on my developer machine and I can get data about the runtime, about my production, right? So all I need is like an SSH access and I tunnel a couple of ports and I see this running. So this is kind of like a summary view with some memory usage and number of processes. I can see the dynamic activity of scheduler utilization I or memory usage. I can see a top like view. This is super useful for discovering bottlenecks in the system. It will usually be the processes which have the big message queue because a big message queue means basically that more messages arrive when the process can handle. So the queue just builds up and it's like usually a bottleneck. And I can see my supervision tree and click on these individual processes and then I can get some data about them. Most notably, I can even find out the state about each process that interests me. And I can turn some traces. This is like the most powerful thing arguably, right? So I can say at runtime for this process, I wanna see the messages it sends and the messages it received. Or maybe for those processes and these functions, I wanna see invocations with input parameters and return values. And of course, all of this allows me to understand some hairy and intricate problems which I'm not able to reproduce locally. I can get those data from the production and it goes without saying that the sooner I understand the cause, the sooner I can fix the problem and the sooner I can fix the problem, the less failures will I have, right? And then of course, we even have the support for deploying these things with no downtime. So with hot upgrades, without need to restart the system and this will further reduce my downtime or further improve my availability because I don't need to stop those little yellow boxes at all to deploy some patches or maybe new versions. So we get a lot of tooling support. Erlang has the first class distributed support. So I can start this bunch of Erlang VMs. We call them Erlang nodes and I can connect them into a fully meshed cluster. And once I have that happen, then processes can communicate regardless of the locality using exactly the same communication primitive. So basically the PID thing can point to a process in the same machine, or in the same node on the same machine or maybe on another node on another machine, it's still the same thing. So in many ways, your code is really prepared to be fully distributed to run on multiple machines because really in Erlang we like to say that we're distributed from the day one. So what happens is we are dividing the total big chunk of work across a lot of small independent entities, processes, and this is distribution, right? You don't need to have multiple machines to be distributed. And therefore in many ways, you're already way down the challenge you consider it from the day one. It's of course still hard to run it on multiple machines. I'm not gonna sugarcoat it. So the big challenge here is in my opinion, discovering the PID, right? So who should I send this message to? This basically means answering the question who in the system is responsible for the service I need. It's not an easy question to answer, especially when you have networking involved in run multiple machines. There are of course helper libraries and some available out of the box, some in form of third-party libraries, most notably the work by Best Show. Some of the stuff they did for IAC is open sourced and can help there. But still not an easy challenge, but at least you're heading down the right direction and it's already a big win. And of course this is nice from the standpoint of availability because we can have more machines supporting the total load. And if we're maxing out, we can have even more machines supporting the total load. And of course we can now even handle failures of entire machines and still provide some part of our service. And the final bit I wanna talk about are a few seemingly simple but very interesting decisions made at the runtime level. So the first time, the first thing is when the scheduler thing, right? So the thread that runs these processes, it does many and a lot of frequent context switches. So when the process enters the scheduler, basically it has the CPU time and it will be there no longer than about one millisecond. No matter what it does, IO bound or CPU bound, it goes out pretty quickly. So we have a lot of these context switches and this really is an interesting decision and I think it's a sensible decision from the standpoint of a software system. So we're running many different, mostly unrelated things. And thus this prevents, these context switches prevented a single activity even if it's super busy, super focused maybe, you know, maybe I'm calculating pi to billions of decimals, it won't be able to take over or paralyze a significant part of the system or maybe even the whole system. So Erlang favors a fairness of resources for everyone at the expense of maximum efficiency of the single one which I think makes more sense from the standpoint of a software system. And also because there is no shared memory in particular even these messages passed around are copied in memory. So really there's practically nothing shared between those things. Erlang can avoid stop the world garbage collections. They don't happen. We don't have the whole pause of the system, right? So what happens instead is a process is in the scheduler and if it needs more memory. So what this means is I mentioned at the beginning a process starts with about two kilobytes for stack and heap. So if that's not enough, if it needs more prior to the expansion there will be this what I like to call a micro GC taking place in a single scheduler. Meanwhile, all others might as well be doing some meaningful work providing some service. And of course both of these features means that your response time can be more stable, right? So you build your system, you kind of fine tune it to some desired performance and you can expect less surprises, less gatchas or less variances there over time as your system is running, which is also again nice because we wanna provide our responses within a reasonable time. So that's it. It's actually extremely simple in my opinion and very down to earth, which is precisely why I like it because it can fit in my head and I can reason about it. And still I feel and I have experience with this and I feel I can do a lot of go a long way using just those simple principles. So the basic idea is use a lot of concurrency, concurrency separation, run different things separately and some benefits come practically for free like full tolerance, responsiveness and vertical scalability. For others, you need to work, you're in the right direction, but you still need to work. But of course that boils down to how much do you want to be available? So these are the things that I usually like to talk about when discussing Girlang and Elixir and these are also the things I like to think about when coding my systems. Of course code is important, code is like low level mechanic, but still important. There are a bunch of good resources out there. I'm gonna take the opportunity and promote my own work. So this is a discount code which you can use for the next few days. Not sure how much, but if you wanna grab the book, now would be a good chance. I have also brought two hard copies as a giveaway. I have them at backstage, so I don't have any competition in mind. So basically once the Q&A is done, just find me and if you're amongst first two, then the book is yours. And that's basically all I have. So thank you for your attention. Okay, so couple of questions. First one is, how do you think Elixir slash Erlang relate to HTTP based microservices architecture? Oh, right, so some smart people in Erlang, Elixir community tend to call those things like nano services, so these processes. One could argue that it's kind of a microservices first class concept in Erlang because runtime is first class concept in Erlang. These processes are used to organize runtime and in my opinion this is what microservices are used as well. So it's a much lighter thing and I think you can go really far without needing these microservices which more to me feel more like they improvise on top of other runtimes that have no concept of managing runtimes, right? So they just run these one things. So yeah, basically I feel that there is a relation and the one could argue that Erlang, the microservices existed in Erlang like many other things in fact, even before they were formalized. Okay, the next one is can Erlang messaging interface with other languages or VMs? Actually it can because the thing is this distributed Erlang is network based and the message protocol is open and you can have for example a Java operating system process run and act as an Erlang node, right? And you connect it to the real Erlang node and you can exchange messages between them for example. So that's definitely a possibility. Although I'm not sure that I would use that possibility personally. So if I wanted to build a more heterogeneous system so have the Erlangs thing and maybe run some other things, there are better mechanisms in place. And one of them is you can run these other things like external processes, you can start them from Erlang and use pipes for communication and you can still exchange Erlang terms or maybe JSON terms if you prefer but you can exchange Erlang terms because the serialization or the format is also documented or maybe otherwise they're just used in fact than like the popular microservices approach with HTTP. So it can be done, people have been doing it but I personally wouldn't go down the throat. Okay and the last one is what is Elixir not good for? Right, that's a good question. So the question really boils down for me to what Erlang is not good for because Elixir really just is a thing that allows us to use more easily to make us more productive using the real benefits of Erlang and with respect to how I started to talk about this software systems, you know it should be like kindergarten stuff but people tend to kind of just neglect it or don't think about it. So I would say that Erlang is good for software systems and this also means that maybe it's not that good for not software systems. So maybe you have pieces of software that have to do like it's all or nothing and if it's all or nothing say for example a compiler, I need to compile a bunch of files and it can only completely succeed or crash somehow whereas success is everything is successfully compiled or success is also you have this syntax error. This is also a success because it successfully reports errors but there is no high availability concept here and with this in mind maybe some other things are more important like raw speed focusing on doing as much work as fast as possible and I would say that for such pieces of software maybe Erlang and Elixir would not be good for. It's also worth mentioning that in particular speed is not really like super high focus of Erlang, it really works, it usually works well but maybe if you're doing like mathematical operations a lot of some geometry stuff maybe you wanna outsource those things out of Erlang but you can still consider using Erlang as like a controller plane to manage those things. Awesome, thank you so much Sasha, final round of applause. Thank you.