 So we're gonna be talking about sticky sessions today and maybe it wasn't such a terrible idea. My name is Dan, I'm from Boulder, Colorado. I work at a little startup in downtown Boulder. It's located right here. We're VictorOps, we're building an on-call management incident management platform. And actually Boulder looks more like this. So it's not a bad place to live. I mean, it's actually so nice to live there right now that the real estate market is really booming. Everything is really, houses are being sold like in a weekend with three multiple offers. And so about three years ago, my wife and I went to buy a house and we found that house on the MLS and we went and looked at it and the realtor said, well, don't worry, you don't have to be able to buy this. It's gonna be gone in a weekend. And we said, okay. Three months later, my wife sent me a text message saying, hey, that place is still on the market. And so we went back and looked at it and I don't know if it's obvious from this photo, but the people who lived there had an interesting style. They had actually mounted a tree decal on the wall. And it also had this awesome like three tone paint scheme going on and it was really messy and nobody could see past that. And so something that would usually have sold in a weekend in Boulder was actually on the market for multiple months. They had this other problem that everything in the house had been dated. It was a house from the late 90s, but nothing had been updated in the last 30 years. So we had all these brass fixtures and this cloudy glass and the whatnot. And so my wife's first comment when she came in was, shouldn't we do something about these fixtures? I go through and upgrade them to brushed chrome or update them to oil rubbed bronze or something like that. And my response being an engineer was, we can do that or we can just wait for brass to become popular again and let the trend come to us and our house will already be modern and contemporary again. That apparently does not work. I tried that a year earlier when my wife said, I need to have a white gold band. And I said, well, why don't you get gold gold? And she said, oh no, no, no, it has to be white gold. And this kind of led me to talking about trends and how architecture happens and then it comes back to one other decision and then we move back to another decision. And it's always this kind of cycle. And there's always reasons for this. It's not like people are arbitrarily choosing trends. You know, there is impetus in the market for it. But if you don't understand the trade-offs and the decisions that you're making when you're talking about these fundamental architectural decisions, then you'll just always be cargo-culting what everybody else is doing. And so what I would like to talk today is about a fundamental trade-off that we make when we're building architectures and that's all about the statefulness of that architecture. So the way that I view the world of at least web application development is a very simple architecture. I write my application in PHP or Java or Rails or something like that and then I get a big instance of my SQL behind it. I buy a whole bunch of web servers. I hook them all up to a relational database and I kind of call it a day. And the underlying decision that I'm making here is that I'm separating the behavior from the data. But there's another way to do it. We could actually store the data for our applications inside of the application. And this has a lot of interesting side effects. This is actually the way that VictorOps is moving right now. We started off building in kind of the standard architectural way and now we're moving to actually store a state inside of our application. So I'd like to talk about what it means to store a state just so we can talk about where it is in the architecture. Where do we store it today? Talk a little bit about history and what people have done in the past and how we could maybe revisit some of those ideas. Talk about some motivations for why we store a state in particular places. Maybe some dangers to storing it in the application versus storing it in a relational database. And then give you guys some more information for you to look at. So when I talk about state in its simplest terms, it's this distinction between data and the behavior. So if I have code that runs, the code is typically the behavior and what the code works on is the data. Another grander way to think of state is it is an implicit coupling between time and space in your code. It's also really tricky to understand which part of the application is stateful. So if you look at your code and you think about the order of operations and that that's an important part of how it behaves, then the order of operations is part of the state. And there are a couple of ways people talk about happens before. They build synchronization primitives so they can make sure that an application moves in a particular manner. And that's all stateful behavior or that's all state that you're talking about when you're trying to build these. If you have a coupling to space, then it depends where something happens and where is kind of a nebulous term. It doesn't mean that it actually occurs in a different physical location. It could be two threads that are talking together. It could be two processes. Or you could have a application that is deployed in California trying to talk to a server that's in New York. Anything where you care about the fact that two requests hate your server and you don't know where that is or they hit different threads. That's again, you're talking about state. But state is also a lens. So this is a very simple function. I wanted to keep it one line so it doesn't have a barrier check for when it hits zero. So this is an infinite loop. But the point is it's a simple factorial function. And most of the time you might look at this and say this is a completely stateless function. There's no variables being assigned. But if you run this on a real computer, computers have stack sizes, maximum stack sizes, they have memory. And so if you maximum stack size is 100 and you try to call this function with 101, it damn well matters that that 100th iteration is happening because at the 101st iteration it's gonna throw an exception and crash. And so in this case, the state is actually being held in the stack. In this next function, a lot of people would say that this creates a side effect that writes to the disk, for instance. And so this is a stateful function. Well, if I change your perspective so that you're looking at this from a web server's perspective and somebody's coming in and making a get request and they don't care that this is writing to a disk, then this is a stateless function. And so it always depends on how you look at things on whether or not you're talking about state. And that's always important when you're considering where things should go. So when I break down the world between stateless and stateful apps, stateless apps typically store their data inside of a database and they ship that data from the database over the network to the behavior. And then they do some work on that data and then they ship it back to the database. So it's kind of a data shipping paradigm. And typical architectures look like this. So I have two requests coming in. It doesn't matter what server they hit. They go to the database, they get their data, they work on the data, goes back to the database. They're often deployed behind load balancers. And basically the majority of web applications you see on the internet, behave kind of as clud is create, read, update and delete applications. Behave in this manner. So kind of like this. If I break it, if I break down what's actually happening here when a request comes in, we know that the engineers wrote this part of the system. We know that they went to the internet and downloaded this part of the system. So maybe that part works. These people don't care when things break. They don't care that it's the bowser's fault. They don't care that it's the engineer's code and they don't care that it's the database. To them, this entire thing is stateful. All applications have state. And keeping it separated, maybe that's not the best thing in the world to do. So a stateful application stores the data right next to the behavior. And the data doesn't have to move when it's worked on and moved is going over a network in this case. So 10, 15 years ago when J2EE was still a cool thing to do, people had this concept of sticky sessions. And I should caveat that by saying that people still do sticky sessions today. But if you go on Stack Overflow and you Google should I use sticky sessions on my applications or whatever, but the popular consensus is no, you shouldn't use them. And the basic idea is every request that comes into a web server from a particular human will go to the same server. So if I have a bowser open on my Mac and I make two requests and I'm hitting a web farm, I will always go to server one. So like in this case, two different people. The guy on top is always making a call to server one. The woman on the bottom is always making a call to server three. And this makes it easier to reason about how I do caching for instance. I can create really simple in memory structures and just pull them out of memory. And I always know that that person's session is going to be local to my server. And they accomplish this in basically two different ways. So in order to spread work through a cluster you need to do some sort of lookup because when somebody hits you from a browser you need to know where that server should handle which server should handle that request. And that lookup can basically be distributed or not distributed. So generally we call this lookup a hash table because you've taken like their IP address and you run it through a hash function and then you store it with a server's IP address or server's name and every time the person hits the load balancer or hits a server you route them to the same server. And you're using some sort of identifier. It could be a cookie that they have in their browser. It could be their IP address. Although we know that that won't help with that at servers but some way of identifying a particular browser that's hitting your website. And so to oversimplify the richness here when that hash table is distributed you might use some sort of maybe a consistent hashing lookup or you can send requests any random server and those servers know how to route to other servers so you can kind of move the request through the farm until you find the correct server and that's who ends up responding. A non-distributed hash table could be a centralized load balancer that just has a hash table sitting at memory so it's not sharing it with anyone else if it were to crash it kind of loses that concept or it could be even as simple as just a persistent connection to a server. You just open up a socket and you just send all of your requests to one particular server all the time and you've kind of hidden the load balancer concept. And then the hash table is kind of on your computer's list of connections. So why would we even want to do this? I talked a little bit earlier about some ideas around how it makes it simpler for programmers to store in memory caches but there are some other big reasons why stateful applications are a really interesting concept. So a CPU typically does something about four billion times per second. And if I look at the cost of going to get information from different levels of a computer, starting off with a CPU's cache that sits on die, L1 or L2 or L3, and then looking at that compared to main memory disk and network, and if I think about each cost in terms of one second of activity, instead of nanoseconds and microseconds which are all orders of magnitude and we don't normally think in that context, going to the CPU's cache takes about one, if that takes one second, then going to main memory takes about two minutes. So that's not super long. If I go to a disk, it takes 14 hours to get that data. And if I have to go over the network, like in that data shipping paradigm that most stateless apps are built on, it takes six days to get that data. I mean, so just like another way to frame this, going to your CPU cache is like you turning your head and talking to the person sitting at your desk next to you. Going to main memory is like walking across the office to get some piece of information and then coming back to sit down. Disk is like driving from LA to Portland to get some piece of information and then coming back and then go into the network. That's like every time you wanted to answer the question, you drive to New York and drive back. I mean, think about the amount of wasted activity that could be happening while that's occurring. The CPU is just kind of sitting there farting around. And so just from a pure performance perspective, we are leaving so much on the table by keeping our data and our behavior separate from each other. There's also terms of correctness when you're programming across the network. If we keep the behavior and the data coexisting, there are actually proofs in the distributed system literature that talk about how you can change levels of linearizability and serializability and we can actually get different guarantees on our system just by keeping things off the network. You don't have partition problems when you don't have a network involved. I also believe as an engineer that ergonomics are very important. The more tools that we have, the more that you have to deal with. A stateful architecture happens all inside processes that you understand. You don't have to go out, I mean, not that people don't understand my SQL, but it's one less tool to understand to keep things in native data structures. There's also a whole different realm of resilience that we can build in. One of the first conversations you'll always have when you're building out an architecture is what do we do if the database is down? Everybody punts on that. They never think about how their code should handle the fact that where their state lives might be unavailable once in a while. And so when that does happen, most applications just absolutely crap the bed. And so you can build in different kinds of resilience. If we keep this state in the data together, you can actually change the way that you think about things and respond. Failure is a lot easier to handle. Whole classes of errors actually go off the table. There's no network related problems. You still have to be conscious of concurrency issues, but at least those are all local to one process or one thread or something of that effect. So there are a bunch of choices on how you might do this. We talked about sticky sessions and that was something that was codified in a bunch of different programming languages. J2EE is a really nice implementation of that. But there are a bunch of different decisions that we have to make. One of them is we need to choose particular one times. There are managed environments where building stateful architectures is not super wonderful. MRI, the default Ruby runtime is one of those. And you'll kind of notice that when people deploy against MRI or the PHP runtime, what they typically do is spin up a bunch of processes behind a web server and then kill them periodically and allow additional ones to serve new requests. So kind of a CGI exact model, but with actually keeping hot processes. And it's very difficult to have a stateful architecture if you're constantly killing the memory space where you're storing the state. You also typically need some sort of threading model when you're inside of your runtime. Again, so that the stateful architectures usually have a lot more background processing going on so they can keep everything polished and working. So you need some background threads to work on. You also typically need some kind of control of your memory that could be on heap, it could be off heap memory that you're just memory mapping. You just need some way to actually control how much data you're storing. From a framework perspective, you also need to make choices and these are kind of the application frameworks that we build applications in. So you need some way to support making remote calls. Unless you're lucky enough to just deploy on a single server, you're probably going to have a cluster of nodes associated to handle the amount of load that you need to handle. And so having those nodes talking to each other is pretty important. You also need to make sure that your frameworks handle concurrency, ideally as a first class citizen and not just something that was bolted on after the fact with some really low level mutexes or seven fours that you're forced to deal with a lot of the concurrency. Then you need some concept of clustering. And clustering is an interesting topic in and of itself. You need some idea of membership inside of your nodes that all understand who the other ones are. This can either be dynamic or static. A static cluster is something where, like on my SQL rollout using Galera, where you actually specify all the IP addresses of that cluster in a config file. And in order to add additional capacity, you actually go back and reconfigure the cluster. That's perfectly fine, people do that. There's also a concept of dynamic clustering which allows your cluster to be more elastic. You can add capacity and it will discover the rest of the cluster and then join it on the fly. Both of those are pretty reasonable ways to do that. A couple of examples of frameworks out there that have reasonable run times underneath them. ACCA is built on top of the JVM. It's an actor based framework. And this is one thing that allow all of those examples I talked about before, ACCA allows you to do. Erlang is a programming language, but they have an application framework called OTP that's designed for helping you to build real-time systems. And that's something that's also really good for using stateful architectures. And then the database React has an underlying library called core that's a distributed systems framework to form the basis of how React distributes data and scales. Or more generally, it's just thought of as a toolkit for building distributed scalable fault-tolerant applications. And those all kind of allow you to build stateful architectures. React core, I mean that was straight off for their GitHub page. You know, a big part of this talk is about trade-offs that we make when we make these decisions. And so there are of course downsides to building stateful architectures. And it's important to know about these pitfalls before you get to them, so that you don't discover them at 2 a.m. when the problems have come up. Probably the biggest thing that I have seen, the biggest problem rolling out of stateful architecture at VictorOps is serialization. You know, when you have a database, you always feel confident that you can restart that database and you always get your data back because they've spent a huge amount of time making sure that the format that they write things to disk can always be read in. And so if you're storing your state inside of your application and you hope to be able to restart that application at some point, you need to think about the way that you're serializing this state. And there are kind of two different levels of serialization. There's writing things to disk and then changing the underlying model and being able to read those things into disk. And we kind of call that backwards compatibility. The other problem that happens, especially in a cluster, is that you will roll out new parts of your model and then try to send messages to systems that are running the old code and they need to be able to handle that fact. And that's kind of what we call future compatibility. You know, I'm receiving messages from the future effectively and I need to be able to deserialize that. In both kinds of serialization are extremely important when you're building a stateful architecture. You also need to watch out for thundering herds, especially when your application is starting up or in the case of a dynamic model, of a dynamic clustering model when it's failing over from node to node. This is one of those cases where engineers are terrible at finding it while they're working on their local machines. They very often have small workloads. They test things locally and then roll it out to production and it craps the bed. This kind of makes sense too. This might be a bit of an obvious thing, but you're serializing all of your data to disk and then you're restarting and kind of trying to fully hydrate a working database. And so of course there's going to be a lot of load on the system when that happens. I've actually heard reports from people running stateful architectures in the real world where their systems take hours to restart the clusters. And so you do have to work around that. You also need to be very careful about the way that you use memory. Again, this is something that we take for granted in the way that relational databases or no SQL databases handle structures that they can, you know, when they need to page data to disk and only keep in memory what they need to be working on. But it's very, very tempting given those performance numbers that I gave earlier for engineers when they're building a stateful application to just keep unbounded in memory data structures. And then again, it works great on their laptop, but in production, it'll grow to such an extent that it starts paging for instance and that completely changes the performance profile of their application. And this is something we take for granted in stateless architectures. There's a lot, the good thing is there's a ton of inspiration out there for how you might do this. Basically any distributed database is a stateful architecture. And so if you need code examples or you need just design examples or white papers or case studies, you can read things like from the React team, the Cassandra team that even read, you know, the Dynamo paper and get general ideas for how DHTs work in production. There's also some framework examples. Aka has a distributed data module where they help you build CRDTs into your database or into your application. A CRDT is just a concurrent replicated data type. So it allows us to build eventual data structures with really sane merge semantics so that I could have a counter that increments on any node. And I know that eventually it will come to consensus without having to lock across an entire data and entire cluster. The Orleans project is actually one of Microsoft's researches. They very specifically set out to build a stateful web development framework. And they ended up deploying it. I think Halo was written. The Halo backend is written on top of this. Unison is a Haskell-based framework for building stateful architectures. He actually took this to kind of a next level. And in addition to it being a framework, it also comes with an IDE for building your applications and a language. So it's a language, a framework and an IDE for building distributed systems. And on top of that, you can build really nice stateful architectures. So we talked a little bit about what it means to have state and what does it look like? Cause it's kind of a tricky subject. We talked about how we store things today in that two tier or multi tier web architecture where things are stored in relational databases and then shipped over to the processes that use them. We talked a little bit about sticky sessions and why people didn't like them. Some motivation behind why you might want to build a stateful architecture. Some caveats to making that decision. And then there was some more information about that. So that's all I got. But if there are any questions, I would be happy to answer them. Yes? And then when you read a request, you can relate that cookie to the cat. Nets and most frameworks like Radles and Django and Python, they promote these patterns. So obviously when this is done in production, these are servers stored outside of the app server. So that means you mention it's a wasteful to go over the network and read the stuff. So what's like a pragmatic alternative to this if someone has an application like this which a lot of people do, you kind of improve the performance of being able to read and write to sort of a state store. You went over some mechanisms, but it's like, but they're all kind of, it would require a big overall of pre-existing architectures and that would pretty much require a lot of buy-in from many different people just to make it happen. So is there any alternative or any incremental improvement upon like the common pattern of just say, let's just store everything in like a reddest cache. So I know it's not even a question. Yeah, I think I can just store it down. That the question is, if I'm transitioning from a stateless architecture to a stable architecture, what are kind of the interim steps that might go on? And I think the answer is you have to start off piecemeal. You know, Redis using a Redis as an interim cache and kind of shipping things around that way is still the same pattern. You're still keeping the state outside of the same memory space, but it's certainly faster. And so I think you cleave off certain parts of your system and maybe you start building things as microservices where the new services can be stateful and the old way can be stateless. And this might only make sense for certain parts of your architecture. In the case of VictorOps, for instance, we have some backing servers that are stateful, but part of our application is stateless. And so it depends on, you know, if some of those benefits work for your company, then those are maybe the parts of your system that could become stateful. It doesn't have to be holistic. Yeah. Yeah. In the other two, I mean some of you will come back while we're on the same cluster. Is there any kind of effort to rebalance that when their session ends? Or you don't see them while you go, make that, take it off that machine store away and then bring it back later for future requests or anything like that? So the question, a couple of points. The question is, how do you rebalance live workloads? How do you rebalance workloads that are no longer important? So kind of like a dead work, right? Yeah, basically those two questions. Those two questions. So rebalancing live workloads is a pretty interesting problem. You actually have to keep in your cluster the performance metrics of the other nodes and then you can move things around preemptively. For dead workloads, when we talk about keeping the state in memory, that doesn't mean that there's not some way to store that state or store the way to get back to that state somewhere else. So at VictorOps, for instance, we use the event sourcing model. So we basically write a log of every state change to Cassandra while it's going on, but that's just a very fast right activity. All state changes themselves happen inside of the application. So when work needs to, effectively, when it's done and it needs to kind of go into a dormant state, we can just kill that shard, for instance, and it can always be rehydrated from the log later. Does that make sense? It's actually just keeping everything in our case. It's more of like a transaction log with snapshotting. So if you were to think about it in terms of a counter, for instance, we store a log of plus one, plus one, plus one, plus one, plus one, and then periodically you might say, okay, the current count is 100 and that's a snapshot. It's the way that databases, if you look at like right ahead logs, it's how that's implemented and the designs behind that. Anything else? Cool. Well, thanks guys.