 Hi, okay. Hey, good morning everyone. Thank you for coming We're gonna kick it off this morning with capacity and stability patterns presented by Brian Pitts Welcome him welcome him on stage Hey, so thanks a lot for that introduction This is gonna be a talk on capacity and stability patterns But before we get into that or get into me I'd like to ask a few questions and like get a show of hands So I kind of have a feel for the audience I'm talking to you today How many people here consider themselves to be like primarily software developers? Cool, so most of the room how many people would say they're primarily like systems administrators or operators All right, a few of us are represented here How many people like don't find that either of these groups, but just want the websites you use to actually work? Yeah, and if I open that up probably everyone would say that's really what they want cool so Hi, I am Brian Pitts. I work as a systems engineer at Eventbrite in our national office So I drove up yesterday. This is my first time at Pi Ohio and first time in Columbus as an adult It's been wonderful so far I'm really looking forward to the rest of the conference and getting to know you and this great city more So I am skyrists on Twitter if you're the sort of person who likes to follow people on Twitter And I have a website at Polybyte where you'll be able to find these slides after the presentation So I work as a systems engineer like what is that means? Let's what does systems in here mean? So this is an unweighted visualization of Software that I took by looking at our configuration management repo. I'm pulling it all out so this is what powers Eventbrite and it's a lot of stuff and My team basically is responsible for the care and feeding of all of these systems and infrastructure Hopefully is at least a few things on there that might look familiar like Python sentry you whiskey If Jane goes not on there It should be in big bold letters because Eventbrite is primarily a Python primarily a Django shop Now what about Eventbrite? The fun thing about talking at tech conferences I usually don't have to introduce Eventbrite because if you don't know what Eventbrite is how did you get your ticket? How did they let you in the door? So we call ourselves as a global marketplace for live events, which is big ambitions in the last year We had 600,000 active event organizers sold over 150 million tickets and did two billion in revenue from those ticket sales Now here's the interesting thing about events Some events are bigger than others This is a graph of calls to one service endpoint involved in purchasing a ticket and In Eventbrite vernacular this shows what we call an on sale just like a sudden extreme extreme spike in activity due to a popular event going on sale so a big part of the job for the operators and developers at Eventbrite is Designing systems that can cope with this sort of traffic pattern now We are far from perfect at doing this and this presentation I'm going to give you is like far from comprehensive about ways you can tackle this problem But my hope is that I'm going to give you some concrete ideas You can take to improve your own systems that you're developing and operating and the patterns I'm going to talk about today generally fall into one of two categories The first of those is stability, which I'm defining is keeping doing processing doing work in the face of impulses stresses or component failures And I said this is how I'm defining it, but actually I'm cheating This is from Michael Michael Nygaard's book release it, which is a fantastic book and one of several that I'll recommend throughout the presentation So in his definition and impulse is a rapid shock to the system It's that hockey stick graph that just is chugging along and then goes up up up when that event goes on sale And stress is force applied to a system over time So maybe that data data store you're using that's on a bad disc and it suddenly started dropping in performance by 10% What's the effect of that over time? So stability is number one the second category is capacity And this capacity is the throughput a system can sustain with acceptable response time and Accessible is a key point here when I've given this talk before I've got questions about like where how you define Capacity I can throw, you know a hundred thousand requests of the system and like they'll come through But that's not a good way to measure your throughput if that first request only comes through after processing for five minutes Because you've saturated your resources and all your users are long gone by then so it doesn't matter that technically you could sustain a hundred thousand requests a Minute if all of those requests are going to a browser that's already disconnected so now I'm going to dive into some of the patterns that we've applied here and after each pattern This is like in the past. I've had people open up the questions after each pattern But because this is a big group and I don't want to go over time I think what we'll do is if you have questions about a pattern or ways We've applied or thoughts about how you might apply it if it's like a generally useful question Hold on to it and ask me afterwards. We should have some time if it's like a more specific about your scenario Or you want advice just come find me afterwards I'll be here over the next two days and I'm like super happy to geek out over any of the stuff Otherwise, why would I even be here doing this right? so pattern number one is bulkheads But the idea behind bulkheads is partitioning systems to prevent cascading failures So this is a nautical term. That's why I threw up a diagram of a boat up here So you can see that within the hull of this boat. There are walls separating Each compartment of the boat. So if we imagine Someone coming up and poking a hole in the hull of that boat Water is going to go in the boat sure, but it's going to be stopped, right? It's not going to fill the entire boat instead the damage is going to be contained and This same design that we can apply to building ships is what we want to apply to the infrastructure we run so Let's look at what I think is like the sort of simplest architecture diagram You'll see in like any presentation someone's giving about designing web systems You're going to have your load balancer in the front that the users talk to you're going to have a pool of web servers Sitting behind that and those web servers are going to talk to a database where the data actually lives, right? And so right now we're happily processing requests. You see I've got all my boxes and cylinders and they're all green Everything's good Now what if some of these requests end up taking much longer than others Let's say the average request is takes a second, but we have some requests that take five minutes Maybe we're generating some very complicated report for an organizer who has lots of events on the system well when the first of the month comes and they decide it's time to go generate all their reports and They start submitting these requests. Oh wait, the web servers aren't so happy anymore The worker processes that were happily chugging along, you know, filling tons of users requests are now All busy trying to satisfy these series of reports that came in and our web servers are overloaded Can't handle other users requests and this one user has essentially brought down our entire pool of web servers Not very cool. So what could we do about this? Well, this is where bulkheads come in handy So we can imagine having two separated pools of web servers one that is handling the general request One that's dedicated to this reporting example. I gave you and so now Yeah, things aren't so great. The reporting web servers are still bogged down but only the people generating reports get that bad experience and our regular users trying to create events purchase tickets They're still chugging along happily on bulkhead a while the reporting on bulkhead B, you know, it was bogged down So at Eventbrite, we do this a lot. We have tons of different pools of web servers and we're Routing to them based on URL patterns the user agents just different ways to classify our traffic So that we can guarantee like quality of service to certain types of requests Now this bulkhead design pattern applies to like stateful services as well I can example I gave you about reports. It would be kind of disingenuous to say well I just bogs down the web server probably one reason it's slow is the web servers waiting on this complicated sequel query to return Well, what if we have a database slave dedicated just to answering those reporting queries and another database slave for serving general traffic? We can make the scope of our bulkhead like as large as we need it to be in order to bound the failure cases that we're concerned with So in this reporting example now bulkhead B and its database labor unhappy generating those reports But again, the other traffic is chugging along just fine So next after bulkheads, I'd like to talk a little bit about canary testing or the gradual rollout of new code Now canary testing is fun because it's probably one of the few terms and like software development that we stole from mining particular coal miners used to suffer like pretty terrible deaths from buildup of gases in the mines Slowly over times a gas like carbon monoxide could build up people would not realize it was happening And they'd be suffering the effects of it, which would make it even worse because you don't notice when you're being poisoned by carbon monoxide So what they realize they could do as a safeguard is take a canary down into the mines with them The canary is much more sensitive to the effects of the gases than humans are So when you're down doing your work, someone is keeping an eye on the canary and if the canary starts to act strange or as our poor bird has happened here kill over, you know that it's time to get out of the mine So we want to apply this same principle to the code we're developing We don't want to subject all of our infrastructure and all of our users to the new code at once Because this is a problem, well then it's too late, everyone is going to suffer Instead we'll roll out the new code of functionality gradually see how this gradual limited subset of it does and then only then if it looks good and we build confidence after a while will we finish the rollout So there's really two ways we do this at Eventbrite One way is what we call baking releases And I've thrown up part of a dashboard that people would be looking at or making a release I intended to show you the whole dashboard but it turns out I'm really bad at CSS So this is all you get So when we're making a new release which we do multiple times a week sometimes multiple times a day of our core product it goes out to a set of servers with any of our bulkheads first and then we'll take a close eye at the health of those servers and compare their health to the other servers for things like response times, error rates, user actions on those servers just generally like does this look okay and only once we're like yes this looks okay will we finish the like kick off the rollout to like the rest of the server fleet The other way we do this canarying is through feature flags For this we're using a fork of gargoyle from Discus and I showed some code here The basic idea is that new features are really many different types of changes When they're released they're off by default So enabling a feature running your new code which is the risky and scary thing is not actually coupled to the process of releasing your code So we see in my example here we just have a gargoyle is active check for our cool new feature if this feature is active we do it otherwise we keep doing what we were doing before which we know is safe and performs and works well So we almost every feature that goes out gets wrapped in some sort of feature flag and this is like chat transcript showing like how someone might submit a proposal and the flag might get turned on So a sample rollout might be First let's turn on this new feature for internal users like within our offices or on our VPN only Then maybe if it's something that changes the behavior of the product opt specific organizers into it who want to beta test it Then start a ramp up of like an increasing percentage of users 10%, 20%, 50% Finally, take it global so everyone gets a new code or new experience And the actual ramp up process will vary depending on what the feature is and how risky it is what's actually changing but the idea is if at any point it turns out oops hey a new era popped up or oops you know wow this got a lot slower we don't have to do any rollbacks this could be weeks into the process we discover a problem we can just turn it right off instantly So next after feature flags like talk about graceful degradation this is turning functionality off in response to failures or load so on a site like Eventbrite and probably most of the sites that web developers hey would work on some functionality is more critical than others we care a lot more about someone being able to create event or purchase a ticket then we care about someone getting recommendations for events they might want to check out for instance we tend to do a lot of graceful degradation through our feature flagging framework which is not the most sophisticated way to do it but it's something that's worked out okay for us so an example might be doing in response to failures so in the dark old days we used to run MongoDB for page tracking and like some reporting features based on top of that so organizers could see like how many people came to their event on different days through different channels if MongoDB went belly up we had a longer feature flag we could go in and turn on just like disabled page view tracking and so instead of servers waiting to talk to MongoDB and getting errors back they just stopped doing that another example would be turning off recommendations during some of our larger on sales so now like our recommendations are generally served by Ajax and have like a dedicated set of resources to it but in the past we've been in line with the event page and so if we wanted to have maximum capacity for people to show up to an event page and purchase events we'd like to avoid doing the work and extra CPU time request time it to generate those recommendations so if we were worried about that we could just go and flip a flag recommendations would just stop showing up at the bottom of event pages there's been some development some experimentation at automating some of this work we have a tool internally we call the velocity engine that right now we primarily use for fraud purposes like tracking rates of actions across the site and one thing people have played with is also piping certain types of errors through there so we can say hey if this sort of error comes through turn off this flag or hey we detect based on rate of certain events we have a volume situation right now we want to conserve resources here's some optional functionality just turn it off automatically so we're sort of playing with that to see if it's worth it or not kind of related to graceful degradation is load shedding now load shedding you're purposefully not handling some requests in order to reserve resources for others so what's different compared to graceful degradation is with load shedding with limited experience to all users you actually serve a completely different experience to users depending on if they've been shed or not for us the primary example of this is what we call our waiting room so during a very large on sale we don't want to have to over provision to the extent that we could literally handle 100,000 people placing orders at once that would be like prohibitively expensive for us to do for the amount of times we need to do that so instead what happens is we have a system where after a certain number of users have entered the order flow for an event any additional users coming through aren't allowed to enter that flow instead they get redirected to a totally different set of systems different load balancers web servers like decoupled from the main side main order processing flow that simply says hey you're in line as soon as your spot is ready we'll send you back into the order flow until then just sort of chill out here and so this is like a really critical way we have to sort of protect the site and keep up and running even when 100,000 people do show up and you still get a good experience like you know you showed up on the page you're not just spinning forever you're not just hitting refresh like can I get in, can I get in which when your customers do that this creases your load and makes it even worse you get a nice message sort of explaining what's happening and then when you're ready you get put through the order flow we also do this occasionally just in terrible emergencies let's say something wasn't load tested properly it may be a really bad query but it's not going to work out and there's no way we can handle traffic to a certain feature or a certain event that's going on right now we might just literally go into our load balancers at the edge and just block that event say before it hits anything that actually runs the event write a code return an error response say we're sorry this page is unavailable to you so that's an extreme measure but a way to give people a really bad experience if it means protecting the experience of everyone else so next is rate limiting from defining as controlling the amount of work you accept so the idea here is you want to understand the capacity you have and prevent exceeding it because it's better to fail fast and give your call or a message that limit was exceeded then cause cascading failures because your system is overloaded the callers of your system are now bogged down and overloaded waiting to talk to you callers of them are now bogged down and waiting and excess of work and overdoing capacity in one area can quickly cascade beyond that and take down lots of systems that you might not even realize are related so it's better to not accept that work than to say yep just wait I'm going to give you a response and then they wait and the areas start to propagate so we do this in a few different places at the very edge of our infrastructure we have some rate limiting in IP tables and in Ginex as well just to catch extreme cases no wait one IP address shouldn't show up and start submitting like a thousand queries a second that's just not legit let's just block that flat out we do this in our application where we have more smarts we can make better judgments about the requests that are coming in and do we want to handle them or not although truthfully right now most of the stuff in the application is more centered around handling and preventing abuse than it is around protecting the systems and you really want to be sure and have this anywhere there's a queue within your system or if you're using external or like third party systems if they have queues as well find ways to limit them an example of some software that does this really well is Elasticsearch for instance and I'll give the pitch now my co-worker John Berryman is giving a cool talk on Elasticsearch tomorrow you should check out but one thing that's wonderful about it from an operator's perspective is for every operation you can do in Elasticsearch a search operation an index operation it has a bounded queue the amount of work it will accept and beyond that it immediately rejects your work which is great because that means you don't get a dump 100,000 requests in there and then have 100,000 processes waiting for a response it's not going to get you can only dump say a thousand requests in there and beyond that immediate failure which you client code can then understand and handle instead of waiting and causing those cascading failures so kind of the flip side to rate limiting would be timeouts so timeouts are limiting the time that you will wait for a request to complete so rate limiting was on the receiver's side the timeouts basically the same idea but it's on the sender's side and it's the same underlying principle that again it's better to fail fast than wait and contribute to cascading failures so again like rate limiting there's lots of levels you can apply this at so at the edge we'll have general timeouts like on requests to web servers and then internally as we have different servers handling requests just timeouts on how long they would wait to talk to the next tier of those web servers or load balancers would have timeouts on request to data stores as well for example if you ever have to wait longer than a second to get your answer back from Redis like something is horribly terribly wrong you probably shouldn't be waiting so wrap it at a second and crucially you won't timeouts on any calls you make in remotely the systems that aren't under your control so a great example for us at Eventbrite is our web hook system organizers can register to say whenever someone purchases a ticket to my event I want you to ping information about that to my web server there's all sorts of events they can subscribe to now what if you haven't heard this presentation and applied these lessons and their web server goes down we don't want our web hook system to be tied up waiting on requests to return to some web server that has been crashed by the load of the web hooks it's received from us right so we'd need timeouts there to ensure that the other web hooks don't get bottlenecked behind those failing ones so that those web hooks can properly fail after a few seconds and then other web server organizers can keep flowing through next would be caching the idea of saving and reserving results to reduce expensive requests and it's kind of a tricky one but it's a powerful one so at Eventbrite it tends to take one of two forms we're saving like computed values from within our code base in either MIMCached or Redis because we have both because reasons or you can save an entire HTTP response and serve that back not do any work in your application and for us we do this a lot with varnish with other caching proxies like Squid can do it for you IngenX even has some limited capabilities this is like the holy grail if you can do this right not even touch your slow application code okay maybe I should back up Python doesn't have to be slow but it might be slow so if you can run it in some fancy fast C code you can track the pre-computer result like that is really keen but it's also tricky right because are you sending back the right result when you're doing caching you've got to think about invalidation strategies and this probably deserves its very own slide so it has one so there's like a few ideas that we've played with this at Eventbrite and this evolved over time like the first idea was just your TTLs need to follow the script stupid so if you imagine that hockey stick stick shaped graph I showed you earlier if we were caching something for even 5 seconds that could make an enormous difference in the amount of requests that we had to actually handle so that was actually like the genesis of most caching at Eventbrite was let's just try caching it for 5 seconds both within our code 5 seconds both our application like event pages let's try caching it for 5 seconds it's a trade-off people can see stale data for 5 seconds but that's such a short time window that is probably okay but once we wanted to get higher cache hit rates and move beyond that we had to come up with more sophisticated ideas for calls within our service layer we built out something that is pretty neat I think it's a centralized invalidation logic so we have a daemon that consumes the binary logs from mySQL which is our primary data store with like the canonical representation of anything within the system is stored and this daemon understands that hey if a change to like a number of available tickets on the event page went through there are these type of cached objects in redis that I should go attempt to invalidate if they exist so building out logic for each service that wants to cache whatever it's responsible for whether it's tickets or venues or whatever else to track the representation in mySQL and then build this one centralized place that can do the invalidation was helpful for us because otherwise with our twisty mess of code base there was concern that we wouldn't be invalidating things in proper places when they were actually updated and we would be caching and serving still data which would not be good but with this centralized invalidation system we were able to over time get confidence in it and ramp up to where we were actually caching things I think for some service calls where we actually got a good hit rate as long as 12 hours so that was a big win and other thing that was even newer and has been a really promising pattern for us is what I'm calling this wrapper strategy to allow us to have a dynamic TTL on pages and this is something we've used in our HTTP caching layer so in that layer if we're just doing straight calls to varnish to decide based on a URL if it should serve a cached response or not that doesn't give us very much control right we can't write too crazy logic there we can't really talk to any of our other data stores we can't really understand like is it safe to cache this for longer or not so what we've ended up doing instead is building both an inner and an outer version of some key pages like our event pages for our inner page for the outer pages this page is never actually cached but it does a minimal amount of work it's designed to be super fast it's only doing really two things the first thing it's doing is it's looking up some data in Redis again based on that sort of centralized bin log consuming process we talked about before like when was the data that makes up this event page like last updated and the older this event page is here I'm willing to cache it for we think it's worth storing it in the cache but hasn't changed in a while it's less likely to change in the future but if the page was to change because we're hitting this outer uncached page first we would still pick that up quickly the other thing the outer uncached page is doing is it's calculating a normalized version of a URL to represent that event page internally let's say you're an organizer on our site and you advertise your event through different channels maybe you own a music venue and you are using our Spotify integrations so people can click through from Spotify and see what what contrasts are going on in their area and land on Eventbrite you would pass through an affiliate code there that we would want to in our code like process and track you might have other affiliate codes coming from Facebook and other places but here we were not caching as effectively as we could because we had different cache keys for each different affiliate code that came through and we were actually missing data that our analysts cared about because guess what, when we did have a cache hit for that event page with the affiliate code because it never hit our application code it didn't get pushed through to our analytics logs and the analysts realized they didn't have a lot of data to properly understand and this is some of our sales channels so they weren't happy so the data engineering team wasn't happy and then eventually basically no one is happy and we have to fix it so this outer event page as part of the lightweight work it does gets to take care of logging that analytics data and then stripping it out so when we make a request to the inner version of the page it no longer has to care about anything variable but it does affiliate codes and now the outer page that ran in the Django application code makes a request for the inner page the user never sees this that request for the inner page gets filled in by varnish based on the contents of its cache so kind of in some ways it feels like a Rube Goldberg system and if I've explained it poorly corner me afterwards and I'll try again but what this has let us do is for like really active events have better cache hit rates because we have that normalized cache key now and for less active events but that was still being requested by users or by bots crawling the site actually have much better cache hit rates because they're still in the cache maybe an hour after they were last accessed instead of all being fleshed out after five seconds finally capacity planning so this I feel like is actually fairly hard it's hard in a way that's different from the other patterns I've showed you the other patterns like as a general operator developer you can sort of figure out how to apply them based on knowledge you have but to do this well it helps if you have people who are actually comfortable with math and statistics and it really helps if you're actually gathering the right data about the behavior of your systems and so for us we're not the greatest at this but we try and do okay there's a few things we do one thing is load testing it's very helpful to have an environment under which you can create load you understand and perform experiments like if I do X I expect Y to happen and oh if it didn't happen why is that let me go address the bottlenecks in the system let me understand how much work I can get done per unit that I want to scale up then have your production capacity be informed by the results you got from your load testing the other thing is collecting those stats on your servers on your application around throughput, around latency around the request and queries and look at the growth of that over time and also look at the seasonality of that which is an interesting component so for us that turns out to be things like we need to spin up more API servers on New Year's Eve because everyone goes out and checks in to lots of New Year's Eve parties on the mobile application just on New Year's night and New Year's day a bunch of stuff like that falls out of it so to recap at Eventbrite we have a number of patterns we've adopted to ensure the capacity and stability of our systems such as bulkheads, canary testing graceful degradation, rate limiting timeouts load shedding caching and planning so I hope that I've given you some useful ideas for how you can take these and apply them in your own systems if you want to learn more about this sort of stuff here's a few books I would recommend these are all quality some like release it have whole sections dedicated just to working through examples and helping you figure out how to build and apply some of these patterns others like artist scalability and the rest through there have good sections to focus on this and put it in the context of like broader systems architecture or operational work thanks for listening I gather I have a little time left so at this point I am going to open it up to questions from the group yes you first I was curious I probably shouldn't talk about that on camera find me later yes yes yes yeah actually there's a few we've used a bit this is something we're not as disciplined about as I feel like we should be I think Varian might be one if I'm getting the name right that we can sort of like interject as a proxy in-between request and like force timeouts force slower responses and that people have tried that there's also we do sometimes just apply strategies like what happens if we use IP tables to drop a certain percentage of traffic or drop this service completely and sort of like see how we handle that so we are the question is do we own our infrastructure do we use cloud hosting we are at this point totally on AWS yeah we have a primary site in US East and a DR site in US West do you want to follow up real quick no not really in depth there's some interesting strategies around services like Lambda but we are more comfortable with some of the patterns we've been applying already yes yeah so that's a good question so the question is you know if you're building out a culture of like pervasive feature flagging like how do you actually clean up your code base because it can be like quite confusing to have to code around all these flags you might not even understand are they still used or not so that's generally up to the responsibility of the developers who have like the team that owns that feature at a certain point they'll feel like you know we're not it's been global for a while we're not going to roll it back so they'll do a code cleanup we do have some tools that we sort of attempt to determine if this is a dead feature flag and encourage people to look at those and see if they can be cleaned up because feature flagging is not without its downsides like I wouldn't want to work anywhere that didn't do this again but it does make some of the coding more difficult for sure and also makes like integration testing for instance more complicated you have to think about the different scenarios and flag states hey sorry so you know I'm not absolutely the best person to answer a question around like how do we test that like flag changes don't like break like user-visible behavior the basic answer is that there's a lot of manual QA and test plan development that would go into like user-facing flag features and so people would be testing different conditions and sort of different scenarios but I would be lying if I didn't say like in the past when there's been like really intensive development across teams in a few areas that that has gotten a lot trickier and we haven't had a good sort of like automated testing strategy around it which means you do rely more on the human testing strategy and things can slip through yes so so yeah so the question was like do we have any analytics on our flags to tell them like when certain conditions are being hit to understand like if they should be cleaned up or not and for the way we use the flags like generally I don't think that is applicable or helpful maybe you could explain to me more later what you're thinking but generally we're setting things in particularly for like a future ramp up in such a way that we understand like if we're saying we want 50% of users to get variant A 50% to get variant B then we sort of know like that level it's usually it's not a flag in which a condition isn't set selective in a way where we don't understand like what the intended behavior is going to be yeah so the question is like how do we handle outages like things can go wrong 24 hours a day I'm not awake 24 hours a day to answer all of them right that would also be terrible so right now we do have team members on my team the systems engineering team who are in an on-call rotation for the entire infrastructure we are in different time zones but not too different so for us the way we tend to do it is we'll have one week rotations we'll all be on call for 168 hours and then someone else will be on call generally like things aren't failing left and right fairly mature company good processes good infrastructure but yeah things do go bump in the night and then someone has to wake up are there other questions are there any hands I'm missing ah yeah that's a great question so the question is like we want to learn more about how Eventbrite does this are there resources we've published we do have an engineering blog which has some good posts actually great python posts I think someone's been doing a series on how we're redoing our like python packaging and distribution internally recently but to be honest I don't think we've published a lot on this so like this talk in the video is going to be probably the main public resource um you know if you want to talk to me more about things and ideas spring running around in your head find me and the other thing I would say is although Eventbrite you know is in a business where we have to deal with this lots of companies are and that these books like really are great resources to draw on and we didn't invent all the stuff ourselves and perhaps we did invent maybe we shouldn't have well looks like that's it so thanks again this has been a lot of fun and find me