 Thank you very much guys. Thanks for coming the brief agenda. So I'll have a brief introduction We're gonna talk about the the punch on the title like basic configure option configuration options of you Izzy And why you should use them and then we're gonna talk about some other stuff Basically you Izzy provides worker management features above and beyond what most whiskey hosts do and I want to talk to you about them because they can Protect you in production against things you might not be expecting And I want to talk about additional features that you won't get from other whiskey hosts and then potentially have questions So I apologize for the click-baity title of like trying to like make some controversy about you Izzy I love the thing. It's great, but you do have to get some basics, right? So why are we here? We're here to be successful right and I mean something different for all of us some people here are language people Some people are data scientists. I'm a distributed systems engineer I like I build reliable performance systems It gives you happy users a profitable business and maybe you're not a for-profit organization Maybe you're you're a non-profit. Well, it'll make your organization more impactful My apologies so Yeah, so building a reliable performance system is important to be successful in whatever enterprise you are running So this is a talk about distributed systems architecture Specifically we're going to zoom all the way into the into the individual service level and you Izzy's a tool We'll be doing to do that. I bring this up because I often talk about different parts of the distributed system stack In 2017 I gave a talk at high Gotham about stateless systems and how stateless systems can make your system more testable and more performance And so this is you know the exact opposite of that We go all the way into the to the very smallest level when talk about each individual microservice and how it's hosted So why you Izgy? Well, it is very powerful the list of features is tremendous and it's very fast, but It was written in a different time back in 2008 when you Izgy was written People did things differently and a lot of the defaults represent that The a lot of there were a lot of quotes in here from unbit the developers of you Izgy Is anyone from unbit here? I was hoping they would be it was also scared. They would be all right So they're not thank goodness But they they have production customers people that pay them money and a lot of the decisions They make are about catering to those customers rightly so So I want to talk you about how to avoid those problems and then move on to the things that you was going to provide above and beyond other whiskey hosts Almost every item I'm going to present today at least in the first third Caused an issue for us at some point of varying difficulty and they might seem like things that are like wow How did that hit you? Well, I have you know a very large team of developers between 40 and 50 at any time Working on services hosted by you Izgy and those guys are productive to get a lot of work done They stumble across corners right so if you're about to start a project with a Whiskey project maybe with flask or Django or anything for that matter You should consider you Izgy as your host and you should use this as a reference Now you Izgy 2.1 is supposed to be released at any moment back in July 2017 The unbit developers posted to a mailing list that they have decided to fix all the bad defaults Especially for the Python plugin because little-known secret you Izgy hosts Java services Ruby services anything you can imagine really In the 2.1 branch the 2.1 branch has not been released as of June 2019 I know this because I check it every morning leading up to this talk word that they would The first reference to you Izgy 2.1 was in 2014 So like who knows when we'll get released maybe that will provide some respite to this prop these problems But as it is now you should use the contents of this talk as a basis if you're gonna start any Development of a whiskey service using you Izgy There's an official things to know doc that you Izgy provides that is very valuable I recommend you read it I'm not gonna say much more about that other than that and I reference it everyone so on throughout this talk I would also like to say that we have published the contents of this article of this talk in an article at tech at Bloomberg comm That an hour ago, so if you go to this website on your phone or on your laptop You will see in markdown form what we're gonna talk about today So in the presentation things will be abbreviated so there's not too much text on the screen But at the website it is like very very It's expanded on very greatly so you can kind of understand our reasoning for each of the items and like with specific examples So it's a much better reference than the slides are today All right, so let's get started. That's it. I can leave now Do this right now? I will explain each one of them But yeah, this is what you want to do. Let's just get right into it So the master process is what makes you Izgy special you want it on I mean, this is kind of obvious. We should just get past this point I want to bring it out though because you there are some circumstances where you want it off specifically when you're debugging if you want to test your Service and you turn the master process off and you run in a single process Which you have to with the master processes off then the PID that is created immediately after you start you Izgy Is where your code will be running so you can use tools like S trace which we talked we heard about earlier in the in the conference or You know any sort of command line profiling tools like valgrind for example And you won't have to deal with like all the forking options like follow fork child and all the stuff that you normally have To do when you're using a tool like valgrind to debug or to profile a python service So otherwise not that interesting Strict config parsing. Okay. So by default you Izgy let you put anything you want in the configuration file And the justification for that is that you can add Non-existent options to your config files as place holders customs options or app related configuration items But the truth is that most of us don't do that stuff when was the last time you ever passed through application specific configuration options in a Framework configuration file never you shouldn't do this So you should turn this off why because it happened to us somebody meant to type something In the config file to control behavior in a way and they fat fingered it They put a wrong character in they move the production the service works right? It's not like it didn't work It's just that that option wasn't set properly service moves of production And then something doesn't behave the way you want right this is an example of you Izgy having some default behavior that can Allow you to do something that you don't intend right so you should set this now If you do want to use this feature of passing through configuration options through the uizgy iini file Absolutely feel free to do it, but that's not the default right or at least it shouldn't be in my opinion The next option is vacuum. So this basically makes you as you clean up after itself Common example is if you run your service on a local unix socket Uh the without this option unix socket will not be deleted for you sounds harmless right no big deal Well in my environments we have multiple users operating in the same machine And if they start a development instance of their service the socket will get their username and they might not have set their Their mask properly so that all users in the systems can read and write those and delete those files So if another user tries to start up a service on that socket location They will get a failure because they don't have file permissions to delete that socket So another thing that you should just set by default There's no real reason to leave these temporary files around it's just one of those things you don't need If your application is special for some reason and you need it then feel free to Move this to false But by default it should just be there and if you have a large development team It might save someone cursing another one of your employees, uh, you know one day in the future Because that is what happens All right, so now we're going to enter a section where there's a common theme and that is uizgy default Default options changing the expected behavior of your code by default uizgy disables python threading Which therefore removes the gill the explanation for this these are all quotes by the way these gray background Sections quotes from the uizgy developers directly They do this because they think that it will improve performance I don't know if it does or not Maybe it does all the applications that I have that are of sufficient complexity end up using threads for something in the Background and so I had to turn this on right away I pointed out because while it's relatively easy to find the first time you come across this your instinct isn't to think Well, my service host must be messing up my threads your instinct is like ash I screwed something up in my threading code right you start by looking at your code And then like depending on how good your employee is at googling or yourself It can take anywhere from five minutes to a few hours to figure this out So just turn it on by default if you're super performance sensitive and you don't use threads You can opt to turn it on like intentionally, but it's not a very good default behavior Another one is this single interpreter parameter uizgy has a feature uizgy really has a lot of features This one allows you to run multiple interpreters in every worker process Uizgy is the pre-forking model So if you want to hold to run 60 requests concurrently it'll spin up 60 processes If you will allow multiple interpreters uizgy will allow you to run more than one web application in each process Basically just a way to condense your processes and keep your pids down I don't really think it's necessary in the modern environment people running containers Or in virtualization, but back when they wrote uizgy and they were working on shared hard shared physical hardware Keeping the number of pids down could end up being important So this usually has no side effects, but there are some third party c modules that do not work well in multiple interpreters They make some assumptions about like globals and c memory space, which isn't validated when you're multiple python interpreters I actually don't remember how this caught us I assume it must have otherwise we wouldn't have just found it randomly So sorry, I can't give you some background on how we stumbled onto this one And then the last one which is just frustrating is die-on-term. So by default uizgy Treats the sigterm signal as reload the application stack Maybe you have updated your code and you want to you know refresh what the server is hosting Sure, fine. That's a great feature. I think you should have it But why would you use sigterm? We all collectively agreed on what sigterm should do it should kill something And for some reason they decided that they wanted to do something else with it So I just don't get it But the first time you start using uizgy someone's going to try and literally type kill the pid number because they're You know for whatever reason they lost their terminal and the service kept running in the background And it's just not going to stop and they're going to be like what's going on And then if you're smart you'll do kill dash nine and get past it But someone might want to figure out what's going on and then you have to find this parameter so again It's just the common theme of uizgy Deciding to change behavior away from what we all expect programs to do to something else for their use case Which we no longer need all right So if you're starting a service in 2019 and you want to use uizgy as your host which you should just by the little things I'm saying Set this as a little as a default And this sorry, I get excited about this one. This is the worst to me. This is the most egregious configuration parameter I think I've ever seen in any project Without this parameter uizgy will start if it cannot load your application That means if you have syntax errors in your application or some other dependency of your Of your application cannot be loaded. It just starts anyway And guess what it does it serves 500 and 400s to everybody because that's the only thing it can do Now what's what's insidious about this is that a lot of systems will just check to make sure that the server is Awake right you have like a health checking system That's really basic or you have some sort of like you know a lot of these systems are pretty naive Your service starts it looks fine Then you start redirecting production traffic to it and everything goes boom The reason they did this is back in 2008 it used to be a common pattern to start uizgy with nothing loaded and then dynamically inject applications I can't imagine what the world was like back then although technically I was operating as an engineer professionally I don't remember doing it this way, but uizgy did it that way back then So nowadays you want to set need app equals true and what that'll do is if you try to start your application And there's a syntax error to fail right away. I don't know if you guys saw this talk today about the guy doing the um Look ma no HTTP But his whole demo relied upon him changing code and then having an auto loader restart his service every time his code had a His code was changed so that he could see the errors he made immediately If you were using uizgy the whole thing would be for not right because it would start up and it would look like you were Fun, so don't do it This one is uh is actually just an opinion. It's not actually anything that I can pin on the uizgy developers um uizgy is very verbose with logging Most of us if we're developing an application will have very concise but meaningful application specific logging Probably because we're feeding our logs into some you know grander distributed grep system Where we have to pay by byte and we don't want to necessarily spam the thing So I recommend disabling uizgy's uh default logging. Um I don't particularly like the format although if that's all arbitrary right We all have our own arbitrary arbitrarily preferred log formats But what I do think you should do and that's really what I wanted to talk about is you should enable the uizgy default logging For 400 and 500 error codes and why do I say I do that because it is very very difficult as an application developer To guarantee that you will always catch an exception or an error and log it properly It is far more likely that some percentage of the time you will intend it to capture all your errors But you won't because it will occur before your logging handler or some other situation that your logging handler can't anticipate So I recommend turning this on to make sure you always have some indication that some traffic was had failed Even if you also do that in your application logger two log lines aren't going to hurt you, but zero will All right, and that's kind of it. So if you stick to this You know and you're running a decent sized application team or you might in the future You can potentially avoid a lot of significant issues that will either waste developer time or cause potentially major production outages After this though, we're going to get a little bit more positive And again, I see for you guys taking pictures check out check out blue murk.com All this content will be there with much with basically all the things that I said written down instead of bare bones like the slides One of the things I want to talk to you about is worker management I feel passionately about this and most wizgy hosts don't even bother. They're just like you need 60 Sure, I'll give you 60 and that's kind of like they stop talking about workers at that point You wizgy gives you a real powerful wealth of features in this area The most dearest to my heart is worker recycling What this basically does is make sure your workers don't get too old and there's many metrics for that I have three different methods here max request basically tells you wizgy to restart your workers after a thousand requests That number is pretty small depending on your your situation At bloomberg. We're doing we're doing data services. So there's a lot of financial calculations that take a lot of time So a thousand requests actually might take a little while If it's been alive for more than an hour Or if the process has allocated more than two gigs of memory resident memory And that all of those go into effect after the request is finished So if you have a request that goes from let's say two gig from one gig to eight gigs in one request It will wait until it's finished and we'll clean it up after it reaches eight gigs I point that out because this is very valuable to help you prevent like a certain class of errors If you have a slow memory leak And you recycle your workers you might not ever figure it out Right if you have like a few bytes that are leaked everyone wants them on because an object somewhere That's not cleaned up and you turn on all these options here probably won't be a problem Now obviously you should find that slow memory leak anyway and fix it But that doesn't mean you should let it cause a problem for you in production So these features are great and you should turn them on front by default as far as I can tell There is really no cost to this Except for maybe if your application has a huge startup time These workers are forked off of the parent one anyway So that even isn't really a big problem, but I suppose if you had a huge amount of page tables that the forked time might be significant I don't know. I have never personally observed any issue with this whatsoever And then the worker reload mercy is just tells allows you to configure how long to wait before you like forcefully close your Workers if for some reason they're holding onto a resource that isn't gracefully giving up its control Then there's dynamic worker scaling so this this again is not strictly required This is a feature that I love because I happen to run multiple whizky services on the same physical host or in the same virtual machine And I also run multiple versions of the same service. Uh, if you see my talk about stateless services We run multiple versions of the same stateless services and then basically we probably request to all of them to compare Differences and responses between them So if you were to look at one of my machines, this is an obvious exaggeration. We don't run 500 processes, but uh, We do often run 96 or 128 And if you have you know 10 or 20 services with that many processes each Your top or your ps can be can be pretty Busy so dynamic worker scaling allows you to start up with a lower number of workers And then scale up and down depending on the load in the system and it is anticipatory Anticipatory, I don't know it anticipates how much traffic it expects based on how much traffic you're getting at the moment And generally leaves enough of a buffer that spikes won't Uh outpace it You can see in the bottom There's some parameters about the backlog that's feature in particular make sure that If it's not scaling up fast enough it will create emergency workers to make sure you can handle it And again as far as I can tell, uh, this has never been a problem for us. It scales fast enough It's not like a risk. It's just another no cost feature that you should use There's also it's pretty sophisticated. There's a lot of options a lot of different algorithms This is the configuration that I like But you should definitely if you're going to do this you should understand your application and check out the other other algorithms other than the busyness one that I've I've selected Okay hard timeouts This is a really powerful feature that I have not seen in other whizky hosts You whizky if you turn on harakiri will forcibly kill your worker process if it is stuck I point this out because this happens all the time and This can basically make that a non-issue I mean it's when it happens you should log it and alert on it and you should figure out why and fix it But we had another talk earlier today. I forget the gentleman's name But he was talking about um using asyncio and production and some of his war stories Granted he was talking about asynchronous cases and it's a little different But he explained a situation where he had a spike of traffic come in or Conversely one of his downstream dependencies was slowing down or reaching its limit and his service basically backed up with requests And he had thousands and thousands of requests going on in parallel and they could never catch up Even if traffic went down to zero it would take hours for everything to recover Right. Well, this is what this feature is for if a spike of traffic comes in and your system for some reason locks up Maybe not because of a deadlock, but because of just limited resources harakiri will clear all that after 60 seconds So when traffic does recover your system will too now granted you should understand why that Q situation occurred and you should expand capacity of the limited resources and make your system more reliable in that regard But there's no reason you have to get woke up in the middle of the night every time And that's what this is for right cya like protect yourself. Don't cause me to drown just because you made a mistake We make mistakes use the tools to protect you So you should use harakiri. We use 60 seconds. That's a long time for most people But we're doing like I said, uh, you know numerical calculations on uh mortgage backed securities another large financial instruments Some of our services go much higher than 60. Um, so the lower you can set this the better because that's basically your your um Period in which you can have a Q condition once that this period turns over you're all the stock requests will clear out This is an analog to this right So hard timeouts are kind of drastic because you're literally kill-9ing You are your workers, right? You really don't want to get to that This should have been in the first section, but I put it here so we could contrast it this Uiski configuration option allows worker processes to receive signals without it Uiski does not allow this to happen. I have no idea why I actually couldn't tell you I couldn't find the justification for this. Hence. There's no quote from them The reason that you want this is that you might want to receive a signal for example in the python signal module you can call signal dot alarm and tell the python interpreter to wake up this handler if you have uh After a certain amount of time So one example would be you set a hard time out at 60 seconds and you use the signal module to set a soft timeout at 59 seconds Oftentimes i'm serving requests where I have partial responses. I can return even if I can't return the full thing So in our system we use the signal module to wake up any Waiting process at 59 seconds gather with whatever data we have and return it to the client If that fails for some reason for example say we have an actual stuff process This will kick in Uh, sorry, Harakiri will kick in and kill the process to make sure we continue to operate properly All right, so these two things kind of go together But this is really a default you should just set this it's there's really no harm in it Like if no one sends your process signals, then fine But if you want to this is another one of those situations where it's like how good are you are googling to figure this out? Uh, this one is very important as well So we had a conversation uh, sorry a talk at this conference about observability about uh, you know tracing and logging and monitoring And how observability is more than just the three things that we that he talked about Which was logging metrics and tracing this is another area you can get observability to your software by default When you start a service host most of them give me this boring output That's just the command line that used to invoke the service which is literally useless practically I mean, I guess it tells you the python module that you loaded But that's really it you as he's aware of this and they give you an option called auto proc name That you can put in the config file that will do something like this It will tell you which process is the master which processes other workers and which which worker number it is This is still a little naive, right? What if you have more than one uizgi service in the same box you might end up with You know 10 uizgi masters and which one is the one you want So you can be a prefix you can say prop name prefix and give it a service name That's much better, but it's still not the best we can do there is an api you can invoke from within your code to specify dynamically what you want the prefix to be And so here's an example. This is contrived my I'm not actually hosting It doesn't have we actually do but it gets the example across where you can put context about what that process is doing So in this case you can specify the service the username and then the uri they are accessing What this helps you with is sometimes you have a problem where there's a lot of Slowness in your hosts and there's a lot of stuck processes and very quickly But it's running top or ps you can see that it's a pattern that it's all the same user or that it's all the Same uri or there's something common about that problem Which otherwise would perhaps take you a little while digging through trace or digging through logs You can kind of get a quick snapshot of it. So it's not a replacement for anything else It's just a straight free zero cost additional tool that you can add to your system Last section I want to talk to you about is additional uizgi features. So this is a controversial section Because if you use any of these features you have now lost compatibility with other wizgi hosts You can no longer switch back to g unicorn if you use these because you are now leveraging Uizgi specific functionality in your code that said it can still be worth doing I mean, honestly, we don't switch hosts that often once we get a system in production And these features can help you Solve or provide a solution to something much with much less complexity than you would be able to otherwise, right? So it's a trade-off between reducing complexity and having interchangeability of your wizgi hosts We end up using some of these because the reduction in complexity is just tremendous when you're when you're running a large system The first one I want to talk to you about is a cron and timer So that's literally what it sounds like running crons and timers So you can say do something every first of the month or do something every 20 minutes It does not let you do one time timers If you want to just say do something in 60 seconds, you should use the signal module Like why would they write something that already is in the standard library? I point this out as being useful because sometimes you want your periodic tasks to be synchronized with the software version The code that you're deploying right sometimes you have this problem where you want to put it a new service That has some some functionality. You want to run on a timer and you have to update your cron system separately And now you can deploy those things in an atomic manner If you're going forward This isn't usually too big of a problem because you can deploy the new features and then change the cron timer later But if you're going backwards actually could be dangerous because you've now moved your service forward There's some bug in it you want to roll back But now you have some cron job that expects that behavior to be there So this can help you deploy software in a more atomic fashion And also if you just don't like using cron or any of those things you can do it in the service itself Then you can also do it with a decorator Instead of with doing it at the global level, right? So that you can put a decorator around a function and you as you will call that periodically Next thing I want to talk to you about is locks. So this is actually very difficult to do If you don't have a like Framework supplied lock like this you have to try and get some code to run before the pre-forking portion of you Is the fork tell your processes and create a lock in global space and then share it around Or use an external lock and redis or something. It's actually a real pain I don't think you should write software that needs a lock. If you have to do this, you're probably messed up But if you do this is probably a better way to do it than all of the alternatives So i'm just making it making you aware that it's there even though like I would kind of feel dirty if I had to use it Then there's the cache system. So this is super cool. This is a cache That you as you provides so that all the workers can share information Uh, you could obviously use memcache or redis as well, right? Like of course you could but then you have to create a separate system to start those things and to monitor them And so it's kind of nice to have it all packed together We use this for rate limiting actually what we do is every time a request comes in we store in If you whizgy cache how many requests a given user is doing And then we like reject requests that they've reached some threshold You might say well, why don't you have ha proxy or engine x do that? Well, the reason is that we want to We want to Throttle people in a more sophisticated manner than one of those proxies can we actually want to look at the HTTP request in pretty Detailed ways in order to do that So This is a pretty powerful feature. It's got some gotchas For example keys or values that are too big for the cache will solidly fail to insert So you need to like kind of make sure you have a good test suite around it But once it's working it's bulletproof and it's much simpler than Starting up a remote or an external memcache or redis instance to get the same job done And then there's mules. So mules are worker processes that are not workers They're not servicing requesting your clients, but they get but they're there to do stuff We use this for example to aggregate metrics So all the worker processes will send the mules metrics and then the mules will offload those metrics to your metrics engine Right whether you use data dog or metric or metric tank or griffon or whatnot So you can put a little decorator on that and say target equals mule If you don't say target equals mule it'll run in the in just a random worker process or all worker processes So that's how you specify that and then there's ways to send data there as well Which I include in the longer form article that we posted on tech at bullenberg.com if you want to see that So that's it. I am hiring I am looking for an sre and just to qualify when I say an sre I mean a distributed systems engineer who doesn't get work from product people who want features built whose job it is to Optimize monitor and improve the performance of our system and its reliability for the sake of itself and not to get some Featured on for a business guy We are a pretty great place to work. Please talk to me. Um, I would love to tell you more about the role We have some people already doing this role. So you'd be joining a team. You wouldn't be bootstrapping an organization And but you'd have to come live in new york, which I know is terrible. Who wants to live in new york? That's it. I think we have a few minutes for questions if there aren't Yeah, we have a few minutes for questions again Microphones are there and they're please line up behind the microphone and uh ask away Hello, hi, uh, could you tell me please more about hierarchy option? Why should that I use it? Why should that? I just not set up like work Worker reload time So the worker process recycling Um, we'll I think we'll kick in I'm not a hundred percent sure about this But I think it will kick in and restart workers But generally your worker process Reloading Is going to have a long time interval. Uh, you might say an hour or thousand requests Right. So if you set it on request, well, it'll never reach that obviously If you set it on memory, it will never reach that either So it's only the time one that will catch it and then it's the question of do you want to wait an hour or 60 seconds That's really it to me. I want to wait 60 seconds if I wait for that hour to happen Many many bad things that happen. I mean this this actually happens to people You have a bug in your code in every worker deadlocks. It happens, right? So the hierarchy is a way to help prevent that whereas if you use just the normal worker process reloading It would the time interval be too long All right for them. Hi What about lazy apps loading so loading app and forking versus Forking and the loading apps after all have you been bitten by that or So we we prefer the post fork option. So we we'd want to load everything and then fork after You know, that's a great question. I don't think we have yet the doing it lazily makes the Process reloading take much longer because now you have to wait to load all of your application code, right? Which can take seconds sometimes So we have not yet But that's a great feature to be aware of if there's something about your application that you can't load it once and fork Trivially, you can enable the lazy apps option and that will prevent that from being a problem Usually that kind of stuff happens if you import something and it initialize something globally that's also like See extension and then you're doomed for lazy stuff Absolutely Thank you So thanks for the talk and I wanted to ask about your deployment environment because I noticed you have a huge number of workers And is it containerized that there are small machines or big machines with a lot of cpus? In our particular situation, we generally have dual socket linux servers physical servers Which somewhere around between 20 and 28 cores generally about half a terabyte to a quarter of a terabyte of ram Um, we have some virtual deployments and deployments on vm's But we don't really prefer them because we run a very high duty cycle in these machines We're doing like Monte Carlo simulations of mortgage-backed securities which is why I have so many workers and those Used like we have a cluster of 500 machines that we've run at 100 cpu for 20 hours a day And so if we were to go to a virtual environment, it would just cost more money Like we need more machines once you get the overhead there So we are capable of running in a containerized environment if we need to but for money's sake In efficiency's sake we run a physical hardware Thank you. Cool. One last question very quickly Do you have any recommendations or gotchas and so on for emperor mode or would any of your recommendations here change with that except for I think one option you mentioned To be totally honest, I don't understand what like emperor mode and zerg mode and these things are for I mean We we have a microservice environment where each service Can run independently and can be deployed independently. So I I honestly have a completely ignorant to all of that stuff Sorry, but we kind of almost recommend against using emperor mode at all. I don't know why you would use it Right. So I I haven't seen the use case for it. I suspect it's one of these back in the day That was the best that they could do You know now things have progressed All right folks, let's thank all our speaker again