 Talk a little bit about asynchronous I.O. programming, otherwise known as event-driven programming. Louder? Can I make this louder? Oh, OK. I'm here to talk about asynchronous I.O. programming, otherwise known as event-driven programming. If you've ever used an async I.O. framework, it's probably obvious why it's called event-driven. If you haven't, you have to use one. And it becomes clear. And let's see, where's my mouse focus? There we go. All right, so a little bit of background on asynchronous I.O. This is a technology that is all the rage in the Python world. And it has been all the rage for about 20 years now. In, I think, 1996 or so, a couple of modules landed in the standard library that nobody remembers about. I don't know if they were removed for Python 3 anyway. They're still there in 2.7, called async R and async hat. That was the first attempt at solving all the world's concurrency problems for Python with asynchronous I.O. The next attempt came along in the early 2000s, a framework called Twisted, which is a much more ambitious and probably much closer to success, successful attempt to solve all the world's concurrency problems for Python using asynchronous I.O. I included PyQT in this list, not because it's a network framework, but because it's interesting that this is a programming technique that is useful for more than just networking. It's also extremely useful for GUI programming. And it's also kind of weirdly relevant in other ways that we'll see. Tornado, which is the framework whose use inspired this talk, came along maybe 10 years ago. I'm not exactly sure. And then more recently, we have a thing called async I.O, which is in the standard library, has been for a couple of years now. People keep doing this. Also, it's not just a Python thing. Redis and Nginx are two very trendy popular monolithic servers written in C using asynchronous I.O. If you're programming in Node.js, my understanding of that ecosystem is it's asynchronous I.O. or nothing. That is how you do concurrency in Node.js. So why is it so popular? Simple, because it works. It's a good technique. It solves the problem that it sets out to solve very well. The problem is how to scale an I.O. intensive workload. And the reason it works so well is because instead of having high overhead, like one thread or even one process for each client, you only have one file descriptor. File descriptor takes up a lot less memory, both in your process and the kernel, and you don't have expensive context switches. Or you do, but they're less expensive. Less overhead means you can handle more concurrent clients. Everybody wants to handle more concurrent clients. But there's a catch. The catch, you shall not block. The way this is always explained if you read the documentation for any of these frameworks is that you shall not block with other I.O. You're not allowed to go make database queries. You're not allowed to hit your cache server. You have to be careful even about opening local files, because a local file might be on NFS and that's network I.O. Oh, has anybody ever tried to open a local file on a hard disk that's failing? Talk about blocking I.O. That's blocking I.O. But something they don't mention in the documentation, which they should, is that you shall not block with CPU activity. And of course, you're thinking, how can I not do CPU activity? Anything a computer does that isn't I.O. Is CPU activity, right? Well, yeah, but if you're adding three integers and appending the result to a list, yeah, that's CPU activity, but it's not that much. It's when you do a lot of CPU activity that it blocks the event loop and bad stuff happens, which I'll get to shortly. So what happens when you block the event loop? So this is one reason why I threw a GUI framework into the mix, because you've all seen the results of blocking the event loop in GUIs, right? The application on your desktop freezes up, doesn't respond to input, it doesn't repaint, there's no stopwatch cursor. I don't have a lot of experience with GUI programming, but I'm pretty sure that usually the reason is, A, they're using an event-driven framework, B, somebody blocked the event loop. All the other frameworks I mentioned, turning out to twist it, async I.O., whatever, what happens is the server process freezes, cannot talk to any of its clients. And the whole reason you're using an asynchronous I.O. framework is to handle thousands or tens of thousands of concurrent clients. And if you block the event loop, you've just blocked all of them from any conversation with the server, bad idea. So the rules, when you're using any asynchronous I.O. framework are pretty simple. Rule number one, all I.O. must go through the async framework, or it has to run in a separate thread. These are Python specific rules, I have rules, I have no idea what the rules are for node.js or lib event that C programmers use. And then for long-running computation, you can do it in another thread, but you have to be aware of the global interpreter lock, which prevents two threads running Python byte code concurrently. In particular, if you have another thread that runs code that's in a C extension and that C extension releases the gil, you're fine. You'll get concurrency, you won't block the event loop. If you can't satisfy those conditions, you have to do your long-running computation in another process. So consequences of these rules, QT, I talked about QT, it's a GUI framework, right? So of course it gives you buttons and menus and mouse clicks and all that GUI stuff, but a lot of GUIs these days, and for the last 20 years, also do network stuff. So QT includes everything you need for writing HTTP servers and clients, or sorry, TCP servers and clients and HTTP servers and clients and lots and lots of networking stuff. And it's all managed through the QT event loop. Twisted, which I believe set out to conquer the world by implementing every conceivable network thing, does that. It has TCP and UDP and HTTP and SMTP and IRC and SQL and on and on and on. Not everything, of course, because then it would be an infinite library, but sometimes it's close to being an infinite library. Tornado is much less ambitious than Twisted. I suspect for a reason. It's a web server framework that happens to use asynchronous IO, so you use it to write HTTP servers. But of course, HTTP servers these days spend a lot of their time turning around and making HTTP requests, so there's an HTTP client library in Tornado. If you think requests, the library is the best thing since sliced bread, sorry, you can't use it in your Tornado application because that IO is not going through the async framework. You have to run it through the async framework. They tend to be all-encompassing. So that's the background on asynchronous IO. The project where I stumbled across some of the pitfalls of asynchronous programming. This is at work, my day job, and this is like one of the products my employer sells. The proposition is you have stuff on the internet that you care about and you wanna know when bad things happen or interesting things. We don't really promise bad things, we promise interesting things. What happened to your stuff on the internet? So if you're an ISP or a fairly big company that has like chunks of IP space, you probably have an IP prefix or two or three or 17 and you participate in the global BGP conversation among routers building the global routing table. And if you know about this stuff, you know that BGP hijacks are a thing that you don't wanna have and you wanna find out when they happen, we'll tell you, we'll raise an alert and tell you. If you're not in that elite club of BGP people, well, you maybe just have an IP address or five or 10 or 15 that you care about, we'll ping them every next couple of minutes and you know when the latency goes up or the loss rate goes up. Or you may have transcended these mere implementation details and moved on to a higher plane of existence, the cloud. And you just care about some AWS availability zone or a digital ocean, what's your Mahoo's that are a Google compute thingamabob. That's okay, we'll do the same thing. Under the hood it's the same, we're just pinging an IP address and telling you when the latency goes up, but we call it a cloud zone, so it looks fancy. Anyway, all of this stuff that you have on the internet that you care about, they're all characterized as assets. So you can organize your assets into inventories and then when something interesting happens to one of your assets, whether it's a BGP hijack of a prefix you care about or ping latency getting high or the packet loss to AWS in Montreal going through the roof, whatever it is, we'll raise an alert and we'll tell you about it. And you can fetch that alert through an API, you can view it in the user interface, we can send you email, we can send you a thingy on pager duty or Slack or a couple of other protocols. Alerts are first class objects. So, surprise, surprise, we store your stuff in a database. It's a relational database with Postgres. The schema is completely boring, obvious and uninteresting. There's an assets table, an inventories table, an alerts table, a bunch of stuff to tie it all together. Nothing surprising there. Surprise number two, we wrote a service to wrap around this database. It's fairly restful, JSON over HTTP, create, read, update, delete, except you can't create alerts, only we can do that internally, obviously. Really boring, not interesting, except it is the subject of this talk. Not the API, you don't care about what post slash assets does. That's not what I'm talking about, I'm talking about the software, the implementation. So here's the big picture, the architecture of this whole alert system. Stuff goes into database and we have a bunch of software that accesses this database through the stuff service. So for example, we've got a couple of web interfaces where you can see and update and create and destroy assets and inventories and you can browse your alerts and do stuff with them. Alert agents, there's one, there's an alert agent, roughly speaking for each category of alerts. So we've got a couple of alert agents that monitor BGP traffic and listen for leaks and hijacks and so forth. We've got other alert agents that do a lot of pings and monitor that for latency spikes and packet loss spikes. Notification agents to send you email or send you a pager duty message, whatever they're called. I have no idea how pager duty works, I didn't implement that agent. And we also let customers interact through an API. And this is actually the same code, they're just two instances of it. One of them has authentication and doesn't let you view other people's stuff. The internal one has, if you're in the network, you're in. And you can view everybody's stuff because it's internal. But they all go to the database. So in the course of designing, implementing and deploying this system, we discovered a couple of non-functional requirements. The first one that we discovered, actually the first one, the important one, which we discovered last, is pretty obvious when you draw this diagram, which is that everything depends on the stuff service. If it breaks, lots of things break. So it has to be rock solid, reliable. We discovered this requirement the first time it blew up in our faces in production on a weekend. The second requirement is web sockets are cool. Do you know what we could do with web sockets, guys? Do you know what we could do? We could have a reactive web interface and we could like, every time an alert is raised, just show it in the user's web browser because web sockets. That's the requirement that drove the implementation of the project. That is why we used Tornado to implement this service in anticipation of many concurrent clients doing, I don't know, web socket-y stuff because cool. So that's the background. So we've nailed down a couple, I don't think I, well, I've told you it's a relational database. When we use a relational database, it goes in Postgres, it's just part of our standard technology stack. We have chosen our server framework, Tornado, because we're gonna do web sockets, right? So we have to use Tornado, yeah, yeah, yeah. It'll be cool. When you have made those two choices, you don't get a third choice about how you're gonna access the database layer. You have one choice. It's a library called Momoko, which wrap, so PsychoPG2 is the common, most standard, most widely used Python library driver for Postgres. Momoko does two things. It wraps PsychoPG2 so that you can use it in a Tornado application with all the IO going through the event loop as you're supposed to do. Except for DNS lookups, but never mind. That's not even in this talk. And number two, it provides a connection pool because everybody needs a connection pool, so let's throw one in. Now, Momoko is perfectly fine. It does what it says on the label, those two things, and it works, mostly. It's not perfect, because no software is. So here's a rough diagram of our implementation stack. We have our service, which runs on top of Momoko and Tornado and Python, which abstracts away PsychoPG and Postgres and the Linux kernel and the real world, all that stuff. So what could possibly go wrong? Did I foreshadow you enough? So we had, over the course of four months, we had three critical outages. Two of them hit in production. And of course, what are the odds that a critical outage in production will hit on a weekend? Two out of seven, right? Well, of course, both of them came on weekends, naturally. Number one, where our event loop was blocked, not by IO, not by a rogue database, a query or an NFS file, it was blocked by CPU activity, that thing they don't tell you about in the documentation. Number two, we had a deadlock in the connection pool in Momoko. Number three, it blew up in our faces because we tried to open more database connections than Postgres was allowed or was configured for. So number one, my boss gets a call or my boss's boss gets a call on the weekend. Oh my God, oh my God, everything is broken, nothing is working, the stuff, service is hanging. Oh my God, what are we gonna do? For some reason, they didn't interrupt my weekend which is really nice of them, I really appreciate that. I like working for these people. Monday morning, bright and early, I find out about this outage and I very quickly learn about a great tornado feature which you can just, I can't remember if it's a one line of code or if it's just configuration, whatever. We turned it on pretty damn quick. You can get tornado to detect when the event loop is blocked and immediately log a stack trace which is very handy because I'm sure you all have the same dedicated neural circuitry in your brain that recognizes a stack trace in a log file faster than you recognize your own mother's face. Um, so combine that dedicated neural circuitry with a feature that logs a stack trace whenever the interesting bad thing happens. We found the guilty code pretty quickly. Of course, I was thinking, oh my God, it's blocked on IO, we're doing IO because that's what all the manuals tell you not to do. But when you have a stack trace, it's hard to ignore the truth which is that your application was blocked for three or four seconds emitting JSON. It's like, what the hell, you know? I mean, who thinks about the CPU overhead of emitting JSON, right? You're just iterating over some nested lists and dictionaries and spitting out 1,000 bytes or 10,000 bytes, whatever. Maybe it takes a microsecond, maybe it takes 10 microseconds. Well, when it's 100 megabytes of JSON, it actually takes a long time, many seconds. And when you block your event loop for many seconds, generating 100 megabytes of JSON and your web application freezes up for all users, you notice, unfortunately. So we quickly deployed a mitigation, not a fix. Another good tornado feature lets you run with multiple worker processes. Formerly, we were just using a single process, a single thread. You can use multiple processes. Multiple threads would be useless here because of the global interpreter lock. So it has to be multiple processes. So now when we hit this design flaw, we only block one of the end worker processes. The others continue working. And hopefully we have enough unblocked worker processes that most clients can continue to get stuff done. I hope it's not fixed. It's only mitigated. Outage number two, nobody's weekend was interrupted. This was in the middle of the day. Hey, Greg, this is a funny thing happening. The stuff service is frozen. And it's just completely wedged. Nothing works. Everything is busted until I restart it. Yeah, okay, pull the other one. Couple of days later, same thing happens again. Okay, fine, I'll look at it this time. So I looked and of course, I've made sure that that, this was in our development environment, not in production, so I double checked. Have we got that log of stack trace option turned on? Yeah. Are we doing any dumb IO? I don't think so. What's going on? So I dug and I dug and I dug and I dug. Took a while to get to the bottom of it, but it turns out it was a deadlock. The deadlock was in Momoko or Momoko with itself trying to fetch a connection from the pool. This is obviously something that happens very frequently on every request. But it's, so this is weird. Before this, I had a bit of experience with async IO programming. I used Twisted for a small project about 10 years ago, but not a huge amount of experience. It just never occurred to me that deadlocks could happen. I mean, I have a fair amount of experience with multi-threading in Java, a bit of experience with multi-threading in Python, and also some fair amount of database experience. So I've seen deadlocks in Java. I've seen deadlocks in database programming. That's where I expect to see deadlocks. I just never expected a deadlock with async IO because I'd never seen one. I'd never seen a blog post about one. I'd never seen manuals talking about one. It just wasn't the thing as far as I know, but it is the thing. It's concurrent programming. Anything that can go wrong will go wrong. If you can concoct some crazy scenario under which a deadlock will happen, yeah, it'll happen. But I didn't know about it. And I don't remember the details, but I do remember thinking, I don't think this could happen with synchronous programming. Does that mean I have 10 seconds left? Okay, thank you. So I think it's specific to async IO. Not sure. I never did figure out if it was... Yeah, so the other funny thing about this deadlock was it would not have happened except for a half-finished refactoring. It was a refactoring. I thought it was harmless. I knew I had to finish it sooner or later, but I released it with this refactoring half-finished. It came up. I quickly finished the refactoring, thank you. And the problem went away and I never really investigated deeply. I never did figure out if this was purely my bug or purely a Momoko bug or something I was doing wrong that tickled a bad code path in Momoko. I don't know. Anyway, deadlocks can happen in the funniest places. Outage number three in production on a weekend. My boss's boss gets a phone call. Oh my God, oh my God, everything's broken. Nobody can connect to the user admin database because everything is broken. Oh my God, what do we do? So dirty little secret here. On a diagram, a few pages back, I showed you the stuff database. There is no stuff database. We just threw a bunch of new tables in the user admin database because it's the user admin database. Let's throw more stuff in there. It's got everything already, right? Honestly, the decision was made before I joined the project and I have no idea what the rationale was. Expedience, I assume. Anyways, that's how it is. So there were two proximate causes for this outage. One was my own damn fault. Some people will tell you, and they're probably right, that capacity planning is a dark art that you must study in an ivory tower for many years to learn how to do. That might be true of some capacity planning, but this kind of capacity planning where you have two services running on two servers with one worker process per core, 24 cores each, and you've configured it for 20 database connections max. That's 960 max database connections, and you've configured PostRise for 500 total connections. You will very quickly realize that 960 is greater than 500 and you are cruising for a bruising. Sadly, I failed to do that very, very elementary capacity planning and it blew up in our faces in production on a weekend. Oops, they didn't fire me, thankfully. Proximate cause number two is a missing feature from version one of Momoko. It never closed, this old version, never closed connections, never shrinks its pool. So when there's a load spike, it stays at its high water market, which is unfortunate. Also, tornado kind of behaves sort of similarly to its worker processes. They are forked once at startup and live forever, which is kind of an inflexible inelastic model compared to something like Apache, which has a lot more exposure, a lot more use, a lot more users, a lot more people hacking on it, a lot more features. Less widely used software has less battle testing and fewer features. So we mitigated this problem in the obvious way. We don't need one worker for each of 24 cores. We don't need up to 24 database connections per worker. And maybe we could double post-gres' max connections from 500 to 1,000. So we raised one number and lowered the other and the problem went away. Also, we upgraded to Momoko 2.0, which does have an auto shrink feature because the software had been around longer. Somebody got around to implementing that feature. Non-trivial upgrade, unfortunately, but there you go, it got done. So, now what? We've had three critical outages, two affecting production over four months. It's probably a good time to stand back and think what the hell's going on here. And at first glance, all of these outages boiled down to, gee, I didn't think that could happen. I didn't know that could happen, that the deadlock or 100 megabyte JSON responses burning lots of CPU, or I didn't think that would affect us. Maybe the JSON response falls into that category. I don't know, whatever. Things we didn't know when we deployed. And I like to pin this all on this mistake, which is the use of unfamiliar technology in critical infrastructure. Oh, and my failure to do basic capacity planning, but forget about that. Just do your capacity planning, boys and girls, don't do like I did. Now let's talk about the use of unfamiliar technology in critical infrastructure. I am in no way opposed to unfamiliar technology. If we stop experimenting and learning, we're dead. If nobody ever used unfamiliar technology, we'd all still be using FORTRAN and COBOL. No, we'd all still be skinning animals with stone tools. No stone tools where unfamiliar technology wants to, so really. In critical infrastructure, as soon as you have paying customers, I think it's tautological if you've got paying customers, you've got critical infrastructure. You can't really avoid it unless you're in that lucky stage where you don't have any paying customers yet. But you really have to try to minimize it. Keep it small, keep it tight, keep it surface area constrained. Don't let it grow without bounds. And be very, very careful about combining unfamiliar technology with critical infrastructure. If you feel like playing around with asynchronous IO because you've never used it, by all means, go for it. It's a great technology with some downsides. The problems that it sets out to solve, it solves pretty well. The problems that it has are known and documented and understood. But use it out on the fringes of your dependency graph. And I'm gonna try to stick to this rule. Don't put it at the core of your dependency graph where everything depends on it. Put it where failure of your new experimental component will only break one feature or a few users. Not where it will break everything. So you remember that non-functional requirement that web sockets are cool and we should do a reactive user interface? Well, we should have done a dedicated service just for that. By the way, shortly after this, like before this one, I think while this was being implemented before it went into production, the company that originally implemented it got acquired, priorities shifted. People moved on. And we never actually implemented that reactive web UI. We never did anything with web sockets. It was all a mirage. We never needed tornado in the first place. So we paid all the costs of asynchronous IO. We didn't get any of the benefits because we emphasized the wrong damn requirement. So please be aware of your requirements, whether they are documented or not, whether they are functional or non-functional. Be aware, whether you prioritize them or not, you are assigning priorities to requirements. And when you pick your implementation framework based on some vague fuzzy idea, you are prioritizing one requirement and deprioritizing others. Whether you make that an explicit decision or not, I would recommend making it an explicit decision and being aware of what's most important. That is all. Thank you. Question. Momoko. Momoko? Yes. Do you know about it? Yes, because the last time I gave this talk, somebody mentioned it from the audience. It's Python 3 only. And we are just starting to creep to Python 2.7. And I believe it only works with the thing in the standard library, async.io. So maybe someday in the future, but I hope not because I would rather re-implement this thing right. Other questions? Yes? Well, it's funny you should mention locust because some months after this, I persuaded my boss, we really need to do some load testing. So I spent a week or two doing load testing and didn't find any bugs. I didn't find a breaking point, which means I didn't finish the load testing. I pushed it as far as I could and it didn't fall over. Okay, great. The load testing did, well I mean part of the problem was that we already knew about like the unbounded JSON response by this time. And load testing was more about concurrent load. It was not about, all right, let's make this one response completely huge. So I don't know. I don't think it would have helped. Yes. Yeah, yes. Well, it depends on what framework you're using. If you're writing it, yeah. Yes, yes. When you drink the Kool-Aid or swallow the red pill of asynchronous programming, you have to go all the way. You can't drink half the Kool-Aid. That's the thing about asynchronous IO. You can't drink half the Kool-Aid. You have to go all the way down the rabbit hole. I don't know. I've never used Node.js. Yes. Uh-huh. Yeah, Django is for synchronous IO. You can use Django or you can use Tornado, not both. That is, well, I don't know that much about Django. I've only dabbled with it, but it's... Did anybody hand you a picture of Kool-Aid that said asynchronous IO? I don't think so. I think Django is just traditional, normal, ordinary, multithreaded or multi-process synchronous IO. Yeah. No, I know if we could reproduce that in a single process. Yes. Tornado was all about coroutines. I don't know. I don't know how Flask does that. It's simpler than that. They've abstracted away a lot of that, so it's really relatively simple and elegant that you don't have to deal with callback hell in Tornado. But callbacks are happening under the hood. So, I mean, Tornado is really, really nice. It is a very, very pleasant framework to use, but it was the wrong tool for the job of Tornado. No, not in my talk. Tornadoweb.org. I mean, it works. It's a great framework. It's just like, you know, a screwdriver is a fantastically elegant tool unless you're trying to pound a nail into the wall. Yes. No, no, that we still have that problem fundamentally. And I think the fix is, it's not parsing. You think about it, okay, parsing 100 megabytes of JSON. Sure, that's gonna hurt, but who thinks about generating JSON? Well, I do now, but I didn't before it blew up in my face. This is all about generating JSON. I did think about, well, another thread wouldn't be good enough because it's all CPU. It would have to be in another process. And then you've got 100 megabyte string that you have to get from one process to the other. I'm sure there are ways to do that, but that to me feels just like solving the wrong problem. And no, I admit, we haven't solved that problem yet. I think I know how to, we just haven't. Yeah. Pyramid and SQL Alchemy just because they're the standard answers in our organization. Pyramid and SQL. Yeah, they work. They're the known, the things we know how to use, particular ecosystem. He also knows how to test out of the company that you like it. I don't. It makes testing harder. No, I have, the project I did in Twisted 10 years ago was a couple hundred lines of code over two or three weeks. It was a tiny little thing. Whereas this beast is 10,000 lines of code and tests that has got person years in it. Not a wild guess. I need dinner.