 Awesome. Hello. Welcome. I am Lynn Root. I work for Spotify, and basically I'm a site reliability engineer at Spotify, and what that means is I either break our entire service or get paid to fix it when other people do. In actuality, what an SRE does at Spotify is because it varies widely among different companies. It's a combination of backend development where my team and I run a few services that other engineers use daily, plus a little devops and sysadmin. I am also our FOSS evangelist. I help a lot of teams release their projects and tools under the Spotify GitHub organization. And lastly, I help lead PyLadies, which is a global mentorship group for women and friends in the Python community. And if you want stickers, I have stickers to give away. Just find me afterwards. So before I start, I want to warn you all that I will take all the time allotted for this talk, so there's going to be no time for Q&A. You might think I've purposely done this, maybe, to avoid Q&A, but I will be here afterwards for the whole conference, so you can just find me into chat. And there will also be a link at the end for the whole notebook and the example code that I use. All right, so let's get started. Async.io, the concurrent Python programmers dream, I guess, the answer to everyone's asynchronous prayers. The Async.io module has various layers of abstraction, allowing developers as much control as they need and are comfortable with. We have simple hello world examples that make it look so effortless, but it's easy to get lulled into this false sense of security. This isn't exactly that helpful, right? It's all fake news. We're led to believe that we're able to do a lot with the Async and await API layer. Some tutorials, while great for getting developers toes wet, try to illustrate real world examples, but they're just beefed up hello world examples. Some even misuse parts of Async.io's interface, allowing one to easily fall into the depths of callback hell. Some get you up and running easily with Async.io, but then you might realize that it's not correct or not exactly what you want or only gets you part of the way there. So while some tutorials walk through and walkthroughs do a lot to improve upon the basic hello world use case, sometimes they're just a basic web crawler. And I don't know about you, but at Spotify, I'm not building web crawlers. But in general, Async.io's programming is just difficult. Whether you use Async.io or Twisted or Tornado or even Golang or Erlang or Haskell, it's just difficult. And so within my team at Spotify, we, which is just mostly me, fell into this false sense of ease that the Async.io community builds. The past couple of services that we built, we felt we're good candidates for Async.io. One of them was a Chaos Monkey-like service for restarting instances at random. And then another is an event driven host name generation service for our DNS infrastructure. So sure, we needed to make a lot of HTTP requests that should be non-blocking, but these are services that needed to react to pub sub events to measure the progress of the actions initiated from those events, handle any incomplete actions or external errors, deal with the whole pub sub message lease management, measure service level indicators and send metrics. And then we also needed to use some non-Async.io friendly dependencies. So it got difficult quick. So allow me to provide you with a real-world example that actually comes from the real world. If you get the pun, we're building a Chaos Monkey. Mandrel is a monkey. So we did build a service that does periodic restarts of our entire fleet of instances at Spotify. And we're going to do that here. We're going to build a service called mayhem-mandrel, which will listen for a pub sub message and restart a host based off of that message. As we build the service, I'll point out traps that I may or may not have fallen into. And this will essentially become a resource that I would have liked about a year ago. So at Spotify, we do use a lot of Google products, in this case, a Google pub sub. But there are a lot of choices out there. And we're just going to make, simulate a simple pub sub like technology with Async.io. This tutorial is quite easy and fun to read, I guess. And this is where we're starting off with a very simple publisher. We're recreating a set number of instances and then adding to the queue. And then we are consuming that queue. And it's very easy to run, especially with the latest Python 37 syntactic sugar. And so when we run this, we see that we are able to publish and consume messages. So we're going to work off of this. As you might notice, oops, a little teaser. We're not actually, we haven't actually built a running service. We're merely just like a pipeline or a batch drop right now. So in order to continuously run, we have to use the loop.run forever. For this, we have to schedule and create tasks out of coroutines and then start the loop. And since we created and started the loop, we should clean it up too. So when we run this updated code, we get this nice little trace back. My mouse goes over, yep. And then it kind of hangs. So we have to cancel it. You have to interrupt it. So yeah, that's nice and ugly, right? So we should probably try and clean that up. So we should try and run it a bit defensively. We'll first address the exceptions that arise from coroutines. So we'll just go ahead and fake an error. Oh, that did not come out well. Hopefully you can still read that. We're going to fake an error so like the fourth message will be an error. So if you run it as is, we do get an error line and it says exception was never retrieved. And so admittedly, this is a part of the async.io API that's not very friendly. If this was synchronous code, we'd simply get the error that was raised, but it gets swallowed up in an unretrieved task. So to deal with this as advised by the docs, we'll need to have a wrapper around this coroutine to consume the exception and stop the loop. So we'll make a little top level wrapper, handle the exceptions of the coroutines. And so when we run our script, just like that, so when we run our script, it's something a little bit cleaner. We're going to stop right now. I'm just going to quickly review. So so far, setting up an async.io service, we want to surface the exceptions so that you can like retrieve them and then clean up what you've created. And I will expand on both of these sorts of parts, both of these parts a bit later, but it's clean enough for now. So so far, we've seen a lot of tutorials that use the async and await make use of async and await keywords while it's not blocking the event loop. We're still literally iterating through tasks serially, effectively not adding any concurrency. So if we take a look at our script now, we're serially processing each item that we produce and then consume. Even if the event loop isn't blocked, there will be other tasks and coroutines going on, they of course wouldn't be blocked. But this might be obvious to some, but it isn't to all. We are blocking ourselves. We first produce all the messages one by one, and then we consume them one by one. The loops that we have within the publish and consume coroutines, we block ourselves from moving on to the next message while we await to do something. So this is technically a working example of a PubSub like queue with async.io. It's not really what we want. We're here to build an event-driven service or maybe even a batch or pipeline job. We're not really taking advantage of the concurrency that async.io can provide. So as an aside, I find async.io's API actually quite user-friendly despite what some people might think. It's very friendly or very easy to get up and running with that event loop. When first picking up concurrency, this async and await syntax makes it a very low hurdle to start using since it makes it very similar to writing asynchronous code. But again, it's picking up concurrency. This API is deceptive and misleading. Yes, we are using the event loop and primitives. Yes, it does work. Yes, it might seem faster, but it's probably because you came from 2.7. Welcome to 2014, by the way. To illustrate that there's no difference in synchronous code, this is the same script, removing async.io primitives and using just synchronous code. And you can see just looking at the consumer, there's no real difference other than a couple of weights. And when we run it, it's pretty much the same and the only difference is actually the randomness part. So part of the problem could be that documentation and tutorial writers are presuming knowledge and the ability to extrapolate over simplified examples. But it's mainly because concurrency is just a difficult paradigm to grasp in general. We write our code as we read anything, left to right, top to bottom. Most of us are just not used to multitasking and context switching that our modern computers allow us. Hell, even if we are familiar with concurrent programming, understanding a concurrent system is just hard. But we're not in over our heads yet. We can still make this simulated chaos monkey service actually concurrent in a rather simple way. So to reiterate our goal here, we want to build an event-driven service that consumes from PubSub, process the messages as they come in. We could get like thousands of messages a second. So as we get a message, we shouldn't block the handling of the message, the next message we receive. So to help facilitate this, we will also need to build a service that actually runs forever. We're not going to have a preset number of messages. We'll need to react whenever we're told to restart an instance. And so the triggering event to publish a restart request could be an on-demand request from a service owner or it could be a scheduled, gradually rolling restart the fleet. You don't know. So we'll first mock the publisher to always be publishing restart message requests, and therefore never indicate that it's done. This also means that we're not providing a set number of messages to publish. So I had to rework this function. Here I'm just adding the creation of unique ID for each message that's produced. So when running it, it like happily produces messages. But you might notice that there is that keyboard interrupt exception triggered by the control seam. And we don't actually catch that. So we can quickly clean that up. This is just a band-aid, and I'll explain that further on. But now we see something much cleaner. So it's probably hard to see why it's concurrent right now. So to help, we're going to add multiple producers to see the concurrency. For that publish function, I'm going to add a publisher ID and have it in our log messages. And then create three publishers just real quick. And then when we run, we can see that we have a bunch of publishers going on concurrently. So for the rest of the walkthrough, I'm actually just going to remove all those multiple publishers. I don't want to confuse anything. Now on to the consumer bit. So for this goal is to constantly consuming messages from a queue and to create non-blocking work based off a newly consumed message, in this case to restart an instance. And the tricky part is that the consumer needs to be written in a way that the consumption of the message from the queue is separate from the work that happens for that message. So in other words, we have to simulate being event-driven by reacting or by regularly pulling messages from a queue since there's no way to trigger work based off of a new message available in that queue. There's no way to be a push-based. So we'll first mock restart work that needs to happen whenever we consume a message. And we'll stick it in our well true loop and await for the next message on the queue and then pass it off to restart host. And then we'll just add it to our loop. And then when we run it, we see that messages are being pulled and restarted. We may want to do more than one thing per message. For example, we might want to store the message in a database for potentially replaying later as we initiate a restart of a given host. So within the consume function, we could just add the await for both co-routines. And we'll see that it happens just fine that both are saved and restarted. But we still kind of block the consumption of the messages. And we don't necessarily need to await one co-routine after another. These two tasks don't necessarily need to depend upon one another. Completely sidesupping the potential concern for should we restart a host if we haven't saved any database. That's for another time. But we can treat them as such. So instead of awaiting them, we can create a task to have them scheduled on the loop and basically checking it over to the loop for it to execute when it can. And so now we have restart and save not necessarily in serially, but whenever the loop can execute the co-routine. As an aside, sometimes you do want your work to happen serially. Maybe you restart hosts that have an uptime of more than seven days. Or maybe you want to check a balance of an account before you debit it. Needing code to be serial or having steps or dependencies. That doesn't mean that you can't be asynchronous. The await last restart date will yield to the loop but doesn't mean that the restart host will be the next thing that the loop executes. It just allows other things to happen outside that co-routine. And yes, I admit this was a thing that wasn't immediately apparent to me at first. So we pulled a message from the queue and we found out work based off of that message. We now need to perform any finalizing work on that message. So for example, we might need to acknowledge the message so it's not re-delivered. We'll separate this out, separate out the point of the message from creating work off of it. And then we can make use of async.io.gather to add a callback. So when we run it, so once both the save co-routine and the restart co-routine are complete, the cleanup will actually be called and that signifies that the message is done. However, I'm a bit allergic to callbacks as well and perhaps we need a cleanup to be non-blocking so then we can just await it. There we go. Now much like a Google Pub sub, let's say that the publisher will re-deliver a message after 10 seconds if it has not yet been acknowledged. But we are able to extend that message deadline. In order to do that, we have to have a co-routine that in essence monitors all the other worker tasks. So while we are continuing to do work, this co-routine will extend the message acknowledgement deadline. Then once we're done, we should stop extending the deadline and then clean up the message. So one approach is to make use of async.io event primitives where we can create an event and then pass it to our extend co-routine function and then set it when we're done. And you can see that it's extending and then it stops extending when the message is actually done. And if you really like events, you can make use of event.weight and move the cleanup outside. And so now we got a little bit of concurrency going on. To review real quick, async.io is pretty easy to use, but it doesn't automatically mean that you're using it correctly. You can't just throw around async and await keywords around blocking code. It's actually a shift in your mental paradigm. Both with needing to think of what work can be farmed out and let us do something, then you have to think about what dependencies are there and where your code might still need to be sequential. But having steps in your code, like first A and then B and C, it might seem like it's blocking when it's not. Sequential code can still be asynchronous. For instance, I might have to call customer service at some point, but I'm going to be on hold for a while so I can just put it on speaker phone and then go play with my super-needy cat. So I might be single-threaded as a person, but I can definitely multitask like CPUs. So earlier we added try accept finally around our main event loop code, although you probably want your service to gracefully shut down if it receives a signal of some sort, like cleaning up open database connections, stop consuming messages and finishing responding to current requests while not accepting new ones. So if we happen to restart an instance of our own service, we should clean up the mess before we exit out. And so we've been catching this commonly known like keyboard interrupt exception, like many other tutorials in libraries, but there are other signals that we should be aware of. Typical ones are like sig up and say quit in term. There's kill and stop, but we shouldn't like catch them or block them or ignore them. So if we run our current script as it is and give it a term signal where we find ourselves not actually entering that finally clause where we like log and clean everything up. So we basically got to be aware of where those exceptions happen. I also want to point out that even though we're only ever expecting keyboard interrupt, it could happen outside of catching the exception, potentially causing the service to end up in an incomplete or otherwise unknown state. So instead of catching keyboard interrupt, let's attach a signal handler to the loop. So first we'll define that shutdown behavior that we want. We want to simulate database connections and returning messages to PubSub as not acknowledged so that they can be re-delivered and not just dropped and actually cancelling tasks. We don't necessarily need to cancel pending tasks. We could just collect them and allow them to finish. It's up to what we want to do. We also might want to take this opportunity to flush any collected metrics so they're not lost. So we need to hook this up to our main event loop now. I also removed the keyboard interrupt catch since it's now taken care of within the signal handling. So we run this again and send it the term signal and it looks like it cleaned up, but you see that we have this caught exception error twice. This is because awaiting uncancelled tasks will raise the async IO canceled error, which is to be expected. And we can add that to our little handle exception wrapper as well. So if we run it, we actually see our co-routines are being canceled and not just some random cancel error exception. So you might be wondering which signals should you care about. Apparently there is no standard. Basically, you should be aware of how you're running your service and handle them accordingly. Also as a heads up, another misleading API in async IO is SHIELD. The docs say that it means to SHIELD a future from cancellation. Sorry, I have a core dev right here. But if you have a co-routine that must not be canceled during shutdown, SHIELD will not help you. This is because the task that SHIELD creates gets included in async IO.all tasks and therefore receives cancellation signal just like the rest of the tasks. So help illustrate. I have a simple async function with a long sleep that we want to SHIELD. And then when we run it and cancel it before the 60 seconds, we see that we don't ever hit the done line and that it's immediately canceled. So yeah, that's fine. So TLDR, we don't actually have nurseries in async IO core to clean up ourselves. It's up to us to be responsible and close up to the connections and files that we open, responding to outstanding requests and basically leaving things how we found them. So doing our cleanup in a final clause isn't enough since a signal could be sent outside of a try accept clause. As we construct a loop, we should tell how it should be deconstructed as soon as possible. It ensures that all of our bases are covered and we're not leaving any artifacts around. And finally, we need to be aware of when our program should shut down, which is closely tied to how we run our program. If it's just a manual script, then Cient is fine. If it's a demonized Docker container, then SIG term is probably more appropriate. You may have noticed that we're not actually catching exceptions within restart host and save just on the top level. So to show you what I mean, we're going to fake an error where we can't restart a certain host. So running it, we see that a host can be restarted and while the service did not crash, it did save to the database, but it did not clean up or act the message. And the extend on the message deadline will also keep spinning. So we've effectively deadlocked on the message. A simple thing to do is to add return exceptions true within our async.io gather. So rather than completely dropping an exception, we can turn it with our successful results. However, you can't really see what actually aired out. So what we could do is add a callback. But as I said, I'm allergic. So we can just add a little helper function to process the results afterwards. And so when we use something like this, we can see errors are now logged and we can handle them appropriately. So quick review, exceptions do not crash the system unlike async.io programs and they might non-async.io programs and they might go unnoticed. So we need to account for that. And I personally like using async.io gather because the order of the return results are deterministic. But it's easy to get tripped up with it. By default, it will swallow exceptions, but happily continue working on other tasks. And if an exception is never returned, then weird behavior can happen like spinning around an event. All right. So I'm sure folks, as you started using async.io, you might have realized that async.io or async.io awaits starts infecting the rest of your code base. Everything needs to be async. And it's not necessarily a bad thing, it just forces a shift in perspective. So for our code to work with this, we need to sort of maybe rework our consumer. Not much is needed actually. I'm still making use of async.io consume code routine to call a non-async consumer and using a thread pool executor to run that code. As an aside, there's actually a handy little package called async.io extras which provides a decorator for asynchronous functions where it would remove the boilerplate for you and you can just await the decorated function. But sometimes, third-party code throws a wrench at you. If you're lucky, you're faced with a third-party library that is multi-threaded and blocking. So for example, Google Pub Sub's Python library makes use of GRPC under the hood with threading, but it also blocks when opening up a subscription. And it also requires a non-async callback for when a message is received. So in typical Google fashion, they have some uber cool technologies and it's slightly difficult to work with libraries. So this feature that they return, it makes use of GRPC for bi-directional communication and it removes the need for us to periodically pull from messages as well as manage message deadlines. So to illustrate, we can use run and executor again. I've made a little helper function to kick off the consumer and the publisher. And to prove that this is now non-blocking, I'm going to create a little dummy coroutine to run alongside run pubsum. We'll add the two coroutine functions and update the main so it's just the run method or run function that we're running. And we can see that it's not blocking. But as I said, that although it'll do a lot for us, there's a lot of threads in the background. And like 15 threads in the background that the Google Pub Sub library gives us. So I'm going to reuse that something else, coroutine, to actually periodically get some stats on threads that's going on. And I've also prefixed our own thread pool executor so I can easily tell which one I created versus what Google created. And when running this, you can see that Google creates a lot of threads. We have the main thread, which is the async.io event loop. And there's five threads from us because we've given it five workers. And then the rest of it is Google. And so this current thread count is like 22. But all in all, the approach to threader code isn't that different than non-async code. Until you realize that you have to call asynchronous code from non-async function within that thread. So obviously we can't just act a message once we receive it. We have to restart the required host and save the message in our database. So basically you have to call asynchronous code from a non-async function in a separate thread. Pretty embarrassing, bear with me. And I got like two minutes to run through this. We'll use the async.io create task that we defined earlier. Then we realize that yes, of course, there's no event loop running. And to get a little more color, here's some log lines that yes, indeed, no event loop is running in the thread. Yeah, I can hear people say, read the dox line. But what if we gave it the loop that we're running in? And it kind of works, but it's deceptive. We're just lucky here. Once we share an object between the threader code and the callback and the asynchronous code, we've essentially shot ourselves in the foot. And to show you that, I've created a global queue that the consumer will add to and then we'll read off that queue with handle message. And you see something funky now. Nothing is ever being consumed from that global queue. And so if we add a line in that queue, in that function, we can see the queue size gradually increasing. And I'm sure a lot of you see what's going on here. We are not thread safe. So let's make use of run coroutine thread safe and see what happens. Yes, it finally fucking works. So in my opinion, it's not that difficult to work with synchronous code in async.io. However, it is difficult to work with threads, particularly with async.io. So if you must use the thread safe APIs that async.io gives you, or you can just hide away and try to ignore it. So in essence, this talk is something that I would have liked to hear about a year ago. So I'm speaking to pass line here. But hopefully there are others that benefit from this. From a use case, that's not just a simple web crawler. Everything is up there up on that URL. So hopefully it's useful to folks. Thank you.