 My name is Yuriy Selovanov, I'm co-founder of Magic Stack, check out our website, magic.io. I'm an avid Python user since 2008, I think the first version I started to use was Python 2, but then in the months I switched to Python 3, I used it since Alpha 2 or something and never looked back, so use Python 3. I'm C-Python Core Developer since 2013, but I believe I actually started to do things even before that. You might know me from PEP362, which I co-authored with Larry Hastings and Brett Cannon, its inspect signature API, then I've created PEP492, that's async await that we have in Python 3.5. I'm also helping Vida and Victor Stiner to maintain async.io, I also created UV loop, more on that later. Structure of the stocks, I actually wanted to tell you so much about how to write high performance code in Python with async.io in particular, but unfortunately I had to cut my slides, like, I don't know, 50% of my slides had to go. So we'll briefly start with an overview of async await, then we'll quickly cover async.io and UV loop, then we'll answer or try to answer a question, how you should write your protocols, how you should implement them using sockets or protocols, or maybe you should use streams. Then I'll present you with something new. It's a new high performance driver that I open sourced like two hours ago. And then we'll recap. I have to say that there will be no funny cat slides, just because performance is hard, so only sad and depressed cats from now on. So let's start. There should be just one obvious way to do it, right? So we have five different ways to do coroutines in Python. First one is to do callbacks and deferts. I think Twisted actually started and originated this approach, one of the first major frameworks at least that used that, and kind of validated that it is possible. Then we have Stackless Python and GreenLats. And I'm pretty sure everybody heard about EventNet and GEvent, that's a good example of frameworks that use them. In short, the programs in GEvent look like normal programs. They kind of look like you're using threads, but instead it's just one program, one thread, and every point of your code can actually suspend and then resume. It's a lot of dark magic, and as Guido said, it will never be merged in C Python, so those guys are kind of under all. Then we have yield, and it was possible to use generators as coroutines in Python since I believe Python 2.5 or something. And Twisted has a decorator called inline callbacks so that you can kind of implement modern looking code using coroutines in Twisted, and you could do this for years. Then in Python 3.3, yield from was introduced, and async.io benefits from it. That's how most of the async.io code is written using yield from. And then in Python 3.5, we have async away. That's the new way. And why do I think that async away is the answer? Well, first of all, it's a dedicated syntax for coroutines. It's concise and readable. It's easy to actually glance over a large chunk of code and see what's actually going on. You will never confuse coroutines and generators. There is now a new built-in type of coroutines. It's actually the first time in Python history that we have a new dedicated built-in type just for coroutines. We also have new concepts, async for and async with. And I believe this is something rather unique to Python. When we added async and await, a lot of people actually told us, well, you copied it from C sharp. Well, yes, we copied it from the C sharp. But we also introduced new things. And I believe this kind, async for and async with are kind of unique. Like I haven't seen any other imperative language that has this construct. Async await is also a generic concept. A lot of people think that async await can only work with async.io. That's not true, actually. Async await uses async await, but you can build entirely new framework and use them on your own. That's, for instance, what David Beasley did with his framework called Curio. He uses async await in a completely different way from how async await is used in async.io. And also async await are, it's fast. If you write something like a Fibonacci calculator, you will see that it will run just twice slower. And that is fine, actually, because even in big async.io programs, you don't have as much async await calls as you have normal function calls. You cannot even compare. It's like 100 times more. So use async await as much as possible. It won't hurt your performance. You won't see any drawbacks. So coroutines are subtype of generators, but not in classical Pythonic sense. In Python, they share the same C struct layout. They share, like, 99% of the implementation, but coroutine is not an instance of a generator, actually. And you can see this sharing of the machinery. If you, for example, disassemble a coroutine, you will see that it still uses yield from opcode. Then we have types coroutine. Originally, we introduced it to make old style yield from coroutines from async.io compatible with new coroutines that uses async await syntax. Because you cannot just await on things. You can only await on awaitable objects. So you cannot await on number one, and you cannot await on a generator. But if you wrap a generator with types coroutine decorator, you can await on it, actually. And again, David Beasley uses this kind of creatively in Curio. If you are interested in async await, I definitely recommend you to take a look at how async.io is implemented and how Curio is implemented to different approaches. And then we have a bunch of protocols for async iterators and async context managers. Let's move on. Let's talk about async.io, libv, siphon, and uvloop. So async.io is developed by Guida himself originally. I think a lot of it is inspired by Twisted. And it's actually good, because Twisted existed for, I don't know, 20 years or something, and they validated that this concept of asynchronous programming in Python actually works. So I think we copied quite a lot from Twisted. And Twisted actually plans to use async.io at some point when they fully migrate to Python 3. They will just use async.io event loop. A lot of people call async.io a framework. Well, it's not a framework. I would call it a toolbox, actually. It doesn't implement HTTP, for instance, or any other high-level protocols. It just provides the machinery and APIs for you to develop this kind of stuff. If you want HTTP, you probably would use a.io HTTP for that. If you want memcache driver, you go and Google it. And it's also part of standard library, which is both good and bad. Why is it bad? Python has slow release cadence. We see new Python major releases every year and a half, and bug fix releases usually are half a year apart. And I would say that for async.io, sometimes it's not enough. Sometimes we discover bugs and we want to fix them as soon as possible, but we have to stick with the Python release cycle. But it's also good because you kind of know that async.io will stay with us for a while. It will be supported by someone always because it's a part of standard library. And also, Python has a huge network of buildbots with different architectures and different operating systems. And it's quite important, actually, to test something as convoluted and as hard as I.O. on different platforms. So it's good. Async.io is quite stable right now and it will be even more stable pretty soon. So what's inside async.io? So we have standardized and pluggable event loop. Actually, async.io from the beginning was envisioned in a way that you can swap the event loop implementation with something different. It defines protocols and transports. That's one way to actually marry callback style programming and async.io is to actually develop protocols using low-level primitives such as protocols. It also has factories for servers and connections and streams. And this is also quite important because if you implement a server, let's say, using blocking sockets, you implement it once and then you start to implement it the second time, you will see that you have lots and lots of boilerplate code that kind of looks the same every time. So async.io takes care of that and factors out all of this implementation and convenient helpers for creating servers and creating connections. It also defines features and tasks. Tasks are... Tasks is something that actually runs the coroutine, that pushes the value into coroutines, that suspends them and resumes them. In a framework-independent way, it's called coroutine runner, actually. And features allow you to interface with callbacks. That's how you actually introduce async.await into something that uses callbacks. It also has interfaces for creating and communicating with sub-processes asynchronously. It has queues. And by the way, queue is a very useful class. You should definitely use it. It's exceptionally hard to create an asynchronous queue that supports constellations, all the stuff like that without bugs. We still fixed a lot of queue bugs in 3.5.2. So queues are useful for things like connection pools, for instance. Definitely should get out. And we also have flux, events, semaphores, everything like that, everybody knows how to use, actually. And as Lukash Lange said on his talk on Python U.S. a couple of months ago, if you love that logs, you can still have them in async.await. So event loop is the foundation. It's the engine that actually writes, that actually executes async.await code. It also provides factories for tasks in futures. It's also an IO multiplexer, the engine that actually reads the data and pushes the data to the wire. It provides APIs for low-level APIs for scheduling callbacks, for scheduling time demands, for working with sub-processes and handling unix signals. And the best part about it is that you can replace it. So that's what we kind of did with UV loop. UV loop is 99.9% compatible with async.await. I'm not aware of any incompatibilities, but maybe there are some. As far as I know, you can drop in UV loop in pretty much any program, and it will just work. It's written in Python, and by the way, Python is just amazing. It's unfortunate that it's not as widespread, and I think it's kind of underappreciated what you can do in Python. Essentially, it's a superset of Python language. You can strictly type it, and it will compile to C, and you will have C speed. You can easily achieve it with a syntax closer to Python. So definitely check out Python and try to cite it and try to use it. UV loop uses LibUV. LibUV is something that keeps Node.js running, actually. Node.js uses LibUV as its event loop. And it's actually a good thing, because Node.js is super widespread, and it's very, very well tested. So LibUV is stable, and it's fast. It also provides fast tasks and features. So even your async await code runs faster on UV loop by about 30%. And it's also thanks to LibUV and a few hacks. It has super fast IO. So how fast is UV loop? Well, compared to async IO, it's two to four times faster on simple benchmarks like EchoServer. Again, nobody probably deploys EchoServer in real life, so as soon as you add more Python code, of course it will become slower. But again, even in real applications, I've seen reports that UV loop runs code about 30% faster. And also the latency distribution is much better with UV loop. So it's faster than async IO. What about other platforms and frameworks? For instance, the same EchoServer written in Python which uses UV loop is two times faster than Node.js, and that's kind of interesting, because Node.js is itself written mostly in V8. That's the JavaScript implementation. It uses LibUV, which is written in C, and there is a thin layer of JavaScript on top of it. So still UV loop that uses the same LibUV is two times faster for the same, almost the same amount of code. It is as fast as Go run with Go Max Prox set to one. That essentially means that Go cannot parallelize on multiple CPUs the load. But still it's quite an impressive result because Go is like a fully compiled language, and it also has, I think, a bit more efficient implementation of IO than LibUV. Just because LibUV is trying to be generic, it supports Windows, it supports Unix, and Go Linux supports it too, but in slightly different way. Anyways, and of course it's much faster than Twisted on Tornado just because it uses a lot of it in C, like most of UV loop is in C. So initially my idea for this talk was to end with this slide, just use UV loop. Thank you for your time, questions. But unfortunately, unfortunately it's not that easy. So part three. Let's talk about sockets, streams, and protocols. That's basically one obvious way to do it, episode two. So what should you choose? Should you use coroutines, like socks and dolls, or should you use high-level streaming API, or maybe you should use lower-level protocols and transfers? Here is an echo server implemented with loop sock methods, and if you look at it closely, you will see that if you kind of drop async and await keywords, it looks like a normal blocking code that uses the socket module. So it is kind of convenient when you have lots and lots of code and old-style code, blocking code, you can kind of easily convert it to async and await. Here are streams. Here is the streams version of the echo server. It's quite high-level, as you see. You don't work with sockets anymore. You have reader and writer. And here is a low-level implementation of echo server using protocols. So essentially, protocol is something that the event loop just pushes the data in, and protocol has a transfer to push the data back to the client, to the server. So the key method here is data received. That's like the main method. The event loop pushes the data to the data received, then protocol can process the data and then call transport.write to actually send the process data or send the response back to the caller. For echo server, it's quite a simple implementation, but you can imagine it gets pretty hairy for more complex protocols. So downsides. When you use low-level loop.suck methods, loop cannot buffer for you. So you are responsible to implement the buffering on top. And you also have no flow control which without buffers doesn't make any sense. You don't use flow control, but when you start, implement the buffers. You won't have it. And it's quite a tricky thing to implement correctly. And another thing why you shouldn't use it is just because the event loop has no idea what are you doing right now. Let's say you are reading some data. Event loop will add your file descriptor to a selector which can be EPOL or KQ on Unix. And essentially wait for an event and when it receives this event it will try to read the data, push the data back to you, but it will also remove the file descriptor from the selector. That's an extra system call because it doesn't know will you continue reading the data or will you write the data now or will you just stop or will you close the connection? So it cannot predict what's going on. When you use streams and streams are by the way are using protocols event loop just knows because you have an intent just keep sending me data to my data received or to my stream and when I don't need this data I close the connection myself. Event loop can actually optimize for that. And flow control is kind of important. I like this picture because it illustrates that sometimes you kind of have to push back on something slow or something that you don't want to use right now. So which API you should use? You should use loop.soc methods when you are quickly prototyping something or when you are making some existing code. But I would highly recommend you to actually stick to streams even for porting code just rewrite it in streams because streams are much easier to use. You can just say give me exactly this amount of data or you can tell streams read until you see slash and or something like that and it will do it. It also implements a buffer read and write buffers quite efficiently and you can use it as a way to program the entire protocols with streams and use protocols and transports for performance actually. If you want exceptional performance you have to go low level. So for this let's focus on protocols and transports and again it's kind of important for your application code you should always use async await. Never even touch, never think about protocols. This stuff is just for drivers for PostgreSQL for memcache for any kind of this kind of code. High level code should never think about protocols. Always use async await it will be enough. So let's focus on protocols. So as we mentioned before, look pushes data to protocols, protocols send data back using transports and protocols can implement specialized read and write buffers. They can also do flow control they can hint the event loop through the transport resume and pause read methods. And you have full control over how IO is performed. You call transport.write you can resume data consumption so you have tools to control it. So how to use protocol and transports. There are basically two strategies. The first one is you implement your own abstractions, your own buffering and your own stream abstractions. And a good example of that is that's what they do. They have buffers and streams specifically designed to handle and parse HTTP protocol and it will be slower than using callbacks and accelerating everything in C but it's quite good, still quite good. So the second strategy is to actually implement the whole protocol parsing in callbacks and then create a facade that allows you to use async and await. And the main key reason why this is a better might be a better strategy and why this can offer better performance is because you can just drop Python completely. You can go low level, you can use Python, you can use C. So part four async.pg. This is something that I just open sourced a couple of hours ago. This is right now the fastest PostgreSQL driver for async.io and for Python actually. It's two days away. It completely re-implements the protocol from ground up. It doesn't use lpq, that's the de facto library for working with PostgreSQL. So we just implemented it completely from scratch. It uses PostgreSQL binary data format and by the way, when you are implementing protocol and you have a choice, use text or binary always choose binary. It's more efficient and you can process it much faster because how binary format works usually you have a length field that tells you how much data follows this frame and then you have another one. So you can create frames much faster, you can decode types much faster. So always choose binary. And also not all PostgreSQL types can be encoded and text and actually decoded from text. So if you have a recursive composite type, it's just not possible to decode it in pg. So what we did for async pg, we actually forgot about dbapi completely. There is no dbapi for async await but for instance what aopg does, they kind of sprinkle async and await on top of existing dbapi. So our idea was let's build a driver that just is tailored for postgres and uses postgres features. And we also support all built-in postgres types that there is basically. So postgres last prepared statements because it doesn't need to actually parse the same query over and over again. When you prepare a statement, it has basically some structure on the server with a plan, with a parsed query that already knows how to accept your arguments and do this kind of stuff. So we use prepared statements every time, even when you don't explicitly create them, we have an overview cache of prepared statements and we do that transparently for you. We also dynamically build pipelines for efficiently encoding and decoding data. So the pipeline is essentially an array of pointers to C functions that can process the stream like with enormous speed. So it actually shows this chart compares different postgres drivers for different languages. The fastest one is async pg, it manages to push almost 900,000 queries to the server. The second one is aopg that's another driver that uses live pq, which is also in C, but unfortunately psycopg doesn't provide an efficient async interface so it's slower and also async aopg and psycopg, they use text data encoding so it always will be slower. Then you will see to go implementations and then you will see Node.js drivers which are just 10 times slower. The funny part about this one is that Node.js pg is actually pure JavaScript implementation of the driver and pg-native is using live pq so somehow a lot of JavaScript is faster than C. I have no idea how. The funny thing about this performance is that there is another library, it's not part of this chart because it's kind of slow. It's called pyPostgres scale, nobody knows about it, we used this for several years and then we just created async pg. Anyways, it's a pure Python implementation and it's as fast as pure JavaScript implementation. Everybody is saying that Python is slower than JavaScript, you shouldn't use it, but we kind of saw that it's possible to write a pure Python code as fast as Node.js code so maybe Python isn't that slow. Async pg architecture is basically implemented in the meat of it is implemented in core protocol. Core protocol is something written in siphon, it uses callbacks to process the protocol, then we have a protocol class that just wraps core protocol and inserts some future objects into it so that you can use async await and the rest of async pg is just pure Python implementation that just implements the high-level API. How would you parse this protocol? Naive approach would be just to use Python bytes and memory views, but unfortunately doing so will cause a lot of Python objects to be created and you will actually see how long you spend on memory allocation so the solution is to use Python and go to the C types and just don't even touch Python bytes and memory views. This is the preview of read buffer, it's a bit bigger than this, it's API, but you can see the first method is the most important feed data, that's what protocol data received actually calls, protocol data received has just two lines in it, the first one pushes the data to read buffer and the second one calls a function that just reads from the buffer and this buffer is kind of tailored for Python and the second most important call here is try read bytes. Try read bytes either returns you a low-level cc data type or it returns a null pointer and if it returns a null pointer then you actually call read which returns you Python object which is much slower, but most of the time 99% of the time try read bytes succeeds and we can avoid that. So, again, high-level logic of async pj is built in pure Python, that is how you can actually use it. You can see it's pretty high-level, high-level API we prepare a statement, we enter a transaction with async with and we iterate over scrollable cursor. Part five, let's recap. Don't be afraid of protocols, use them to implement really, really high-performance drivers and use Python for low-level code. It's much easier to code than in C, you can quickly refactor code, completely change everything and it will just work. Don't think about protocols and transport, use only high-level code. Once you have fast database drivers and stuff like that and you use UV loop, you will be able to do it much, much faster. Look great future was introduced in Python 3.5.2, that's a new feature. With this, if you use look great future, UV loop can inject fast future implementation into your code because UV loop implements its own version of the future and it's about 30% faster than async IO future. If you can do binary, go binary. Always profile your code. It's funny because when async PG started to work, I benchmarked it against AIO PG and it was two times slower and I didn't understand why because it should be faster. There is no way it can be slower. I spent 30 hours without sleep and I tried to optimize async PG and made it two times faster, four times faster. The important lesson from this is that if that first run showed that async PG was 30% faster than AIO PG, maybe I wouldn't spend so much time trying to optimize it. Always profile, always analyze and try to push it forward. By the way, Python code can be used in any language. It's a very useful tool. Check it out. And Python has a useful flag. It's called dash A. It generates HTML representation of your source file and each line is highlighted. It's either blank or it's a shade of yellow and the most yellow lines use more Python API and it is slow. Basically you have a quick run. It's speed. So definitely check out that option. Always try to do zero copied, try to avoid working with bytes, memory views, all this kind of stuff, go low level with Python and don't copy Python objects never. And one of the last advice actually is to implement an efficient buffer to write data. So what we do for writing messages, we have a write buffer that just preallocates a portion of memory and then we compose messages with high level API and we don't touch that memory at all. And when the message is ready we just send it. So we have high level API of creating the message but we don't allocate any memory while we are doing so. So we should definitely say set TCP no delay flag. We probably will set it by default in async.io in Python 3.6. Right now it's not set. You should do it because it will speed up transfer.write method. Basically with this flag set on the socket, socket doesn't wait until it receives TCP arch message. It just sends the data as soon as it receives the message. So when you have control over how frequently you are calling transfer.write, you can basically use TCP cork. What you do, you cork the channel, then you do multiple writes to it, then you uncork it and it just sends all of your data in as few TCP packets as possible. And the last slide is timeouts. Always implement timeouts as part of the API. So you have to wait for it because wait for is slow. It wraps the cork into a task and that comes with a huge penalty. Your code will become 30% slower if you use wait for. So design timeouts as part of the API at the lower level implement timeouts with loop.callLater method and it will just work. That's it. Thank you for your time for maybe one or two questions. Hi. Thank you for the presentation. I want to ask you about using async.io and your event loop. Not for high performance but for high concurrency. Do you have any would you use it for high concurrency? I have a scenario where I have hundreds of thousands of concurrent connections but... Yes, UV loop is even better for that because it uses less memory than async.io. Don't hear you. UV loop is much better for highly concurrent application that handles hundreds of thousands of connections simply because it uses less memory. Again, it's faster. 100,000 connections and it handles, it's pretty okay. Thank you. Unfortunately, this is all the time we have. Thank you.