 Hello everyone So it's my second Europe Python as a speaker I'm very happy to be here And I guess you all came here to learn about my failures, right? Okay, so let's begin with a short introduction So I work at Akamai. We are a content delivery network and also cloud and security services provider We move tens of terabits of traffic We have really a lot of servers and you can do really cool stuff with them Like if you've ever watched Apple's keynote or Played fortnight, you definitely used Akamai's network without even knowing about that so we We tend to believe that we move between 10 to 30 percent of all web traffic It's really difficult to calculate the the right value so our systems are really Distributed really Servers are the network is really vast and also recently we've launched a quite new Product which is IOT edge connect which is basically MQTT on on steroids so we are also very interested in the IOT and Me personally Currently, I'm leading development of one of the most core Akamai metadata systems, which is responsible for safe and secure Reliable metadata updates in our network and I'm super happy to be here because my bachelors and masters were about projects Which I which I did for CERN so CERN is based in Geneva. So not that far from here and I Really fell in love with Switzerland. So I'm really glad I can be here and you know working on metadata upgrades for such a big network is really fascinating and there's so many challenges and actually Something from from my field specifically. So 10 days ago. We saw a big outage on the internet caused by a bad metadata upgrade Do you know what was the outage about? Yeah So that was cloudflare So cloudflare is our competitor and you know but Failures do happen and I believe that it's not failures that define a Company and it's how failures are dealt with and how transparent The company is about them So I believe that people at cloudflare did a really great job dealing with that failure and Do you by chance have anyone from cloudflare in the room? No one Okay, even if there's no one Even if there's no one here. I think they deserve a big applause because Because failures do happen and it's how we how we embrace them so you know I Think it was the first or I've heard it first from Brian country and during one of his presentation And he's usually very emotional and he's shouting a lot and he said that Production is a war. It's truly a war and There's no There's a lot of victims and everyone's your enemy and you know Failures do happen in production We are not able to test every single branch every single situation that our code might Encounter so it's really difficult. So it's really important in my opinion that we are open and transparent about failures So that we all can learn from them. So that's why I'm really thankful for I'm really thankful cloudflare that They were so transparent and open about their own failure and you know we have failures too and when you're a CDN and You have a failure. There's always these big flashy headlines Lake tank taking the the internet down Okay, so let's move to the To an introduction about acing. This is an advanced level presentation So I won't go into too many details But I want to lay the groundwork just to give you two definitions so that we are on the same page. So What's acing? if we have like a single worker doing a single task like Gopher which for some reason burns books So you have one gopher and you have a pile of books and we want to burn them Then, you know, it's a serial sequential synchronous execution So if we have many workers and we can divide our big pile of books into smaller piles then we have parallel execution and Async is about having a single worker doing a lot of things at the same time so with This beautifully edited image Yeah, you can see a single gopher Which basically does or is in the process of doing a lot of things but he does it or only One thing at a time. So there's no parallel thing and I think it's really useful when we have many similar things to do But for some reason we cannot process them immediately. We need to we need to wait for some research resources or or other things Okay, and the second thing is how I think How asynchronous execution how I think I owe is implemented in Python So it uses an event loop So basically if you run an async IO application in the main thread You're supposed to have an event loop, which is basically a loop of tasks with tasks, which async IO iterates Constantly so the tasks are being scheduled They are being put inside that loop and later they are being checked whether the resources are available and the processing Can can continue so a bit of history So async IO didn't invent Asynchronous execution in Python. So before async IO we had async core modules async chat and and and similar but someone in September of 2012 submitted the Python idea entitled async core included batteries don't fit and it was true that Using we're having asynchronous execution of code in Python was quite troublesome So later that year Guido started with PEP 31 56 Then there was to leap period and in the March 2014 Async IO was released in Python 3.4 as a provisional API that provisional API means that you would actually Need to be mad to start using it in production because a lot of things might change and it's just Will deploy it and we'll see how the community Reacts and also, you know the like the most complex bugs can be they usually appear only in production and when people start to use it so Yeah, and the API was provisional and The next release the syntax changed so we We got async await keywords and also some additional tools like asynchronous iteration asynchronous context managers And then with 3.6. We've got asynchronous generators and asynchronous Comprehensions and I know that we are currently at 3.7 but I guess my My knowledge stopped a bit at three three point six when I Switched projects So how async IO code looks like? Let's say that we have an application which processes some tasks and What it requires for processing of these tasks is some network IO so we can run it using async IO it fits So we have a loop We need some data to to start our task and then what we do we just We wrap our core team with a task and we put it inside an event loop So this is what the ensure future is for If we are if we are smart, you can actually get some data beforehand for the execution and later we can use one of the helper functions of async IO, which is async IO gather and We can basically schedule a lot of Asynchronous tasks at the same time so that they will populate our event loop and they will be Worked on asynchronously But the thing is that in async IO applications you usually should have the main threat which does the the work and with the main threat if you need to make some operations or arithmetic operations any Logic operations then you you are blocking that main threat So async IO is not magic. You have one threat and you're working in it. So if you have something which Runs longer than milliseconds seconds minutes, then you you should use Threat pull executors. So basically What you want to do is to create a pool of threads or processes and Then you can just delegate running a specific function to to that executor and Of course when you're dealing with things that Execute a lot of time then you want to time out at some point to Just not block All your execution if something gets stuck So nowadays you can use either the wait for helper from the async IO or you can also use this really Neat context manager, which will time out the inner call for you Okay, so why does it all exist? What's you know, what's all that about well Handling tasks asynchronously, it's quite useful if you have independent tasks and of course you could use threads for that but you know with with gill in C Python if you have a lot of threads if you need to process a lot of tasks then You will start getting congestion from gill with amount of threads that you're using so for 30 50 100 threads you will start observing them the gill effect so with async IO you you you are not limited by gill and Also, like I mentioned before async IO didn't invent running the code asynchronously in Python. So before async IO There were modules like tornado twisted which were already asynchronous. It's just they were doing that in their own way so also because It started to to be popular to run asynchronous code then core developers decided that Maybe if we made some tweaks and modifications to see Python will have better performance For async execution. Okay So the first story a Story of synchronous asynchronous code Okay, so we had one of our first project that we use async IO was executor type of app which basically Received some dead tasks data and it had to process it So what we did in the beginning? We implemented everything and we had this nice like main application loop which we would just get some Get the task pre-process it lock it and then we would run them and the thing is that the the the problem that we encounter and encountered was much later So we we had this app inside the run tasks method We had basically the the loop which will we which you saw before So it would get data will run tasks and for each task we would await And it was a thing after a year ago That our application was used more and more and we found out that actually that application runs synchronously and not asynchronously and So if we received some tasks the idea was that we are not blocked and that we can just schedule them for execution and you can just take Another ones as as much as we have Available slots in our you know executor So here we are Awaiting the run tasks method and then we are awaiting every single execution. So basically We wrote a synchronous code using async IO Yeah, so so what you want to do in such in such scenario you want to use ensure future which will just wrap your Coroutine With a task object and it will ensure that your task will be will be run and it will be put into the event loop so in the Implementation of the run tasks method. We also ensured that our task is Is scheduled and run and then we were super happy all tests passed and then we went to production and Wow new class of errors. So our task was destroyed the while pending and also sometimes we received this huge errors in our STD out Which were not logged by by any standard logger that we had An exception in one of our tasks and it was never retrieved So basically what we did we moved from a synchronous slow application To a superfast application which loses data and throws hundreds of errors Yeah So one thing that we've learned is that you always need to await your futures your coroutines so even if you write your coroutine like as a self-contained thing that basically writes its output somewhere or it's right Or it writes its error state Then you also still need to like keep keep a reference to To that and also check whether the the task was actually done because the event loop will get rid of it If if no one is interested in in what's happening with it. So what what we end up doing is We're always when we are We when we are scheduling our task, we always keep a reference to it and then we Just check whether it was done and there's of course some Error handling logic, but It's a simplified example. So, you know always await your awaitables Okay, another story a story of dependencies nightmare so after you know Learning from the previous application about all possible mistakes that we could make When you start with I think I owe We wrote another application, which was an API using tornado and tornado and tornado's I a loop Was at that time Rapper for a sync IO IO loop So if you used tornado and if you used postgres you wanted to integrate Database operations so that they are asynchronous too So there's a really nice wrapper called momoko which integrates with both turn tornado a sync IO and psych of G and Also, actually if you have like Application which is supposed to be really performant what you want to use is a module called UV loop Which is rapper for Libu Libu V which is basically a replacement for a sync IO standard Implementation, which is mostly in Python. So for that For all the operations on the on the loop itself they are moved to SC extension so that it's much faster and Also, it's quite hard to write a sync test and in the early days you had to like Write a lot of setup code So there's this neat module a sync test which basically Sets everything up for you and it's easier to to write tests Okay, so what did we learn? we learned that Tornado uses IO loop, which is a wrapper for a sync IO loop. So first level of wrapping Then momoko is another wrapper and also between 2016 and 2018 I think there was not a single comet made to the momoko repository So basically we used a module which was pretty dead When I when I was preparing for this presentation, I saw some movement in the momoko repository. So maybe it's been brought back to life What we've learned also is that UV loop is really great But some modules sometimes depend on the implementation details of a sync IO and For example a sync test does that so if you used UV loop, you cannot use async test and Actually the async test module at that time was developed by a single developer. It was not Super stable. It had problems with resource allocation and it was not compatible with UV loop So if we wanted to use it we We couldn't run it with the loop that we would like to use in production The other thing is that we started in Python 3.4 So everything was done at that time using a sync IO core routine and yield from expression So we had to move you have to migrate all that syntax Into a sync await because async await was much better integrated with CPython and actually there were some low-level optimizations which would which were Suggested which were recommended to to use But we didn't also have we this problem because like I said before tornado was using a Synchronous model match before a sync IO. So when it started there was no yield from expression There was tornado gen core team and yield So in some places we had yield in some places we had yield from and it was really really a mess another story a story of a asynchronous HTTP client So Basically another application which we wrote using async IO Like the the main logic was here. So What it does is it basically creates an HTTP HTTP connection Then it connects to some API and then it constantly iterates over the data that comes in the stream so I would say that this code is Good, but and ugly at the same time because it's good because it uses async IO and It's quite quite performant. It's much faster than than threading It's quite ugly because with all the context managers you actually end up with half of the line length or or or even less and at that time and I don't know if If it's already available like to have a multiple context managers async context managers just Put into a Single one so that the indentation doesn't go too far so also the code was ugly and Because there was no A iter method for asynchronous iteration So we had to use the Dunder version and there was also a no a next method. So another Dunder method and The code was also quite bad because When we deployed it to production we we saw something like that. So this is Graph of memory usage of that application So you can see that for some time Everything's fine and then the memory user just starts going high and the reason was that The server API we were using was an HTTP one so the API and the API was designed in a way that we basically open a stream and and then in a In HTTP request we would constantly get some Monitoring data from the from the server and this is not how you're supposed to use HTTP so after days of of debugging we We like narrowed down the issue to our asynchronous HTTP client And what we did we so so so the library Doesn't have an option to limit how much it prefixes data. So it creates coroutines for For prefetching data and if our processing Wasn't able to keep up with it. It would constantly fetch more data unless the memory was full so we contacted the developers of that app and they told us that You're using an HTTP client for something that's not HTTP So we're not going to fix it or accept your your fix So in the end we had to switch to a module which had an option to limit prefetching data And Unfortunately, there was no such module available for async.io at that time. So we had to switch to The thread threading implementation another story So this is a story of an elastic search pushers Async bump So Who here worked with elastic search a Lot of the people. Yeah, so we probably know that elastic search is super fast if you configure it correctly Yeah But you know even at some point if you if you're like pushing your your your performance and your Application is really is really fast. You'll actually get to the limit of your cluster If your elastic search cluster So there's a lot of benchmarks that you can benchmark your cluster with and when you approach When you approach those limits from benchmarks, there's basically nothing more you can do so We had an application which would get some data in a Super fast fashion. So let's say it was an asynchronous queue Then it would process it a bit and send to elastic search So, you know, great. We have their entry point is Asynchronous the out and the output was asynchronous because we used an asynchronous module for Communication with elastic search but the thing was that We when we pushed it to production and we really Put a lot of traffic into it We saw this Yeah, so again, the memory usage went really high and the application Was was crashing at some point, but apart from that we also noticed this So this is CPU usage for the for the application So we started suspecting that For some time after we start the application the flow from the from the asynchronous queue was quite stable for some time and After that if there was some congestion on the network or anything else going We would like get a little lower traffic for some time and then everything that was congested it was like a waterfall pushing to to our data queue and message brokers are really fast and can Like survive a lot and our application also was super fast But we we reached the elastic search limit So when we started getting 429 Response code from elastic search, which is too many requests. We started slowing down and But up to that point everything was fine But then after Like half an hour more then we started seeing that our memory usage grows really high So we started monitoring how many tasks are we creating the tasks to process data and push to elastic search And we also started monitoring how much time it takes to process a single event and we found out that our application was indeed super fast But at some point when it already created like 10 million of tasks to execute into To to process the event loop itself would become so slow That we would never recover from that state So I mentioned that in icing I owe we have this event loop So there's no magic we put a task there and then I think I owe iterates on that loop and if you put like 10 millions or a hundred Millions of elements then your iterations will become slower and slower and After some time we had so many asynchronous tasks to process That the the system would never process them because even if that traffic was slower It would not been able to to recover and to keep up any longer So what we had to do Is this is something which is called? async bump So there's a really great module called IO jobs Which basically allows you to limit how many tasks are you creating and how many it actually allows you to define two two limits how many tasks you're allowed to run asynchronously in any given time and How many tasks can wait in your buffer so that you don't Like overflow like and that your application Doesn't crash and then you use that scheduler which basically is a set of cues and And blocks if any of the limits is reached Yeah, so that's that and after that We have we had a story of an asynchronous service communicating with the synchronous credit service so now we are we are living in the microservices world and almost everything is microservice right now and We usually tend to believe that if we create our service it is totally independent from any other services and It might be true If you're not doing something like that so if you have an asynchronous service Which requires for its job communication with a threaded service Basically what you end up with is a cluster of services which behave like a one big threaded synchronous service Because our application Was again fast performant low latency And stuff but for every single task we had to communicate with with some With some other service which was threaded and synchronous and you know, there's actually a rule that if your microservice Depends on some another microservice then it's not really a microservice and Unfortunately, we we've learned that the the hard way So what we observed was this I'm really pushing Reusage of these images, right? so what what we observed was this and actually this this history happened like really either during the previous one or really Really soon after so when we fixed the other one we saw this and we started to believe that maybe the applications are contagious Or or or or something like that, but This was actually a more general problem like the the problem with with elastic search was like a specific instance of this problem, so So we implemented Like the the the fix which we used for the previous application, so the The asynchronous limits for for tasks and all was done and We we didn't see that but on the other hand like the limits for the For the previous application were really high so we could process like 50,000 asynchronous tasks at the same time and it was fine and for this application when we set The same limits we would observe that memory would go high and then after sometimes it would drop And also the CPU usage would sometimes sometimes ramp up really high and then after some time it would either crash or or Or go down so that was not like the the the main fix so then like there the the other service was not maintained by us so we contacted with their developers and We've learned that the other service basically is It's a view for a for a database so basically it has like 20 threads and If you want some data, it uses a threat and It performs some SQL operation and gives you the data back so we couldn't have like 50,000 asynchronous tasks, which would basically wait for 20 threads to get some data from so I think it's really we've learned that first of all you cannot if you're running an asynchronous application Like you really need to know what other systems you depend on and also if Somewhere in the line. There's a system which is Which doesn't scale very well, which is quite slow Then it might not make sense to have like a super fast asynchronous application which would basically for most of its time Wait for the for the slow system Okay, so up to this point. I told you only Bad stories and our problems with elastic search with async.io But we also have some good ones like after learning that we Went back to the drawing board and we had yet another service to implement so we made sure that it's Really really small service which does one thing and it does it good so We received the task to Reimplement a system that was causing problems What a system would do it would get some data in batches Let's say like 20 millions of entities to process for each entity it would make DNS query and then push The results back to to some other system so we used a MQP for communication for that system and and one thing was that The implementation at that time was synchronous. It was using Threads and it actually had this really nice benchmark which would calculate how many threads You you you should use to to have the the best performance The problem was that in order to process 20 million Messages it would need 12 hours, which was too much so what we did we Reimplemented it in async.io because we had a MQP At the front and at the back what we had to do was DNS queries which are Asynchronous in their nature because DNS uses UDP So it's you send something and hopefully it will return to you at some point so no handshakes just super easy and With like this was the main logic of that application. So we would use Async.io DNS query and we So we went from 12 hours to eight minutes with async.io and The funny story is that when we first run the application make it run It run great for the first five minutes and then All DNS servers stop responding Yeah, so we were super afraid that We caused another headline like Akamai DNS servers went down But it turned out in five minutes Infosec called us It's always funny to receive a call from Infosec right after you deployed a new version of Europe So they told us that we have malware on our server and the malware is trying to bring down the DNS servers Yeah, and when we told them that no, it's for that system is business as usual There's there's nothing wrong. It will perform that way so they they told us that It's fine that DNS servers will be fine with that but Next time maybe we'll we should give them some heads up So sometimes being too fast can also cause you problems Okay, and another story as I'm slowly running out of time is just a story about if you have If If you thought about a lot of things that you Can encounter if you prepare your application to be an async one if you have if you use communication which is asynchronous so entry point and The the output of your application then the entire architecture and the entire way In which your application is written is also really really simple Nice so for example if we nowadays if we use async IO, we really try to use message brokers MQP ones so that we can basically get the data in an asynchronous way Then we can make some processing asynchronously and then we can push data in an Asynchronous way so using like this this this pattern it makes your application Super easy and simple and even if you use You know a synchronous tasks. It's quite easy to to debug it if you don't have additional threads or Or any other things Okay, so what are pros and cons in my opinion? For using async IO. Well, you can definitely Gain some performance if you're relying on IO if your application is IO bound so network or DB and there's an asterisk for DB because databases are usually Threaded applications so make sure that you have enough workers on the DB site to To not bring it down or to just use it to It's fullest Well, you get better resource utilization Because you spend less time on communication and synchronization compared to two threads You if you use async IO you are on the technological edge Which gives you new ways to solve old problems, which makes you more creative and it actually makes you follow Python progress and contribute because there's still a lot of features missing Regarding async IO and what are cons? Well, there's still a lot of features missing so Async iterators and have messy in deterministic cleanups eater tools for async is missing and A lot of modules which allows you to interact with popular services like zookeeper elastic search. They are still quite young and the implementation implementations are quite early and Also for some of them or for some of the systems and there's a complete lack of Modules async IO compatible modules So, you know young implementations with many bugs and actually async IO Costs our community to become more and more divided because a lot of modules maintainers They decided that they will not Import the modules to async IO because they they don't like it So, okay last slides So what projects are best suited for async IO in my opinion? Micro services like I'm putting an emphasis on micro project the small list of dependencies so that you understand how you're How introducing async IO? Will interact with other dependencies simple HTTP APIs projects with big load but light processing so a lot of small tasks not not heavy ones a Projects where threads are not enough where you need like 50,000 Async asynchronous tasks and projects where the rest of the your technology stack is well understood What projects are not suited for async IO? Projects which architecture heavily relies on threats. They are difficult to migrate to async IO projects with dependencies heavily using threads and Projects we're processing of a single tasks take a lot of time because you don't gain anything from from running async in these projects and projects doing uncommon stuff like communicating with legacy HTTP services So, okay, I don't think we have time for Q&A So I'm here. I think for an hour or two more because My plane leaves in the evening, so you can just grab me you can have a coffee and you can share some more stories and also you can Contact me using either my email or my Twitter, which I think I have like one or two tweets So it's not some bot account. It's just I'm slowly getting into Twitter. Okay. Thank you very much