 Hi, good morning. Thank you for joining us in this first session in the PyCharm room. Our first speaker is David Arcos and his talk is titled Efficient Django. Thanks for coming. In this talk I will speak about Efficient Django. I will tell some tips and best practices for avoiding scalability issues and performance bottlenecks. The four main things that we will see are the theory, the basic concepts, then measuring how to find bottlenecks and finally some tips and tricks. The conclusion of course is that Django can scale. So hi, that's me. I'm David Arcos. I'm a Python developer since 2008. I'm co-organizer at Python Barcelona at the Meetup. I'm a ICTO of Lead Ratings. Lead Ratings is a startup in Barcelona that does machine learning as a service. So we provide a prediction API so our customers can rate their sales leads and then improve their sales conversions. Looks difficult but it's quite straight. Ok, let's start with the basic concepts. Have you heard of the Pareto principle, the 80-20 rule? It says that for most of the, for many events, most of the effects come from a few of the causes. And this happens in many different fields. In scalability this happens too. Ok, we can focus on optimizing 80% of the task and achieve a very few results or just focus on a few vital tasks, the 20% and we will achieve most of the results. The difficult thing here of course is to identify these few tasks. So if we want to improve the performance and the scalability of our platform we need to identify the bottlenecks. Basic concepts on scalability. Usually scalability is defined as the potential to grow a system just by adding more hardware without changing the architecture. It's recommended that you don't store the state in the application servers but on the database. If you keep stateless app servers you can do load balancing and then you can scale them horizontally which means just add more hardware and if the state is not shared it's very easy to grow. But then we move the problem to the other side, the database. If the state is in a single point in the database this will be difficult to scale. It depends on the database, it's not the same scaling a Mongo, a Postgres, Redis, each of them have different things. To improve the database performance this is quite obvious. On one hand you have to do less requests and on the other one you have to do faster, more efficient requests. We will see how to later. And doing less requests means that you have to do less reads and less writes. You can achieve this with caches. And doing faster requests you can do many things here, we will see how to index fields and you can denormalize your models. Denormalizing means that you have some recalculated data inside the model so you don't have to do expensive operations all the time. About the Django templates, the standard templating engine is good enough. Jinja is a bit better but anyways you have to cache all the templates. Django has fragment caching that means that you can cache just little blocks of the templates. You don't need to cache everything at the same time and you can go layer by layer, template by template and do different caching at different spots. Of course this depends on your system. If you are doing an API you don't have templates, but if you are doing a normal web application you will have a lot of code that can benefit from this. The cache, this is one of the most important things. Of course you can cache almost everything so the most standard approach is to go layer by layer of your stack and try caching things from the top if you are using varnish, if you are using a CDN, a platform, the access to the database, the templates, sessions, everything. Django has very good cache documentation and it's very powerful and the problem here is the cache invalidation. How do you invalidate the cache? Once a model is updated you have to remove it. You can do it in many different ways, we will see how later. So cache everything. Bottlenecks, now we are moving to the interesting parts. You have to identify the bottleneck on your system. The bottleneck is the place that makes your system slow. If you remove a bottleneck your system will go faster. You will have another bottleneck. You have to identify that bottleneck, solve it and rinse and repeat. It depends a lot. Different systems will have different bottlenecks. If your bottleneck is the CPU, the memory, the database, you can do different things. The thing is that first you have to fix the current bottleneck and then move forward to the next one. So how do we find the bottlenecks? Second part, measuring. You can monitor your application, see data, numbers, and this can help you to find the bottleneck. As they say, you can't improve what you don't measure. So you measure your system to find the bottlenecks, you fix those bottlenecks and then you verify because you are measuring, you verify that the bottleneck has been fixed. And you keep doing this until it's efficient and performant and scalable. Easy to say. So from top to down, monitoring, you can monitor the system, load CPU memory to check the basic stats. The database of course is very important, queries per second, response time, the size of the database even, same for cache, the queue, when you have a system of workers it's important to see how many tasks do you have queued. If it's going too fast, then the bottleneck could be there. And also custom metrics for your application. You can do profiling with the Python C profile module, which is the standard module for profiling. And profiling allows you to run the Python code and it will return you some numbers like this, the number of calls that goes in each call. Running time, time per call, these numbers are interesting for finding which is the slow call, the slow line and which lines are being repeated the most because you can have an idea in your head on how the application is performing but until you measure it's just a hypothesis. Time it. The time it module is another standard Python module that's what it says. It's times how much time does it take to run your command. So you can use it to call a script or you can embed it into Python code. Here it's calling just a method and time it runs this snippet many times and calculates the average, the best and well this kind of metrics. So the idea here it says best of three. Usually as a baseline you want to use the best possible time because in your system you have many different variables and the best time is when you have the cache prepopulated, is when the CPU is not doing other things, is when you are not having network problems so the best measure works okay for knowing a lower bound of your system. IPDB. IPDB is the Python debugger so if you are using IPython, IPDB is the same for like IPython so it has a few more features like better top completion, as intense highlighting, more tracebacks, introspection. You just use ipdb.setprace and then when your code goes over there it will stop, it will give you a shell to keep executing Python. So from a normal Django application for example that you are running in your machine you just put a traceback here, a breakpoint, sorry, a breakpoint and then the run server will stop and you can see all the variables that are there, you can keep running, you have a few commands to continue to go step by step and this is very useful because when you detect a back you can just raise this and check it, no need to go through the tracebacks. Another very important tool, the Django the book toolbar. Django the book toolbar consists on a series of panels, in those panels you can check things about everything and you can add more panels, ok? So you can do the profiling here, you can see the SQL queries, you can select to explain the queries, you can see what's in the system right now, you can see how much time it takes, also things about redirections, about the templates, about the cache usage and for me this is the most useful tool for debugging things because when you have a theory, a hypothesis on how your system is working but then the numbers don't make sense, you can go line by line, view by view and check really what's happening. It's very modular so you can add more modules. First the Django the book toolbar line profiler embeds the Python profiler so you have a new toolbar panel and then you can provide the views, the models, everything, it's very useful. And then Django the book panel, not Django the book toolbar but panel, this is an extension for the Chrome browser because some calls don't return HTML. If we go back, this is, in this picture we can see, this is the result of a single page of your application then you click on a button that says Django the book toolbar and it opens all of these, okay, but this is an HTML view and all of these is HTML and JavaScript. But sometimes you are not using HTML, you are doing an API or Ajax request or non-HTML responses, you are returning a dynamic JSON, whatever. In those cases, you cannot embed the HTML inside that view. So this Django the book panel allows you to use the browser, you have this little extension and you can check all that you are doing to the server, you can check the same things as if you had the Django the book. This is very useful too. Okay, tips and tricks. Now that we know the basic concepts and how to measure and how to find the bottlenecks, we will see a few best practices and a few possibilities on how to fix performance bottlenecks, okay? So first the most important, databases. Databases are usually slow because the indexes are wrong. A index in a database, well it's an index, it makes your queries faster, but you need to have the right indexes. Databases are not as intelligent as they seem. You need to be very specific on what you want to index. So in example, all the time the primary key will be indexed, okay? But then you can add indexes for single fields, the dv index, or composed indexes for more than one field index together. The first one is defined in the model in that field. You just said dv index equal true. And the index together is defined at the meta of the model and then you there put arrays of many fields, okay? So in example. Yeah, so you can think, this happened to me a few days ago, you can have your idea on how it's working, but then it's slow. You think it's using an index because it's a very simple query, okay? A date time field, so you are ordering a list of rows by date, but it's very slow. What's happening? If you use the book toolbar or any other of the toolbars, you will see at some point that the problem was in postgres in my case. It was not using the index. Why? Thanks to the book toolbar I found that it was a multiple index. It was indexed by creation time and UUID, okay? Why? No idea. This was inside the Django admin so I understand that it shorts by time and by UUID, but once I found it, fixing it was just adding an index and it went from 15 seconds to 3 milliseconds. The difference is huge and this table was very small, just 3.5 million rows. So for bigger tables it's very important to be sure that you are using indexes for your most used queries and also if you are using the Django admin, of course. What's the bad thing about the indexes? Why don't we add indexes to everything? Why it's not automatic to have indexes everywhere? Indexes occupy space? Space is cheap, but space on the database, well, it's problematic and also having indexes make slower writes because if you insert a row it has to update all the indexes. Okay? If you have two indexes it's okay. If you have 20 indexes it will get more complicated and you can do permutation for multiple indexes of many, many fields and it will get slow very fast. So use the indexes only when you need them to and be sure to profile and to be sure that it's using the right index. The difference is huge. It's very, very easy to see that it's working as expected. Okay? Another tip for the databases doing bulk operations. In example, if you have to do an initial ingest of data and you have thousands of rows and you go one by one it will be thousands of writes to the database, okay? You can use the bulk create method and do bulk insertions if I don't know a thousand at the same time or ten thousand at the same time. This goes much faster. The database has no problem in adding ten thousand rows at the same time. It's just a bit slower but the difference in number of queries is huge. Each query you do to the database has an overhead of going to the remote database and everything. Sometimes you test in your laptop and it goes very fast but once it's in Amazon or it's in another provider you will see the overhead very fast. So you can do bulk operations for creating. You can do bulk updates and you can do bulk deletes, okay? Instead of iterating over all the objects, all the models, all the rows, you can use these methods. Update is a bit more complex. Why? Because usually when you want to update a field, you know what you want to put into that field but if you want to update a query set of many fields, usually the field you want to update is dynamic, okay? Because setting the same value for all the fields, that's not a common use case. So you can use the F expressions that are for setting field values based on dynamic data. Dynamic data I mean things that are already in the database. So in example you want to increase a counter so you could use an F expression to say, okay give me the part of that counter plus this kind of things. By the way, these are links and I will post the slides and you can check all the links. Most of them are going to the Django documentation but others are going to other resources, okay? And delete is very easy. No parameters, you just delete a full query set in a single operation. Another thing to take in mind is that when you do bulk create it's not using the safe method. It's not using the signals, same for update. If your logic depends on Django signals on a given model to do something it's time you add a row, this will not call the signals, okay? So you have to manage that apart. Okay, another tip for the database. Getting related objects within the same query. Here we have two different use cases. Well, they are very similar. If you go to foreign keys or if you go foreign keys fields or many to many, okay? For foreign keys it's easier, you just use the select related method of the query set and you will have one model and one object and all the related objects in the same query set. For example, I want to get the model country and all the cities in that country. So normally I would use one query for the country and then one per city. That's inefficient. I can do it with a single query and tell the database get me this country and all the cities at the same time. It's a bit slower than a single query but much faster than doing n queries, okay? And the second one is a bit more complex. It's for many to many fields when the relationship is not only a foreign key but you have more fields. This does an extra query. Before the normal query, this will do an extra query. This will get all the ideas of all the related objects and it will do the join in Python. This is important because sometimes the databases are very slow doing joins. If you don't have the adequate indexes or if it doesn't fit in memory it has to go to the file system or whatever. This makes sure that you will get all the related many to many objects with just an extra query. So you will do two queries instead of... Next. Slow admin. I use the Django admin a lot. I usually extend the admin and add custom. And one thing I like is that the default value for the admin well. It can have lots of fields. It doesn't grow very well. You can do many of the tips we have seen in example. List select related will do the select related thing inside the model admin. You can do overwrite the get query set to do the prefetch related. So the get query set method you just extend it and call with prefetch related the fields you need. Ordering. The ordering field make sure that it's using an index. And the same for the search fields. If you are doing searches on an indexed fields it will be very slow. Now for foreign key and many to many fields you can do two things. With only means that instead of... In example we have a list of all the cities. There are thousands of cities and this means that it has to do an extra query to get all the cities and render it and you will have a select box with a lot of things it will be slow. Not on the database part but on your machine the browser will get very slow. So if you do ret only fields it will not be a select field it will not be editable. So you will have just the current value. And this can be useful because most of the times in the admin you are not changing this kind of relations. But if you need to change them then the next one. Wow ID fields. This is a different field that instead of listing all the possible values in this foreign key it will display... I should have put a picture here. It will display just the ID. A little button for search and a little button to delete. So in example we have a list of cities we would have a field and say city 45. And that would do the relation spamming lots of HTML entities into the browser. The raw ID fields is cool but it's not very beautiful. It's better to use the Django Salmonella external application. It's like the raw ID fields but it tells you the name of the field that you are using. It's a little more beautiful and more usable. So with this Django Salmonella instead of seeing city 45 you will see city Barcelona. It's more usable by the end user. Another little trick extending the admin templates. In this case if you extend the filter template the filters are what in the admin in the sidebar at the right you have all the possible... The filters that you define in the model admin will be there. If you have in example the city you have thousands of cities it will take a lot of space and it's slow in the browser it's a lot of craft. So you can extend this filter and instead of doing a standard list using a selector and it's the ML selector the standard form. In this way it will occupy less space and it will be just a normal form that when you click it filters you by this foreign key. Now I talked a lot about the catch and the catch is difficult because you have to invalidate things and you have to know what to catch and you have to do many difficult operations. If you know this center that says that in computer science there are two difficult things cache invalidation and naming things. Cache a lot. It's not a joke, it's a very good software. Django cache a lot is a system for caching the ORM queries so the database accesses and automatically invalidates them. This is a very cool project. This is done by the... There was another project called Johnny Cache this is from the same people I think and this manages automatically the caching on the ORM level it introduces itself at the middle of the ORM and it does caching at table level. This means that if the table doesn't change the cache is still there. Once the table changes the cache is invalidated. What can happen here? You have a table and you are writing all the time this could be a problem because you will be invalidating the cache all the time. Anyways I did some small tests and even if you do that having the database cache in the ORM improves your performance because usually inside of the same request you could be accessing the same database the same row many times and if you are not caching that just by caching it inside the request you can avoid a few extra queries so even if you are having a lot of writes well my... I would say that you have to measure if it goes better for your system this will take some space in the cache of course but having an automatic system this project has very good code coverage and well it's very... it's the low hanging fruit you just install this, it's very easy to configure and your application gets much faster for most of the usual cases of course if you have some specific things you can use the low level API of Django cache a lot and do caching in specific places or disable some tables or accommodate to your own ok use un workers do the slow stuff later sometimes you have to do stuff that is slow could be cpu bound so the cpu is working a lot because I don't know you have to do generate a pdf put it inside a zip file this kind of things takes a lot of cpu you don't need to do that synchronously ok that can be you just generate a pdf cdf cdf cdf and a synchronous jobs system where you queue the stuff and you do it and you will have some your application servers of course but then some workers and these workers will just run the tasks ok the task can be any kind of task not only cpu bound sometimes you have to go to a URL E isso tranquiliza. Se você posar e q into it, would not have to waiting for this blocking op classify so it can be Done later. Ok? E se quer xe Iím going to improve the performance, you have to identify these slow stuff and move it to another place. This is also a very basic tip. Cut sessions. d realizing xxs xxs un xxs xxs x xxs xx you can have those in the cache and that's it. So if it's non-persistent it just keeps it in the cache and once the cache is deleted, the user will be locked out of your application. But if you want persistency, it's very similar, it's caching the reads but then it will write. Okay, once it's not so often as the default settings but it will eventually write the session. Still, all the reads will be avoided. Persistent connections, yeah, to the database. Another Django setting that by default is set to false and you have to enable it, okay? And this says that a connection to the database can persist for, I don't know, 60 seconds. Otherwise, it will close the connection and open it again and close and open. You can set it to true and then it's forever. But connection, I think it's better to close the connection after a few time because if you are having connectivity issues or issues with the database or the app servers or whatever goes wrong and keeps the connections open, you can have trouble because other workers won't be able to connect or other app servers won't be able to connect. So this should be set for, I don't know, a minute, five minutes, something like that. The important thing here is not doing lots of connections all the time in the same second doing thousands of connections, you want to avoid that. More things, okay? This is not performance, but scalability. UUID, UUIDs are the universal unique identifiers. And by default, yeah, by default, Django use normal primary keys, sequential IDs, so the first row will be one, the second will be two, third or fourth. UUIDs are different, unique identifiers are not sorted, are not ordered, are not incremental, so each time a UUID is generated, it's totally random. The chance of collision is calculated and it's negligible, so it will not collide. Even if it collides, you would get an error in the database saying, oh, this key already exists. Advantages of using UUIDs. You guarantee the uniqueness, so you won't have collisions. What could happen here? If you have two application servers, the database gets disconnected or they are in different times and so the database gets just split or whatever, you could have a new user ID 25 and in a disconnected machine, creating another user ID 25, same ID. What happens then? You have a conflict, you have a collision and that's not nice. Also UUIDs are very well indexed because they are using native fields, they are using hexadecimal values, so it's not looking for a string, okay? It's something very, very well performed. So using UUIDs from the beginning makes it very easy to do database sharding. If you don't do this, then later you will have to do a database migration to use to add UUIDs and remove the standard IDs in all the places and in the foreign keys and it's a crap going through all the foreign keys changing these UUIDs. So do this at the beginning of your project and then when you want to shard the database, it will be much easier. Okay, slow test, not a scalability issue, but this is important anyways. Slow test used to be a bigger problem because right now we have, since Django 1.8 we have the keepdb option and since 1.9 we have the parallel option. Before that you had to do different hacks to avoid first the migrations. If it's time you have to run the test, you have to run all the migrations for all the apps, you can have tens or hundreds of migrations. In Django 1.7 consolidating the migrations into a single one was not working very well, was not possible, in Django 1.8 it worked better, but running all the migrations make the test very, very slow. So when you run the test just use the keepdb and it will not do the migrations, okay? Run in parallel. This means that each test case will be run in parallel. At the beginning the unit test system will create, instead of one database, many databases. If you combine this with the keep database, it will be very fast and in each of the databases it will start running the test cases. Also for faster test you can disable things that you are not using, e.g. middle wars. Middle wars are usually a suspected bottleneck because if you have custom middle wars doing lots of stuff, it will get slow. If middle wars usually go to the database or whatever and do stuff, do validation, do authentication, this kind of things. Installed application, it's not a big difference, but anyways if you are not testing an app, remove it from the installed apps. Password hashes, this is standard in the Django documentation, use easier hashes, MD5, e.g., it's not valid for production, but for the unit test, it's enough because you are not testing the password hashes. You are testing the user creation, e.g. Also logging, you can disable all the logging with just one line. Also use mocking whenever possible. Mocking means that instead of going to an external service, an external database, e.g., running a low program, you write a mock that simulates that it's this external call. So in example, if you are connected to Amazon S3 to upload databases and you do that, I don't know, 1000 times inside your unit test, that will be slow. If you do a mock and just keep those files on the local system or in memory or in depth null, whatever, it will be much faster because you will not have the overhead of going to the internet all the time. Also for the philosophy of the unit test, it's better to test only your logic, nor the external services that could or could not be working. So after all of these conclusions, the first thing you have to do is to monitor, to measure to find the bottlenecks. Once found, optimize only the bottlenecks. Go for the easier stuff. The 20% of the lines spend 80% of the time. So find those lines, go for those lines and don't try to optimize everything because if you want to optimize every line, that defeats the purpose. And once you have fixed the bottleneck, you have fixed that 80%, okay? But now in the remaining 20%, the 80% of that will be in another bottleneck. So you have to keep doing this again and again. Rise and repeat. A few external resources. The official Django documentation is awesome, so it has a section on performance, a section on database scalability very good. A book, High Performance Django. This book is very good, it's very oriented to production systems, to have, well, performance more than scalability things also. This is a must-have if you have Django systems in production. It tells you everything. In my talk I have focused only on the Django things. In this book you will see about other things, using gins, haproxy, varnish, external systems that you can use to make it faster. So you don't scale Django only with Django things, but also with external things. A block, the block of Instagram engineering. Instagram they say it's the biggest Django project deployed in production nowadays and in all the history I think. And they post a lot of use cases, they posted how they increased all their systems when they started with the Android application a few years ago when Facebook bought them and they are posting things all the time as engineers and also the data science block is interesting too. They talk about scalability issues. And this is a document, you can click here or Google for this line, latency numbers, every programmer should know. This is a link to a university and they say how much time does it take to go to a local connection in a local data center. Go to a connection from Europe to the United States. How much time does it take to write one megabyte on the hard drive, on an SSD hard drive? To read, to read from an SSD, to read from memory from another machine in your data center. L2 cache, L1 cache, everything, how to compute inside the CPU, how much does it take to run an instruction, a hit, a miss, whatever. This resource is very important because it happened to me, I thought that in example going to the local hard drive would be faster than going to an external machine in the same data center, okay? It's not true. Going to another machine with a network connection. If the machine has a data in memory, it's much faster than going to the hard drive. So you have to get these numbers and play a bit and accommodate to them, okay? And that's it. Thanks for attending. I will, no, the slides are already posted at slideshare. And at lead ratings we are looking for engineers and data scientists, feel free to contact me, okay? And that's it now, if you have questions, anybody? Okay, nobody understood anything. Yes. I usually deploy often, so memory leaks are not a problem. I deploy often, so Celery is restarted, so memory leaks are not usually a problem for me. But yeah, that can happen of course. Sorry again. Could be. I have tested zero MQ, I liked it a lot. But usually I go with the easiest option and Celery was good enough. But of course there are many different systems. Of course, if your jobs are not time critical, Celery is okay, but if you need more performance, there are better systems. Okay, so if you have any more questions for David, just grab him during a coffee break or during lunch and he'll be happy to answer all of them.