 OK, hello everyone. I'm here today to talk to you about advanced Django query optimization. I'm coming from this small country, so far away from here, called Bulgaria. If you hear about this for the first time, no worries. So who am I? I'm 23 years old. I'm a student in Sofia, which is the capital city of Bulgaria. I'm a last year student. I'm also a Django React web developer. I was very lucky to find great people and friends who really loved their job to build software that solves a real life problem. And I'm still working with them. So we are Hacksoft. We are an outsourcing company which built software for business. And we make programming courses in Bulgaria in Python and Ruby. We organize a big programming conference in Sofia on 15th and 16th of September, so we'll be very happy to see you there. But that's about us. Let's talk about Django. My aim during this talk is to tell you about three types of problems that we've hit while developing Django applications, and we've faced while using the ORM to interact in the database. But first, let's make two assumptions. The ORM's main idea is to make the database interaction easier. We want to avoid writing a SQL, and we just can write a single Python class which represents the database table. And we can call a method like .all that will fetch all the data from the database. And my second assumption is that Python is not the new SQL. What I mean is that SQL is great when you want to make database operations. But using ORM makes you think that you can do the same thing with a simple Python filter which is not a good idea. So the first problem, too much SQL queries. On the project I work at the moment, we had an API which had like 5,000 or 6,000 SQL queries per request, which was the real performance issue. So how we solved that? We know these two methods, select related and prefetch related. They came natively from Django ORM. So how they work? Let's say we have two models, user and shared account. So every user is related to a shared account, and every shared account can have multiple users. So let's say we need to have all the users. We make user.object.o, which under the hood, every query set is just a SQL query. So the query for this expression is select everything from user. So the problem with this query is that if we want to get the account for every user, we need to make an extra query for each object. So we can solve this with select related. You just tell the ORM that it needs to join the table with the shared account. So you get it with the same query. But if you want to get the shared account object with their users, you cannot join. You need to get more than one user for every account. So the way we do that is using prefetch related. We need to select all the accounts, get their IDs, then select the users which have these IDs in class. And the one thing that we do not see here when we use that is you need to fire the first query, then you force making a second query. Actually, prefetch related doesn't optimize your API. So you'll make the second query anyway. So if you don't need the users, you don't need to use the prefetch related. So tip one, you can make a simple SQL join with select related and prefetch related. It's easier. The second problem, too much data. We had an API which made 100 queries, which is not very big count, but we had too many objects. Let's see it in the example. We have a new service web application with three models. We have user, every user has related playlists. And every playlist has related songs. So let's say we need to get the total length of every playlist. We cannot store it in the playlist because as soon as we add a single song, it will change. So we implement property in the playlist model, which will calculate the sum of all songs length. But this is Python, so we can write just like sum from this comprehension to save some lines of code. Let's make the same thing for the user model. We need to get all the playlists and sum their length. But the problem with this code is if you want to get all users with their length of all the playlists, nothing else, just that. And we have 10 users, every user has 10 playlists, every playlist has 10 songs. Then in order to calculate the length for 10 users, we need to fetch 1,110 database rows, which will not be very slow, but if you have 10,000 database rows, it will be a problem, since there is still physical limitations between the software architecture. So here's the second assumption. Python is not the new SQL. SQL is made to make this thing easier. And what we do here is just a sum over a group. It's a sum aggregation over a group of songs, grouped by playlist. And this aggregation looks like this. It's a select quote from the playlist, and the song's total length field is calculated as aggregation over this. And here's our model at the moment. So how can we do that with an ORM expression? How can we generate this query with the plain Python? We can define a query set. The query set is the jungle way for generating SQL query. You can write custom method, which will select or prefetch something, or annotate, which is the ORM syntax for sCloud. Select subquery as something. So let's define a query set. In our query set, we need to group all the songs that make us, then we have to make some aggregation. So we define a method called collect, which annotates the song's total length. And we modify our property in the model so it can use the annotated field. The reason we do that is if we get the playlist from the song with the song.playlist, we don't get it from the query set. So it doesn't know about this collection. We don't go collect there. So we need the Python code anyway. But if we just need the playlist with their song's length, we can define the collect. We can annotate the song's total length. And we should put something there which is equal to the group by class. So the ORM has two functions that will do the job. The first is subquery. Subquery will accept two arguments, query, which is the expression for group by class, and output field, since the database will know what to return as a result of this query. But the ORM doesn't know what type of value it should expect. So the output field should be integer field, since we have some integers. And the query set will be song object with values just the playlist ID. This is the field to group by. The next thing is from the group by, we need just the group for this playlist, not all the groups. So we filter the playlist ID with the outer ref, which is the ORM expression of get me field from the outer query. And the last thing we should do is to make the aggregation itself. In our case, it's just sum, sum of fields of the song. And once we call playlist.object.collect, it will produce the same query that we would do if we didn't use the ORM. We just generated it. OK. OK, we want to do the same thing for the user model. We need to group all the playlists, get their length that we calculated, and return the sum aggregation over them. So we do exactly the same thing. The only difference is that we group the playlists by the user ID, filter them with the user ID, and then we should sum over the aggregation that we implemented in the playlist query set. And we can get it in the first line of the property. The end query will look like this. OK, tip two. If you have an API which calculates something over a group of objects, but you don't need the objects, you can use subquery and autoref. And use the database function to calculate since they're really fast. And you will avoid fetching the objects in the application. And the third problem. We have too much queries, but we have too much data. And the difference from the previous two is that you cannot use select-related. You cannot use profess-related. But let's see it in the example. We have these three models, the same as the beginning. But the only difference is that the length of the song, the real length of the song, is calculated as the length multiplied by 0.8. So if we do the profess-related, if we do the same thing as the beginning, it will select everything. But you need to calculate the song through your length anyway. So you need all the objects anyway. So how can we do that with Django ORM? Django ORM provides an expression wrapper, which is the way Django says you can implement this expression when passing it to this wrapper and define the output field. So what should be the expression? We have just the length multiplied by 0.8. You can get the object field with the f expression, which will be evaluated in the database while fetching the objects, not in the Python code. And you can pass Python value to the database and say, do these things with these values, from the database rows and from the query. So we can just get the length and multiply it by 0.8 and define the output field, which is integer field. But once we do that, the previous group by class that we've written will not be correct since they should depend on the real length, not the length. And we should go to this query set and say values list from some aggregation of not the length, but the expression equal to the real length. You should get it from the song query set. And the problem here is how should we test that? We write an ORM expression, and we think it's correct, but we don't have the test for it. If it's a Python code, you can make a unit test for it, and you can test every case you have. But we said that we need to have the Python code anyway because you can get the object as a relation of the other object. You can get the playlist as a song.playlist, not the playlist.object.collect. And if you had these properties, we can write proper tests for them, proper tests like this. We have a song object. We have the real length expected, and we just assert this. So the only thing we need to do is just to get the object again with the collect, with song objects collect, we get with the same ID, and assert the same thing. This is the only change in our tests. So the good thing here is if you have the proper test for every model's property, with the minimal change, you can see if it works. Thank you very much. You're fast, so we have plenty of time for questions. So maybe you would like to ask some questions how to do it, so I come and give you the microphone. Good talk. One question. You solved one of the problems there, basically through code duplication, in the example of multiplying the length of the song. Do you have any tips on reducing code duplication while maintaining efficiency? Reducing what? Reducing code duplication. In your example, you multiplied the original database length of the song by 0.8. Let me turn it twice. Of course, that's a contrived example, but the... You mean the song total length with the new property? No, by 0.8, to get real length in one of your later examples. In the Quercet? Not in the Quercet. You have one method on the song model, where you multiply the stored length of the song by 0.8 to get the real length. And in your higher level method on the user, you do the same multiplication operation. And that is code duplication. You're doing basically the same operation in two different places. Yeah, it's the same aggregation. Yeah, but that is considered an anti-pattern, something generally to be avoided. In this case, you're doing it for efficiency, but it still carries some risk. Well, you need to group the objects anyway because of the relation. Yeah. But is there a way to do it with less duplication is what I'm asking? I'm not sure I understand you. That's OK. Thank you. Maybe you can clear this up and then you look at the code together, but we haven't found the code place yet. Any other questions about queries making them fast? Oh, yeah. Thank you for the presentation. I have one rather trivial questions. You used the decorator property. Would it be advisable to use the Django own decorator cached property? Would it bring some benefit here? Yeah, it will be really shorter and easier. The only problem that you can face if you use cached property is, let's say, you return money-filled instance in the property, but the database doesn't support this type of value. So you need to get the field collected by the query set and then pass it to the Python class money-filled. And I think this is the only case that you can cache problems with cached property. Thank you. It will be there anyway. Thank you. Sometimes you have the idea of the query you want to write, but then the things you end up writing with the Django or REM don't look at all like the query you had in mind. So are there first good resources to understand what you should be using in the OREM in order to get what you need as SQL? And is there a way to verify that the SQL actually is what you want it to be? Well, I think it's easier to check your OREM expression while you develop when you see the real SQL query. But you better write the OREM expression since it's easier to maintain long-term. And you can use the query just to check the science check. You want what you need to achieve, and you just need to see the real query. And you can easily print the query itself in the Python shell. Yeah, but the OREM expression is definitely easier to maintain long-term. And is there any good resource in a simple place where we can get mapping and this query OREM expressions gives this SQL like a building block, something like we have for the class Bayviews. There's the classic Bayviews site that many people use. Is there the same kind of resource for the OREM? Well, I don't think so. OK, thank you. And with this, you can get the actual SQL query which you will generate. You mean the name of the expression, the annotation? Yeah, let me check the query. Where is it? The query here, the subquery here, has an S clause, the song's total length. It's the same name as the, let me check the annotation, the same as you've written in the annotate method as a keyword argument in the Python function. You can extract it from there. What were your experiences writing raw SQL from Django? You mean in Django application? Yeah, like using the dot raw query set. Well, in our project, we do not have the plain SQL run by the OREM, plain SQL while interacting with the database. But this is very useful when you have, for example, service which integrates with your application and with your application database and calculates some statistics. You need to put the plain SQL queries there. And you can get it from the query set. You can make an OREM expression, get the SQL query, and put it there. And you have tests for it. This is the good part. Thank you. I think the general question was how to find out about the existence of outer ref subquery and so on in Django. Because when you show this, I think people are seeing it for the first time, so maybe to help. This was general. How did you find out about the existence of outer ref and how to use it? Well, we actually have a really heavy API that we needed to optimize. It was extremely important to optimize it since it was very used. So in Django documentation, there is two short explanations how exactly subquery and outer ref actually works. Actually, there were a blog post that explains the behavior of these two properties. But we found it while using it. We just make attempts, then print the SQL query, and see what happens. And that, for example, how we found how, let me check the query set, you need to make group by over the songs for a playlist. But you need to tell the field that you want the group to be grouped by. You need to group the songs by playlist ID. So how should you say that? The only idea of subquery and outer ref with aggregation like some is to make group by, but you don't have an interface for it. It's not explained very well. So the way you do this is to call values or values before making an aggregation. So when you say song.objects.values from playlist ID, the select class will look like select playlist ID from song, nothing else, just that. So once you make a value list with the aggregation, it will return group by this field. So this is the way you can actually do it. Hello. Thanks for the great talk. So you basically found out how to make such queries by trial and error. That's what you're saying? I mean, as I understood, there's no documentation in Django about the. That was the main thing. The other thing was Django documentation, of course. That's a basic and multiple blog posts with examples. Thanks. So do you use MySQL or Postgres for this? And do you know if this works the same on both databases? Because Django RM has some issues with MySQL. There are not supported some parts of it. You mean if you want to generate a query which is not supported by Yoram? Yes, this subquery and outer ref part, does it works on MySQL databases or only on Postgres? Do you know what database do you use on this project? To store you. I think, yes, it will work. OK, any more questions about queries? If there are no queries anymore, then we stop. Thank you very much. Give a big hand to the speaker.