 Okay. We will proceed. Mark Andre Lemberg will talk about advanced database programming. He's a seasoned Python developer, being around since 1993. Also founder and CEO of Egenics.com, one of the founding members of the Python software foundation and a board member of the real Python society which brought this lovely conference to you. Give him a warm welcome, please. Thank you very much for coming. I'm going to give a little talk about advanced database programming because we've in the past heard a lot about the easy stuff. So I thought it might be a good idea to talk a bit about the more advanced things. A bit about myself, I'm Mark Lemberg. I've been using Python for a very long time. I've studied mathematics. I have a company doing Python projects. I'm a core developer, Python software foundation member of the Europe Python Society and I'm based in Düsseldorf. So this is the agenda for the talk. I don't know whether I can actually do everything that I have on this agenda because of the time constraints. But I'll try. First I'm going to start with a short introduction of the Python database API 2.0. How many of you know the Python database API? Interesting, not that many. So the Python database API was designed in the, the design started in the mid-90s. That was the 1.0 version which is now deprecated and we're now at 2.0. So it's a very old kind of standard that was developed. It's the development is ongoing on the Python DB sync. So if you want to join the discussion there, you just have to subscribe to that mailing list and you can add your thoughts to the standard. It's supposed to be a very simple kind of standard. It's supposed to be easy to implement so that we get as many database modules as possible. And I think that has worked out really well. The two main concepts in the Python database API, one is the connection object and the other is the cursor object. So you use the connection objects to actually connect to the database and also to manage your transactions. And then if you want to run queries, you open a cursor object and you run your queries on the cursor object and the cursor actually works like a cursor in a text processing system. You actually scroll down or like in the spreadsheet you scroll down in your result set and then get your data into your application. So this is how a typical application looks like that uses the DB API. So first you import your module, you get the connect API from that module, you open a connection, you pass in the database name, the user name and the password. Then you create a cursor object on the connection object and you run your queries on the cursor object. And finally you free the resources by closing everything again. So that was a short, very, very short introduction to the DB API. The next part is going to be about transactions. Transactions are a very, very useful thing in databases. You can do stuff on your database and if you find that you've made a mistake, you can just roll back your changes, which is very nice to have. You need it in production systems to work around bugs or input errors from users so that your database doesn't become corrupt. So it's very useful to use these transactions. However, there are a few mistakes that people often make and this sometimes causes people to not like transactions. One common mistake is they forget to commit their changes. So they apply a lot of changes on their curators and connections and then close the application and see that the database hasn't really changed. Because the database API defaults to transactional, it doesn't actually store the data if you don't do an explicit commit. Now a workaround for this is to just disable transactions, which of course is possible in databases as well, but it's not a really good workaround. Because instead of losing your changes, you get data corruption for free. Another common mistake that people make is they keep the transactions running for too long time. And I'm coming to that later in the talk, the transactions are basically your units of locking things in the database. So you want to keep transactions short to not lock other processes from accessing the database. So what you have to do is you have to try to make transactions short. Now the best practice is of course, like I said, always use transactions. Even if they are sometimes annoying, don't use auto commit. Always try to make use of them. Keep your transactions short. If they get too long, you can run them in batches. For example, if you're loading data into your database, it's much more convenient to do that in batches, say a thousand rows at a time and then you do commit. That also keeps the transaction lock of the database short and the performance will stay just fine. So you won't really see the overhead that is caused by the transaction mechanism. And if you know that you're not actually writing to the database, it's a very good practice to set the read only flag on the database. You can do that in the connection. You can usually do that in the connection options. And then the database will know that it has a read only connection, so it'll basically not work on this, not use the transaction lock and make it, make the whole query mechanism run much faster. So that again was the simple kind of level of transactions. Then we have a more advanced level of transactions. If you want to connect multiple databases and you want to have transactions span the different databases, then you have to think about what to do when you read data from one database and then put it into some other database. Of course, you only want that to succeed if all the databases have actually received the data. And that's what's called distributed transactions. Typical use cases are in accounting, for example, you debit from one account on one database and you credit the amount to some other database. You only want that to succeed if both databases have actually made the change. And you have similar things in Q-processing or if you want to integrate different applications. Now the typical buzzword that you'll hear when you're talking about distributed transactions is two-phase commit, which is kind of the standard method of approaching this problem. So it works like this. You have a first phase and the first phase, the commit is prepared. So all the different databases are asked whether the commit would succeed with a high probability. And if all databases say yes, then you go to the second phase and you actually do the commit. There's a tiny probability there that the database, some database in that process may fail in that second phase. And then I can say I'm sorry then your data is corrupt. You have to work around that in some way because there's no easy way of undoing the second phase. But most databases just handle this fine. To make it easier to deal with these transactions across multiple databases, there's something called a transaction manager. This is not in Python. This is usually done in some other system. For example, there's MQ series from IBM or J2EE or Microsoft has this DTC mechanism, some database systems, DB2 and Oracle offer these transaction managers. You can sometimes hook into them from Python. There are Python APIs for some of these. And you can then use them or you can use a database specific one like for example, the one that's integrated into Postgres. In the DBA API, we have addresses with a new set of APIs, the TPC API. Those are modeled after the XOpen standard for these transaction managers. But unfortunately, not many databases support this and not many database modules actually provide these APIs. So you have to check your database whether it supports this or not. Okay, next point. Concurrent database access. Am I going too fast? Too slow? Concurrent database access is very important if you have multiple processes accessing your database. For example, if you have a web application, because of the GIL, you would normally want to have multiple processes set up to talk to your database. And so it's important to think about how the databases deal with the problem of concurrent access. So you have typical setups, what I've written down here. The most typical one is of course again the many writers, many readers. So there's no real special case that fits in the common case. You definitely need to make compromises in these setups. So when writing an application and thinking about this, you have to ask yourself some questions. For example, should readers immediately see changes made by other transactions or made by other processes? And should they even see things that went into the database even though the transactions in those processes have not yet been committed? Or should the reader just see everything as it was when it started the transaction? So it doesn't see anything that came into the database after it started its transaction. And the database, the databases can handle all these different situations. They provide different what's called transaction isolation levels. But they have to implement this using LOX. And LOX is something that usually tries to avoid in your application because they keep your whole application from running in parallel and using most of the resources that you have. So I'm going to walk through the typical sets of transactions, isolation levels that you have. The first one is the easiest one to implement. It's read uncommitted, which basically says you don't have any LOX. With read uncommitted isolation level, all the processes that you have that talk to your database will immediately see everything that changes in the database, even uncommitted changes. So strange things can happen. You can have dirty reads, which means that you read data from another process that hasn't been committed yet and you're not sure whether it's actually going to be committed later on. You can have phantom reads, which is something that basically says you add something to the database in some other process and then you remove it again. And your process might have read this row that was added by another process and later on it's removed again. So it's basically a phantom that you're working with. And there are some other things that you have to watch out for. If you want to read up on these things, there's this URL down there. It's going to be in the talk slides. You can click on it. It's a very good explanation of these things. Then the next level is the read committed. This is the default level in most database applications. When you open your connection, you will usually get this isolation level. This isolation level basically says you're only going to see changes that were committed to the database. Now, you can still see changes that were made while you're running your current transaction. So if some other process commits while you're running your transaction, you will still see those. But you will not see any uncommitted things from other processes. So the way that it works is you have this cursor. I drew this table up there with the yellow bar in it. It will lock the current row. It will put a read lock on it. The database will, if there is a write lock on a row, it will wait for that write lock to be removed by the database. So if some other transaction has written to that row, that other transaction will have put a write lock on the row. And the write lock is only removed if the transaction is committed. So if the other transaction has committed to change, then you can actually go ahead and actually read this row. So this is basically you just looking at one row. Then the next level is repeatable read. This basically says that your data won't change in the transaction. So everything that was returned to your application by the database is guaranteed to stay the same throughout the whole transaction. And this, of course, requires more locks on the database. So you put locks on everything that you actually you pass back to the application. And then the highest level is serializable, which basically means whatever you do on your database, the database will stay like that, will stay exactly like it was when you started the transaction. And nothing will change. And this requires lots and lots of locks. The locks will not only address the things that you've read from the database, or you've written to the database, but everything that you've ever touched in the database. So even whole tables, if necessary, all of these levels are necessary for some applications. For example, if you want to run a report, you may have, you may want to avoid inconsistencies in the report. So you may, for example, want to use the serializable isolation level. And there are these other levels can be used if you have situations that are not as strict about data processing. So how do you do this in Python? There are two ways to do this. Well, actually, there are three, you can usually have an option in the connection settings that you can set to to set a default isolation level. But you can also do it dynamically in your application, and you can actually do it on a per connection basis. So you can have multiple connections to your database and different isolation levels. You can run a statement set transaction isolation level, for example, or you sometimes some database modules have special ways of directly setting the option on the connection. What's important to know is if you want to change the setting while having a connection open, you need to make sure that no transaction is currently running on that connection. So the easiest way to do that is just commit your connection or rollback, right? Optimizations. So you have a database application, of course, you want to run it as fast as possible. The first thing that you should do is you should ask yourself what kind of system you're running, whether you're running an old LTP system, which means online transactions. So you're interested in putting lots and lots of data in. You're not so much interested in making complex queries on the database. Or you want the other thing, you want to have data analysis. So you already have all the data and usually huge amounts of data in your database. And you're interested in doing complex queries, multi-dimensional, faceted, search, drill down all these things. And then you use an OLAP system. Now just like with transactions, the situation is often that you actually want a mix of both. So you want to have both run fast. So there's one way you can do this, depending on the size of database that you're talking about. One way is to put an OLAP system in front of the system that's actually taking in the data. And then every now and then you copy over that data into your OLAP system to analyze it. On a more lower level, you can, in Python, you have a certain number of problems that you can address directly. One is for example, your queries run too slow. And the queries are simple. So you're just doing, for example, the select on a few tables, a few columns. The usual way to address that is just you add more indexes. Because adding indexes is very easy in database. Some people, they add indexes to everything they have in the database. This slows down things because every time right into the database, the database has to update all these indexes. And so you should really only put indexes on columns or on combinations of columns that you actually need in the database. And the best way to find out which tables and which columns you have to index is best to use a query analyzer. The database is usually an awful way to get the information about how a query is running in the database. And you have a look at that, you analyze the data, and then you check which indexes you should put on the database, and it will increase the performance enormously. If you're using Python, you can in some situations also add caching at the Python level. So basically you read your data from the database and you store it in memory for subsequent use. You can even use SQL Lite for that if you have smaller data sets and they do in-memory processing. Now the next point is complex queries run too slow. For example, you have a report that's running on millions and millions of rows. Those will usually take a few minutes to run, depending on how complex they are. Of course users don't want to wait a few minutes for this. So a common strategy for doing this is to pre-process some parts of those queries. So every say 50 minutes you run the queries, you put them into a separate table, and then you run your reports on those query tables. And again, if your queries themselves are too complex, you can address that in Python as well. What you do is you simply split up your queries, make them easier to handle for the database, and then you combine the results from those queries in Python. A typical example of that is if you want to run a report that has aggregates right in the result set. So doing that in SQL is really hard, and you can do it, but it's really complex, and it's much easier to do in Python. So for this example, you just run two queries, you run query for the details, and you run one query for the aggregates, and then you combine everything into a single table, and use that in your application. Tips and tricks. This is just a collection of random stuff that I just thought might be interesting for you. A typical problem you have is record ID creation, so you want to put a new row into your database. You need an ID for that, for that record. And you have this kind of chicken and egg problem, because a typical way of doing that is to use an auto increment column in your database and just have the database deal with that, or you have a sequence and get your IDs from that sequence in the database. The problem there with the auto commit, for example, is race conditions, because with the auto increment, the database will take care of adding the increment value. And then you have to, of course, fetch that value again, because you want to continue working on that row. And the usual way is to ask the database for the last used ID, and depending on how you do it, you run into race conditions, or you run into context problems, because it's not really clear what the last ID is. There could have been some other transaction, for example, that also just inserted something. And so it's not clear where to get that last ID from. Another way of doing that is you just let the auto increment field insert your ID for you, and then you just query back the row simply by knowing what's in that row, and you query the database for that row, but that introduces the performance overhead. So it's not really ideal. Something that we always use in our applications is a very simple approach. It's kind of a randomized approach. We simply have a big integer field for the ID, and then we use a good random number generator to generate the ID for us. And we just bet on not having collisions. So we use that ID in the row, we put it into the database, and it usually succeeds. In the very, very rare cases where it doesn't succeed, you just create a new random number, and then you try again. How does this work in Python? Well, first you have to set some constants, so you have to have a row range ID. What we often do is we set the highest bit, so the IDs look nice. Then you have an API get random row ID. This needs to be thread local, because every thread should generate its own IDs, so you don't get any overlaps. And then you have to deal with setting up the random number generator. And the best way to do that is to use system random to get a good seed for the random number generator. Then you put the seed, you put it into hex, and then you feed the random number generator with that, and then you use that in your thread local. Right. Next point is referential constraints. People are usually very happy about using them. Does everyone know what a referential constraint is? Oh, very few. So it basically means this. Instead of, for example, you have a table up there with a, you want to reference a product. Now instead of referencing the product name directly in your table, what you do is you reference the ID into the table that has all the product names. And then you just put the ID into your table instead of the name. And then in your report, you combine all those things into a nice looking output. And this process of referencing from one table to another is called a referential constraint. Usually you use foreign keys for that. You can implement one to n mappings. You can implement n to m mappings. The constraints are enforced by the database, and that can sometimes lead to problems. Because if you, for example, you have lots of references in your database schema and you want to load data into your database, then it'll often fail because the way that the data is loaded does not actually match those referential constraints. And with some database, you can switch off the checking for those constraints during the load phase. But it's not really ideal. Another thing that can happen is if you delete something, then you can get cascaded deletes, which is not necessarily what you want. So what we do for these things, we just completely leave out the referential constraints and put everything into the Python database abstraction layer, which has much more knowledge about these things. It's also a good idea to put those things into the business logic. And then you avoid all these things. So if you wanted to emerge into an emergency situation where you have to quickly load data from a backup again, you don't have to think about how to turn off the referential constraints. You just load everything and it works. High availability is another one of those things that you have to think about. So you have multiple databases and one database server breaks. You want to switch over to the other one. There are very systems for doing that. They tend to not always work perfectly. So what, something that can happen is you can have automatic failback, which means that you have a failover situation where the system switches to a different database and then it automatically comes back. But you're not necessarily sure that all the database servers have actually synchronized by then. So you can have the split, what's called the split brain situation. So you have, the data is spread across different servers, but you're not necessarily sure whether all the servers have the same data. And of course some clients may miss the failover event. So some clients may continue to write into a different database server than the one that is actually currently being used. So again, the best thing that you can do is you move back to Python and manage everything in Python. And then you can also handle the communication between the clients and make sure that all the clients know about this failover event. You can do the failback in an optimal way. And you can also use this for doing read optimizations. You can have, for example, the database, the application part, writing to the database, write to one database and then have the synchronization between the databases, take care of moving the data to the other server. And at the same time, while writing here, you can read from the other database and you also avoid some of these locking issues. So I actually made it, 42 slides in half an hour. That's it. Thank you for listening. Any questions? Yes? Thanks. So thank you for talking. You said transactions shouldn't be too long. What's a good measure for the size of transaction? If you're writing to the database, I'd say just maybe like 10 to 100 rows that you write in a single transaction. Okay. Thanks. You may not want to use random numbers for your IDs since you may break the internal awareness of location and the internal balance tree. So it's maybe a good idea to use auto increment for the awareness of location. Otherwise, you may have some performance issues during writing to the database. Okay. Yeah. Well, we've measured it and it worked out fine. So, but we're in some databases, we're using sequences, for example, from the database to generate those. It depends on the implementation of the database. So I think it depends on the implementation in the database. So my CQ and NUDB would may break them. Okay. We're mostly using Postgres, so maybe. I have a question. Maybe you can help me with that. Maybe I'm missing something because when you are talking about problem with slow, complicated queries, you suggested that you may split it into multiple simple queries and move the weight of the computation a little bit to the Python. And my question is why would it even speed up the process? Because I imagine that the database is constructing such a way that such queries should be done in the maximum performance. So why would Python do the same logical thing faster? It's not so much about speeding up the operation. It's about making it possible in the first place because there are some things that you want to do in the reports that are not possible in SQL because SQL is too limited for that. And even though if you sometimes you can do things in SQL, but you get really huge SQL statements to do your processing, or you have to resort to procedures and everything, which makes things a lot more complicated. So we found it's usually better instead of just wasting time thinking about how to structure your SQL properly and making it very complex, it's easier to just have a few simple SQL queries and then just do the combination of those in Python. Okay. Thanks. The slow forward may be confused a little bit. Thank you. Hi. Thanks for the talk. It was very interesting. Just a suggestion based on experience because I've been a DBA in a former life. If you people do plan to do applications, talk with your database people beforehand. And just they're usually nice. They don't bite you. Just talk with the database people. They can make your life much easier because there is so much implementation detail to do. So maybe that's a good question just to consider. By the lunch, get out, talk to hours of them, it saves you about eight hours of programming time. Yes, thank you. That's a very good suggestion. By the way, many of these things are database specific. So you really have to know your database if you want to make proper decisions so they don't necessarily apply to all databases. Some databases are better, some are worse. We found that Postgres is a great foundation. Thank you. Thank you.