 So, next up we have Django Multitenancy, the isolated database approach by Aditya. So, over to you Aditya. Yeah. Thanks Berman. Thanks for the introduction. Right. So, hey Pythonistas. Good evening. And welcome to this presentation on Django and Multitenancy, the isolated database approach. A little bit about me. I am Aditya and I am currently working as a senior software engineer at Innovator. And this is the idea that I have been predominantly working on for the past few years. This is my first talk at PyCon and I'm super excited about it. Also, this is probably going to be extra special for all the Django buffs out there who primarily use Django as their web framework of choice. So, without further delay, let's get started. Say hello to Bob. He's the presentation mascot. He's also my partner in crime and will be helping me drive this presentation through. So, there are three major topics that I'll be covering today. I'll be starting off with a brief introduction to Multitenancy, why it is needed and the approaches that are commonly being used in the industry to solving this. Then I'll be specifically throwing light on the isolated database approach, why it is essential and how to go about implementing it. To conclude, I'll be talking about some additional implementation challenges that might have to be solved in order to get things production ready. So, to understand Multitenancy, we'll first have to understand how single tenant deployment systems work. So, let's move further to discuss this. So, as Bob says, in the beginning web applications were single tenant. So, what does this mean? Well, imagine that we built a wonderful library management system in Django and want to deploy it across multiple customers. Now, the back end deployment architecture for each customer might be fairly similar to the one shown in the image here, wherein each stack could consist of a load balancer such as NGINX, the primary application server written in Django served by Unicom and PostgreSQL being used for the database layer. Now, as you might have noticed, this entire stack has to be replicated for each customer we sell our software to. So, this method of deployment is called a single tenant deployment scheme, where the term tenant refers to a customer and can be used interchangeably. So, now that we have a fair understanding of what a single tenant deployment architecture looks like, let's take things up a notch. So, coming to the golden question, what is Multitenancy, right? Well, in the SAS world, Multitenancy refers to an architecture paradigm in which a single instance of a software or web application is capable of solving requests from multiple customers. Great, so you guys might be wondering, who is using this, right? Well, some of the biggest SAS companies in the world, such as Google, Salesforce, Atlassian, Amazon, and Zoho, use multitenant architectures to serve customers at scale. So, now that we know a lot of companies are using this, let's take a look at the illustration of the two most popular approaches being used in the industry to solving this problem. They are the isolated database approach where each tenant gets its own database, and the shared database approach where all tenants share the same database, but data isolation is achieved logically by means of relationship constraints. So, now that we have a visual understanding of what a multitenant architecture looks like, let's take a look at its benefits. So, the advantages of having this architecture can be largely derived from the two illustrations that we have on screen here. So, firstly, as you guys might have noticed, all the stateless components of the stack, such as the load balancer and the application server, are common across all tenants, which helps in greatly reducing the infrastructure deployment costs. As a consequence of this, it gets rid of the fragmented versioning problem, because now, when a version upgrade happens, all customers get access to the latest version of the software, thereby resulting in a better overall customer experience. Also, it largely mitigates the DevOps and maintenance overhead, because now, there's only one version of the code base to develop and maintain. So, now that we've seen how essential multitenancy is, let's see how to implement it using the isolated database approach. So, Bob is thinking, why the isolated database approach? Why not the shared DB approach or something else? But the answer to that question is manifold. So, in the data-centric world we live in right now, data isolation and privacy are of utmost importance, especially in industries like healthcare. With this approach, it is guaranteed that an accidental data leak or a data overlap between tenants can never occur. Also, all existing database utilities such as Backup Restore and tools for database visualization would work without any modifications, as they all operate at a database level and so does this approach. Interestingly, this approach would work irrespective of the database technology being used, whether it is SQL or NoSQL, since the concept of a database is universal across both these paradigms. In hindsight, we'll also be seeing that adapting existing or new applications to this approach is fairly easy, as it involves almost no change to code or database schema. So, now that we've understood the significance of this approach, let's move on to discussing the implementation details. So, what does the implementation look like? Well, there are two major problems that we'd have to solve. The first one being dynamic routing of database queries, that is, all queries made from within the HTTP.out handler or in this case, the Django view should be scoped to a particular tenant's database at runtime. The second one is even more interesting in that when a new tenant is added or when an existing tenant's configuration is modified or deleted, the app servers will have to be informed of these changes and would have to handle them accordingly as well in order to serve subsequent requests properly. Now, for the purposes of illustration, Django has been used. The approaches that we'll be discussing to solving both these problems are pretty generic and can be extended to any web framework of choice. So, let's take a deep dive into how the DB query routing problem can be solved. So, the objective here is quite simple. It has to basically write a reusable Django application, which when plugged into any existing single tenant web app, would instantly make it multi-tenant by routing database queries to the appropriate tenant databases at runtime. So, now that we know what has to be done, let's take a look at the solution architecture for building this reusable app. So, let's first take a look at the various components present in the architecture. Obviously, we have Django itself, which is our primary application server. In addition, we've included salary to the backend stack in order to facilitate asynchronous processing of tasks. So, to all the folks out there using salary as part of the backend stack, don't worry, we've got you covered. So, moving to the right, we have the PostgreSQL cluster consisting of both the tenant one and tenant two databases. Come to the center, we have read is being used as the tenant configuration store to keep all the tenant related metadata and their respective database configurations. Then we have the message queue being used by salary. And finally, we have the ORM routing layer, which is responsible for database query routing. So, let's take a look at how these components behave in the boot phase. So, when the Django server boots and the ready method of the tenant router application is fired, it pulls the database configuration of all tenants from the tenant configuration store and loads it in memory. And the same mechanism is repeated when a salary worker boots as well. So, moving on to the execution phase, we see that when a HTTP request is fired, the tenant middleware is invoked and sets the appropriate tenant context in thread local storage based on some metadata from the request like its headers or domain. So, assume that the route handler or the Django view executes a bunch of ORM queries and files a salary task as well. So, let's first taste the execution flow of an ORM query. So, when it is fired, it first reaches the ORM routing layer which consists of a database router class defined according to the Django specification. And this class uses the tenant context that was set earlier by the middleware in order to figure out which database a particular query has to be routed to. Thus, using this mechanism, all ORM queries that get fired from within any Django view get routed to the appropriate tenant databases at runtime. Now, let's see how we can achieve the result, the same result with a salary task as well. Firstly, the before task publish signal is fired from within the Django context where the tenant context that was set by the middleware is injected into the tasks metadata before pushing it into the message queue. Now, when the worker picks it up from the queue and the task pre-run signal is fired, the tenant context that was injected earlier is pulled out from the tasks metadata and set as part of the thread local storage of the worker thread. Now, when an ORM query gets fired from the actual salary task handler, the same routing mechanism which I mentioned earlier kicks in in order to route the database queries to the appropriate tenant databases. So, with that, we come to a wrap of the solution architecture. Now that we've understood the behavior and responsibilities of each of these components, let's take a look at how to configure them in order to get everything working. So, I'd like to point out that all code snippets and demos in the upcoming slides are part of a GitHub repository whose link can be found in the references section at the very end. So, the code snippet here shows the Django specific settings that have to be configured. To begin with, the tenant outer app is plugged into the list of installed apps. Next up, we've plugged in the tenant context middleware into the list of middlewares. So, make sure to add this middleware at the top of any other middleware that might require the tenant context to be set. Going on, the custom database router class is plugged into the list of database routers. Now, this is responsible for enabling the dynamic query routing mechanism. Also, the very familiar database's dictionary has to be configured as well. As you might have noticed, this will no longer contain the actual database configuration. Rather, it would serve as a template that will later be used when expanding this dictionary into something like tenant1-default, tenant2-default, and so on. Now, the expansion happens when the ready method of the tenant router app is fired during boot up. So, let's move on to discussing the configuration specific to the tenant router app. Since the application could use multiple ORMs, the tenant router app would have to be informed of all the ORMs being used and their respective configurations as well. This is achieved by configuring the tenant router ORM settings. Well, here, it simply contains just one entry because we're using only the Django ORM. Other possible ways to configure this setting can be found in the GitHub repository. Next up, a per-service, per-tenant configuration namespace has to be created in the tenant config store so as to avoid collisions with other namespaces. Now, this is accomplished by prefixing all keys stored in the config store with the value provided here for the tenant router service name. Also, the caches dictionary has to now include an entry with the key tenant router config store, which in this case points to the readers instance which would be seeded with the metadata and database configuration of all tenants. So, now that we've seen how to configure the application, let's take a look at it in action. So, in this demo, we're going to be looking at a simple hospital management system that lists hospitals and the patients belonging to these hospitals. Firstly, let's verify whether the query routing mechanism works. So, as you can see, the list of hospitals being displayed when the browser is pointing to tenant1.test.com is different from the list of hospitals being displayed when the browser is pointing to tenant2.test.com. This in itself verifies this behavior to a certain extent. Now, to verify further, let's inspect the API calls being made to fetch the list of hospitals. So here, we see that the route being fired is slash api slash hospital. So, let's move on to tenant2.tab and check the same API call. So, here again, we see that the route being fired is slash api slash hospital. So, this confirms that both the API calls are being made to the same Django view, but the data that is returning is different. So, how is this happening? Well, the answer lies in the value of the extended ID header being passed. So, here we can see that the value for this header in tenant2.tab is tenant2.test.com, whereas the value for this header in tenant1.tab is tenant1.test.com. Thus, we've verified that queries from within the Django view are being routed to the appropriate tenant databases based on the value of this header. Now, let's try and verify the same behavior with a salary task as well. So, let's go ahead and add a patient. So, let's call this patient Bob. So, when this form is submitted, rather than the patient being added immediately, a salary task is fired in the background which adds the patient to the database in an asynchronous manner. I would like to point out that this has been done only for demonstration purposes and is not a real-world scenario. So, let's go ahead and submit the form. So, once the patient tab refreshes, we should be seeing the patient appear. Ah, awesome. So, we have the patient that we just added being listed here. Also, we see that the patient's tab is now displaying different data for each of the tenant tabs that we have. So, well with that, we come to the end of this demo. We've successfully verified that the database query routing works with both Django and salary. So, let's move on to bigger and better things. So, now that our first problem is solved, let's move on to the second one which is even more interesting. So, the objective here is again pretty simple. We need a mechanism by which events such as tenant addition, updation, or deletion is communicated to all running instances of the app server which in turn would have to handle them in order to serve subsequent requests appropriately. Now, typically in the Django world, a problem like this is tackled using a graceful restart. So, let's briefly walk through this architecture. Assume that the DevOps or the infrastructure system fires a HTTP call to the app server whenever a tenant gets added, updated, or deleted. Now, when the worker handles this request, it first updates the tenant configuration store and then sends a hoop signal to the master to initiate a graceful restart. Now, when the server reboots, it picks up the updated configuration from the config store and is just able to serve a subsequent request appropriately. So, the problem is pretty much solved but the solution has a couple of major drawbacks. So, if we think about it, if multiple tenants are onboarded in quick succession and the app server is under high load, then the entire app server cluster would have to be restarted those many number of times, which seems very unwieldy and impractical. Also, not all components provide a reliable way to achieve a graceful restart. For example, the restart mechanism in salary works by first sending a turn signal and waiting for its workers to finish any of its pending tasks before initiating a worker reboot. Well, so based on all of this, we don't do that here. So, the solution to this problem is to probably design a cool published subscribed mechanism to communicate all tenant-related events proactively to all running workers and allowing them to handle those events subsequently as well. So, instead of the workers restarting, they would be reacting to tenant configuration changes. So, let's discuss in detail how the solution architecture for this system looks like. So, let's first take a look at the different components present in this architecture. So, to the top left, we have the DevOps block representing the DevOps or the infrastructure system. Now, this could be AWS, Azure, DCP, etc. Coming to the center, we have Unicorn consisting of two worker processes. To the bottom right, we have Readers being used as the tenant configuration store. And also, we have a Pub-Sub provider component which serves as the base infrastructure upon which our entire Pub-Sub service is built. So, coming to the execution flow, when Unicorn boots are each worker apart from the main thread, it comes in additional background thread which keeps listening to events on a particular channel. So, now that the boot phase is done, let's take a look at the workflow that gets executed when a new tenant is onboarded to the platform. First, the DevOps or the infrastructure system spins up the necessary architecture in terms of database, cache, etc. on the cloud platform. Then, it makes a HTTP call to the app server to inform that a new tenant has been added. Now, the payload of this request contains the database configuration of the newly added tenant. Now, when the request is handled by one of the workers, here's what it does. Firstly, it inserts this database configuration into the tenant configuration store. Then, it runs database migrations on the newly provisioned database. And finally, it publishes a tenant create event on the Pub-Sub channel. Now, the Pub-Sub provider delivers the event to all active subscribers which in this case are the background threads of both worker processes. Now, a callback is invoked subsequently in both these threads which updates the in-memory database configuration which in this case is the database is big that we find in Settings.py with that of the added tenant. Thus, all subsequent HTTP requests coming from the newly added tenant can be handled by either of the worker processes. So, the solution seems to be convincing, but is that all there is to it? Well, let's find out. Well, Bob is back and thinks that we're missing something here. And indeed, we are. What about closing or invalidating the stale database connections? So, to understand the scenario better, let's take a look at what happens when a tenant update event occurs. Now, the callback executed in the background thread has to perform the following operations. The first one being, updation of in-memory database configuration with that of the modified tenant. Now, this is common to all tenant life events, but the second one, however, is specific to the tenant update and delete events, wherein the stale database connection has to be closed or invalidated. Now, the first operation is simple and can be executed safely, but the second operation is a bit tricky and here's why. Database connection management in ORMs can be divided into two categories. They are thread local connections where database connections are stored in thread local storage and process level connections where database connections are stored at the process level and shared by all threads of a process. Now, if the ORM uses thread local connections, the problem is that closing the stale connection of the main thread is impossible since the callback now runs in the background thread. If the ORM uses process level connections and the connection is closed in the background thread, then the main thread would suffer from a broken connection error since both these threads now share the same connection object. So, if we think about it, we'll have to modify our solution from before such that closing the stale database connections can be done in a safe and reliable way. So, let's see how we can achieve this. So, the architecture diagram here is almost the same as what we saw earlier, but with some minor modifications. We see that a new event queue component has been added to each of the worker processes. So, let's see how adding this component can solve all our problems. So, let's work through the execution flow again, but this time look at what happens when a tenant update event occurs. The execution flow is the same until the event is delivered to the background threads of both worker processes. Now, the event handler callback fired in both these threads, instead of performing the actual operations, creates a wrapped event and puts it into the event queue to be consumed later. Now, the implementation of this queue could be as simple as using acollections.dq in Python. Now, the queue is later consumed by the main thread at both the request started and request finish signals emitted by Django, where the actual callback responsible for invalidating the stale database connection and updating the in-memory configuration gets executed. So, how does this solve the problems mentioned earlier? Well, in the case of thread local connections, since the actual callback now gets executed in the main thread, it now has access to the thread local connection that has to be closed. In the case of process level connections, since the connection is closed before the view starts or after the view finishes, the view would never experience a broken connection error in between thereby solving this problem as well. Also, I'd like to point out that the same mechanism has been extended and applied to sanity. A brief note about this can be found in the GitHub repository, so do check it out. So, now that we have a working mechanism by which app servers can react to changes in tenant configuration, let's take a look at how developers can use this API in practice. So, this API model has been largely inspired by reactivates as observable patterns. So, let's assume that there's a user-defined Django application which needs to perform some custom operation when a tenant lifecycle event occurs. The tenant channel observable is used to achieve this by subscribing to the specific lifecycle event and providing a callback that needs to be triggered subsequently. You can also see that the callback receives an additional event argument to provide for better processing context. So, more details about the API and how it works can be found in the GitHub repository. So, now that we've seen how to use the API, let's take a look at the settings that have to be configured in order to get things working. So, let's first take a look at the configuration in settings. So, we need to inform the application of the backend class and location of the PubSub provider to use. So, this is achieved by configuring the tenant router PubSub settings. Currently, the support for readers, support for other providers like Kafka can be added iteratively. So, also, in order to enable the PubSub mechanism, a flag named tenant router PubSub enabled has to be explicitly set to true. Next step, in order to start, reliably start the background event list method on boot up, a couple of callbacks have to be invoked from Unicorn server hooks. To understand why it has to be done this way, there's a technical intricacy that has to be understood about the different pre-focking models that components like Unicorn and cell reuse. So, you can find more information about this in the GitHub repository. So, now that we're done with the configuration as well, let's see it all in action. So, in this demo, let's take a look at the add tenant workflow followed by a delete tenant workflow. So, assume that the database for the new tenant has been created. All we have to do now is to make a HTTP call to the app server to inform that a new tenant has been added. So, here we have the create a net API. So, let's go ahead and fire this. Awesome. So, we have a successful response. So, at this point, the tenant create event has been put into the event queue of both workers, but hasn't been processed yet. Only when the next HTTP call is made, the event queue is consumed. So, let's go ahead and make a HTTP call from the new tenant that we just added in order to verify this behavior. So, let's go ahead and make a call. Right. We have a successful response for both the hospitals and patients API. So, this means that our Unicorn workers have updated their respective in-memory database configurations with that of the newly added tenant. So, let's go ahead and verify the same behavior with A7D task as well. So, let's add a patient to do this. So, let's call this patient Adam. Right. So, let's go ahead and submit this form. Awesome. So, it seems that the salary workers have updated their in-memory database configuration as well. Additionally, we see that all three tabs are now displaying different data in the patient section. Cool. So, to wrap things up, let's try and delete this tenant. So, all we have to do is to fire a HTTP delete request to the app server with this tenant ID. So, let's go ahead and do that. Awesome. So, we have a successful response. Also, we see that all API calls that have been made from this new tenant is erroring out saying that the tenant with this ID does not exist anymore. The reason is that it has now been removed from both the tenant configuration store as well as the in-memory configuration of the Unicorn workers. So, with that we come to the end of this demo as well. It seems like magic that we've actually successfully achieved reactive tenant configuration. So, now that we have a working model of our entire idea, are there any additional implementation challenges to look out for before trying this out in production? Well, the answer is yes and the first one being database migrations. So, the default migrate command would no longer work as it applies migrations to only one database at a time. But now, we have to run migrations across all tenant databases for the migrate command might have to be written. So, caches similar to databases could also be tenant specific. Since Django currently doesn't provide a routing mechanism for caches out of the box, a custom cache handler might have to be implemented in order to address this use case. Well, the last of our concerns is unites. So, since the concept of a default database doesn't exist anymore, all unites during execution would have to go to a single tenant's database. To achieve this, a custom test on a class might have to be implemented extending the behavior of the default Django test on it. So, if you guys are wondering whether these problems have already been addressed, well, the answer is yes and you can find the relevant solutions in the GitHub repository. So, do check it out. Well, with that, we come to a wrap. Hope you guys understood a fair bit about multi-tenancy and the challenges involved in implementing it. If you guys like the idea and my work, please check out my Medium and LinkedIn profiles as well. I guess there would have been more than a few questions that would have sprung up by now. So, let's kickstart the Q&A. So, I guess we are ready to take the Q&A. Yeah. So, there has no question as of now, but there are probably people are going to ask more, but we are running out of time, so we might have to continue. Okay. So, I guess I will be available in the Zilip chat. Yes. Aditya will be available, folks. In the Zilip chat, we will post the link as well. Yeah. People have been asking for slides. They really liked it, so cool stuff. Oh, okay. Thanks a lot. There is one question. Right. Let's get to that. What about PyTest with Django? How do we implement this? So, regarding unit tests, so the thing is that we will be implementing a custom test runner. So, whether we are using the default Python standard unit test library or PyTest, we can basically specify the path to the test runner to the testing framework that we are using, and it will use that test runner to run all the test cases. So, so the solution is like sort of implemented in a way that it isolates the testing separately and the testing framework separately. So, the test runner is separate, and the framework which runs the test runner is separate. So, in respect to whether you use PyTest or any other Python testing framework, if you give the path to this test runner, it should work.