 Yeah, so I'll start anyways. I hope that I'll get it awake with the stop. I'm Ratnadeep Devnak. I'm also known as Artian Pro. Sorry, at Zapier. I contribute to open source and mentor new people, get started with open source contributions. I'm also a Bhakti Yoga practitioner. And I work remotely from a place called Mayapur, which is 130 kilometers from Kolkata in West Bengal, India, where the world's biggest temple is coming up. So to start with, I'd like to set some context for the stock so that you can relate to the stock better. As I said, I work at Zapier. So what do we do at Zapier? At Zapier, we help you connect your apps and automate workflows. You can connect apps that you use every day and do some cool stuff. For example, you can write a zap or a workflow, which will check your inbox for new messages arriving. And if it has an attachment, you can upload the attachment to Dropbox. And you can send that Dropbox link to Slack. You can automate similar other cool workflows in your life. And you can do that using our cool web editor without writing a single line of code. So what is Zapier under the hood? We are a Python Django workshop. We run more than 10 million tasks a day to automate all the workflows for you. We have around, like, 250 application servers that talk to our databases. And we need to scale up to get into the new users joining our platform. So we won't win any awards for our application architecture. It's plain vanilla. We have got multiple application servers running multiple workers. We have got RDS. For database, we use RDS MySQL. It has a single master and multiple replica setup. The app servers connect to primary for read write, and they connect to multiple replicas for intensive reads. However, this model has got some shortcomings. Our apps end up making too many database connections to the DB. And we tried Django connection pooling. But it was not enough. It doesn't scale beyond a single process. What to talk about multiple servers? Even with Django pooling, we ended up in situations where our applications would make too many connections to the database. And MySQL really doesn't like a lot of connections to the DB. And we had a threshold, like, around over 6,000 connections. Our DB would start stalling down, and then things go spiraling down. And this caused downtimes at peak hours, which is not that great. So far, we had used duck typing to get around this problem. Whenever we hit some limits to scale up RDS MySQL instances, one approach is adding more replicas to handle more reads. But for us, the master is also bottleneck, and we have only a single master. If we cannot solve this by adding replicas, yeah, we hit a button in AWS, scale up RDS, pay some additional dollars, and get away for some time. But not forever. This has limitations. It cannot go forever. We are limited by the hardware that is available in a particular AZ of AWS. This also has increased cost. Yeah, it's cost prohibitive. So what now? We started with a vision that we want to hit 10x scale without 10x cost. We considered AWS Aurora, but it is still limited by a single master. And it involves some re-architecture, and we do not want to do that at that point of time. So what would be an ideal solution for us? So we started looking for a global database proxy with a good and efficient connection pool, something that works across multiple servers, not just a single process. So we started looking at the ecosystem for available solutions. First, we hit some proxies, like Nginx, HAProxy. But soon, we realized that they are just proxies. They are not database-aware, and they do not have connection pooling, what you need it. Then we hit MariaDB Max Scale. It is database-aware, which is good news for us. It has connection pooling, something we want. However, it is not aware of transactions, which is kind of a letdown, because we need support for transactions. And it doesn't support query-based startling. So what do I mean by query-based routing? So you can define rules like that, like route all the read traffics, for example, slate queries to your replicas, and route all the write queries to the primary database. You cannot do that with MariaDB Max Scale. And above that, it is proprietary. And at Zapier, we prefer using open-source applications as long as possible. Then we hit upon proxy SQL. It is database-aware, like its earlier counterpart. It supports connection pooling. It is transaction-aware, so it understands database transactions, good news for us. It also supports query-based routing. And on top of that, it's open-source, win-win for us. So our choice, the title on the top, is obviously proxy SQL, because it is featureful. It is transaction-aware. It supports connection pooling. It's open-source, I told you that before. On top of that, it has got an awesome community. So proxy SQL is not developed by a single person or a single organization. There are collaborators from various organizations. They have an active GitHub issue list, pull requests, and active mailing list. You get support from all these channels. And also on top of that, there is a community around it, like various DB consulting firms like Percona, several lines. They support proxy SQL. And they have generated awesome content from proxy SQL with benchmarks, good blogs, how to set up proxy SQL efficiently. So we feel that with this awesome community, we are at good hand with proxy SQL. So how does proxy SQL really work? So proxy SQL sits between your application servers and your database. So your application servers, as notorious as they are, they can spawn around thousands of connections to proxy SQL. And proxy SQL just maintains a few hundred connections to the database and does intelligent multiplexing so that it can deliver the throughput with having less load on the database. So how do you get started with proxy SQL? You install proxy SQL using your distribution package manager. It's available in most distributions. You dump a big proxy SQL CNF in HC, proxy SQL CNF. I'll go through the important parts here. So you can configure your backend database servers using MySQL servers. You can configure users that proxy SQL will use to connect to your backend databases and the user that the applications will use to connect to proxy SQL. And then you can define some query rules. You can define code-based query rule which is the easiest to get started but you can also write user-based query rule. You can also write query-based query rule. So you can have a lot of options here. And then you start proxy SQL. Once you start proxy SQL, you also have the option to update your configuration live without restarting proxy SQL. That means when you have to update configuration for proxy SQL, for example, configure new backend DBs for it, you don't have any downtime. So if you search about proxy SQL on the internet, you'll end up finding a lot of articles about good benchmark, good numbers, generated by the community on it. But before taking things to production, we wanted to try things for ourselves. So we used a tool called sysbench to run benchmarks on proxy SQL. So sysbench is a cool tool which can throw like thousands of concurrent connections to your database and put it under load and generate some metrics out of that. So we ran the tests in multiple modes. We ran it in read-only mode. We ran it in read-write mode across two scenarios. Like one time there was the database which was a small RDS instance. Directly, we hit the database directly. And in other case, we hit the database via proxy SQL. And then we recorded the metrics. These are the metrics for record. I'll analyze it further here. So if we compare the above metrics, so we see that in case of latency, proxy SQL has around like 10% higher latency than direct access to database. That is the expertise, another layer on top of the database. In terms of QPS, transactional, proxy SQL is again having 10% less performance than direct access to database. In case of other queries, the QPS is again around like 10% less than direct access to database. But the interesting thing to note here is that while in case of the direct access to database, we had 100 concurrent connections hitting the database. In case of proxy SQL, we had 100 concurrent connections to hitting proxy SQL from applications. But proxy SQL maintained very less number of connections to the database. In case of read write, we had one 10th of the connection. Proxy SQL just had 10 connections for handling 100 connections. And in case of read write, it had 20 connections, one fit of the total number of connections that came from the application. So we see that proxy SQL is able to deliver almost like 90% of the performance with maintaining one 10th or one fifth of the connections to the database, which is a great number. It gives you enough room to scale on top of your current database setup, on top of single master. Proxy SQL has got a number of advantages. Or there's something else. Let's see. How much can it multiplex? The upstream claims that proxy SQL it quick in a particular way can scale like multiplex indefinitely. But there are some goths to that. You don't want higher latency on connection initializations. So based on our benchmarks and test in our staging and production, for us 10x, we had like easily 10x gains for read queries and 5x gains for write queries. These are like pretty good numbers for us. Proxy SQL claims and boasts about zero downtime configuration update. That means that you can add backend databases or replace backend databases without restarting your app and without restarting Proxy SQL. That means there will be no downtime. But it really sounds too good to be true. So, as skeptics we are like we decided to write our own tests and assert this. We had to write this because the docs did not say that it has zero downtime live update but it did not have examples or explanations on that's where kind of like in doubt. So we wrote our own tests. So in this link, you can find the test that we wrote to assert this and what we found was really cool. So whenever you replace an existing backend server from Proxy SQL and replace it with a new one, so Proxy SQL lets the in-flight transactions complete rather than killing them right away. And for the newer transactions that arrive, it starts routing the transactions to the new database. So you do not end up losing any in-flight transactions during this live update. Seems good. We use Django. So we wanted to assert that whether, how well Proxy SQL plays with Django and basically the Django ORM. So what works with Django? Typical read write works well. Django performs transactions using set auto commit zero and set auto commit one blocks. It is not a typical way to do transactions. DBS don't like that. Proxy SQL did not support that earlier but then on the basis of a requirement by the community, they eventually added up adding a parameter in the config which enables us to support it, set auto commit zero and one blocks as transactions. So you can set MySQL auto commit false as transaction to true to enable identify auto commit zero as transactions. So what it does is that like it disables multiplexing during transactions so that your connection in a transaction won't end up in multiple backend database connections. And you can use port based routing to get started with like in Proxy SQL to get started with using Proxy SQL with Django without any code refactor. What doesn't work? Officially documented save point queries or nested transactions do not work with Proxy SQL. Long living connections from application do not work that well. So if you have application pooling, you might consider disabling that or like tweaking the timeouts. Sorry, the time till the database connections are alive. So at Zapier we, and I think in many organizations you will be using save points like nested transactions. You want to do a partial commit and then roll back if necessary or like commit. So it was very key to getting live with Proxy SQL and this was like a blocker to go to production. So it is not officially supported but we had a hypothesis that since all the transactions are wrapped, like all the save points queries are wrapped inside auto commit zero and one blocks in Django and since Proxy SQL disables multiplexing in that period, it might work. So it has just a hypothesis. So again, we started writing some tests and we found that it works in certain cases. So save point works when it is encapsulated with auto commit zero and auto commit one blocks. So that we can treat it as a transaction. And we need to turn another value. MySQL enforce auto commit on reads to crew to get this working with various isolation levels. So in our initial test, it did not, even a save point encapsulated inside auto commit, it did not work for repeatable reads, TX isolation levels, but with this value set it works from like read committed and repeatable reads quite well. And Proxy SQL needs to have as many back end connections to handle a concurrent transactions that are hitting Proxy SQL. So if you have 100 concurrent transactions hitting your Proxy SQL, you need to have 100 back end connections ready to handle that because multiplexing is disabled during that time. So if you kind of like over provision Proxy SQL bit by say 100 few more back end connections, it should work pretty well. And here is the source code upper like the Django Proxy SQL demo and test that it did. You can refer to that later. So once we are happy that we can go in production in Proxy SQL, we started planning our deployment strategy. So when it comes to deployment strategies, there are multiple ways you can deploy Proxy SQL. So you have got one approach which is application sidecar. You're not confused sidecar with the container world. However, the meaning is something similar. Centralized cluster. So a sidecar approach looks like this. You have got Proxy SQL running alongside the application on the same host. And the application talks to the local Proxy SQL server and the Proxy SQL talks, all these individual Proxy SQL talks to the database separately. It has got some advantages, some disadvantages. So since the Proxy SQL resides on the same host, this little latency between application and Proxy SQL, so hardly any network latency. So the access is fast. And the load on Proxy SQL server running on each node is less. The disadvantages of this would be like it's a pain to maintain such a setup in a traditional deployment model using EC2 Ansible Packer. However, it's quite simple to do it in the Docker world with Kubernetes, but yeah, that's how it is. Monitoring so many instances of Proxy SQL is pain. Even understanding what fail and what not is painful. Even maintaining Proxy SQL instances deployed across say 250 application servers is a painful process. So there's this second approach like Centralized Cluster where you have got a highly available Proxy SQL setup running between an application and a database. And the applications connect to this cluster and the Proxy SQL cluster talks to the database. The advantages of this approach are like it's easy to maintain, it's easy to monitor. It's a single set of few nodes that you have to monitor and maintain. However, these advantages are, now you introduce a network between your application and Proxy SQL nodes. And this has an added latency because of network. And if you put an ELB in front of that to load balanced traffic, you'll have a few milliseconds of higher latency. And it also creates like, since it's a Central cluster, it creates higher load on Proxy SQL unlike the previous approach where the load on Proxy SQL was less. Does it, is it a disadvantage? Seems like, it sounds like it. But Proxy SQL shines the brightest when it's under load. This is a benchmark generated by the Percona folks which shows that with increasing number of connection threads from the application to the database or Proxy SQL, the throughput increases to a particular point. And after that, the performance starts degrading, throughput decreases in case of databases. But usually that in case of Proxy SQL, it reaches a plateau after like 200 threads. And this is how you want your systems are scaled to behave. You do not want the performance to degrade with increasing load. And Proxy SQL maintains an average performance even after that particular threshold where as direct access to database just drops drastically. So our choice was obviously a centralized Proxy SQL cluster because it's more performant and it's easy to maintain and monitor. So no deployment is complete with the monitoring. So we also implemented monitoring for Proxy SQL. Our monitoring story uses DRDOG and this is a snapshot of the DRDOG dashboard we use for monitoring Proxy SQL. Monitoring key metrics for Proxy SQL like native traffic, the number of CPU cores it is using, the number of, the amount of memory it is using, the number of kind of client connections it is handling per node, the number of connections that are in use, the number of free connections you have in a pool. So you can know that is really to like, are you good enough? Like do you have enough connections to handle any connection traffic spike? And it shows you average latency. Yep. So Proxy SQL, how do you do monitor Proxy SQL? Proxy SQL has got a database from, where it dumps metrics, it's the stats DB. So you need to write your some custom script to scrape the metrics from that and push to your favorite monitoring tool. This is the link for the script that is used to monitor Proxy SQL. And this is below is the link of the wiki where it talks about what metrics is for what. So you can go into that and like figure out what metrics you might need. Now it comes to logging. Yeah, we also need logging for any production system. The logging story with Proxy SQL is a bit dicey. As in like, it doesn't support out of the box logging integration with frameworks like Grailog, Logstash. It does basic logging to H2D art, which is not that useful. Proxy SQL does support query logging. It's helpful in debugging, but we have disability in production anyways. But it's in binary format and you cannot mix and sort of that much. You need additional tools to make sense out of the binary query logs. However, some folks from Zendex, they are implementing a working on a full request which will enable support for JSON based query logging so that they can route it to other log aggregators. For us, we just rely on some application logs which kind of tell us about like some operational errors that we get from Proxy SQL and we have set up alerts for that so that they can act whenever there is something like that. So failure is a thing that might happen anytime anywhere so you have to be ready for failures and Proxy SQL is no exception system. It has got failures. One kind of failure that you might get is that when your Proxy SQL node dies, how does the application react to that? So you can get around that by setting up a highly available Proxy SQL cluster so that it's not a single point of failure so you can handle one node going down. And you can also set up a graceful shutdown of Proxy SQL so that the connections are not killed in flight. So Proxy SQL has enough time to complete the existing connection that is handling and then close it gracefully. Then when you start using Proxy SQL, you might get this error in the first few days that Macs connect timeout page while reaching host group. So this doesn't come didn't Proxy SQL. So MySQL is really bad at handling connection spikes. So it takes really time to create new connections. So whenever there's a connection spike in your application, say you just started Proxy SQL and enable Proxy SQL on all apps and they all start connecting to Proxy SQL. So MySQL will choke up creating connections and then Proxy SQL will start throwing this error to you. One way to get around that is that you can increase the parameter value like MySQL connect timeout server max. So your clients can wait for longer time but I do not see it as a feasible way. I do not want to have this only more than 10 seconds. So yeah, you can configure this up to a limit but not much. However, you can enable one feature called MySQL free connections percentage. What it indicates that once Proxy SQL opens the connection, how much percentage of the connections does it keep in the pool free so that it can directly handle that to the app whenever there's need. So we set that way higher value because we want our Proxy SQL to be ready for handling new connections. And you can also do one thing is that you can slowly ramp up Proxy to Proxy SQL or you can have some script that actually warms up Proxy SQL so that it has got enough connection in the pool and then brought application traffic to Proxy SQL. Then there's one common failure that you might face in Proxy SQL as well that MySQL server has gone away. That means that an application establishes a connection to the backend database and it is a long-lived connection and in between the connection dies and the application doesn't know and it still tries to do a query and then it gets this operational error that MySQL server has gone away. So what you can do is that Proxy SQL is an efficient connection pooling. So there's no need to have again applications at connection pooling. It's quite efficient to have seen that. So you can either disable connection pooling in your application or you can keep your database connections alive for a small time, say 15 seconds. For us, 15 seconds works very good for us. And you can shut down or rotate Proxy SQL more gracefully. Like you can, whenever a shutdown signal comes to Proxy SQL, rather than killing Proxy SQL, you can handle the signal. You can wait for some time so that sometime which is greater than all your, the time for each database connection sleeve and you can wait for Proxy SQL to clean up like gracefully close all the connections and then kill Proxy SQL. And then there's another error that it faced when it just went live last week in production is that like too many connections. It happens when a lot of connections come to Proxy SQL or even it would come in your MySQL case as well beyond a particular limit that you have set. So there's a parameter called max connections. If the number of client connections exceeds that or is about the same, you'll start getting this error. Solution is that simply bump up this value and it'll be good. General advice is that all the Proxy SQL works out of the box kind of but you'd like to make your refactor app to make it a bit more resilient to handle database connection and operational failure so that it recovers automatically without you needing to restart the app. So what is the end result look like? We had like five X gains in case of our write. We had 10 X gains in case of our read. And we have some cost saving as well. We just got rid of our like two out of our four AWS replicas saving on like $1,500 a month. And the load is on our master is now less giving us enough leg room to handle traffic spikes during a peak hours and peak seasons. This is a live stats pulled out from our data metrics dashboard. So you see that the latency is almost the same as before but on the right hand corner you see that the connections made to the primary DB has dropped drastically. Initially that value was like around 1.5K on average value and that has come down to around 300. So it's like one fifth and that too it has got the free like connections which are free in the pool. So it has got connections used and which has got connections free. So the real value will be much lower but we have overpropagation Proxy SQL anyways. And in case of reads, we had initially four replicas handling around over 1,000 connections each. Now that has come down to around like 400 connections in all. Again, there is the free connections that are lying idle. And you see that over all these experiments, the latency, the throughput of a database has remained almost same. So we have been able to achieve this same performance with maintain less number of connections through database. So what is in store for Proxy SQL as Zapper in future? So currently we use Django based routing because that was a simple approach to take. We did not need to refactor our code base but in the future, we'd like to try the query based routing of Proxy SQL the way DBS do it like having a sender place where you can configure how your queries are routed to your various databases. Proxy SQL also has got a cool query caching feature. It's a very efficient query caching feature, much efficient than MySQL query caching. However, it's a very simple system. It's a simple TTL based system. So that means that once the cache is set, you do not have control to invalidate the cache. That will need some refactor in our app if you want to leverage this but we want to give this a shot in the future. And we have not till now, we have not tapped into like live updating Proxy SQL config. It needs some additional tooling. So in our version one deployment we just went with kind of immutable deployments but in the upcoming future we want to implement live updates so that we don't need to restore Proxy SQL at all and keep serving traffic without any errors or times. These are the resources I used for my talk. You can this link to the Proxy SQL website. You can find the Docker image that we're using to run Proxy SQL, the benchmark source code and the test demo test Django app that is built for Proxy SQL. You can find it on GitHub. So come up for questions. Hi, you mentioned that to prevent Proxy SQL from being a single point of failure you would want to cluster it. So how do you cluster that? Oh, you can have a Proxy SQL running either in an ASG behind a load balancer or in our case actually I did not delve into the topic we were running Proxy SQL on Kubernetes. So there'll be another talk but you can have multiple Proxy SQL running put an ELB behind that then add some latency but in our experience we saw that the latency is only one to like two millisecond which was an acceptable figure for us. Other thing you can do is that you can have a DNS based load balancing as well like weighted records but with DNS there's the trade off of TTLs kicking in that you won't have live updates immediately. You'll take some time to. But then won't you lose the benefits of connection pooling because connection pooling holds a connection to a particular server. And ELB has the ability to keep connections hold connections for some time. Okay, thank you. Quick question. You mentioned you were running it in front of AWS RDS instances. Yeah. Does it actually handle failovers like Funky AD failovers problem? We have not yet delved into failovers like Proxy SQL but it can do that because with the live configuration update you just need to update, have some kind of tooling to just update the config and it will just failover gracefully. Yeah, so we have not yet delved into that so I cannot, but it is a good area we can explore. One minute. Yeah, I just remember Proxy SQL also has got some kind of health checks. So Proxy SQL can do it for you. Yeah. But we have not implemented so I was not confident enough to tell that up front. I cannot hear you. Yes, equipment services using a service a load balancer type service and with a lot of hacks around that. We are creating an ELB because ELB cannot I think work for this like non-HTB traffic. We tried it and it did not work great for us in the early stages, it might work now. So yeah. Is it possible to sort of configure two masters and then one master sort of taking the traffic and then when the failover happens automatically? We are exploring that Proxy SQL works with setups like Galera cluster as well, which are multi-master. So Proxy SQL has got capacity to like able to write rules to like send traffic to shards. But we have not explored into that so but it does some supports some features like handling shards and all. So it works in multi-master setup but we have not used that. Not probably multi-master set up. I was probably indicating similar question to what we are. Are there health checks for that, you know? If my primary master fails automatically it's diverted to the. Yeah, it has got health checks. It monitors the back ends. So it can do like whenever there is some error it can switch between the back ends. So one more follow up question, you mentioned that. Old master, I think any DB, like any back end, yeah. Yeah, so what it does is that it lets the in-flight connections complete and then only, and parallely it starts routing the new connections to the new DB but it doesn't kill the in-flight transactions. I'm not able to. Won't that lead to inconsistency? So that has to be, obviously yeah, there are various challenges. Like Proxy SQL won't know about like the inconsistency. You have to orchestrate that how you want to do the failover. But it won't kill your transactions which you want from the proxy. No questions? So I think I'm done. One question, so we saw the previous presentation. Where, yeah, yeah. The previous presentation on the test. Could you have used such a solution instead of what's at the. Oh, we are considering to use because first when you're like exploring solutions but last year when we started researching on this thing we're not that confident with running something stateful in Vitez but Jitain's talk like clear some doubts that we can actually run something like that in Kubernetes like a stateful application. So we just kept Vitez away but we are considering Vitez for other reasons like other things in like just for experiment, yeah. I have a question about the hardware. For example. Can you speak a bit louder? Hello? Is it audible? Yeah. I have a question about hardware and based what type of hardware we need to choose for this process? Oh, yes. Process SQL is very CPU intensive. For in our setup we are using like we have like six practical loads of pods running. They are using like two CPU cores each but it's very light on memory. This is CPU intensive but less on memory. So you need to get compute intensive. Are you enabled in your environment? Are you enabled in the query cache in ProxySQL? We have not yet enabled query cache because this is the first implementation with ProxySQL. Once you're more confident with that we'll go it step by step. But it is really efficient that we have kind of checked the function code factor on our side to do that. There's one question at the back as well. Hey. Hello? Yeah. So you said that ProxySQL will figure out whether it will pass your query and figure out whether to send to a replica or to a master. Now, how do you handle stuff like slave lag? How do you handle stuff like slave lag? I mean, if you're hitting a slave how do you ensure that you're getting proper data? So ProxySQL will not magically route your traffic. You have to write query rules for that and it's totally up to you how you want to write the rejects that fits your, like, how you want to route it. So you can write that something like which has like select for update things goes to primary and all select queries goes to database. But in case of I think transactions it disables multiplexing altogether. So it'll just, it'll send all the queries to a particular master connection. What segregating need and write? I mean, I've been using Max, Max can segregate. I think the documentation that I read earlier at the information might be a bit outdated. So that time when you're evaluating all these services so it did not like what I read, I might be mistaken but what I read that time was that it did not have support for query rules segregation. You cannot separate, like can you write query rules to say that the route on my select queries on this table to that and insert queries to a particular table to a particular database. You can, okay, then, yeah, then, okay, I got that.