 Databases, a database seminar series at Carnegie Mellon University is recorded in front of a live studio audience. Funding for this program is made possible by Ottotune. Gugor. Hi guys, welcome to another daily seminar for Known. We're excited today to have Andre Borodin. He is a brand new engineer at Amazon, but prior to this he was at the index where he built the Odyssey proxy that he's going to talk about today. So in addition to this, he's also a PostgreSQL contributor and he has a PhD in computer science. So Andre, thank you so much for being here. The floor is yours. And I ask you for the audience, if you have any questions for Andre as he's going along, meet yourself and ask questions at any time. This will be a conversation for him and not him talking to myself over Zoom for an hour. So Andre, the floor is yours. You have an hour. Go for it. Thank you. Hi everyone. As I mentioned, my name is Andre. I'm a postgres hacker, a postgres developer. And I'm all in open source for a long time. And today we are going to talk about Odyssey. This is a connection, a postgres connection puller. Here is a team behind Odyssey. Odyssey is purely open source project. You can use it for whatever purposes you like. Just don't see us. This is not all contributors that I remember at all. On the left, you can see Dmitry Simonenko, he's designed overall architecture. On the center is Kirill Ryashki, he designed most hardcore features and some performance tuning. And on the right side is me. I was primarily responsible for rolling out Odyssey in production and teaching unpredictable stuff that happened there. A few words about me. I'm contributing to postgres since 2016. My first project was committed that year in postgres 10. I maintain open source projects. Some of them are stable in productions, just pure research projects. And I know which one is which. Odyssey was developed for Yandex services that were based in postgres. For example, Yandex serves mailboxes for half a billion mailboxes. I have no idea how much active users are there, but someone had more than one mailbox. But the workload is about 1 million requests per second to the database. And it's quite stable for already like some years. Also, there is a Yandex cloud for other services where we have some like something like 10 petabytes of ball. I don't know exact figures right now since I'm Yandex employee, but it must be on the same scale. A typical cluster that we used is a cluster of some usually three replica nodes. There is a primary node and two core replication standby nodes. Replication standby nodes usually serve read only read workload while changing mutating transactions go through primary node. We never do analytics on postgres. We use a logical application to copy the data to greenhouse or green plum or which or some other systems that we use. And we collect right ahead logs and backups to our stream implementation. And let me make a bold statement. Maybe too bold for me, but still every most of postgres installations used for need some connection proxy. This is a problem of postgres that must be solved. Like if you're designing your own service with postgres, you can start it right away. But when the service becomes loaded, you need something that stands between your driver and postgres database. Why so? How happened? How happened that we need some proxy tools? Let's check out fresh postgres out of object and run PG bench. Here you can see that we run 10,000 connections. I used random virtual machine with eight cores and got 86 seconds. This is very small. And the easiest way to tune the stuff is to just reduce concurrency. If we reduce concurrency to 1000 of active clients, we see that now we have 500 transactions per second. Going further and we have some sensible performance of a cluster. We can tune a lot more. This is not where you should stop. This is out of box configuration when nothing was tuned. But just reducing the concurrency bring order of magnitude more performance. How does it happen? This is a heritage of postgres architecture. One server connection, which is called one backend, is a one unit process. If, for some reason, it crashes, it can even not crash the whole database. This is a kind of isolation. But the cost that it brings is that this process for each connection have its own cache of relation. It knows some tables, its columns and types of columns. It has its own cache of compiled PLPGS queue statements. It has its own cache of query plans. So all caches, not all, but most of caches are tied to backend cache. Also, we use for offensive, but let's keep it for now. When we move on, we see that inside the postgres core, many algorithms depends linearly on a number of active connections. For example, here, a small snippet of taking a mbcc snapshot, which just processed through a whole so-called proc array. Array of every array of connections checks if the connection is active, if it's active, analyzes its states. So we pre-allocate this array on the start of Postmaster. Memory consumed by each connection is pre-allocated upon the start. And many, many small algorithms depends on the size of a proc array. That's why we always want to make proc arrays smaller to make these algorithms faster. Community recognizes this as a problem. For example, in 2020, I'm destroying from Microsoft, make a huge effort to make Postgres more scalable in terms of stand-by-care. In terms of, you can read his blog post on making idle connection less effect performance less. Let's check out some chart from his blog post. First thing that you can see here is that one Postgres box and optimal workload can serve up to two millions queries per second. This is a lot in factually. This is two million read-only transactions per second, two million querying of snapshots, two million index scans and then heap scans. This is one virtual machine with 72 virtual CPUs. Yet we see that performance degrades even with before Postgres 14 performance degrades, even with simply adding idle connections. Connections that do not do anything. To quantify things, we can say that performance of 10,000 idle connections is at least three times lower than performance of 100 server connections. This is repeated in many blog posts. For example, this is a blog post by Percona, where they used different benchmark tools, Sysbench, which can be more close to different workloads, just different workloads. And here we see that with some kind of connection pooling, we have a reasonable and stable performance, but without pooler in some cases, we get much better performance. So the basic idea of connection pooling is to reduce concurrency within the database, closer to some global optimum. And on some benchmarks, we observed that this global optimum really exists. This is a benchmark from unrelated patch, but it was done in a cut of a number of concurrent clients. And we clearly see here that concurrency around the number of CPUs is optimal for database. This is not always the case. The benchmarks with more than 100 of CPUs for progress show that concurrency is still somewhere around less than 100 active concurrent queries, active concurrent transactions. But things change when we move from in-memory workload to some disk-based workload. So this chart shifts to the right, and some optimum is usually achieved around 300 to 400 of concurrent backends. I think this is connected to QD learn devices, but still it looks the same. You have some optimum and then decline of performance. This is just pure benchmark stuff. But when things get real is that there is no such thing as homogeneous workload. Usually you have some different micro services that are asking the database different queries. And this is kind of isolation for your service, kind of isolation of your application. The whole service that is not very important, which you limit for about 10 concurrent queries. And then you have a big business critical service, which is super important for your business and you limit it in 200 concurrent questions. Why usually we do this because one small micro service should not be able to kill entire production. All its resources, all its database quota, and we don't have another mechanics to quote performance of some micro services. And thus with this isolation, we protect micro services from each other. And this services are still DBA driving. You need some smart DBA who knows which services critical, which is not. This service will perform better with 50 concurrent connection. And this service is based and you need 500 concurrent connections and they don't interfere with working set. But that's where it happens. The solution to the number of actively concurrent queries is to be found yet. Typically, if you use out of the box proxy, you have error based on currency. What does it mean when the application is querying a database, it first obtains a connection. If it's an lucky pass, it will get it from an application side pool. If it's not, it will have to establish new connection to the database. Ask the query, commit the query, and if something failed, we need to have a back off period to wait and then retry from the beginning, acquire a new connection and rerun a query. Any LTP application must be prepared to retry connection, retry both connection and query. Small interruptions of network connectivity happen here and there. And any application that do not retry queries to the database can be very sensitive to transit network out of this. This is that's the place where it's designed. Basic concurrency limit, but it doesn't work well in practice because developers usually just forget to add this right. Where can we allocate some connections on advance and limit usage, limit the number of concurrently used connection. Easiest ways to use just application side pooling. Use the driver which have its own connection pool on the client side. Another option is to use some proxy, which is connecting lots of client queries into a small number of server sources. And finally, there is ongoing work on making built-in pooling. For example, in MySQL you have enterprise version with a connection pool or built-in. In POTJOS we wanted to, but for the last two years the work was not very active and advancement of scaling usual connections, but by just throwing more impressives and built-in connection pooling. If you have client side, yeah. What are the numbers one, two, four, seven? Did those mean anything there or? No, this is just bitwise combination of three. Yes, I got it. In practice, usually you have some layers of proxies and you still have to use application side pooling and some users use built-in pooling. For example, if you are using proprietary POTJOS Pro, you have this built-in pooling already there. And in some cases, you have three pooling in the same time. Got it. I mean, maybe there's a way to, actually, yeah, you're going to go through it all, keep going, sorry. No problem. If you have application side pooling, everything just works. This is the first step when you encounter problem with database connectivity. You just enable your application side pooling and everything continues to scale good or okay. But at some point in time, you see that now application backend is a single point of failure. It's stateless and you can just simply add more backends. Same thing happens when you have microservices. Each microservice have number of backends. Each backend have a number of connection pools. Each pool have a number of connections. Then you understand that now one data point of failure and you go to multiple availability zones. And now you have a lot of scaling stateless backend. Stateless backend create a lot of idle connections that can be concurrent sometimes, but they can affect performance. The sole purpose is to be prepared to switch workload from one availability zone to another availability zone. And finally, when you have sharded postgres, you have some application side sharding which decides to which shard you go. You can end up with thousand and tens of thousands of server connections. So in Yandex Mail, one shard currently is utilizing up to 30,000 concurrent TCP connection in bound. Let's get back to this slide where I was establishing 10,000 connections to the local postgres. You can see that establishment of these 10,000 connections took 15 seconds, 15 CPU seconds. This is a lot. This is eight cores. So if we have like 30,000 in bound connection and some network switches rebooted, we have to re-establish 30,000 new connections. And it cost us simply 10 minutes of CPU time due to some small network glitch. And this is kind of frustrating. And we didn't encrypt anything yet. What? How often does that happen? How network connection can be broken? In your experience at Yandex, was it like once a minute, once an hour, once a day? Some switches are upgraded and when their firmware is upgraded, this can happen quite often. Firmwares were upgraded like once in a month. I think there is some certification process that requires that firmware must be upgraded once in four weeks or so. So when the firmware is upgraded, the switch is rebooted. Usually it is rebooted in waves. So it's not the whole data centers go down and then it has to restore all its connections. But single database still suffers from these network connectivity problems. There are some other sources of upgrades on a wave from a client to the server. And sometimes we have a connection between availability zones, which goes through external internet and external internet simply unpredictable. Someone can just unplug the link and plug it back and all the connections are done. What is the cost of having one server connection? First of all, when you're going through the open internet, you have to have an encrypted connection. And the start of encryption takes up to 100 milliseconds of cryptography. If you have some very good CPU cores, you can reduce it to maybe 15 or even 10 milliseconds of cryptography workload. But still it's a lot of CPU time. When you connect out of the box, you have to call fork with subsequent page faults in a newly forked postgres. And yes, it can consume up to 100 milliseconds of CPU time again. This will be mostly idle CPU stored in a page fault or in system calls, but it will still cost you some performance degradation. Finally, not finally, but next, having server connection costs you small epsilon cost on each transaction because internal postgres structures became bigger. And finally, it's very cash and friendly copy of a system catalog in every single backend if the number of backends is in thousands of backends. What to do with it? The basic idea is that postgres protocol is very simple. You have a message protocol, messages fly from front end to the backend and vice versa. You can issue a simple query, return some data rows and invitation for the next query, which is called ready for query. By applying them all you can query even before getting ready for query, you can issue some queries and get some results. Interleaved by ready for query messages. This ready for query message can contains one bit, are you in transaction or are you not in transaction. If you are not in transaction, you are feel free to just use the server connection to serve some others, some other clients workload. And this small bits allows all the proxy pullers that exist right now. There is a plan of different proxy pullers like PgPool2, which is a good tool but it solves too much of different problems and attacking many problems didn't end well on our benchmarks. Crunchy Proxy, this is not very commonly used proxy written in Go and it's easy to study what's in it. PgBouncer is the fact of standard. Most of cloud providers, more or less give you PgBouncer-based quadris clusters. Also, there are some new cameras like PgQip and this PQR and I will touch it at the end. After choosing Puler in 2015, we decided that PgBouncer is the best for us. This was seven years ago. But PgBouncer was not perfect. We saw a lot of problems. First, you need to teach your developers not to use stuff that survives through transaction boundaries. For example, temporary tables or advisory logs do not survive this. And also, you have some session parameters like timeouts and some mass parameters and culture parameters like the time and the currency, etc. All this stuff, if you change like decimal separator, for example, in Western culture you use dot, in some Eastern cultures you use comma to separate fractions from integer. And if you change something in transaction pooling mode, someone is getting your server connection with unexpectedly changed decimal separator. But this all is solvable and some first work well with session parameters, yet temporary tables, advisory logs and prepared statements are the problem for transaction pooling. And this is the main problem why not every installation is using PgBouncer. PgBouncer is open source, which is cool. And it's hard to find someone to pay for advancing. For many years, it was not very actively developed. Recently, Peter Weissentraut and Joel Stenema done a good job in moving PgBouncer forward, but at the time of the beginning of 40 C, we saw that no new versions were happening. PgBouncer connection still cost us some CPUs. When we connect to PgBouncer, we have to, let's can check, which is not free. And PgBouncer processes connections in first in first out manner. And clients impose three seconds connection timeout. What does it mean that when your network switch is rebooted, from time to time you see a wave of incoming TLS connections. They are limited by three seconds and processed in parallel. There may happen this situation when no one can just get in. This is some monitoring charts from an incident when we connect to the server, we see that all the CPU is consumed by PgBouncer. And this CPU is purely cryptographic stuff, which we cannot optimize. It's already quite good optimized in December. And the correct solution for this problem is that when we have more than 100% of utilization, we must inverse priorities. We must limit our TLS connections and process first clients that arrive at last, because they have a better chance of surviving through connection timeout. And for some reason we choose another option. We decided that we need just more PgBouncer. We can just create a high availability proxy, which is, it's already proxy. It proxies traffic through many PgBouncer and then to PgBouncer. This solution didn't work well because our proxy did not understand PgBouncer's protocol well, and sometimes used too much of PgBouncer connections too. The next solution we used was port reuse, which is, by the way, is currently committed into PgBouncer. It was committed here a year or so ago, and you just start many PgBouncer on the same TCP port. And then operating system is responsible for accepting new TCP connections by different processes. And it worked quite well, but if the one of PgBouncer is overloaded with something, it is not accepting new TCP connections. So it may happen that one PgBouncer accept a lot more connections than others because others were thinking about something or were busy at the time when a new wave of TCP connections arrived. We worked with this setup for quite a long time, but at some point we understood that we need too much of PgBouncer's, and three PgBouncer's is just not enough. We use a two-level cascading system when we have an external array of PgBouncer processes, internal array of PgBouncer processes, and only then we have positives. And in some cases, it is still not enough. As you can see here, 16 PgBouncer processes utilize its CPUs totally. I've heard that STAP or some other big tech company were using three layers of PgBouncer's, but we understand that this is just enough. We have to create new software. Scading of PgBouncer's were really hard to maintain. There were other problems. For example, if PgBouncer is running some heavy workload against the database, and for some reason all client connections are lost. PgBouncer will wait until every query is actually executed. There was a pull request to solve this problem, to just issue a console of queries, and this pull request was ignored for some years, and this was another point to just start a new thing. And finally, we have a long list of features that we want, and basically we were building a cloud solution, and conflicts of PgBouncer didn't fit in the cloud requirements. This is how ADC started. This is all the problems that led us to building ADC, and all these problems must be solved now in ADC. Not must they actually. From birds eye overview, we can say that we support two platforms, both Linux on AMD64 and ARM64. And, obviously, is pure C project, we do not use any C++. And we have almost no dependencies. All code is handwritten, except. And also now we depend on post address itself because we use scrum authentication from post address, because we don't like homemade cryptography. In the small portion of assembler code, that's why we are not purely C code, but I will touch on this later. And ADC contains two main building blocks. One is Machinarium and one other is Kiwi. What is Machinarium? Machinarium is a cooperative concurrency framework. Why did it happen? If you see code of, see that it's basically state machine. It remembers what is the state of client, what is the state of server connection, and have a table what to do if next byte arrived. For example, if we don't have a server bound to the client, next byte of a client connection arrived, we must send this byte to someone, thus we acquire a new server connection. If there are no server connections, we are opening new server connections, and we change in our state that client data arrived, opening server connections, and then sleep. Until next IO happens, until next byte from server will arrive. To understand what is the client flow for the connection, and the way that client flow is described sequentially. Again, its state has to be remembered and sleep should be called until next IO event. It just issues machine sleep or machine calls, which notifies the Machinarium that if new bytes arrive, walk me at the same code place where I was previously. That's why we have this assembler code which is called context swap, which basically saves current stack of frontend of client connection and swaps call stack with a system Machinarium call stack. Machinarium allows server thread, which is responsible for waking up its back without using a system calls. We avoid context swap using syscalls, and we avoid using server processes to do multitasking. And on top of Machinarium, we have a QV library. QV is basically library to format postgres messages. For example, if you want to start a process, this is a bunch of bytes. To compose these bytes from parameters, we are using QV library, which is doing just this. Overall architecture using these libraries looks like this. We have a system thread, which is governing work, governing other threads. And this is calling TCP accept, and after accepting TCP connection, before starting any TLS connection, it is bounding client connection to some what is called Machinarium or machine or just system process. And this process subscribes to each subscribes to IO events using EpoL and is in response to wake up execution of a frontend code of each client. Also, main thread is responsible for retiring old server connections and running queries on a console database. Some words about multitasking. Basically, each operating system thread, which is called machine in Machinarium, have its own context of file descriptors that it must be in response of. It doesn't sound very clear. I got to keep going. Okay. We are doing accepting in a separate thread. Why are we doing this? Because we are trying to solve a problem that we observed in PgBouncer. We must distribute new connections evenly among all of our machines. Unfortunately, we don't know how much each machine is overloaded. So, for example, if you follow some pattern like open four connections, close three, open four connections, close three, open four, close three, et cetera, and have four machines. All the traffic will still go through the same machine. It's not solid yet. But we did not observe a problem with this introduction. We are trying hard to pack many small packets that arrive one after another into one big packet because the protocol allows it. It created some problems with driver, which did not expect that many packets are arriving together in one packet, but that's not our problem. Basically, as long as we are here to post this protocol and we are trying to connect client always to the same server connection if possible. If the server connection is busy, we have to give another server connection to the client. Also, we solved the problem of counseling queries. And this is not always, I'm not sure if I will redo or receive from the start. Counseling queries is a feature because sometimes you are counseling queries that are almost about to succeed. And counseling postgres is creating yet new fork of Postmaster. If you have to cancel some, a lot of small queries. You can take new wave of forks to accept counselor requests on postgres sites. So counseling sometimes saves you from incidents, but sometimes it creates new incidents. And you never know which one is more disastrous for a workload. Also, physical replication and logical application passes seamlessly through the Odyssey. And we are 100% administrative tools of PGA balancers, like show server, show pools, show stats. We are showing the same, except that we are not reporting average queries per second, we are reporting quantiles of 99% of query timings, and other personas. Also, we have some somewhat enhanced transaction processing, and essentially it means that we support protocol level-prepared statements. For example, you can have a typical workload of executing query in following packets. If you have a query, then you bind the query to some parameters, then you ask the server to describe what is the result, execute and close open portal and sync like tell me all the results that you have. Plans happens when you parse query, generic plans, there are some different kinds of plans, but generic plans happen at parse time. And then you can just issue multiple parameters to the same prepared statement. And since it saves you time to parse query, but it's not about just time of parsing. This actually verifies your query in system catalog. It checks that you are using operators that do exist. It checks that you have columns of tables that are correct as long as your snapshot. It checks that system catalog was not invalidated just before and indexes were not concurrently revealed. So actually, this parse is usually like a half of a workload timing. Andrei, you got distorted. You said it's half the workload time and then I missed what you said after that. About parts. Prepared statements are, they survive through transaction boundaries. When the transaction ended, they still leave. That's why prepared statements are not compatible with transaction pooling. If you run PgBinge, like default benchmarking tool against PgBouncer in prepared statement modes, it will just work because every single connection will prepare the same number of prepared statements and call it with the same names. Real drivers do not do it. And the problem is this different SQL statements can have the same name in different server connections. One client is preparing X as select one, another client is preparing X as select two, and first client getting second server and getting two instead of one and this is like error. To solve this problem, we have two hash tables. One hash table is translating client prepared statement name to SQL name, another hash table can translate SQL hash name to server prepared statement. And when the client builds Odyssey parse this SQL statement, we just don't do anything like we remember that this statement was requested, but we do not actually prepare it anywhere. And when the client tells us, okay, it's time to bind prepared statement to parameters. We have a server connection and if not we should prepare in this time. And this was like changing in logic of Odyssey, because before this, we were just connecting one, sending the same bytes that we received from the client. And now we are rewriting protocol to create to squeeze some more performance from the same machine. The problems of this approach, it's current in production, some services, server prepared statements are allocated forever in server connection. Fortunately, we don't allow server connection to live forever. To avoid cash bloat Odyssey closes server connections in one hour or so. So if you have a flood of prepared statement statements, they will decay no longer than one hour from, from right now. And mapping this to some canonical hash and from canonical hash to server and implementation consume memory and no much, not so much we can do with this. And hashing takes some time, but this time is still much less than planning query. And one of the last features that we added to Odyssey is a simple according. As you may know, you can buy replicas of postgres can show you still data. This is a tricky question because sometimes they show you data from the future, like replica can have some data that did not happen that is still located on primary yet. You show data from from too much of the past and typical way to solve to decide where you go is defining in connection string many addresses and specifying level of your connection you can say any node or primary node or read write node and primary and read write is the same. But some have stole lagging replica. For example, it was disconnected for for a cluster for time partitioning. Or it could be rebooted like virtual machine was gone, or you have a radar rays that is not really good in current situation, for example, one is the disk is broken and it's just thinking it's Ray and lagging postgres replica happens now and then, and you don't want to observe to all data for most applications it's okay if you have like one second lag or few hundred milliseconds lag, but having lag more than seconds is usually is very undesired for an application. So in Odyssey you can specify watchdog query. And if this watchdog query returns some values that is not acceptable to the user. For example, here we tell that we are okay if we have 10 seconds lag and we can wait for 10 seconds until replica will catch up. If this doesn't work, the client will retry and retry will give, give the another replica. This allows to see not to steal data but this is not real solution for consistency as you understand. You still observe some stale data, but not that stale and for some, some of applications is just okay. This is currently in production version, we are developing some new functionality, but it's working probably. The last version was developed mainly by Kirill. Usually he is looking more like this and I'm just hoping to review all the code he is implementing every day. This version allows us to move fast despite being five year old project. We have some questions that we don't know how to solve. And the most important for us question is question for looking for some optimum workload. If you are creating a cluster in Yandex cloud, you have to specify some magic number for each user, number of concurrent connections. And how do you know this? I have no idea. We are trying to find it in a way that we do not allocate all connections simultaneously. This is just one question problem. Should we open new connection now or should we wait for some time? Maybe we will have some new server connection available for the query. And magic number from the user say 1000 server connections is the limit. And first half of this 500, we are allocating in parallel. New query wants new server connection, we are just firing new server connections. This allows us to ramp up quickly to get some new server connections when they are necessary. But when we reach the center of this magic number, we are starting to delay requests for new server connections. And at the end when you reach this magic number, we have connections which will be allocated, which will be free for queries. But it's still all about the user. And user have to solve this multi-dimensional optimum problem for his microservice. Another problem that we won't solve one day in Odyssey but didn't solve it yet. This is a read after write consistency. Very simple idea that when we are writing to primary Postgres node, we can get LSN of our commit, logical sequence number in write a head log that when our transaction was committed. And when we are getting query from a client, he is returning us this LSN and if this LSN standby node allows the client to see his own changes. This is like, I would say it's not even working for this idea in progress. We had a lot of discussion, but so far we did not implement it yet. The Postgres wire protocol you support does read write tokens? Why is the Postgres protocol do not support it? I'm asking, does it? I don't know. The basic idea, not the basic idea, the basic problem is that no one did implement it. Just a new type of message, yes? The Postgres wire protocol currently does not support read write tokens. Got it. So far protocol do not support it. It's quite easy to edit the protocol, but it will take some time to discuss with the community. So you just need one extra message before turning ready for query. There is already messages that just got comment complete and you can follow it comment complete with following LSN. And have LSN of your transaction you can return and get it to the primary node. If you have this LSN you can write to some probably stolen but okay not that far node. And another thing that we simply don't have is a protocol compression. Currently Postgres protocol is not compressed, mainly due to the fact that we are afraid of crime attacks. Currently there is a threat in Postgres hackers with implementation of protocol compression. And in fact Odyssey supports this protocol. So if this protocol will ever be adopted, we already have a code that compresses the data between client and server and also compresses the data between primary and standby. Three times more of the traffic from a client to the server. But to actually use this code that already is implemented and tested, we need to agree with the community on what protocol is. There is another attempt to make a sharding via lightweight proxies. And if you see that if you look at currently existing sharding solution for Postgres, you will find that they are based on custom plan node which decides which shard to roll to for example C2C is working this way. And substantial part of time consumed by routing is checking the query against system catalog. For example, in TPC C benchmarks, you are creating temporary tables. These temporary tables invalidates other backend real cashers and CIS cashers. That's why Postgres is not scaring well in TPCC in pure. If you are going through a router which must check its validity against system catalog. But this is the router which is routing the query to just one shard can be completely stateless. And in fact that rate very resembling to Odyssey. Odyssey is written on low level C and eventually we will do sharding in Odyssey 2. This is just too much of dirty prototyping and we do not know exactly how we want to operate. That's why we created a copy of Odyssey in Go which is called SPQR. Surprisingly, it is not working much slower than C version of Odyssey. And the only difference is that SPQR have its routing part. You can specify many server connections that are associated with different ranges of some sharding keys. But today it's only a prototyping stage. It's not in production anywhere and we are experimenting here and there and maybe things will change. Unfortunately, there is one concurrent project which is doing exactly the same but in Rust. If you don't like Go and want to do some lightweight proxy sharding, you can look at PgCAD. This is absolutely brilliant project and it's really developed fast by the team. The ported only explicit sharding when you have to say to which shard you want puller to connect you to. But now maybe a week ago or so they implemented automatic sharding. They are really fast. I think they didn't even create a new version for this feature. But they are mostly focused on failover part of proxy because it's team. What is the problem of failover? Usually when you want to connect to primary node of Postgres cluster, you are pin each node in order with the connection timeout. And this is done by each client connection. Using proxy here can save you a time for repeating all over the same work. When the proxy sees that one node is down, it's not reiterating all over servers again. It just knows that the server is down for now and when it will return, it will return and pull. But the problem is connection from the client to the proxy is still an ever connection and it can be down too. And if proxy goes down, clients still have to iterate through different addresses of the proxy. We made community, not me, made some benchmarks against PgBouncer and PQR and PgCat. For some reason they didn't benchmark and see I didn't know why. And the main takeaway from these benchmarks was that this lightweight sharding is not adding any latency overheads more than one millisecond. So routing can consume up to one millisecond. That's where we lose some performance. For example, if we get back to Rekonos benchmark, here it is. We see the direct progress connection on some workloads can work better than proxy connection, but this is a performance when you have a client and server basically on the same machine. You don't lose anything on network latency. Sorry for this slide jumps. Keep takeaways. It's all about performance. PgBouncer, Odyssey and other projects exist for one reason, solving unexpected performance degradation. It's not shown on even PCC benchmarks. Tend to avoid waves of homogenous peaks on the database. But most of production databases on Postgres currently use proxies. The basic idea of proxy is to pull large number of client connection to small number of server connections. This problem must be solved in core one day. Today Postgres community understands this quite well. Many bright people like Andres are working on the problem. But so far in Postgres core the problem persists. I have no idea when it will be fixed. One day we should not have pullers at all. It should just be servers that just work. Clusters that just works without any proxies. But now we have to use it. That was all that I wanted to say and I'd be happy to answer some questions. Thank you for doing this talk. We have one minute left for one talk from the audience. Matt, go for it. Thanks for the talk, Andre. Honestly, it's pretty sophisticated with the user space code routine and stuff like that. Did you look at anything like DPDK, like kernel bypass stuff or like, or the new like, or something different like the new IOU ring support for like sockets on like, I mean that's bleeding edge Linux kernel stuff but like would, would, would that help something with with Odyssey's design or no. DPDK. You have to sort of repeat what you said. Who would you say about DPDK? Sorry. I didn't understand the question. Matt, can you repeat the question? Did you, did you guys look at anything like DPDK with Odyssey or more recently are you looking at like IOU ring with socket support on but that's pretty bleeding edge kernel. Yes, we didn't look into IOU ring. We were thinking about moving to closer to the Oath kernel. But our theorem is just too small. This is how Kirill was looking after thinking about DPDK. He is following all the new stuff. But we looked at some benchmarks of KLS in kernel and understood that it won't be more than twice faster. The problem is Odyssey when it's tuned is not a bottleneck of system. System is mostly kept by productivity of database itself. It's just relaying bytes. And this is the reason why PgBouncer is not multi-traded still. Because the basic idea is that it's just relaying bytes from here to there and you should not need too much of CPU power. And this context switches must not be the bottleneck. Yet at some point I think we will still have to look into IOU ring or maybe DPDK. But for my point of view, DPDK will be a bigger time investment with unclear benefits. Yes, it's good. When you're doing such stuff you always want to attract some ICPC students who want to just create a cool code. This should work as a marketing tool attract some open source developers who want to create some cool code. And ask to implement some production features. It's completely unrelated but that exists in the idea. My last question would be, I'm assuming at Yandex you guys never put the proxy on the same box. Right? You can do like Unix pipes instead of go and go TCP. Here is the podge of the cluster and orange frame is actually for a connection puller. We use a connection puller for on the same node as the database stands. And we use it for high availability fence too. When we are unsure should we accept connections or not, we are taking down connection puller thus ensuring that no client will reach the database. And it's fast to kill ODC process. If for some reason we need to do some maintenance work for high availability. We usually see on the same machine as the database, not on the client side.