 Good morning everyone, I was hoping a little larger crowd than this, but yeah whatever it is. Yeah, thank you. Thank you very much. So, we will be talking about a lot of logical replication in Postgres, multi-master replication, you know why do we need multi-master replication in Postgres and then some of the replication nuances like conflict resolution and so on and so forth, right. Let us delve into this. So why do we need logical replication? Some of the folks that know about the replication methodologies available in Postgres, there is physical streaming replication, logical trigger-based replication which is kind of a legacy and we also have some of the asynchronous replication technologies that are coming into the mainstream offline, right. We require the native logical replication that we are becoming part and parcel of Postgres from PG-10, right. So, some of the use cases where a lot of our database engineers use logical replication is to achieve no zero downtime or no as less downtime as possible when you are doing major upgrades between database versions, right. Say for example, you are on a non-supported 9.4, right and then you would like to upgrade to PG-10, 11 or no PG-15s or 4ins, right. So, you will be able to use the native logical replication and do a major upgrade with zero downtime, right. And logical replication works in publisher and subscriber model which means so you can have publishers publishing your data sets and data sources to a ton of subscribers, right. There could be more than one subscriber. Some of the use cases that we have seen on the publisher and subscriber model is you can have one of, you know, setting up one of your entire HA system, you know, with the logical replication, right. And then you can pick and choose what tables you would want to have as part of logical replication, right. So, you could do your data warehousing use cases with logical replication. You could do a lot of lake houses and, you know, the modern data stack that, you know, I was talking about, right. And then with replication in logical versus physical, right, it's more of the entire database or the instance versus in logical it's database by database, right. You could pick and choose schema by schema, table by table and stuff, right. And with PG-14s and 15s there is row level filtering. You can, you know, have a specific column to be part of the publisher and then, you know, subscriber will be able to consume them, right. Under the hood it uses logical decoding. Now, there are some, you know, super cool low level APIs that does that task, you know, using the logical decoding, right. And, you know, we slowly fast forward to the multi-master application and then what was the history behind multi-master and multi-active systems that were available in Postgres ecosystem, right. We had one that was based off of 9.4 and 9.6. That was the BDR-1, the first and foremost version of bidirectional replication. BDR stands for bidirectional replication, right. And second quadrant and now the erstwhile second quadrant and, you know, the EDB folks have something called as Postgres distributed, right. That is also one of the bidirectional slash multi-master replication solutions available, right. And then, you know, after BDR-1 there was BDR-2 and BDR-3 which were, you know, proprietary slash closed source ones, right. Now, why are we looking at multi-master? Why are we looking at, you know, some of the multi-active solutions, right? So, we are talking about computing use cases a lot, right. So, if we look at some of the, I mean, the stack that we are trying to look at here, right. So, we are getting the user closer to the system, right. Closer to the compute, closer to the edge network, right. You've got variables, you've got mobile devices, you've got the entire application stack on your right, right. And then, these are some of the smart home systems, the IoT devices and stuff, right. So, what are we trying to do? We are getting the compute, we are getting the data closer to the system, closer to the users, right. And then, we are also working with a few CDNs, right, you know, content data networks like, you know, Cloudflare, Fastly's. And, you know, I was also talking to a couple of folks from Warnish Software, right. They are here in, in, in BOSAsia. All of these guys have the static pages that are available and then made available on the edge. What we are missing is the data piece of it, right. Now, how do we get the database closer to this edge compute network, right. Edge cases, right. That's what, you know, we are trying to solve with PG edge, right. One of the reasons why we had also named the company as PG edge, I would say, right. Now, what's the take on Postgres, right. How do we get integrated with Postgres ecosystem altogether, right. So, PG edge has got two products. One is the PG edge platform. PG edge platform is a completely standard Postgres and 100% open, right. So, take a look at the platform per se, right. It's got the standard Postgres engine, right. So, we start supporting from PG 15. If there are use cases, you know, we will be able to support PG 14 also, right. The right hand side of the slide is more of the standard Postgres offering, right. Now, Postgres has this super awesome extensibility model, right. So, we've got Postgres, which works primarily on the REST APIs. And then, you know, there are extensions, right, that I was talking about, right. So, we have already made these 20 plus extensions available on the PG edge platform. So, you go ahead and download the binaries from PG edge.com. So, you get the entire offering for free, right. So, you'll be able to go ahead and do your development activities, right, and get your hands dirty with multi-master application, right. So, one, I mean, two components that are part of the platform are the node control and the cluster control CLIs, right. Both of these are the armory behind, I mean, working under the hood, that is enabling the entire cluster orchestration, that is enabling the entire PG edge platform, right. And the extension that enables the Postgres cluster to be multi-master is SPOC, which is the multi-active replication. And any multi-master replication solution would require super solid conflict resolution, conflict avoidance in place, right. So, SPOC brings to the table the multi-active and multi-master with the conflict resolution and then, you know, keeping things as simple as possible, right. So, there are multiple masters writing databases, multiple sources writing and then, you know, you have under the hood a strong algorithms, strong APIs taking care of the conflict avoidance and conflict resolutions. That is the part that SPOC brings to the Postgres platform, right. And apart from, you know, a PG edge platform, right, there is another product called as PG Edge which will take a look, right. But, you know, if you take a look at, you know, the PG edge developed and the Postgres community developed. So, these are the only two components, right, node CTL and cluster control. That is the intelligence behind the cluster configuration and then the standard configuration that gets set up as soon as, you know, you run the PG edge platform, right. So, the adoption or from an existing PostgresQL cluster, right. You know, how do you go ahead and incorporate PG edge or how do you adopt Postgres plus the PG edge. These are the two components that you would require, right. The entire components and all of them are PG or the PostgresQL community developed software, right. So, this is the only two components which is this SPOC and the node CTL and cluster CTL that you would want to attach to your cluster and then, you know, make that cluster, you know, edge aware and multi-master and multi-active, right. So, what PG edge platform brings to the table, right. So, all of the features are part and parcel of the fully distributed PostgresQL, right. And then, you know, it is available for you guys to go ahead and download, put it on development machines, put it on your production and stuff, right. And we also have enterprise support available, right. So, that is where we have our subscription-based support available from PG edge and, you know, some of the eminent Postgres contributors being part of that enterprise support. And, you know, every product in the Postgres ecosystem that is bringing, you know, new functionalities, future set into the system will have a licensing model, right. So, what is the licensing model that we have as part of PG edge, right. It is a community license. More of the licensing is around the confluent, right. So, just like Kafka, that which is a very good example to take, right. And we will be able to, you know, download Spock and then, you know, just attach this Spock extension to your, you know, already running database, right. And you could just use it on production, right. There is absolutely no licensing around, you know, or restrictions around, you know, using the system. And the only caveat or the only catch here is none of the cloud providers will be able to package our product and then host it on their cloud offerings like an RDS or Azure for Postgres here, right. That is why we had the pun, sorry, AWS, right. And this is the cloud SaaS software that I was talking about, right. So, this is a fully managed DBAS where you will be able to go ahead and set up on AWS and Azure at the moment, right. And GCP and, you know, Google Cloud support is just around the corner probably in a quarter or two, we should be able to get up and running on Google Cloud also, right. And with the PG Edge, right, so PG Edge platform, so you will be able to go ahead and do an on-prem installation, on-prem setup, you know, for your current commodity hardware or use your cloud, use your AWS account and then, you know, hop on your IAM and then, you know, grab the EC2 instance and then set up the PG Edge platform code, right. So, it's very easy for us to go ahead and, you know, incorporate and then, you know, set up the PG Edge platform. And the access to both PG Edge platform and PG Edge cloud, PG Edge platform, obviously, you know, you have got CLIs, you know, you've got web interfaces that you can go ahead and hook it up. And then, you know, PG Edge platform also comes with a super solid monitoring system in place, right. So, I've got Prometheus plus Grafana as one of our monitoring, enterprise monitoring tools that are incorporated and, you know, integrated with PG Edge and PG Cloud, right. We'll be able to take a look at some of the PG Cloud, PG Edge Cloud instance demo screenshots, right. So, if I summarize the whole PG Edge, you know, the whole feature set or the interesting facts and figures about the PG Edge, right. We'll have low latency because we are closer to the user, right. The database systems are closer to the user, meaning you are writing to the nearest available node, right. So, you just have to have your application code, not talking and, you know, being aware of what is the nearest database node available in the cluster, right. And, you know, that solves a lot of your, you know, latency problems and then the latency issues where a plus or minus 50 to 100 milliseconds that, you know, we will be able to save, right. And then we are talking about millions and millions of transactions for a second and then, you know, millions and millions of transactions that are happening across, right. So, that's a pretty good tradeoff, I will say, right. If you are grabbing the 50 to 100 milliseconds off of your connection request, right. And then the ultra high availability model that we, I mean, specifically call it as ultra high available because you have got multi master nodes and all of the multi master nodes will have a logical replica supporting them, right. So, a logical replica around how do we set up a fallback instance, right. So, you have got a, I mean, like I spoke about the logical replication, you can set up the entire instances as a logical replica, right. And you will also be able to set up your own witness nodes around the clouds, availability and so on and so forth, right. And then the data residency is another very important use case that we are seeing these days, right. And you want to make sure the data is residing in your own continent, in your own ecosystem, right. That's another one where we implement something called as PII enabled partitions, right. So, you can have multiple partitions in place and then all of those data sets could be PII enabled, right. So, that's already available, you know, baked into the product, baked into the PG edge, both cloud and PG edge platform, right. And optimized for the network edge, right. Like I mentioned about, so your data set and then your application layer are, I mean, your application layers are, you know, all of them are HTPAs and HTVDs, right. So, all of them will be able to go ahead and take a read, write access to your nodes that are closer to the network edge, right. And fully managed to debas, right, or PG cloud. So, I've got some of the folks running the show there taking care of the support ecosystem there, right. And then PG edge, you can access it via web. You'll be able to do REST API calls, you'll be able to use the DPA friendly client, main interfaces CLIs and stuff, right. PG edge platform is more of the CLI product offering of it, right, which is completely open. The one, the extension that empowers the whole edge computing and bringing about the multi master of application into the system is this POG, right. This is an extension that, you know, you could go on the GitHub and see the code and stuff, right. So, this allows you to do asynchronous replication between all your master nodes, right. And Spark also bakes in the predefined monitoring, the data dictionary, the catalog, because it's completely based off of the Postgres, the data dictionary and the catalog views are already available in them, right. Like I said, now with every multi master application solution the conflict resolution and the error handling that has to be within the system is pretty important, right. Which means you are, you have got master nodes across three or five, right. And then all of the nodes are writing and then you should have the conflict avoidance first place and then, you know, if there is a conflict, how do we resolve, right, how do we resolve those conflicts, right. The conflict resolution model that we have available is, so you could, you could go ahead and use the apply remote, the last update wins or the first update wins, right. So, you'll be able to go ahead and configure according to your data needs, right. So, like we say, understand your data, understand the problem and then look at what is the exact conflict resolution that you would want to incorporate, right. And then there is a simple example of, you know, how do we look at first update wins conflict resolution, right. So, these are two nodes. So, we are talking about this as only one transaction. So, imagine a transaction volume of around 12,000, you know, TPS, right, 12,000 transactions per second and all of these happening in a, you know, more streamlined and more foolproof manner, right. So, we are looking at node one and node two. So, up until now, PG Edge supports five nodes. So, you'll be able to set up five different nodes and on five different regions on AWS and Azure. And then, you know, for conflict avoidance, you know, we have not created any CRDTs. CRDTs stand for conflict-free replication data types, right. So, we have a simple implementation of grabbing the old value and then, you know, whatever the updated value and, you know, how do we bake those node two values and then, you know, grab the exact result and then, you know, have it eventually propagated across all of the PG nodes, right, the PG Edge nodes, right. And the functionality that, you know, we have is simply making sure the table has got log old value equals to true, right, which means that whatever is the column, right. Basically, this is the primary key column that we would want to make sure that has the log underscore old underscore value. What it enables is it makes sure we are having, it's explained in the next. So, first we will grab the old value of the column that is captured, right. And then, second, the transaction that is going to update the value and then, you know, both of them are computed and you will get the source of truth that is propagated and eventually consistent across all of the nodes. So, a typical deployment of a PG Edge, right. So, you will be able to see a lot of hub nodes available here, right. So, this is one availability zone, right. We have got one in Europe, one in the Australia region, right. All of the hub nodes can have the logical replicas for a high availability standpoint, right. That's why we say ultra high availability. And, you know, right now, I said, you know, the AWS and Azure are the ones, you know, we will be able to set up the servers on and do the deployment. These are, now, when you hop on the PGEdge.com, right, you will be able to create your own clusters and then, you know, see how things work out and, you know, how do we set up multi-region database nodes across in PGEdge. So, these are the UI screenshots that I have grabbed from our PGEdge system. And you'll be able to, like I said, you know, one of the important things that is already available with PGEdge is the monitoring system, right. So, you'll be able to have the strong and, you know, enterprise-level monitoring system available with Prometheus and Grafana, right. So, that's another thing, you know, which is available on the PGEdge cloud dashboards. You'll be able to see and make sure you've got the database instance as being monitored on live, right. And then the security modus operandi of PGEdge is, you know, pretty much the owners is on the user, right. So, bring your own cloud accounts and then, you will be able to go ahead and take care of the VPC pairing, you know, what do you want to do with, you know, how much of cost you would want to do, right. All of that stuff that, you know, that is associated to a cloud service provider. Now, you will be having complete control, right. We've not wanted to be resellers of, you know, AWS or Azure, right. So, we wanted you to be having the pick and choice of what cloud provider you would want to work with and then, you know, bring your account and we'll be able to set up those set up on those regions and cloud infrastructure. Right. And every operation that you would perform on the PGEdge cloud so you will be able to audit and then you will be able to go ahead and take a look at any compliance checks that, you know, that is part of parts of, you know, a lot of cloud offering products, right. And these are some of the useful links I have laid out. So, go ahead and, you know, hop on the github, PGEdge.com, I mean, PGEdge.com or, you know, so there are some cool blogs that are available on PGEdge.com. And the one blog that we have written is the distributed PostgreSQL with Cloudflare and some of the other CDNs that we are, you know, currently working with. That's something that we are pretty excited about and we'll be working on as part of our next coming, you know, few weeks. And if you have any questions, I'm around the whole day in the conference. Feel free to, right, you know, touch base with me. Any questions? Yes. So, how does a client figure out what is the nearest hub they should connect to? What's the process? That's a great question. So, if you take an example of the Cloudflare, right, so Cloudflare has got pages, Cloudflare workers, and then, you know, at the end, it is database, right? So, Cloudflare workers, so we are working with Cloudflare, you know, to have a specific code in place as part of workers. The workers will have somewhere around code inside the workers, right, which is a simple Node.js code, right? The Node.js code will be able to use the Geolibrary and then, you know, pick up the exact closest node available. That's one of the CDNs that, you know, we are working with. Fastly has, you know, something called as, you know, Point of Presence, you know, FastlyPOP, and all of these CDNs. Have you designed some sort of a real system for data residency so you can build the logic about what stays local, what gets replicated? Absolutely, yeah. We have it as part of, you know, there is a whole bunch of infrastructure available for you to have API data, right? So, you can have your API partitions, you know, specifically residing in one of your regions, right? So, say, for example, the GDPR plus the EU, right? So you'll be able to go ahead and see what requests are coming from the EU, and then, you know, all those requests are routed to the EU region and stuff. Any more questions? Okay, that's the case. Thank you very much, Hari. Thank you. Let's thank our speaker. Hello, Phosasia. And we're going to be talking about databases at scale. This isn't really just about Redis and Postgres. Both are great databases. But it's also about how we have to think about databases when we're operating above certain thresholds. And we're going to talk about how that changes as our databases get big and busy, and also how Redis and Postgres are very different from each other. So, basically, I have a bunch of experience with Postgres and actually a bunch of experience with Redis. I'm not going to go into too much here for lack of time, but I've worked with Postgres for 24 years and I've probably worked with Redis for maybe six or seven years. As I say, they're both good databases. I like them both, but they're also very different and there are a bunch of misunderstandings that people have about both of them when it comes to scale. So, I wanted to thank, in particular, Adjust where my case study came from. I used to work there and we actually moved a bunch of Redis stuff to Postgres specifically for scaling reasons. And also, when I used to work at Delivery Hero, we had cases where we were looking at Redis and Postgres together in certain environments and asking questions about when is it appropriate to use these. That comes in really important. That was really helpful in kind of developing my ideas on this. And then, of course, representing OriolDB here. I'm working with them on this and actually this whole slide deck and presentation came out of collaboration that I started with them some months ago where the question was what can we do to make Postgres able to replace Redis in some cases? I wouldn't say you're ever going to replace it in all cases but then this gets into the question of what's good about each system and what's bad about each system. So, for our agenda day, we're going to talk a lot about databases and scale. Then we'll have a quick tour through both Redis and Postgres in terms of their basic architecture. And we'll talk about how these are affected by volume and velocity of data. Because both of those are things which can impact scalability in both cases. So, we'll look at those carefully. We'll discuss the case study and we'll talk about why at Adjust when I was there we moved all these Redis systems into Postgres. And it'll become very, very clear as we go through the architectural side why that was the case. We'll talk about some common solutions to using one or both together and finally some general recommendations. So first, databases and scale. So, basically, there are a large number of things that people believe about running databases and scale that tend not to actually match the experiences that we have when we get there. So, of course, different levels of scale are different and we'll talk about that. But people sometimes say, I've heard this many times, Redis is faster, it scales easier, et cetera. This may or may not be true depending on what exactly you're doing. Similarly, if you're at a point where you have to scale out Postgres, there may be some solutions that might help you with that or you can probably afford to hire people who know what they're doing to make it happen. So, that's basically the overall motivation of this talk and then the question is how we use the tools that we have in these cases. So, first thing I really want to drill home here is how we use databases changes as they get big and as they get busy. And I'm willing to bet, you know, at least when I started working with databases, I started with some toy databases like less than a gigabyte in size and eventually through my career I've worked in Postgres databases that were 170 terabytes. In these different environments, you end up thinking about your data and your database very, very differently. And so, where we start in our careers and where we can end up as we scale is very, very, very, very different. So, I'm going to basically go through a few sightings here and we're going to talk about how every time we increase the database by a factor of 10, some of our considerations may change. First one is database under one gigabyte, right? Who cares about indexing? Who cares about how efficient your SQL is? Who cares about what the planner does, right? It's quite a query and it comes back almost instantly unless you do something really, really, really, really stupid because all the data fits in memory. You can sequentially scan through a gigabyte of data really fast, right? So, at one gigabyte, typically we want to focus on learning how to do things correctly, make sure our data is properly normalized and, you know, all the very basic things that, you know, we would learn from a purely mathematical perspective. We can look at the database here as just a math engine and nothing else, right? When we get to 10 gigabytes, you know, storage starts to matter a little bit because not all the data is going to stay in memory, usually, unless you have a lot of memory. You're typically going to want to think about indexes because even if you have it in memory, sequentially scanning through 10 gigabytes of data is going to take a bit longer. And, you know, if you're writing really crazy reporting query that, say, processes like 5 gigabytes of data like 20 times in order to get one report, that's going to perform badly and you're going to have to start thinking about how on earth you are going to write this better. So query efficiency starts to matter here a little bit and indexes start to matter just a little bit. But usually, I mean, while storage can matter a little bit, it's usually not a big concern. When you get to 100 gigabytes, storage starts to matter a bit more. You need to be pretty good at indexes by the time you get up here, right? You need to understand what the index is, what it's doing, how to make it work, and you're probably going to have to start tuning your database here for performance, right? At 10 gigabytes, what tuning do you need to do, right? At 100 gigabytes, you're probably going to have to tune some things. One terabyte. Okay, so one terabyte, we've typically gotten really good at indexing already. We don't have new indexing problems. Performance is going to be much more about storage and access patterns, and we're going to have to start to think in terms of these lower levels. And we're going to start to face problems that were a little easier on lower levels. For example, I usually find that if your database is 100 gigabytes in size, certain approaches to backups work very nicely, but when you get to a terabyte, then they start becoming really, really, really difficult. And so you may have to switch what tools you're using for backups here. You may have to rethink how you're doing backups. So a lot of these administration things become a lot harder when you go from 100 gigabytes to one terabyte. So when you go from one terabyte to 10 terabytes, what happens? So by the time you get to 10 terabytes, typically you're starting to have to reason about the internals of the database on almost everything you're doing, right? The database is no longer a black box. It is a very transparent set of working equipment that you actually have to use as working equipment. So internals matter a lot and you'll run up against them. And by the time you get to 10 terabytes, there are no longer any kinds of new kinds of problems that you will face as you grow, because you're up against the limits of your hardware and you're up against the limits of the software. So 10 terabytes to 100 terabytes, same set of problems. Basically, early on, we think about databases as black boxes. Internals don't matter and we don't really have to worry about performance tuning, right? Closers straightforward, et cetera. But as things grow, we start to have to think very carefully about internals. We have to work our way around the hardware. We have to think about the hardware. A lot of these other things become really, really important. So I'm just going to go quickly through a case study why we moved some systems from Redis to Postgres. This was a big ad tech environment. We had a large number of very big Redis servers. I think it was somewhere around 20, if I remember right. Each Redis server was probably running 10 instances of Redis. And it was a really complicated setup with Nutcracker and Sentinel. I guess with newer versions of Redis, you have the clustering stuff that might help a little bit, but it was a massive, massive headache. It was to the point where we couldn't work on the servers reasonably well because we didn't have complete confidence that if you had a Sentinel failover, it would failover all the right things with Nutcracker and so forth. So like touching that was terrifying. And so the question is how did we get there and why did we move it? So another major reason to move besides the administrative headache was the fact that Redis hardware is more expensive than Postgres hardware. Redis has to have everything in memory. Postgres puts everything on disk. But the other big issue is that this big Redis art infrastructure was brittle. And what had happened just to quickly describe the history of this is that the first minimum sort of proof of concept that they had written way back when they started was in MongoDB and Node.js. Then they discovered MongoDB didn't scale away. They neither did Node.js, so they switched to Redis and Golang, and that worked for a while, and then they moved some things from Redis to Aerospike and some things from Redis to Postgres. And then eventually we got rid of most of the Redis and moved almost everything to Postgres. So I'm going to talk a little bit about Redis's internals so that it makes a little more sense why we did this. So Redis is a main memory database, but it has a very specific character. It has a very specific architectural decision to make it go really fast on small workloads. It is a single threaded event loop. There is no concurrency, and there is no ability to run queries in parallel on the same Redis instance. Newer versions can do disk IO with threads, but they still cannot serve requests with threads because that's all a critical section. What this means is if you saturate that event loop, it just cannot go any faster, right? It can't go any faster. So Redis goes one speed only, and you cannot throw hardware at it to make it go faster. Okay? Persistence is optional. That becomes important if you're doing queues and stuff. And replication is more or less... Replication is more or less similar to Postgres in the sense that there's effectively a binary set of changes that get distributed and applied. So it is possible to write Lewis scripts. I have not done this, and I have not evaluated how it affects this particular architectural thing. I think the Lewis scripts run outside, but I'm not 100% sure. If that's the case, then it would be a little less irritating than if they run in process and can starve everything else. So I mentioned persistence. The big thing about replication is you're replicating writes, and that also has to go through the single threaded event loop. So if you saturate an instance with writes, it will saturate every instance with writes across all of your replicas. So writes compete with reads, and they will compete with reads... A single write will compete with reads across every replica of the same Redis database. So I mentioned no parallelism, easy to saturate things like that. So typically as people get really... As things grow with Redis, then you typically set things up with not cracker. I guess now they have the clustering stuff that should make some of this a little easier. The issue is if every replica has to have the same memory allocation, you're trying to scale up. You're scaling out within a single box, which means more memory. And typically what I've had to do in the past has been Sentinel and not cracker. And that gets really complicated really quickly. So quickly described Postgres. I assume people here mostly know it relatively well. It's a multi-process, it's not multi-threaded. So every backend gets its own query. This makes it much slower to start up and it has a much higher latency. But you can run many more queries in parallel and you can scale it up easily within the same box. Now if you have to scale it out, now you have to write your own tooling or you have to use systems that may not be built for what you're doing. Usually many people start writing their own tooling in those cases. So it scales up, it doesn't scale out so well. Postgres is persistent by default. It is possible to maybe make it not persistent. You could throw things on Ramdisk. For example, don't do that. Replication is tied to the persistence. So if you get rid of persistence like for an unlogged table and try to put it in a Ramdisk, it won't replicate. And you have many, many, many, many options for replication. There's logical replication we heard about in Spock. And you can build some really complicated things there. So compared to Redis, Postgres is slower per core and it's higher latency. And this is never, ever going to change, right? Because the architectural decisions mandate that. But Postgres scales up much, much easily much more easily than Redis does and much more cheaply than Redis does. And at scale, Postgres is just a lot simpler to manage. So in many cases, if you don't need one or the other, if you're centralizing everything, you may find it much, much easier to put things on Postgres in many cases. But at massive scale, you've still got to write all of your own sharding pieces. So costs are usually less. I've mentioned that. I'm going to skip over the scenario. Again, this is the case study. It had massive amounts of Sentinel and just the complexity of dealing with a failover was very, very, very, very, very high. Redis can work very well as a cache. So there's even a process, there's even a project that'll read your logical replication string from Postgres and put it into Redis as a cache. That gets rid of cache invalidation problems. So in the case where you need that, it's great. Also Redis has this time to live, which is really, really, really, really helpful. So in the event where you can leverage the time to live, Redis can in fact be a really, really good alternative. So Postgres also can work as a key value store. There are tons of ways of doing this. You can also store a lot of other things. Replacing Redis as a queue, like with the list types and pushing and popping, is not as good. So if you're running a queue off of Postgres, you need to write it to run a queue off of Postgres. You're not just going to lift and shift. So a couple things I would say, Redis and Postgres are different. And when you're at scale, you need to make sure that you have a variety of tools. They both have easily readable code bases. It's easy to understand how they work. Typically, I find Postgres much easier to manage at scale, but there may be many cases where you might still push something out in Redis. One case where I saw it really helpful in my career was for authentication tokens on a website. So you have time to live. If that can do a lot of your work for you, it's read, sell them right mostly, and have to be managed in this way. So I've just gone over my recommendations already, and I'm open to questions. We have questions for Chris. So I like the things that you mentioned about the Postgres cannot handle the queues. So can we mix the Postgres with Redis and let the Redis manage as a queue? So to be absolutely clear here, it's not that you can't build a queue on Postgres. I've done it also. The issue is that on Redis, you have this list data type where you can just push and pull. That doesn't handle persistence very well, so people usually just put it in memory. So for an in-memory queue, Redis is going to be the easiest thing to do, and I've worked in environments where that's been split off into Redis, that's running on the same system even as Postgres. If you're trying to build something that can store more information and you don't want to worry about the memory limits or you need persistence, then I probably wouldn't do the queue on Redis. I'd probably do it on Postgres. So you mentioned some experience with OreoDB. Could you elaborate a little bit about some improvements you might see coming? Yeah, so what we're building is we're building a flash-optimized storage plugin for Postgres, because the Postgres data table format was built around the idea that sequential reads are a lot cheaper than random reads, and so if you start to try to flash-optimize things, random IO is a little cheaper, and you can now start to do things like block-level compression, some other things relatively nice. So what I've seen, at least in the benchmarks we've put together, is multiple times the throughput if you're running it on flash storage with transactional workloads. I think obviously it may still have some benefits on spinning disks, but those benefits will not include speed and will probably be slower. I would also say that the compression is actually really, really helpful, and that's something by itself that I'm really excited to see. Why is there a good place to start to look at how to do Q in Postgres? So I started a project on GitHub. There are a couple of implementations of this. So there's, of course, PGQ, which is originally, I think, part of the spec tools. That's one possibility. I also wrote a simple implementation called PGMessageQ. It's not for high loads or anything, but it was designed for answering the question of how do I send an email from the database back and Q it and have something else to it. And then I was working on, I haven't really completed it, but it's on GitHub and you can take a look at it. It's a project that I created called PGTitanides that was based on the work I did on a big life sciences database for really, really big heavy Qs and high workloads. And I think that can probably be improved at some point because the code base was built around Postgres 9.4, but those would probably be the three things I would look at. I would probably start with PGQ because it's the most mature. That's all the time we have today for the questions for Chris. Thank you very much, Chris. Let's thank our speaker. We have a short break after this. We welcome back everyone. For those of you who don't know, I'm Yogi. I would be your competitor anchor for the day. And we have Iger from Pythian. He's a principal consultant at Pythian and he is going to take us through the journey of scaling Mongo. He's come here all the way from Europe, so welcome to Singapore. Let me take you through this session on how to scale MongoDB. My name is Igor and I'm working as a principal consultant at Pythian in the open source database practice. Pythian has more than 25 years of experience in managed database services, 450-plus experts across the globe. We are a premium partner with most of the cloud providers. Recently, we started supporting SAP, Snowflake, and we have customers across the globe. So in this session today, we will start with MongoDB high availability and how MongoDB achieves high availability, discuss a little bit about replica sets, its components and deployment apologies. Then we'll move into scaling, horizontal and vertical scaling, what is a sharded cluster in MongoDB and its components, and at the end we'll see what are the sharding strategies with MongoDB. So let's start with high availability. High availability usually refers to systems that are durable and are likely to operate without a failure for a long time. If we take a look at the systems on this diagram, like system A, B and C, there is application connecting to the system A, which system A refers to as a database. If this database fails, then the entire system fails. Then it means system A doesn't have any fault tolerance. As soon as this database fails, the entire system is down. System B looks like it has a fault tolerance of 1, but that might not always be the case with systems that have a standby secondary that it will take over when the primary fails. In most cases it's necessary that someone else decides whether this other is convenient to take the traffic or maybe just there is a network partition between these two, so it's not always easy when there are just two nodes in the system to decide which one will be the new primary. While in system C, system C has fault tolerance of two nodes because at any given time, if two nodes out of five go down, the rest have majority, so there is always the majority of the nodes to elect and decide what would be the new primary. So with MongoDB, MongoDB achieves high availability by replication and there's just a group of processes that maintain the same dataset. There is a concept of primary and that's where all the rights go from the application connecting to a driver and there are secondary nodes which provide redundancy for high availability as well as the secondary nodes can be used for scaling grids. The driver has options for sending grids with read preference to a primary or to a secondary and there are other read preferences like primary preferred, secondary preferred or nearest. The failover in MongoDB replica set is automatic which means the primary goes down, then the rest of the nodes will form an election and it will decide what would be the new primary. But for that, there has to be a odd number of voting members. Some of the configuration options for MongoDB replica set are that the limit of nodes in a single replica set is 50 and seven out of those can be voting members. There is also a node type arbiter which just participates in elections and it doesn't hold any data. Priority zero node means that it's a secondary node, it doesn't participate in election and it will never be elected as a primary. But it has votes. Hidden node is also a priority zero node but with a difference that the hidden node is not visible to the application from a driver perspective. And there is also a delayed node. The delayed node is also a secondary node with replication delay. So it's usually used for disaster recovery purposes, taking backups and things like that. Another feature in MongoDB replica sets is configuring the secondary nodes with tags and this is good for scaling grids. So if we have our primary node, let's say in Singapore, there are seven voting replica sets, there are seven voting members in this replica set and the rest are non-voting members. If we configure the replication members into tags, we can do local reads. Our application, if it's North America based with tag options, we can direct the reading just for the nodes that have the tag North America. And we can do the same with Europe or Australia or Japan or whatever we configure with the tags. So this way we can scale our reads, adding more members as well as use data locality. That will work just for reading data. Any write that this application from North America will do will have to go to the primary node. So let's move into scaling and in general every process including databases when the working set outgrows the memory, it's usually time for scaling. With database, that will be usually... We want to fit the indexes in memory and plus the working set so as the databases grow over time, at one point of time the working set will outgrow. So we will have to start thinking about adding more memory or even more CPU power to the instance that the database is hosted. And the first thing that we can't do with any scaling including applications or databases is that we need to grow the instance. So when we think about vertical scaling, the first machine worked for a while, it served its purposes, then we need to grow vertically. So we need to increase the machine, we need to make it bigger. In terms of computing power and systems, this is an example from GCP. If we start with a small instance and we have let's say 8 gigabytes of memory, that will serve for a while. As soon as our database outgrows and the working set no longer fits, we need to increase the instance size and we can do that up to a certain limit. Vertical scaling cannot go above some physical limitations that for example in GCP we cannot go above this M2 Megaman 416 instance. Even if we do, probably this will cost us a lot more than for example if we go with horizontal scaling. So horizontal scaling is instead of building the machines up and increasing the memory and CPU, we want to partition the data, we want to parallelize the workload and so that more instances in parallel will do the same job and return the results. And in the computer world that will be adding more instances in parallel but adding instances in parallel means more operations work so it adds a little bit of complexity. We need to know when we query is our data on ABC, how do we return the data, whether we need to query all of these machines or whether we need to send a query just to a single machine. And horizontal scaling at the beginning, if we just initially start with horizontal scaling, it is more expensive compared to vertical scaling but after some time if we decide that we need to scale, horizontal scaling costs grows linearly compared to vertical scaling which almost grows exponentially. So there is a time where we just need to decide whether we want to keep growing horizontally or vertically or just go with horizontal. So all these complexities about scaling with MongoDB are solved with the MongoDB Sharding and MongoDB Sharding is just a cluster environment where we split the data across multiple instances. We share. It's not just adding a pool of resources but the data is actually partitioned across multiple Shards and the components in MongoDB Shards are Shards. This is where actually the data lives and it's partitioned. Then there is the metadata on the config servers and this is where the information about each document in the MongoDB database where it exists. It has config metadata and there is a MongoDB router or the MongoS that is basically the interface between the application and the cluster and the router is just getting all the metadata from the config servers and it draws the queries. So the Shards are just a subset of the entire data from the cluster. Each document in the cluster exists only once and it exists on a single Shard. There is a primary Shard for each database in the system and the primary Shard is defined when the database is created initially through the MongoS. MongoS router will decide what would be the primary Shard for this database. So if we have a database that has a collection which doesn't have Shard enabled, that collection will only exist on the Shard where the primary database is and it doesn't get distributed across the Shards automatically. We need to enable Sharding for that collection, decide what would be the Shard key and then it gets distributed across the Shards. So in this case, collection one has Shard enabled. It will be more or less evenly distributed across the Shard. For collection number two, there is no Shard enabled. It will only exist on Shard A where the primary database for this collection is and that's kind of important because if we have many collections that are not Sharded, then we are potentially making a single Shard a bottleneck because that's where most of the queries for these collections will go. And the config servers are just the part, the metadata of the cluster where when we decide to Shard, that is where all the information is stored, what documents, where they exist and how the Mongo OS will route the queries to the Shards to get the result. Starting with MongoDB version 3.4 which is already out of life a long time ago, the Balancer is running on the primary node of the config servers and this also run as a replica set for availability. And also on the config servers, there is the admin database that stores all the information about the users and authentication and privileges on the system. The Balancer, we'll talk about that later but it's a process that we define what's the Balancer window in a cluster so we can enable Balancing during the night and during the day the Balancer will stop and not migrate any data so it's flexible. And the last piece of the Shard cluster is Mongo OS which is just a router and it doesn't have any persistent state. It has all the metadata in memory and it just routes the queries to the Shard where it gets the metadata from the config server, refreshes the state and when a query is sent it makes a decision where to send the query to with Shard. Since version 4.4, Mongo OS supports hatched reads which means that if we are reading data from a secondary node it will distribute the query to the secondary nodes in a cluster in the Shard and the fastest that returns the response then that will be returned to the customer via the driver. So there are two options for Sharding with MongoDB the first one is range-based Sharding and with range-based Sharding it's just dividing the data into continuous ranges. With this type of Sharding it's more likely that the documents that have close values will live in a closed chunk and potentially will live in the same Shard. So when we send a range query to the cluster it's more likely that this range query will end up on the single Shard and with that we have a query isolation. So we are eliminating potentially the scalar gather query which means a broadcast query will have to go from Shard1 to ShardN scan all the documents and then return the result. Range-based Sharding if the Shard key is properly defined should eliminate that and we should have better query isolation with this type of Sharding. Instead if we go with hashed-based Sharding every document on the key that we decide to Shard the cluster will decide it has a hash function the application doesn't need to compute that the cluster will have a hash function and it will create a hash of the Shard key and with this the data will be randomly distributed. So by doing this we can have a better data distribution across the cluster, across the Shards but with the expense of not good query isolation when we are scanning a range of Shard keys. And the data in the cluster is separated in a logical unit called chunks. So that's just a continuous range of documents. Starking MongoDB version 6, the default chunk size is 128 megabytes. Previously it was 64 and it's customizable. We can configure it between 1 megabyte and 1 gigabyte. If we lower the chunk size to something that's close to 1 megabyte then we'll have a lot of chunks. The data will be more evenly distributed with the expense of having many chunk migrations across the Shards. If we go with an upper chunk size then there will be less chunk migrations in the cluster but the data might not be evenly distributed. And also starting with MongoDB version 6 which is the latest we have this balancing of data instead of chunks. So previously the balancer was just making sure each Shard has the same number of chunks starting with version 6 the balancer is just making sure each Shard has the same amount of data. And some tips on choosing Shark key. We're running out of time but a good Shark key should have very high cardinality, low frequency and it shouldn't be monetarily changing Shark key. Again the same as replica sets we have configuration options zones. So we can localize data let's say if we want to localize data on North America we can tag the Shards as North America and the Shark key if it has the North America tag will have documents only live in those Shards. And this is good for example with the GDPR rules we usually have applications that need to leave their data in Europe and so we can localize the data and live only in Europe. And some summary for best practices deploying and scaling applications on MongoDB. Scaling should be a resort when we no longer have vertical scaling options but if we can run the system as a replica set it's better that we keep replica sets instead of clusters and then move to Shared clusters when we really need to scale. And that's all. Thank you. Thank you very much, Igor. Do we have questions for Igor? Thank you for the talk. I have a question that like when you do Shared you always introduce some problems like data consistency between Shared and something like this. I have heard that Mongo added support for distributed transactions. What is the current state of this work in Mongo? Yes, so there are distributed transactions with MongoDB. So a transaction is basically when we commit it's either committed or if we want to roll back then it's rolled back but that's not the bigger problem when we want to save data on a distributed system. There are other problems with MongoDB on a replica set based, I don't know if we have enough time we can discuss out but if a document is inserted on a primary node and that primary fails then it doesn't guarantee that it will be replicated on the secondaries. And a secondary maybe elected as a primary that doesn't have replicated the document. So we need to make sure that the right concern is set properly and MongoDB has right concerns which can be one, two, or n the number of the nodes or majority which means it gets replicated to the majority of the nodes and that's what can guarantee our rights are happened across the replica set and that includes the cluster. Do we have any other question? From the sharding and distribution standpoint is there any built-in tool or some tooling to show the distribution? Like how do you know if it's equally distributed or not? SH status for example which shows the status on the cluster then there is on a collection level sh.collection.getshardingdistribution for example so it prints out a summary, first chart how many documents, the average document size and with percentage of data localization. No, it's already updated in the metadata. Okay, it's automatically updated, thanks. Thank you for your speech. Is there any recommendations for hash-based sharding instead of range-based sharding or perhaps some use cases? Hash-based sharding is in general better for more even data distribution and when we have more right-intensive workloads while range-based sharding is even in the latest version of MongoDB range-based sharding has sorted out the issue with not even data distribution but with range-based sharding the benefits are that we have local per chart reads and we eliminate the broadcast operations. A broadcast operation on the cluster means we are sending the query to all the charts and with that increased latency the query response is slower instead of just directing the query to a single chart and getting the result back. That's all the time we have for questions. Thank you very much, Agar. Welcome back everyone, I'm Yogi and we have Koji-san all the way from Japan sharing with us about ETL will lead easy with Apache hop. I'm looking forward to this. Thank you, Koji-san. ETL made easy with already no one knows, okay? Founder of Apache hop user group Japan and Neo4j user group Tokyo and the Japanese team and I love Neo4j graph database so graph committee MVP and some Neo4j speaker, Neo4j Ninja. Okay, today's agenda is here what is Apache hop and community and hop concept and hop versus cattle. Do you know cattle? ETL too, no? Okay, and the plug-in, IDB connection and languages and getting started with Apache hop how to download, how to use. Okay, and demo of Apache hop and conclusion and QA. Okay, what is Apache hop? Apache hop is the short name of the hop orchestration platform, HOP. Okay, and this is open source data integration platform. It's easy to use, it's free. Apache hop level project and open source and Apache version 2 license. This is a fork from cattle. History is here. 2019, hop initially started as a fork of cattle. Who knows? No? Oh, okay. And 2020, we started Apache hop in Quebec and 2021. So we started Apache hop, Apache hop level project. So delete incubating, so just Apache hop. This year Apache hop 2.4.0 released. Community are here. This is user group Japan, so meet up here, Facebook here. So little small user group, please join us. Okay, and this is YouTube. You can see YouTube about Apache hop. Okay, a concept is here. Apache hop visual development easy to use, drag and drop interface, just you can make workflow and pipelines. And design once run anywhere like Java. Okay, and it's run on native hop runtime and Apache Spark, Apache Flink and Google Dataflow. Apache hop life cycle has a life cycle management hop, GUI projects, environments, runtime configuration here and manage Git versions. Hop 2.0 is here. GUI, call, encrypt, run, search, server, translator. Hop versus Kettle. Kettle is a base of hop, so a little bit a lot of different user interface and name. Feature here and very different name we use. So most common, I think. Okay, and graphical user interface name and Kettle, a spoon. We use hop, GUI. And hop run and hop run, it's easy to use. Easy to understand here. Easy to understand, do you think so? And unit testing is here and we support unit testing and Apache Spark support, Flink and Google Cloud. Features here. We add a lot of features to Kettle. Hop has a lightweight server really like Ruby or Python, yeah. 8081, just... You can use the lightweight server. This is a pipeline. Pipeline together with what was the main building blocks in hop and oops, okay. And next, workflows. Workflows are one of core building blocks in Apache Hop. So where pipelines do heavy data lifting orchestration work? So we can control a similar pipeline to one workflows from one workflows. We have a lot of plugins, a lot of servers, a lot of database. It's one of the action plugins. SH, SS, SFTP and pop. We can... We can get from SFTP pop. We can use SFTP pop from the data and MySQL support and Neo4j also support. A lot of visual database connections have. So driver, we use driver, JDBC driver and native driver. Our DB is here. 2.4.0 We add that DB. You know? That DB. A lot of DB here. That DB DB too. And there'll be so just access or Postgres or Oracle and MySQL. A lot of database we can read. Okay, transform is here. Add checksum. Add seconds. XML join output and we can control a deep file. Deep, open or read. Extended plugins here. We have Dropbox input and output. And Excel output. And Google Analytics and Google Seats input and output. And lightweight LDIF interface and MQTTQ input output here. This is a sample Postgres interface. JDBC driver for 42, 4.3 and like you can enter just like this. You can use Postgres. And we have a lot of languages. One is simplified Chinese. We add simplified Chinese and English. Japan also is beta version. Languages here English and Chinese, Italian, French. Not beta version. Beta version here. A lot of languages. Beta versions are Japanese and Korean or something. Okay. You can download a website 2.4.0 or 2.5 snapshot from here. And you can also docker. Okay. Can I use demo? Do you mind? Okay. Short demo. And this is a docker version of Apache Hop here. And you can access here. This is a top level screen of Apache Hop. Okay. And this is a screen and so please imagine if you create fake data for your application. Okay. And we make a pipeline. Okay. So generate okay. This is a create of data. Just 10 records. Okay. 100. And I can't use demo with your screen. Okay. Kafka. Yes. Yes. I love Kafka. Okay. I think there is a big update. I just downloaded it. Yeah. Yeah. It doesn't work with screen presentation. Yeah. I think there is some compatibility issue between the streaming software and this. But yeah, I just tested it on my machine. It does work. Do we have questions for Koji-san? In the whole data pipeline from like scattering processing data, where the hop contribute in. I mean like that's just like I'm reading a little bit in here and sorry for the first time in hearing about us. It seems like that's a kind of the GUI for you to interact with data, right? So in case you know what exactly your data set, what you want to transform and you can transform by code, what the average is hop contribute on that's whole data processing? Yeah. So the question is what does hop contribute in the actual processing of the data? Is it only the visual design or it can even run the data pipeline? Yeah. Hop can do both. So it has the visual designer that you are looking at and you will create the pipeline and run it using the runtime. So it has both. Any other question? Because I have a few questions. How is it related or different from Spring Cloud Dataflow? Are you familiar with Spring Cloud Dataflow? It's also a Java based data processing technology. Same concept. It has pipelines and they also have the they call it source process and sync components. So and also they have a visual designer and everything. I will try. It's excellent. The one thing that I know from Spring Cloud Dataflow is every component that you add in the communication between like your source and the processor, it goes through a messaging system. Like a Kafka or Google, PubSub or something like that. Is it the same here as well? Yeah. So under the hood, between the steps the data would actually move through the messaging system. Yeah. And you can use email or something like that. We can use email as a sync. So if we want to like compose emails and all that. That's wonderful. Wonderful. Okay. Thank you very much. Goji-san. 9MDT. Alright. MTG. Alright. The mic. Sure. Right click on the tab and you can mute it otherwise there will be music. Ah, okay. Alright. Oh, I see. Yeah. So hi guys. My name is Omar. I work at TikTok. In this talk I wanted to talk a little bit about InnoDB. I'm curious, curious amongst the audience here. How many of you have heard of InnoDB? Awesome. That's good. And how many of you have used it? Probably everyone right. Okay, cool. Pretty significant. Nice. Let me just give you guys a historical background. What is InnoDB? So InnoDB is a storage engine for MySQL and also MariaDB which is also technically MySQL. And since MySQL 5.5 which was released 13 years ago InnoDB kind of replaced MyISEM as the default storage engine. But it's interesting to ask yourself why is How come MySQL has multiple storage engines? And this is an interesting design decision because from MySQL's perspective it's designed to be to work with multiple storage engines. They offer this abstraction and different storage engines have different applications. So you can build a storage engine that's kind of meant more for OLTP workloads or OLAP workloads, this depends. But you can keep the same common components the same. That's the design philosophy here. So what does a storage engine do exactly? There's actually two big parts of a storage engine. One is it's going to manage the in-memory structures that basically you need to keep your the pages of your query and keep them in memory as much as possible as well as manage the on-disk data structures. And InnoDB has a bunch of data structures that they use to make all this work. They also manage your write ahead log. So there are a bunch of these things. I won't have time to go through a lot of things because this talk is kind of a lightning talk. I'll just talk a little bit about the high level picture. So as you guys probably know and you guys probably know this in terms of storage hierarchy the fastest thing you have is our CPU registers and probably the slowest thing you have is network storage. And our hard drives and SSDs are somewhere in between DRAM is fairly fast as well. Non-volta storage in general is relatively slow. I think everyone knows that. It's also cheap. So if you're storing tons of data you probably want to store it on non-volta storage and you probably okay with higher latencies. This is one of my favorite slides and this is like there's a website for this as well. So I'm a bit of a performance nut. So if you guys are interested in performance to performance is something you must know. What is the latency for a cache hit? So L1 and L2 caches are in your CPU. They're almost as fast as register lookups. Whereas DRAM is around 100 nanosecond. SSD, you can see it's a few orders of magnitude slower. And HDD is several orders of magnitude slower and the network storage sticks that way out. So if you convert this to time skills we can understand a cache hit is 1 second and compared to DRAM is 100 seconds and then compared to SSD is 4 hours. So if you fetch some data from your cache, if your cache is 1 second which is 4 ready for 4 hours if you fetch it from SSD which is supposedly quite fast and HDD will be 3 weeks. So this is something that you must be aware of when designing any system that needs to deal with a lot of data. Another thing that we need to be mindful of is random access on normal storage will almost always be slower than sequential access and generally storage engines and DBs will maximize sequential access. So they'll try to tune the algorithms because this is the fundamental bottleneck. It's not about algorithmic complexity, it's about making sure everything is sequential. So the first thing is inside of DB there's a concept of a page because when you organize data you organize data in pages and the reason for doing this is because when you are writing to a normal storage normally the API has to write it or block based. So you work with blocks of data and that's why to optimize that further you want to structure the way you save the data into blocks because if you save one byte you're actually saving the whole block and normally the hardware page is 4 kilobytes. So you don't want to do byte by byte operations because then you're wasting a lot of throughput. So what databases do is they will organize storage pages. The pages can vary in size. Database pages usually will be between 512 bytes to 16 kilobytes. Machiko user, NODV use 16 kilobytes. And the way they organize this page layout is they'll have a header in front tell you some metadata about stuff about this page and bookkeeping and then usually what they'll do is they'll have this thing called a slaughtering. So this is a slotted page design and they'll have these slots that are pointers that are pointing to where the actual tuples that contain your data actually are kept. And the reason for this design is so that it's kind of easy for you to support tuples of variable length because tuples can have variable length. They're not always going to be fixed length. That's why this slotted page design is used. So another thing about NODV is you need to man to keep these pages that are described just now. They're usually going to have the same representation on disk and memory. You want to keep these pages as much in memory as possible. Because like I mentioned you want to fix things fast and memory versus disk there's a huge difference in latency. So normally what NODV does is NODV will use an LRU based algorithm to keep pages in memory. And the larger your buffer pool is obviously the higher speed-ups you're going to get which kind of makes sense. That's also why if you give my SQL some memory it will never give it back to you because it's going to keep this buffer pool and it want to use it as much as possible. Just to give you an idea of how the whole flow actually works. So normally when you send a query the query will go to this execution engine. The execution engine will decide it will come with a query plan, it will decide how should I execute this query. After it has decided it's going to talk to NODV and then it will say I need to fix these pages these pages might be on disk maybe not in the buffer pool yet so they'll fetch it from disk it will be loaded in the buffer pool and then the query will operate on that and then send the result back to the user. One thing about NODV in particular is that it's asset-compliant. How many of you know what asset means? That's great. Asset DA means atomicity and the key idea here is that when you have NODVs when you have transactions everything happens nothing happens. Imagine this is quite important for many applications the NoSQL movement was quite hip 10 years ago and it eventually most NoSQL systems ended up adding transactions again because transactions are critical for many, many use cases. Especially, you can imagine transactions are extremely critical because you want to make sure if you have two database operations either both succeed it or nothing. We need atomicity to support that so the way NODV with atomicity works is you just fire a SQL statement without a transaction it will be auto-committed nor by default and you can tune this if you want to and then for transactions they also have a commit and they also have rollback statements for you to commit a rollback. Another part about the C in asset is consistency what that usually means is you want to make sure if you write data to your DB you get the result. In NODV you can imagine this can be quite tricky because while you're running you can crash and if you crash while you're running you might lose consistency maybe from the application point of view you wrote it but actually you didn't write it to the disk yet so how to manage this so the way NODV works is it manages this thing called a double write buffer so it's kind of like a backing source so basically you'll write to that buffer the disk and if you lose that buffer it's fine so they also use write ahead logs separately to kind of if you crash they'll replay the write ahead log so that you can recover so I won't go into this in too much depth alright one other very fascinating area that I'm a big fan of is isolation the I in asset so isolation what does isolation mean it actually controls the extent to which a transaction is exposed to other concurrent transactions right because if you think about it you can have the simplest database possible which is just single-threaded and it it doesn't need to it doesn't do anything if you have a single-threaded database running on one core and then you just give it a query and then basically you don't need any logs you don't need to do any protection for concurrency this model can work and actually there are some databases that do that so red is quite famous for doing that there's also walldb it's also designed with the same idea in mind so however the problem is that you also lose a lot of concurrency concurrency for databases especially those that deal with disk is quite important because you don't want to if you're only running a single thread you'll be spending a lot of time spinning waiting for this guy right so we want to maximize a lot of concurrency especially a modern hardware when we have a lot of course right the problem however is when you have concurrency and multiple transactions are going at the same time and they're all operating the shared buffers and the shared storage you want to there's a lot of interesting problems that'll happen for example you write something your transaction is ongoing and you read some you someone else writes some root root to the same object that you're supposed to read and you'll get a dirty read for example right or you might get you might have unrepeatable reads you might have phantom reads I will go to that in too much detail because we don't have much time but people are talking more about this afterwards right but isolation levels isolation is a very very important and in ODB there are four isolation levels and it's very important for application developers to understand which one to use and what makes sense because this decision is left to the application right so the strongest level is called serializable it is essentially the guarantee here is that it is as if the transactions were executing one by one however in actuality they're not in actuality they will implement two-phase locking to still allow the currency but so there will still be some currency but in ODB we will guarantee that the results you get are correct the default level in ODB it's actually repeatable read repeatable read does have an issue that it actually you might get some phantom reads that can happen but that's a default the reason for doing repeatable read is it's better for performance right because with serializable you basically you need to do a lot more locking right and your performance is going to suffer it's recommended if you can't live with phantom reads you should use serializable by default most people don't most people use repeatable read and one of the interesting things about this is that this is not something that's that visible you probably wouldn't notice it unless you dive into your data and you see oh wait, maybe the data doesn't add up something went wrong here so like phantoms a few places but what I mean by phantom is you have a transaction and you collected some processing some objects and while you're running your transaction someone else inserted a new object that matches your criteria so that's this kind of thing can happen read committed is another isolation level provided by NODB where the performance is better but once again it also allows for phantoms it also allows unrepeatable reads unrepeatable reads is like you might have these flashes where you might not get the same read again this is because someone else mutated that data whereas it's an unrepeatable read normally they'll use some snapshots so they'll copy over and have different versions of the data but in repeatable in read committed they'll actually be touching the same same data and read uncommitted is like basically anything can happen it's basically MongoDB like basically you don't get any transaction isolation right so any of these problems can happen and this I mean if you're using this then you're using my SQL alright then let's briefly talk about durability because I'm also out of time so so as I mentioned as I mentioned earlier NODB has a double write buffer this kind of it's one of the this helps us to get durability because it's double the double write buffer is a storage area where NODB will flush the pages because you remember it is a buffer pool NODB writes is when the page is dirty it will write that page to disk so and then it has a double write buffer to kind of help you manage that and help you recover from because we actually write it twice you have to double write buffer and you also write to the actual actual page but one the nice thing is that there's some optimizations to kind of minimize the cost of repeated repeated f-syncs so you don't need to actually f-sync multiple f-syncs will not be as expensive because you can do, you can batch everything into a large sequential chunk with that I'm out of time just nice so do you guys have questions thank you for the talk I'd like to ask if there are any main differences of NODB engine from other engines or it's just very many small things more optimizations that makes NODB engines better than others in some way you're comparing my SQL engine or you're comparing to like any other storage engine same my SQL engine right, there is a huge difference so especially so my ISAM I believe tries, the philosophy there is quite different it's not as optimized for OLTP workload in particular as my understanding where NODB is very much targeted towards OLTP small transactions but lots of them so this does, there is a huge huge difference, the code base is completely different and the design and architecture is also completely different so if you compare NODB to other database engines it's like for example Oracle the Oracle has their own storage engine for example the entities DB2 from IBM has their own storage engine compare do you compare NODB to those guys? I mean they're similar architecture, definitely but those guys, because they're enterprise they'll have much more performance guarantees because they can spend a lot more engineering manpower, this was NODB's open source so it doesn't get that much love yeah, it is thank you so I guess one of the other popular engines in the physical world is MIROX yes, yes, that's a good one so how do you contrast the two? that's a great point so ROX DB uses LSM trees which is different from the standard B3 mechanism that's used by NODB it's a pretty good, interesting to contrast them both one of the things I know is ROX DB is optimized for this space basically the tradeoff is you may get slower reads, faster writes and you need less disk space to store the same amount of data however some certain types of queries are a lot slower in MIROX so you have to do the tradeoff a bit carefully so we use MIROX in situations where we have a lot of data and we want to optimize on disk space so ROX is pretty good for that hi understand that the NODB storage engine is optimized for handle transactions for RTP workloads so let's say when there's a growing huge growing workloads then you need to handle let's say 100k TPS you need to handle for example 100 terabyte of data how do you do that with NODB it depends if you're bottlenecking on a single machine you just shard it obviously you can't get especially if you have a heavy transactional workload you probably can't get that much TPS in one machine so generally the idea would be you shard it there are solutions for sharding there's open source solutions like for example YouTube had one I forgot the name yes Vitess so Vitess is pretty good for that internal one so I think most companies will do sharding well that's all the time we have now so I guess this is the most important time of the day I guess it's lunch so thank you everyone for joining this morning we will be recommending that later time