 We are all the way from Europe, so welcome to SINEPA. Thank you. Thank you, everyone, for joining my session. And good morning. Am I audible? It looks like there was a lot of fun in the social event last night. So let me take you through this session on how to scale MongoDB. My name is Igor, and I'm working as a principal consultant at PTM in the open source database practice. PTM has more than 25 years of experience in managed database services, 450-plus experts across the globe. We are a premium partner with most of the cloud providers. Recently, we started supporting SAP, Snowflake. And we have customers across the globe. So in this session today, we will start with MongoDB high availability and how MongoDB achieves high availability, discuss a little bit about replica sets, its components, and deployment apologies. Then we'll move into scaling, horizontal and vertical scaling, what is a Shattered Cluster in MongoDB and its components. And at the end, we'll see what are the sharding strategies with MongoDB. So let's start with high availability. High availability usually refers to systems that are durable and are likely to operate without a failure for a long time. If we take a look at the systems on this diagram, like system A, B, and C, there is application connecting to the system A, which system A refers to as a database. If this database fails, then the entire system fails. Then it means system A doesn't have any fault tolerance. As soon as this database fails, then the entire system is done. System B looks like it has a fault tolerance of one, but that might not always be the case. With systems that have a standby secondary that it will take over when the primary fails, in most cases it's necessary that someone else decides whether this other is convenient to take the traffic, or maybe there is a network partition between these two. So it's not always easy when there are just two nodes in the system to decide which one will be the new primary. While in system C, system C has fault tolerance of two nodes because at any given of time, if two nodes out of five go down, the rest have majority. So there is always the majority of the nodes to elect and decide what would be the new primary. So with MongoDB, MongoDB achieves have availability by replication, and there's just a group of processes that maintain the same data set. There is a concept of primary, and that's where all the rights go from the application connecting to a driver. And there are secondary nodes which provide redundancy for high availability, as well as the secondary nodes can be used for scaling grids. The driver has options for sending grids with read preference to a primary or to a secondary. And our other read preference is like primary preferred, secondary preferred, or nearest. The failover in MongoDB replica set is automatic, which means the primary goes down, then the rest of the nodes will form election, and it will decide what would be the new primary. But for that, there has to be odd number of voting members. Some of the configuration options for MongoDB replica set, the limit of nodes in a single replica set is 50, and seven out of those can be voting members. There is also a node type arbiter which just participates in elections, and it doesn't hold any data. Priority zero node means that it's a secondary node. It doesn't participate in election, and it will never be elected as a primary. But it has votes. Hidden node is also a priority zero node, but with a difference that the hidden node is not visible to the application from a driver perspective. And there is also a delayed node. The delayed node is also a secondary node with a replication delay. So it's usually used for disaster recovery purposes, taking backups, and things like that. Another feature in MongoDB replica sets is configuring the secondary nodes with tags. And this is good for scaling grids. So if we have our primary node, let's say, in Singapore, there are seven voting replica sets. There are seven voting members in this replica set, and the rest are non-voting members. If we configure the replication members into tags, we can do local reads. Our application, if it's North America based with tag options, we can direct the reading just for the nodes that have the tag North America. And we can do the same with Europe, or Australia, or Japan, or whatever we configure with the tags. So this way, we can scale our reads, adding more members, as well as use data locality. That will work just for reading data. Any write that this application from North America will do will have to go to the primary node. So let's move into scaling. And in general, every process, including databases, when the working set outgrows the memory, it's usually time for scaling. With the database, that will be usually we want to fit the indexes in memory and plus the working set. So as the databases grow over time, at one point of time, the working set will outgrow. So we will have to start thinking about adding more memory or even more CPU power to the instance that the database is hosted. And the first thing that we can do with any scaling, including applications or databases, is that we need to grow the instance. So when we think about vertical scaling, the first machine worked for a while. It served its purposes. Then we need to grow vertically. So we need to increase the machine. We need to make it bigger. In terms of computing power and systems, this is an example from GCP. If we start with a small instance and we have, let's say, 8 gigabytes of memory, that will serve for a while. As soon as our database outgrows and the working set no longer fits, we need to increase the instance size. And we can do that up to a certain limit. Vertical scaling cannot go above some physical limitations that, for example, in GCP, we cannot go above this M2 megaman 416 instance. Even if we do, probably this will cost us a lot more than, for example, if we go with horizontal scaling. So horizontal scaling is, instead of building the machines up and increasing the memory and CPU, we want to partition the data. We want to parallelize the workload. And so that more instances in parallel will do the same job and return the results. And in the computer world, that will be adding more instances in parallel. But adding instances in parallel means more operations work. So it adds a little bit of complexity. We need to know when we query, is our data on ABC, how do we return the data, whether we really need to query all of these machines, or whether we need to send a query just to a single machine. And horizontal scaling, at the beginning, if we just initially start with horizontal scaling, it is more expensive compared to vertical scaling. But after some time, if we decide that we need to scale, horizontal scaling costs grows linearly compared to vertical scaling, which almost grows exponentially. So there is a time where we just need to decide whether we want to keep growing horizontally or vertically or just go with horizontal. So all these complexities about scaling with MongoDB are solved with the MongoDB Sharding. And MongoDB Sharding is just a cluster environment where we split the data across multiple instances. We shard. It's not just adding a pool of resources, but the data is actually partitioned across multiple shards. And the components in MongoDB Shards are shards. This is where actually the data lives and it's partitioned. Then there is the metadata on the config servers. And this is where the information about each document in the MongoDB database, where it exists, it has config metadata. And there is a MongoDB router, or the MongoS, that is basically the interface between the application and the cluster. And the router is just getting all the metadata from the config servers and it throws the queries. So the shards are just the subset of the entire data from the cluster. Each document in the cluster exists only once, and it exists on a single shard. There is a primary shard for each database in the system. And the primary shard is defined when the database is created initially through the MongoS. MongoDB's router will decide what would be the primary shard for this database. So if we have a database that has a collection which doesn't have sharding enabled, that collection will only exist on the shard where the primary database is. And it doesn't get distributed across the shards automatically. We need to enable sharding for that collection, decide what would be the shard key, and then it gets distributed across the shards. So in this case, collection one has sharding enabled. It will be more or less immediately distributed across the shard. For collection number two, there is no sharding enabled. It will only exist on shard A where the primary database for this collection is. And that's kind of important because if we have many collections that are not sharded, then we are potentially making a single shard a bottleneck because that's where most of the queries for these collections will go. And the config servers are just the part, the metadata of the cluster where when we decide to shard, that is where all information is stored, what documents, where they exist, and how the MongoS will route the queries to the shards to get the result. Starting with MongoDB version 3.4, which is already out of life a long time ago, the balancer is running on the primary node of the config servers. And this also runs as a replica set for availability. And also on the config servers, there is the admin database that stores all the information about the users and the authentication and privileges on the system. The balancer, we'll talk about that later, but it's a process that we define what's the balancer window in the cluster, so we can enable balancing during the night. And during the day, the balancer will stop and not migrate any data, so it's flexible. And the last piece of the shard cluster is MongoS, which is just a router, and it doesn't have any persistent state. It has all the metadata in memory, and it just routes the queries to the shard where it gets the metadata from the config server, refreshes the state, and when a query is sent, it makes a decision where to send the query to which shard. Since version 4.4, MongoS supports hatched reads, which means that if we are reading data from a secondary node, it will distribute the query to the secondary nodes in a cluster in the shard. And the fastest that returns the response, then that will be returned to the customer via the driver. So there are two options for sharding with MongoDB. The first one is range-based sharding. And with range-based sharding, it's just dividing the data into continuous ranges. With this type of sharding, it's more likely that the documents that have close values will live in a close chunk, and potentially will live in the same shard. So when we send a range query to the cluster, it's more likely that this range query will end up on the single shard, and with that, we have a query isolation. So we are eliminating potentially the scalar better query, which means a broadcast query will have to go from shard1 to shardn, scan all the documents, and then return the result. Range-based sharding, if the shard key is properly defined, should eliminate that, and we should have better query isolation with this type of sharding. Instead, if we go with hash-based sharding, every document on the key that we decide to shard, the cluster will decide it has a hash function. The application doesn't need to compute that. The cluster will have a hash function, and it will create a hash of the shard key. And with this, the data will be randomly distributed. So by doing this, we can have a better data distribution across the cluster, across the shards, but with the expense of not good query isolation when we are scanning a range of shard keys. And the data in the cluster is separated in a logical units called chunks. So that's just a continuous range of documents. Starking MongoDB version 6, the default chunk size is 128 megabytes. Previously, it was 64, and it's customizable. We can configure it between 1 megabyte and 1 gigabyte. If we lower the chunk size to something that's close to 1 megabyte, then we'll have a lot of chunks. The data will be more evenly distributed. With the expense of having many chunk migrations across the shards, if we go with an upper chunk size, then there will be less chunk migrations in the cluster, but the data might not be that evenly distributed. And also starting with MongoDB version 6, which is the latest, we have this balancing of data instead of chunks. So previously, the balancer was just making sure each shard has the same number of chunks. Starting with version 6, the balancer is just making sure each shard has the same amount of data. And some tips on choosing Shark key. We're running out of time, but a good Shark key should have very high cardinality, low frequency, and it shouldn't be a monotonic changing Shark key. Again, the same as replica sets. We have configuration options zones. So we can localize data. Let's say if we want to localize data on North America, we can tag the shards as North America. And the Shark key, if it has the North America tag, will have documents only leave in those shards. And this is good. For example, with the GDPR rules, we usually have applications that need to leave their data in Europe. And so we can localize the data and leave only in Europe. And some summary for best practices deploying and scaling applications on MongoDB. Scaling should be a resort when we no longer have vertical scaling options. But if we can run the system as a replica set, it's better that we keep replica sets instead of clusters and then move to Charlotte clusters when we really need to scale. And that's all. Thank you. Thank you very much, Agung. Do we have questions for Agung? Thank you for the talk. I have a question that when you do sharp think, you always introduce some problems like data consistency between shards, something like this. I have heard that Mongo added support for distributed transactions. What is the current state of this work in Mongo? Yes, so there are distributed transactions with MongoDB. So a transaction is basically when we commit, it's either committed, or if we want to roll back, then it's rolled back. But that's not the bigger problem when we want to save data on a distributed system. There are other problems with MongoDB on a replica set based. I don't know if we have enough time, we can discuss. But if document is inserted on a primary node and that primary fails, then it doesn't guarantee that it will be replicated on the secondaries. And a secondary maybe elected as a primary that doesn't have replicated the document. So we need to make sure that the right concern is set properly. And MongoDB has right concerns which can be one, two, or n, the number of the nodes, or majority, which means it gets replicated to the majority of the nodes. And that's what can guarantee our rights are happening across the replica set. And that includes the cluster. Do we have any other question for Rager? From the sharding and distribution standpoint, is there any built-in tool or some tooling to show the distribution? But how do you know if it's equally distributed or not? Yes, there is commands that can be run on a MongoS, which is SH status, for example, which shows the status on the cluster. And there is on a collection level, SH.collection.get sharding distribution, for example. So it prints out a summary, per shard, how many documents, the average document size, and with percentage of data localization. Do you have to run a statistics? No, it's already updated in the metadata. OK, it's automatically updated. Thanks. Thank you for your speech. Is there any recommendations for hash-based sharding instead of range-based sharding, or perhaps some use cases? Yes, hash-based sharding is, in general, better for more even data distribution. And when we have more write-intensive workloads, while range-based sharding is, even in the latest version of MongoDB, range-based sharding has sorted out the issue with not even data distribution. But with range-based sharding, the benefits are that we have local per-shard reads. And we eliminate the broadcast operations. A broadcast operation on the cluster means we are sending the query to all the shards. And at that latency, the query response is slower. Instead of just directing the query to a single shard, I'm getting the result back. That's all the time we have for questions. Thank you very much, Hacker.