 Right. So this is this is Mark Smith. I don't think Mark needs an introduction. He's he's working for MongoDB. He's gonna tell us lots of things that you probably don't know about MongoDB. So off you go. Thank you. Hi, everybody. I'm known as GD2K online generally for reasons that aren't that interesting. And I joined MongoDB last December. So I've been there around seven or eight months and I'm reasonably experienced. In the past I've built systems on top of MySQL and Postgres, lots of stuff on SQLite, Redis, CouchDB, Solar, but I'd never actually used MongoDB before. So I've been learning a lot over the last seven to eight months. And there were two primary things that I really learned. One is that MongoDB is really powerful and kind of fun. And the other is that almost everything that anyone says about MongoDB online is wrong. So I'm going to spend the next 20 to 25 minutes trying to bust a few myths and give you an idea of what MongoDB is, what it is and how it works. So primarily this talk is aimed at people who don't currently use MongoDB and might be interested in what it is and hopefully sort of working their way through some of the misinformation that's online. If you use MongoDB regularly, you probably know what it does, but it is a big and complex product. So hopefully you might learn something new anyway. So next slide please. So before I really get started in the real myths, there's this video on YouTube and it involves two dogs kind of having an argument. And one uses a lot of technical jargon and a lot of slogans to describe how amazing MongoDB is. And the other dog is actually a bit more down to earth and increasingly gets frustrated with the first dog and how unrealistic it is. And when I announced that I was going to work at MongoDB, a friend of mine sent me a link to this video immediately just in case I hadn't seen it. And the thing is, this video was published in 2010. It's 10 years old and you all think that no one at MongoDB has seen it and nothing could be further from the truth. Next slide please. This is my colleague Max and like me and everyone else at MongoDB, he has seen the video on YouTube. He has also bought the t-shirt. He's wearing his MongoDB is web scale t-shirt. If you were to walk around our head office in New York, you would see this sticker on about half the laptops in the office. So the next time you're tempted to send this 10 year old video to somebody who works for MongoDB, maybe if you could think twice, I'd really appreciate it because it was funny at the time, but it's almost all incorrect now. And I'm going to counteract a lot of the misinformation that it's been so easily spreading for about 10 years. So next slide please. So now that I've got that out of the way, I'm just going to give a very high level overview of what MongoDB is in case you really haven't touched it at all. Next slide please. So MongoDB is a clustered database. The minimum number of machines you can have in a cluster is three. If you want to have high availability and you don't want to lose any data, you need an odd number of machines in your cluster so they can have a little conversation among themselves and elect a primary. Now the primary is the machine that essentially handles all the connections to the cluster. So once a primary is elected, all the clients, whether it's your Python client or any of the other language clients that can be used to connect to MongoDB will connect to the primary. And it's the primary's responsibility to stream data down to the secondary. So they all store the same data but the secondaries can be slightly out of date. So this obviously is for redundancy. It's for data resilience rather than scalability because all those, the primary essentially acts as a bottleneck to the cluster. You can force your client to connect to one of the secondaries to do, say, analytics queries on a machine that's on the slightly less read load and that's kind of an interesting model but you have to understand that you will be working with stale data if you're talking to the secondary because the data comes in through the primary first. Next slide please. I have to keep moving backwards and forwards between my notes and what's on the screen. So what do we store in this database cluster? Well MongoDB is a document database so we store collections of documents. Now a collection is like a table in a relational database and a document looks a bit like this on the right. It's a map of keys and values. This document in particular is from our sample movies database. It's for a movie called Blacksmith's scene that was filmed in 1893 and it's one minute long and it involves a blacksmith hammering at an anvil and then taking a break, wiping his brow, opening a beer and passing it around and then getting back to hammering on the anvil again. It's like a TikTok video from 1893 which I think is kind of cool given I didn't even realize they had movie cameras in 1893. But I'd like to highlight a couple of things about this piece of data that I think are kind of interesting. So the first, next slide please. So the first is that it's hierarchical and kind of multi-dimensional. So the cast value here is a sub-array and you can update individual parts of this document individually. So you don't have to update the entire document each time. So I can append items to this array if I want to or I can insert or delete items from the array. The other thing is the IMDB value is what we call in MongoDB a sub-document but in Python you would call a dictory. In JSON you might call it an object. Next slide please. And the other thing is that there's some values here that are of types that aren't available in JSON. And I'll talk about that a little bit more in a moment. The ID is an object ID type which is a special type we use for generated IDs in the cluster and released is displayed here as a native Python datetime object. It's not actually stored in the database as a Python datetime object but it is a native MongoDB datetime object and a Python driver converts that into something that's useful for you as a Python developer when it's retrieved from the database. So next slide please. And I'll be covering why this is important in a later slide. So the first myth about MongoDB is that it's on version 2.4. You won't see people kind of promoting that fact online but if you install a relatively recent version of Debby and Jesse and run apt install MongoDB you will get version 2.4. The problem is that version 2.4 was released in 2013 so it's getting on for eight years old and there have been seven or eight major releases since then each of them fixing bugs and adding features and increasing improving performance. In fact the entire storage engine has been rewritten since then. So MongoDB 4.4, next slide please, is almost a completely different database to 2.4. And if you're interested in actually installing an up-to-date version of MongoDB, Google MongoDB community, follow the first link that goes to the MongoDB website and follow the instructions for installing it on your favorite Linux, Mac or Windows distribution. Next slide please. So the first myth, the second myth sorry that I've already alluded to is that MongoDB is a JSON database. You will see this quite a lot. It's a very pervasive myth. Next slide please. In fact at the moment it's on the MongoDB homepage that we store Richard JSON documents. But this isn't actually true. Next slide please. MongoDB is a Beeson database and this may sound like a sort of minor technicality because a Beeson is a binary, actually I don't know what the S stands for. Anyway it's a binary object notation. It's not just a sort of binary version of JSON. It's much more efficient to store and traverse than JSON which you would kind of expect from the fact that it's binary but it also includes extra data types like the ones that I showed you before and the ones you'd really care about as a developer are the ability to efficiently store binary blob data. It stores date times natively so you can query against different aspects of a date timestamp that's stored in the database and it also includes various different numeric types like decimal numbers that are good for storing currency. This is important because Beeson is totally fundamental to MongoDB. It's the format of the protocol that's used to talk to the server. It's not a rest query that you use to talk to MongoDB. There's a binary streaming protocol built on Beeson. Database queries are actually Beeson structs. Database results are Beeson. MongoDB is fundamentally a Beeson database. Next slide please. Another thing you will hear about MongoDB everywhere is that it doesn't support transactions. That it's a I forget what base stands for. Basic availability, soft state, eventual consistency database. Two of those have never been true. There's a reason for this myth existing. It's that essentially two years ago we didn't support transactions. Next slide please. Two years ago we added transactions to MongoDB in version 4.0 and then last year transactions were extended to support sharded clusters. Now whatever type of cluster you have for MongoDB will support transactions. Despite this, it's not so necessary to use transactions in MongoDB as it is in traditional relational databases. Because we have a rich document format that allows you to store nested related data together in a document, you don't need to do so many joins across collections or tables to update data. You don't need a transaction to ensure that those data updates are done atomically. Updates within a single document are atomic by themselves. If you have a good database design and you're storing related data together, then you can update all of that data in one atomic operation. Having said that, now that MongoDB supports transactions, for all those times where you do need to do updates across different collections or across different documents, that facility is available too. But if you're doing it too much, you probably need to look at your data design, your model design and try to factor that out as much as possible. Next slide, please. Another thing that goes hand in hand with the transactions thing is that MongoDB doesn't support relationships, that you can retrieve multiple documents, but you can't join across collections. This hasn't been true for quite some time. We support joins, left outer joins, and have done for quite some time using a type of query called an aggregation pipeline. Next slide, please. That doesn't say a lot. Aggregation pipelines are, yes, it's been supported since 2.2. Next slide, please. I'll show you just quickly what an aggregation pipeline actually looks like. It's a set of operations you can conduct on a collection. They're optimized so that they can be reordered or filtered out based on what will work most optimally with the data and the type of query that you're doing. One of these operations is called a lookup operation. Here I'm conducting a single aggregation operation on the orders collection starting at the top. This is doing a left outer join with the inventory collection where orders.item equals inventory.sku. It will embed the resulting documents in a sub-document called an inventory docs. Next slide, please. It becomes a bit clearer, and one more slide, please. This becomes a bit clearer when you can actually see the kind of data that's returned. Here it's looked up some documents matching that relationship in inventory, and then it's embedded them in the resulting order document. Here, this happens to only have one embedded document, but because this is a one-to-many relationship, it's an array of documents that comes back, so you could have multiple documents in here. I quite like this because when you do this kind of query in a relational database, it flattens it down into duplicated rows in the resulting, in the result set. Whereas MongoDB actually takes advantage of the fact it's a hierarchical document format and embeds it in there. Not only does MongoDB support relationships and joins, I think it actually does it quite intuitively. Next slide, please. Another thing people will talk about quite early on if you're discussing the pros and cons of MongoDB is sharding, and there's a reason for this. Sharding is a pretty cool feature. Sharding is when you take your entire dataset and you divide it into separate pieces. So you take all of your data, you maybe divide it down the middle, and then you have what are called two shards. So they're two separate datasets that are related, and then you take one of those datasets and you put it on a cluster, and then you put the other dataset and put it on another cluster. So you now have two shards. And when you do queries, you ideally send the queries to the cluster that holds the data that you want to get back. Next slide, please. The problem with this is that I said you need a minimum of three machines in a cluster. As soon as you have two shards, as soon as you have sharding, you need three machines per cluster. So you're looking at a minimum of six database servers. And because you need to send the queries and the data updates to the correct shards, you actually need some servers to negotiate that in front of them. So you need a shard server. And because you want redundancy, in case one of the machines goes down, you actually need a minimum of two of these machines. So you've gone from a minimum of three machines, which is a significant cost in a production environment, to a minimum of eight machines. Next slide, please. So what we recommend generally is if you're working with large data that isn't currently performant on your cluster or won't currently fit on your current machine, we recommend you look at upgrading your machine first. As soon as you start adding shards, you start to limit the types of queries you can do. You add a huge amount of complexity to your operations cost in terms of actually managing the cluster. So buy more RAM. For example, if your data isn't fitting in memory, buy more RAM. It's probably cheaper than buying another five servers or a faster processor. If that's your bottleneck, essentially look at the thing that's limiting your scaling and attempt to upgrade that first. Look at the cost of that first. Having said that, sharding is there for you if those aren't solutions at the current time. Another really interesting thing that sharding can do is if you shard on the location of the users who are accessing the data, you can move that shard cluster to be geographically closer to those users so they get less latency when they're doing queries against your database, which I think is kind of cool. Next slide, please. But if you are looking at upgrading your cluster, I recommend you look at MongoDB Atlas, which is the database hosting service that MongoDB run. So they'll run your database for you. They will handle scaling your cluster up and down as required. So you'll only be paying for the kind of usage you require. And it still supports sharding and things like that if you need them at a later date. Also, it will take huge amounts of operation spend away in terms of doing backups and handling redundancy and things. Next slide, please. So the reason people talk about sharding is because micro sharding used to be a big thing. So back in 2.4, if I remember correctly, MongoDB had a lock in the storage engine that meant that it could only efficiently use one core on a machine. So it was a bit like the global interpreter lock in Python to that respect. Actually, I apologize. It was before MongoDB 3.2. So some enterprising DBAs discovered that if they sharded their data and ran multiple shards on a single machine, they could make use of all the cores on that machine, which was quite clever, but a bit of a hack. And this got known as micro sharding. Since 2015, MongoDB has a non-blocking storage engine called WiredTiger, which doesn't require this anymore. So it makes full use of the cores on your machine. So people used to talk about sharding because it was required to optimally work on multi-core machines, but this is another historical thing. Next slide, please. Oops, that wasn't meant to come up. These are my slides that were actually disabled and somehow it's exported to them to PDF anyway. Could we move to the next? I'll keep on saying next until we get to next, next, next, next, next. That stopped. It was the last slide, please. So myth number six, MongoDB is insecure. So there have certainly been data breaches with MongoDB and it has developed a bit of a reputation for it, which I think is slightly unfair. Next slide, please. So the cause of this is that MongoDB in most distributions with the older versions automatically binds to the network and it automatically starts up with no authentication. So hopefully when you're hosting a production server or anything on the internet, you have a firewall either on your machine or in front of your machine or both. It's the default on Amazon web services to have a firewall that only exposes your SSH port. Unfortunately, some less experienced developers, I would say, would usually develop their services as a separate app server and database server and when they found that they couldn't connect to the database server, they would log into all the firewalls that were available and essentially open up the MongoDB port and at no point think about adding authentication to their database. And what this means when you expose that port to the internet is that a bot will find your instance of MongoDB and essentially steal your data or these days what it does is it encrypts all your data within your database and adds a document telling you where to send Bitcoin to get your data unlocked. I would argue that if as a DBA you put an unsecured database on the internet and somebody steals your data, then it's more your fault than anything else, but modern versions of MongoDB also don't follow this behavior anymore. So nowadays, MongoDB won't bind to the network. It will only be accessed from local host until you've added some authentication. You can use an override flag to override that security feature if you want to. You still can stick unsecured instances of MongoDB on the internet if you really want to, but I strongly recommend that you don't do that. On top of this, MongoDB uses industry standard security. It uses TLS for connections by default. It uses Scramshare 256 for authentication. There's no reinventing the wheel here. MongoDB is no less secure than any other database you might use. Next slide, please. Another quite persistent myth about MongoDB is that it loses data, and I will try to explain where this myth comes from. Next slide, please. First, I would just say it's difficult to prove a negative. MongoDB is used and trusted by some big banks like Barclays and Morgan Stanley. If it lost data, then I think that would be unlikely. It's also used in a bunch of other industries that really care about not losing their data. Next slide, please. The reason people lose data in MongoDB is because MongoDB actually allows you to trade off between the robustness of your data and performance, and the default is perhaps not ideal. So by default, when you send an update to your MongoDB cluster, it uses a configuration option called a right concern, which by default is set to one. This means that as soon as the data has been accepted by one of the databases, one of the servers in your cluster, which will be the primary, you get an acknowledgement back that the data has been accepted. Unfortunately, this means that if you get your acceptance back and then the primary goes down with a catastrophic disk failure, then you've lost your data. So that's not ideal from a perspective of caring about not losing your data. So almost every time, unless you really, really care about squeezing out all performance and you don't mind losing some data from time to time, you should use a right concern of majority, which means in a cluster of three machines, it will be accepted by two of those machines and written to disk before you get an acknowledgement back that the data has been accepted. It's a little bit slower, but you won't lose your data. Again, this kind of stems from people not quite knowing what MongoDB is or how to use it, but I would say that the default is unhelpful in that case. Next slide, please. So this really sums up all of the other myths that I've covered. People will talk about MongoDB being really easy, and I've certainly found it very easy to get started. Storing data in the database is really easy. If you have some JSON data that you can just, or just some data structures in Python, you can just start storing them in MongoDB without really knowing how to efficiently use the features that are in there, how you might lose data, how you might squeeze some extra performance out of it. You don't even need to know relational theory like you do with a relational database. You can just get started. And this is kind of a problem, I would say, overall in terms of generating image, because you've got a lot of inexperienced users. So its strength is also maybe one of its biggest disadvantages from a marketing perspective. So what you really need to do is learn how to use MongoDB properly from an operations and a development perspective. And MongoDB themselves provide kind of two paths to this. The documentation is really comprehensive and is always being expanded and revised. But less known is that MongoDB run this thing called MongoDB University, which is a bunch of online courses that you can do in many cases for free that will dive into different topics around MongoDB in terms of hosting or indexing or complex aggregation pipeline queries and really help you become an expert in using the product. So if you do decide to take up MongoDB and actually use it in production, I recommend doing a few of these courses to make sure that you're really using it properly and that you know all the features that are available to you. Next slide, please. I can't remember quite what time I had until. Is it quarter to? So we have two minutes left officially, but we have a longer break after this. So maybe you can overrun me by a minute or two. I think I only have two minutes anyway. So while I have an audience, I'd just like to pitch something that I personally have been working on recently. So my team are taking the John Hopkins University COVID-19 dataset, which is stored as a bunch of CSV files that every so often change format or get updated to GitHub incorrectly. And we've turned it into a queryable MongoDB database cluster. It has username password of read only, read only. If it sounds interesting, the bit.ly link here will take you to all the documentation on how to connect to it and use it from lots of different platforms. So if you just kind of want to have a play with MongoDB and with an interesting dataset that might teach you something useful, this is a good place for that. If you decide to build something either with it or without it that's based around COVID-19 and in some way, good for humanity, we're offering free credits for anybody doing that. So if that sounds interesting as well, please do get in touch with me on the Discord. Next slide, please. So this slide had a lovely build up while I was talking, but you know, just go straight to the end. So now, unlike most of the people on the internet, you will hopefully be right about MongoDB at least some of the time. Thank you very much. Thank you very much for the excellent talk. That was just in time.