 Now, let me introduce our speaker for today, Asya Kamsky. Asya is a senior solution architect with Tengen, helping customers get the most out of their MongoDB deployments. She has over 20 years of industry experience ranging from big companies like Cisco, GE, DEC, and Lawrence Berkeley Labs to cutting-edge startups like TGV Incorporated, eGreetings, Route Science, and elementary security. Her career has been work in database technology, security, software testing, networking, and the web. And with that, I will give the floor to Asya. Hello and welcome. Thank you very much, Anna. Good morning. Good afternoon, everyone. I was going to talk about NoSQL. Oh, I can barely hear you. Today, we're going to talk about NoSQL, what it is, how you might know whether you want to use it, and we'll take a close look specifically at MongoDB. Asya, can you move the mic just a little bit closer? We're still having a tough time. You know, in your technical problem, I'm speaking right into the mic. How's this? Better? Much better, yeah. Well, the audience is saying no. No, okay. One second. Testing, testing, testing. There. How's this? Much better. Okay. I will speak up. All right. Okay. Take two. Good afternoon, everyone. We're here to talk about NoSQL, what it is, how it works, why it is that you might want to consider it, and we'll take a closer look specifically at MongoDB, one of the leading NoSQL solutions. So, a little bit of history. Relational databases have been around since the 1970s, and things were very different back then. Storage was expensive. Normalizing data conserved that storage. It also allowed abstracting the data layer from the application, which can be sometimes good, but sometimes can be not quite so good. And basically, through the 80s, the databases became incredibly popular. They were commercialized. There were a lot of vendors. We became familiar with the client server model, where multiple clients could be interacting with the server. And SQL became the standard for which the applications could talk to the server. Are there still problems with the audio? I'm not hearing you fine, but the audio is coming out a little poppy. Yeah, I'm seeing a lot of participants' comments that they can barely hear me. Yeah. violin, please hold one second. Sure. Thanks, everyone, for your patience. We just want to make sure we get this set up. Testing, testing, testing. That's much better. Audience, please comment. Better, better. Excellent. But now, yep. My apologies. Shall we start over? Sure. Our apologies. Thanks, everyone, for your patience. Good morning, good afternoon. My name is Vasya Komsky. We're going to talk about NoSQL databases today, how you might want to use them, what they are, and we'll take a closer look specifically at MongoDB, one of the leading non-relational databases. A little bit of history. In the 1970s when the relational database was first invented, storage was very expensive. Normalization of data acted both to reduce the redundancy of data and also to abstract the data storage and the data structures away from the application. In the 1980s, we saw relational databases significantly increase in use. The client server model allowed many different applications to be querying against the database, and a standard language, a structured query language was established that you could use to talk to actually multiple different databases. Things became quite different in the 90s. First of all, we saw the increase in the three-tier architecture where the light client talked to the application server, which talked to the database. The rise of the internet and the web saw the scaling needs grow by orders of magnitude. That was the time when we learned to balance the web traffic with many, many, many web servers, and they may have been talking to multiple app servers, but all of them were talking to a single database in the back end. Now in the 2000s, the problem continued to grow. Rise of social media meant that there was less content that you could cherish statically on the edges of the net because people were constantly treating and updating their statuses and changing their preferences in their friend networks. e-commerce has also become a lot more accepted and popular. Hardware has become cheaper and the amount of data that we're collecting grew tremendously. However, the database was still a single bottleneck in the back of all of these systems that collected this data, needed to update it, and needed to serve it up very, very fast. So we had a dramatic need to scale. The question is, how do you scale the single bottleneck that's in the back of everything? So the database space back in, you know, the last decade essentially broke up into two distinct paths. You have the operational data store. Now, its strength store could handle complex transactions. Its tortabular data was very, very good at serving up ad hoc queries. However, the price was that the object that the programmers viewed their data as did not map to these relational things very well. And that saw the rise of a lot of object relational layers, management layers, and things that would map these things automatically without developers seeing them. It was not super agile. It was difficult generally to change the schema significantly if it was needed. And there was some speed and scale problems when you had to write a lot of data at the same time that you were reading. For a completely different use case, we started seeing more different types of databases mainly aimed at business intelligence and reporting. So they supported ad hoc queries still. And in fact, SQL was a standard protocol between all of the reporting clients and the server. It scaled horizontally better than the operational databases. However, at massive amounts, we still saw quite a bit of limitations. The schemas are rigid. So you pretty much have to know what it was you were going to be asking questions about before designing implementation. And they didn't give you real-time data. They were great at bulk loading a large amount of data and then answering queries about the recent past. So good for analysis, not so good for interactive type use. So not that many issues for the reporting. BI mostly satisfying the use cases that it was meant for. But operational, there's major problems. So what were some of the solutions that were implemented from this? So the map reduce, which is kind of where you chop up the problem and in parallel try to solve many parts of it and then merge back the answers was kind of a good batch level solution for the reporting in BI. The reply wouldn't necessarily come back right away, but it could handle a very massive amount of computations. On the OLTP side, things were a little harder. There was a lot of layers that could improve performance. For example, you could cache a lot of your data in memory, specifically in the form that you needed it to reduce the number of queries that you'd have to run. Some of the data would be stored in flat file so as not to overwhelm the relational storage. You could do some partitioning of the operational data and handle that on the application level. Meanwhile, the developers saw an incredibly wide adoption of a methodology called agile. So agile does not actually cause shorter development cycles nor does it cause the evolution of requirements. But it was a response to the fact that the requirements were constantly evolving. With the internet time, people wanted to push changes to their sites every day, every couple of days. And you have to have flexibility at the line time so that you would make it possible to make incremental changes and have very quick releases in response to certain business needs. Now as you might imagine relational schema, and I imagine that most of you probably worked with relational databases, they are hard to evolve. So if you make any changes, you would then need to have this massive database which may have terabytes of data or through a migration in order to alter the structure of some of the relational tables. Now the application changes have to stay exactly in sync because it's very difficult to maintain an application that understood two different views of the schema. Because of object relational layers, also keyword developers really understand what data is under the hood. So a lot of times when implementing applications, they were not aware of the performance implications and how their program interacted with the data. All this made it very hard to scan. So this is roughly how things might work at the beginning of the project, everything is great. Then the performance starts suffering a little, so maybe the data model gets denormalized and you stop using joins because you notice that there are these queries that run forever that involve 35 different tables building custom caching layers so you can run a query only once and then save the results for the application. And custom sharding is just a different word for partitioning where you split up the data so that you can have multiple database servers handle the data. The more complex the application grows because you now have to know where to go for which data. On the DBA side and the IT side, the scaling might have been addressed by simply getting bigger servers. The problem with that is that a server that has four times the capacity usually costs a lot more than four times that of a smaller server. And eventually that sort of thing would max out and you could not, in fact, grow beyond a certain point because simply the computers weren't built that big enough to handle that. So the real need is for horizontal scaling. Now what is horizontal scaling? That's when you are able to linearly increase the capacity of the system by simply adding more machines. So you're using commodity hardware, some sort of a Linux server, your average machine you might have on your desk or even on your server room. You would now be able to get maybe 10 of those and have 10 times the throughput. 100 would give you another 10x improvement. There are more requirements for real-time queries. In other words, more data needs to be readable from the operational store in real-time rather than waiting for some sort of an offline job to run to compete it. I already mentioned how agile and very fast-moving business requirements require simply faster development time, which usually requires a flexible data model. And, of course, everybody wants the all-front costs and low total cost of ownership. The main thing as far as capacity is also that applications simply don't have constant needs. Some sites, maybe during holidays, might have higher demand. These sites might have problems with demand during extremely important world events. Increasingly, everybody wants high availability for both leads and rights. And also, if you can handle multi-data-centered distribution, it really helps with high availability. So, in comes NoSQL. And what exactly is it? So, I kind of put it into the DB space circle in its own area. It tries to increase both speed and scale, but there's always trade-offs. So, ad hoc query may be limited in a NoSQL solution or might not even exist. It's not very transactional. Now, I'll talk a little bit more about joints and transactions and how they interact with the ability to scale. So, there's no standard for NoSQL. As you'll see, not only are there multiple different categories, but there's multiple vendors in each category. And of course, we don't have their own specific type of thing that they do best. Now, a lot of it does fit object-oriented programming and design very, very well. And it tends to be really agile. So, trading off some things in order to gain some wealth. Now, in general, I'm going to call NoSQL kind of an entire group of non-relational operation data stores or databases. So, it encompasses a collection of very different products. Pretty much all that they have in common is that they're not relational. They don't use structured query language for queries. Most of them don't have predefined schema or have some sort of a flexible schema. And some allow more rich data structures than others. In general, they are not tabular. So, in relational databases, we have relational. So, that's tables. Tables have rows, rows have columns. In general, you would expect all the values in tables to be populated. There are keys that refer to IDs in other tables. That's how they relate. That's what we call relational. On the non-SQL side, we have things like key-value pairs. A key identifies a value or a set of values. A document which may be a rich, multiple key-value pair combination. You can have XML documents. Graph databases are kind of a very specialized use case. Some argue they're not in the group of no SQL, but they're definitely not relational and may address a different need. And column data stores which can have a key that can have multiple column families defined for them. Traditionally, the relational databases guarantee their asset properties. Admissivity, consistency, isolation, and durability. Now, some of those are traded off by some of these no-SQL solutions. Some of them are simply addressed differently or they happen on a different level. For example, on a relational database, you can atomically execute a number of operations. They will either all succeed or they will all fail. On the non-relational side, if you have a complex document with a large number of changes you need to make to it, you can do that atomically. Either all of the changes will land or none of them will land. But that's in the context of a single option. So that's a very different model. The other thing is in order to be able to be scalable and distributable across a large network and many, many hosts, a lot of them are traded off consistency for what's called eventual consistency. So the data will propagate to all the hosts but not in real time. Now there's two phase commit. These transactions we're talking about and the atomic transactions that can only happen at the document level on the non-SQL side. And the thing that actually allows the distribution is the fact that there are no joins on the SQL side. Now, the join is where you have data in two separate tables and a single query relates the data together. Maybe update something based on values in the other one. When you distribute the data across multiple hosts, at some point you have to give up on the idea that you are willing to wait if the data that's in two separate locations is not available at the same time. Maybe there's network lag and maybe just bad latency or maybe some hosts come down and you won't hear from it holding a lot and holding many records and being invisible to the rest of the system state makes the system completely not scalable. And so there's very different semantics on the non-SQL side. So to distribute it in all foreign architecture but it does not give ass and guarantee. So there are a lot of players in the non-SQL land. Kind of in terms of the root of the movement is before there were no SQL products many companies had extremely large amounts of data that they needed to deal with as you might imagine. Google, Facebook, Amazon, LinkedIn, Yahoo and each one of them essentially wrote their own solution and eventually all of them had published open source the code, published papers describing it. It's kind of great that these solutions have been productized for people because if you are a small company but you're hoping to grow maybe not to be the size of Amazon maybe only one tenth of Amazon which is pretty big you don't have the resources to write your own caching layer, to write your own custom null SQL layer but you have the ability to take this open source program and start using it and hopefully you will actually need that level of scalability if you're super successful. So kind of these are rough groupings the key value stores most of them are originally based on either the PNUTS paper by Yahoo or DynamoDB which is built by Amazon there's columnless stores like Cassandra and HBase graph database and document databases of which MongoDB is one. Now there are a lot of different things you might want to look at to decide whether or not you need no SQL If you're currently using a relational database and you're not having any problems mapping your data to it and you're not having any scalability problems I'm not really sure that you need to lock your both and jump on no SQL because everybody says it's the coolest latest thing However a lot of people actually have trouble mapping their data to relational tables the data doesn't appear to be relational or they start seeing very quickly some scalability problems based on the type of usage patterns that they know they will have So you definitely need to look at the type of reason the type of rights, the type of queries your application has You want to look and see how easy it's going to be to maintain or to scale the solution that you picked how easy it will be to use how long it will take you to come up to speed on it and then there's all these other things like scalability and cost and things like that So let's take a look in some closer detail on how MongoDB deals with a lot of these issues that I mentioned in the general problem So a little bit of MongoDB history So the founders of MongoDB One of them was one of the co-founders of DoubleClick which I'm sure many of you have heard about I was serving up lots and lots of ads in the late 90s and early 2000s until it was bought by Google So now I'm serving up lots and lots of ads as part of Google to the point of having to handle over 400,000 ads per second which was pretty challenging and the founder started another company Gilt Group and a co-worker who's from DoubleClick started Shopwiki or a few others and they constantly ran against the same problem which is it was easy to scale the web traffic and the application but it was very hard to scale the database So they start, set out from the start to build the type of database, the type of data store that they wanted to have when they were developing these type of applications and then what happened was they actually started developing it as part of an app platform and people weren't that interested in the app platform but they were very interested in the database So they open sourced the database so it is available, you can just download it and try it no cost and it's been released in production in some sites for a number of years now So it's design goals kind of be summarized like this Key value stores that are in memory are very, very fast and they trade off a lot of the rich functionality that RDBMS has and they wanted to design MongoDB keeping as many important features of RDBMS as they could while at the same time having the scalability and performance be as close as possible to the in-memory key value stores So this is kind of what they came up with It's a documented oriented storage so what is stored, the records are JSON documents It's actually stored as BSON which is a binary form of JSON that richer and has more data types that it supports but you can think of it as JSON documents The schema is flexible A lot of times people will say schema free which I'm not sure I like as much the schema is not enforced so different records, different documents can have different fields If a particular field is not applicable it doesn't have to be set If it's type of different based on the type of object that you're storing then you would store different types This makes it quite flexible for the developer Scalable architecture there's two things and I'll talk about both of them in a lot more detail One is auto sharding which is partitioning automatically with minimal effort or knowledge from the maintainers and replication for high availability Some of the features that they wanted to keep from RGB masses indexes that can make the ANHA queries fast have a rich and expressive query language and be able to answer more complex questions through the aggregation framework and to map with these jobs adjacent documents A lot of them might be familiar with it with them because they're quite common especially for developing for the web They also need to be mapped to a lot of different programming languages and very flexible and I'll show you how they provide better data locality as well So if we take a look at an example of something like a blog post So a blog post might be represented relationally with a table of authors or people They have posts with a missing line there Posts can have comments authors can also have comments Posts can be tagged based on what they are about so there can be many many relationships between posts and tags And let's take a look at what that would look like as a JSON document for those of you who are not familiar with them You can think of it essentially as a set of key value pairs where the value is not limited to be a simple value It could also be an array of values So for example we might add tags as an array and that array will have a bunch of simple values which is strings Now if we want to find all the posts that have the tags used we just say db.post that's familiar with my collection find and then pass it a JSON document that says tags should be equal to news and it's going to return to me all the posts that were tagged with news Now we might also want to keep track of voting and since we might want to query and sort by the number of votes at the same time that we add voters we might also want to atomically update the number of voters so we don't have to do complex calculations later kind of in a way pre-calculating that value because we know that we're going to need it and we could embed comments right there in the post itself So each comment you can see it's an array actually is a document on its own So each member of the comments array shows who it's by and then the text shows who it holds but for simplicity I'm showing just this The nice thing is not only can we have an index on any secondary field like author or tags but we can also have an index on embedded documents and we could search for example all comments by a particular author and use an index to be able to find that very quickly So this preserves all the expectations that people are used to from their database works with relational databases but it allows them to create much richer, more complex document that contains within it all the data that's associated with it Now, why might that be advantageous? So when data is stored on disk you see to the place where the data lives is quite expensive compared to the actual reading of the data This is why solid state drives for example are so much more expensive and so much faster than your regular spinning drives and spinning drives that have slow spinning speed are particularly bad So here you have three tables you got your authors, your posts, your comments and you have to seek to all the different places find the appropriate records that have been created for gather them together assemble them and then join and then return them So what happens when it's all stored as a document is you find the data and you return all of the data that needs to be returned and when you are saving when you are writing a post for an author you do the right only once rather than entering things in a number of different fields So MongoDB is actually meant to be a general purpose database with complexity dynamic queries that you can construct secondary indexes and various other types of indexes like multi-T indexes on arrays 2D indexes for spatial queries and very rich capacity on updates and upserts and by the way, upsert is an operation where you say insert this record if it's already there then update it if it's not there then insert it's kind of a cross between inserting you can update documents autonomically for example adding a new voter to a record and at the same time incrementing the count of voters making sure that those of course have to be in sync you can aggregate data for various types of reporting you can write fairly arbitrary and not produce jobs so it kind of makes it viable as your primary data store you scale is very important you want to have high availability you want to be able to scale immediately you want to be able to do that without downtime and the other very important thing is the application should not have to be changed when you change a number servers you have or how you've configured your cluster so first I'm going to address high availability and MongoDB deals with what we call replica sets so replica sets is a set of servers of which one is a primary the last is all secondaries it has properties of automatic failover on failure of a primary it provides you a data redundancy with extra copies of the data for disaster recovery it does it in a way that's transparent to the application which also allows you to perform maintenance with no downtime since you can perform maintenance on one of the secondaries then put it back in the cluster and eventually switch the primary over to one of the secondaries so the primary can have the same maintenance done on it so it looks something like this if a primary and two secondaries and you could have seven secondaries or eleven secondaries three is generally a good minimum to have you have a synchronous replication that's happening from the primary to all the secondaries your application which is talking through its driver which is a native interface for the application and the drivers all know how to talk to the replica set the rights and the reads by default all go to the primary so the rights have to go to the primary and they will be replicated to the secondaries the reads by default will go to the primary now your application can specify a secondary for example it's appearing for yesterday's data or historical data so you can explicitly specify that you are okay reading from a secondary but it would never happen without your knowledge now what happens if a primary fails so as soon as the secondaries notice that the primary is not responding an automatic collection happens and a new primary is elected it would be the one that's most up to date and there's ways in configuration that you can influence which one should have a preference during the election now the driver which is keeping track of the replica set configuration will switch over now to writing and reading to the new primary with the other secondary continuing to serve as the secondary if the original primary now comes back online for example maybe it was a network that had gone disconnected or the machine crashed and now was brought back up okay it now catches up as a number of the replica set as a secondary when it catches up depending on whether it has a higher priority than the machine that's currently serving as a primary it could force a new election and become a primary again or the whole replica set could stay in this configuration until something else happens to say the new primary all of this is kind of transparent to the application now how do you increase capacity because as you could see you could add more numbers to the replica set but it doesn't really increase capacity except in a limited way it can increase your read capacity and it certainly will assure more hiding ability but what about write scheme and we want to do it in a way that's transparent to the application simply for simplicity of development and ease of administration so model DB uses range based partitioning now it's splitting up of data into partitions and balancing it between the servers or shards as we call them because this is a sharding system it's automatic it happens something like this so let's say you have fairly big server and you know what it's time to shard this I'm going to get more machines provision I'm going to add them to my sharding cluster and you have to pick a T and just for simplicity let's just say some T range 0 to 100 in reality you would make it something different depending on the needs of the application because you want to make sure that the writes are distributed equally among the separate shards and you want to make sure that the leads are also happening efficiently so I'm going to use an example from sort of a T you have a range of 0 to 100 the partitioning happens under the hood automatically let's say maybe you created four shards in the cluster eventually when enough data shows up the key ranges will get split up approximately evenly and the data gets balanced between the servers now how is this invisible to the application by the way each shard would be a replica set it wouldn't be a single server because you want to make sure you have high availability for every partition of your data so the application actually talks to a process which is the sharding router process we call MongoS as opposed to MongoD which is the standalone database process so the application MongoS books and talks like a regular MongoD standalone server so the application is just sending its queries to MongoS now under the hood MongoS is figuring out where to send each query or each write to which shard if it's something that goes to multiple shards then MongoS gets back multiple results merges them together and sends them back to the application and now the application didn't have to be changed at all but now you're essentially quadrupling the capacity this way now MongoS is a very light weight process and you would actually have as many of them as you have app servers maybe because each app server could have its own it's just something that knows how to route requests based on knowing where the data is on the shard now how does each MongoS know where the data is well they use these config servers and the config servers and you would have to have three of them to make sure that you always had availability in case one of them failed or even two of them failed in the worst case config server is where the metadata lives that maps the location of the data to which shard which chunk and which shard it's on and this is also the database that keeps track of different chunks of data moving from one shard to another and the reason you can have many many MongoS routing processes is because each of them just checked with the config server where's this data, where's this key range that I need where should this data come from etc so you might imagine that you're running with four shards and successfully doing really great and suddenly you have big spike in demand maybe and you release some coin version all you have to do is provision some additional servers creating shards add them to your cluster and automatically the data will get distributed from these four shards onto the additional shards to distribute the load more evenly so management is really quite straightforward the goal of the original design was to have as few configuration options as possible to let the operating system to do as much of the RAM and disk space management to have the right thing mostly happen out of the box that you would only need to do occasionally and to make it very easy to deploy and to manage and of course to developers it's easier to use because it maps more naturally onto the data that they're dealing with if you have a contact and it has two email addresses you might start a transaction enter something into the contact table then enter a couple of entries into the emails table and in MongoDB that's a single document that you save once that save is atomic so you don't have to worry about starting and ending transaction because only some of the rows might get written and kind of simplifying things significantly now because they're native drivers for dozens of languages the users don't actually have to learn any kind of new engineering language in database language they are simply using the language that they are already used to using whether it be Java or C sharp or Python PHP Ruby etc now some usage examples from real life MongoDB of course can be used for many different use cases we have examples customers using it for content management for digital intelligence e-commerce management of user data high volume data feeds when I tell you the three different examples one is Wordnik so they have a massive amount of data and it might be a lot more now actually I'm not sure how up to date this slide is but they have something like three and a half terabytes of data in 20 billion records and their problem was that they were trying to analyze a really staggering amount of data and adding data too quickly resulted in outages because of the way that relational database spreads the data across multiple tables it really doesn't scale that well when you have a lot of writes coming in at the same time that you're trying to do reads and even though initially they launched and were okay on my SQL we quickly hit some performance roadblocks they picked Mongo and as a result they were able to eliminate the NumCache D layer so they had been caching a lot of the records to avoid hitting the database for leads they were able to migrate 5 billion records in a day with no downtime and now MongoDB powers every one of their website requests which is something like 20 million API calls per day almost a nice side effect they told us was that they were able to reduce the total size of their code base significantly compared to MySQL now part of it is because of the native drivers and the no need for extra layers of translating relational data to objects part of it is because they no longer had a need to have an extra NumCache layer their fetch time went down from 400 milliseconds to 60 milliseconds and they were able to sustain the research speed of up to 8,000 words per second actually frequent bursts as high as 50,000 words per second and this was also significant cost savings for them because they were able to essentially process the same or larger capacity with fewer servers and nice quote from VP of Engineering there now into it now I know a lot of people know into it has Quicken, QuickBooks and I use Quicken myself but they also host something like half a million websites for small businesses and they wanted to be able to collect and analyze data but in general it would take days to process the information to get answers to just even simple queries they decided that they wanted to be able to use MongoDB ad hoc queries and map reduce jobs to simplify the better performance than what they currently have but they felt like it would be less effort than playing a complex or deep cluster and part of the reason they picked MongoDB is because of the strength of the community there's a very large and active community using MongoDB and very active mailing answering questions from users the developers are very active on it as well it turns out they were able to prototype the whole system within one week they were able to become proficient in MongoDB did a functioning prototype and feel like they could see exactly how the whole system would work and they so they picked it based on how quickly they were able to turn this around in addition the result was two and a half times faster for them in their original implementation in my stock now Shutterfly, I don't know how many of you use it, they store a lot of photographs for their users something like 20 terabytes of data, 6 billion images for millions of customers and partitioning them by function now they used to have a homegrown C value store in memory C value store on top of the Oracle database but they were still getting poor performance it was also hard to manage and it had high licensing costs and rather high hardware costs as well they picked MongoDB for a few reasons one was the JSON based data structure was rich and allowed providing very complex data structures that represented their use case very well they thought that it was a very agile and high performance solution that had a low cost and it worked very seamlessly with the existing services based architecture they achieved pretty significant cost and reductions in performance improvements maybe even more importantly to a lot of people they really were able to accelerate time to market for a lot of projects and it just simply became easier for the projects to access data and be able to cruxably adjust the schema and it was needed for them and the latency for inserts went down from 400 milliseconds to 500 seconds which is pretty significant and a quote from them saying in fact that the rich JSON based data structure that offered them a way that was extremely agile for development there's a thousand of other organizations using MongoDB maybe you're one of them and styling to see what works and if not maybe you're considering looking at it so before this was useful information for you and we're going to see if there's any questions hopefully you're going to be typing in questions into the public chat or if you wanted to send a private note directly to me or to Shannon the moderator you can do so as well and she will relay them to me I'm going to take a second now to give you a chance to type in your questions okay I see a few pretty good questions so one of them asks about best practices about NAND to start sharding so how do you know if you can make do with a single cluster or if you should straight off shard now it's definitely true that it's much easier to shard an over provision system in other words a system that has a bit of room to grow still than to wait until you absolutely need to shard and then start it and the reason might seem pretty obvious to you if you think about it when your system is already overloaded and you start the process of partitioning and migrating the data migrating data requires reading about half of your data granted you know one chunk at a time and moving the data competing for the resources that are already very very stretched so I would say if you have a system that's you know about loaded to 50% pretty comfortable it's probably a really good time to you know hurry up and figure out that you want to shard because once you have passed about a 70-75% point you still don't have as much headroom left in order to be able to painlessly add more and this is specifically going from you know essentially one shard or no shard to two if you have six or seven shards adding one or two more it's not going to be generally as much of a stress because generally that will be spread across the existing shards already now we have a question about how do you know that no SQL would be a better approach than relational SQL approach based on your application so that's a really good question you know if you have an existing application you already have to denormalize the data that will build a custom caching layer to do some sort of custom partitioning that's a pretty obvious uh symptom that you should be considering a different data model but what if it's a new application and you actually don't know yet whether you would run into any of these difficulties so I would say first of all look at data modeling look at the type of queries the type of data that you're going to be storing and how you're going to be querying it and see if it maps really easily or well to a relational schema if it does it's probably a really good fit for a relational if it doesn't if you have either really very varied data with the type of data structures that really resemble more of JSON document that they do with tabular data then probably you want to consider both if you can't even figure out how to map your data to a relational structure then I would guess you're definitely a strong candidate for considering an OSTOL solution alrighty what cases would RGBMS be better than MongoDB well I would say first of all if the nature of the data is relational and naturally application would take advantage of a lot of the relational database capabilities such as transactions, a two-phase commit if you have two unrelated operations that must happen in a way that would assure that either both of them happen or neither one of them happens that's a very kind of a relational thing to do right, financial transactions sometimes call for those kind of semantics if you use let's say MongoDB in that case you would have to essentially do all that work yourself on the application level so just like you don't want to do on the application level what MongoDB can kind of give you for free, you don't want to do on the application level but a database can give you for free so look at the capabilities that fit your application needs and see where the balance falls alrighty so there's lots and lots of questions yeah so there's a good question about how do you migrate if you're currently on a relational database and you want to go to MongoDB and you definitely don't go one table to one collection in fact I would say probably approach the migration as if you had a brand new design and look at your application code and see what data structures what objects in your application code you would want to store as single objects and then after you kind of get a picture from that see how they are currently stored in the relational database it may very well be that you'll end up changing the structure so significantly that it would be hard to see kind of a natural mapping like in an example that I showed where you want to store blog posts the authors and the comments and the tags are embedded how would you map the comments table or the tags table? the tags table wouldn't really exist a lot of times in relational databases you have tables with concepts that don't actually exist on their own tags don't exist except as attributes of blog posts line items on orders there's no such thing that you can look at all in the hand called a line item it's just a property or an attribute of an order and so it would be an array of these things on your order object even though they're stored in the separate table in the relational database so the question about data encryption and decryption by the database I'm not sure well so actually the data encryption at rest in general would be handled by the file system rather than by Mongo hopefully that helps if not helps when you know you can give you more information about that yeah okay so here's an interesting question so somebody has a collection and there's a key on that collection that can only have one of two values either one or zero it's index when you query on that column the performance is very slow now this is something that's not in any way specific to Mongo or no SQL databases you would see pretty much exactly the same thing in a relational database when the key selectivity is very low and I know we didn't have time to get into the details of how indexes are implemented but MongoDB implements the exact same data structure that pretty much all relational databases use which is the B tree and the more values there are the more distinct values you're indexing the better the performance game when you're only selecting a subset of those values now if you have 3 million records and half of them are zero and half of them are one you say I only want ones where that value is one you're still looking at one and a half million records you wouldn't really expect the performance improvement to be that much better than a complete and full table scan so in general you need to take about the same care when designing your indexes in Mongo whether they're compound indexes or multi-key indexes as you do in a relational database which is make sure that the index is helpful in the type of queries you'll be doing and you don't want to just arbitrarily throw in lots of indexes because of course when inserting new records or updated data the maintenance of indexes still has to be taken care of so here's a question I'm not quite sure how to interpret how can MongoDB ensure data quality so MongoDB cannot ensure data quality in the sense that if your application writes incorrect data or data that's not accurate to what it was supposed to do into MongoDB MongoDB wouldn't know anything about that so it does shift the logic a lot of it onto the application layer now people still choose to use an object relational layer between their code and MongoDB and it's precisely for that reason they want to have a layer where they define the data model that will enforce and prevent an application that's the wrong version or the bug in it from being able to push incorrect data into the database hopefully that's kind of what you meant there's a couple of questions about the four square outage which was quite a while ago I only want to go into too much detail about it because you can kind of google on the net and find lots and lots of discussions but I believe one of the root causes was very much related to the question about when to go to shorting where you are in a situation where you're already at nearly maximum capacity then spreading the load to a new machine is going to basically build out available capacity and cause that machine to be non-responsive both to the migration and to be the same queries so is there any documentation on good practices to implementing shorting that's simple and right the first time from the beginning to use scaling so there's definitely a lot of documentation there is well obviously there's the manual there's a lot of write ups in the community about people's experiences there's a pretty good book called Scaling MongoDB which is all about shorting and I think in general if you approach the problem keeping in mind things like how the scaling will benefit you what you need done in terms of what kind of writes you will have specifically thinking about the peak times because a system that's available most of the time that falls over during the worst peak times for you is not a very good system you want to make sure that it's the peak that you can accommodate there's a question about whether a deadlock can occur during write operation and the answer is no generally a deadlock can only occur if you have two threads each of which is holding a lock in the same couple of locks that the other one has and we don't have anything that holds that requires multiple locks that holds one lock while waiting for another this is actually a situation that's kind of unique to a system like a relational database where multiple locks can be needed in order to complete a transaction and if you're writing to two separate tables you would need a lock on a one and a lock on a row in the other one and you may be waiting for that one while a different transaction is executing the same update in different order or similar update so because the lock only happens on a single document either you acquire it and complete the transaction that single update automatically or you don't if you don't acquire the lock you wait for it and when the process that does complete it releases it then you acquire it so that pretty much makes it immune to that kind of a distributed multiple lock problem already we are pretty much out of look for maybe one or two more quick questions we're right at the top of our ASEA so actually I'm going to cut you off there and thank you so much for this fantastic presentation and thank you everybody for your patience with the sound and glad that it was coming through so and we'll get the just to remind everyone we will get the links to the recording and links to the slide within two business days and ASEA mentioned a book that would be helpful maybe we can get that in the follow up email as well and get any information out to everybody thank you very much everyone so thank you and thanks everyone have a great day again thanks for your patience with the sound and glad we got things working and again we'll get things out to you as soon as possible have a great day bye bye all conference hosts have hung up this conference is over thank you