 So as Ben introduced me a little bit about my background I've worked at I started off writing code for the core indexing team at DB2LUW and for variety of projects pure XML range partitioning for the data partitioning feature and then Move to mark logic and now couch base The database market has changed quite a bit over the last few years And it's been very exciting to see emerging technologies come up and be a part of that I'm very excited to be here and talk about a different approach to solving data management So quick overview of the presentation. We'll take a look at why transition at all Relational databases have been around for a while and have a huge ecosystem around them So why are people looking at these new technologies? We'll then specifically talk about distributed document databases and how they solve two problems One is the unstructured data problem with we'll take a look at the data model comparing relational and document databases and then take a look at the scalability model and see how we can handle a big audience Now you'll see you'll say a big audience. How does that relate with big data? Is it a new term? Is it why don't I know about this yet? Well in some ways it's related big data Everyone has their own definition for it at the core It is not just about volume, but about variety complexity of information big audiences are is the is the ability to support large number of users with new applications coming out today like social media applications could be a blog application or a quick photography application Instantly you could have millions of users and so that's really about the big audience problem Which is trying to solve trying to serve data to potentially millions of millions of users We'll take a look at some of the other characteristics and then if time permits I'll introduce couch base server But this is really not an advertisement for couch base This is really to help you understand what document databases are about in some cases. I'll use couch base as an example perhaps So why transition a lot at all you you have a huge ecosystem around relational databases You have database tools you have bi products etl mdm master data manager metadata There's this whole world out there. Why are people looking at no sequel as These new emerging technologies come up To figure out why let's go back a little bit Let's go back to when these databases were built So if you go back to 1970s or 80s early 80s, you see that a lot of things were different The number of users were different the applications were different the infrastructure was different We started off with an application that served 2,000 users was probably one of the largest applications at that point of time Where today we start off with 2,000 users and then potentially grow to 200,000 or 20 million We've seen some applications Some customers that we are working with where you can actually grow to 35 million users in a matter of weeks And so how do you support this use case? The dynamic user population is the other aspect where you don't have control over users users come and go You don't you don't have the predictability You don't know but you still have to plan for it because if you do have demand It's a good problem to have and you need a system that handles that demand Looking at applications you have you used to have a lot of automation Applications that were based on automation that is they be converting a manual process or a business process Into an automated automated process using a system versus today. It's about innovation How do you get users to shock differently and socialize and and entertain themselves? And it's a completely different kind of application that you're trying to build particularly in the interactive web Space or you could you could you could sort of call it in the OLTP space. This is not on the warehousing side This is more on the OLTP side of things and then finally the infrastructure it was all about mainframe centralized computing you want to scale up all the way and Networking was in the infancy and memory was expensive It was incredibly expensive and today on the other hand one gig E is is norm 10 gig E is becoming very common particularly in Managed data centers, right? And so you these these problems that the infrastructure has changed And so you need a different kind of a database to handle these different user requirements and different application requirements So we actually conducted a survey late last year and we got about 1300 respondents we tried to advertise in a bunch of different places There was something here that surprised us now if you see this the second most the second biggest driver for no sequel is The inability to scale out data now given that the user requirements change the application requirements change This was not something that was new to us. We were working with a lot of customers Where scale was the issue but turns out the number one driver was actually lack of flexibility the ability the inability to Iterate over your applications faster the inability to build to do rapid application development And this was the big factor that came up as right on the top now We'll see how document databases try to address both these issues, but in general All no sequel databases most no sequel databases Address these two issues whether they're key value stores whether they're document databases whether they're graph databases They try to address these two issues At couch base given that couch base is a document database. We believe that it's it's the right balance It gives you the right balance between Schema flexibility as well as performance and so we'll take a look at that into more detail Before I jump into this how many of you are familiar with the no sequel space and the different categories that are out here All right, so I won't spend too much time talking about this But what I'd like to say is this is kind of what the the categories look like today So we have on one extreme key value stores We have a data structure databases document databases column databases graph databases and so on and and I'll focus on Just that one but while I don't want to complicate things for you there's a lot of other options available out there and We really believe that it's the right database for the right application So you need to understand your application requirements see if the the database that you're looking at is the right choice now couch base is a Document database at the core every no sequel database is a key value store Are people here familiar with a key value store? Otherwise, I'll just maybe spend a little a little time Sentence or two talking about it. All right, so key value stores are most basic kind of database Where you have a key which is equivalent to a primary key of your table And you have a value a value that it points to and that value is stored as a blob in the database Right, so it's it's very very basic You have a key and a value to get access to your value you use the key and that's really the only way to get access To it so you have you do you can insert data and you can get data But it's incredibly fast, right because it's because you have the key You know exactly where the data lives and it's really fast Document databases build on to that concept. They add indexing and querying capabilities And that's what you would expect out of a database And so that's why we feel that document databases gives you the best balance between Schema flexibility by being able to do more with the information that you actually have in your database Couch base is has auto sharding. So it's it scales horizontally We call it clone to grow where you can add nodes as you go and build the database horizontally as your demand increases replication is used for High availability. So if you have node failures within your cluster You can automatically fail over to the node that that has a replica of the data And it includes an in built-in cache a caching tier that gives you the the consistent height throughput and low latency That you would expect for an application with with millions of users Couch DB on the other hand is a single server database. It's a we we we are descendants from membase and couch DB and Couch DB has it's likely it's useful. It's good for different class of applications What we took is that the ability to index and query and that's what we embed in couch base MongoDB is another database that is again visible and a widely adopted like couch base and it's similar in some ways It supports auto sharding slightly different flavor of replication it uses a master slave replication and it also has support for ad hoc querying So if that is something that you need for your application Then a MongoDB could be a good fit So let's take a look at distributed document databases and and by distributed I mean that it's a database that can be spread horizontally across multiple servers But to the application it really looks like one instance or one database at the core you have a Document that describes your record or your data or your object each each the data is self-describing So here you see that you have a couple of attributes you have time server Type and so on each attribute has a value associated with it and every document in the database Could have a completely different schema So it have could have a completely different list of attributes now it could get of fairly complex So you could have embedded objects You have the details attribute here, which actually has a Document embedded within it and so it allows you to model data very flexibly and model different kinds of data So particularly for unstructured information whether it's blogs or or comments or a tagging application This is a great way to represent your information because you have so much flexibility with the data model Now all day all documents are pointed by a unique key. This is the document ID Equivalent to a primary key in a database and The data model to actually store this on disk is is could be different. So in some cases, it's JSON So JSON is this is this is what it would look like in some cases It could be XML or derivatives of JSON and XML a couch base We use JSON MongoDB uses B son, which is binary JSON Now in a key value store if this is what your data or your value looks like you won't really be able to query it or Look into it, but with a document database you get indexing and querying capabilities And so now you can create secondary indexes on Individual attributes like you could in your relational database and then and use those For for querying purposes now you can even have group eyes reduce and so on the implementation of indexing Could vary from document database to database in couch base We use incremental map reduce as a method of building these indexes In MongoDB you have indexes that are built right after an insert and an update happens Versus in couch base. They're actually built when when the where the index is actually queried And so there's some behavioral differences depending on the system that you use that you should be aware of Now auto sharding is the ability to scale out. So each database Implements this slightly differently with couch base. We use consistent hashing Consistent hashing takes a key. It hashes it to some to a value and then the clients The client knows exactly which server that value lives on and so you can directly go and point to the server to access the information Which is why you have much faster speeds and much low low latencies So that's a brief overview, but now let's take a look at what exactly the data models look like and the scalability models look like and so on That's probably a schema some of you might be familiar with With relational databases being so mature a lot of existing applications of very complex schemas now No sequel databases are in fairly early in maturity from a product perspective And so I'm not here to tell you we can replace that I'm here to say that for new kind of applications that are maybe simpler that are That where where your requirements are very different from what you're used to no sequel might be a good option So let's let's take a look at a very simple comparison So most of you are familiar that with tables table definitions and so on here's an example with Where you have rows and columns you define a table with a primary key perhaps a couple of columns in it every row must conform to that table definition and If you if the table gets complex you use three third nominal form You'll denormalize and denormalize and denormalize in the end event end up with that kind of schema Where you have multiple tables and foreign keys that connect all of them on the document side the same information Can be represented where each record is different it could look different And you could store all this all these documents or records in one database or in one table And so that's really at the core the basic difference between the relational data model and a document data model So here's an example where we have an error log You're trying to log an error the errors that you're that you encounter across multiple data centers In the first table we have primary key we have maybe the error text the time that it occurred and then a Foreign key dependency to the data center stable Which gives you information about which data center was what the phone number was and so on and so to get a get The complete record of a specific error you would perform a join across the two tables get your where the foreign key is equal And and then get the information back out Now how would you represent this in a document so here what happens is To make the first part is pretty simple You've taken each column and you've created an attribute with it But you've also included the data center as well as the phone number of the data center in that In that object and so you've basically joined the two tables you've D you've Denormalized it and you have created this this record that represents the complete set of information For a specific error what makes what what makes it what gets easy to do is schema change If you have to change the relational schema you would probably need for this kind of a change We'd probably need to add two columns to your table Maybe some additional columns in your data center table. It depends on now what you're trying to do But a simple alter table could take weeks or months in some cases to implement And so you're not being you're unable to rapidly iterate over your application To push things out faster so because that's what the market wants and so with this kind of schema change It becomes really easy to add additional attributes as new information becomes available You can simply add new attributes to your document and continue So what about modeling these documents? I mean we have so much theory and so many best practices on the relational side It's fairly early on the no-sequel side But there's a couple of different options and let's take a look at that It really depends on your application if you want do you model these? Objects as separate objects in your ORM layer. Do these objects get access together? Are do what about your what are your atomicity requirements? Do you do you need all these objects to be updated at the same time as one atomic operation? Or do you have a lot of concurrent users that are that are updating these documents a lot of these? Aspects go into your document modeling But at the highest level think of it as representing objects or logical objects at the ORM layer and that makes it a lot easier The simplest way is the one that I described earlier, which is all Information about a related to document object is in one document makes it really easy But there's issues with that the issue is that you duplicate a lot of content, right? What about? Three three and f and and denormalizing and reducing duplication you will see your data sizes increase What about the In the in the other option where you have separate documents you have different kind of problems No sequel databases are pretty early in product maturity And they don't they don't have a way of implementing joints And so if you have them as separate documents You will have to implement joints at a client level within your application as your application could get a little more complex Let's take a look at a document ID. So we talked a little bit about it. This equates to a primary key on the relational side and This is what's used to actually shard a document across multiple servers So if you have a distributed database, you need to know where the key lives where this document lives So you can directly go and access it without the additional hops and so that's what's used To to shard data across multiple servers This is a way to get information extremely fast because you know exactly where it's located And usually the document ID is unique across a bucket a bucket is equivalent to a table some cases It's called a collection And so just a couple of options here won't go too much into it. You have different ways of picking your UIDs could have date-based numeric IDs as you do with relational in some cases. It might make sense to have human readable IDs to make the application easier. Yeah question It's a unique user ID, which is auto-generated typically Yes I could think of it as a sequence that you might use in the relational database So let's take an example of maybe an a blog application a new kind of application Where you have unstructured data because blogs typically don't have structure your comments could be five pages long or two lines And you need to be able to handle that so at the core you have a user profile You have the profile then pro points to blog entries You might have other settings like badge settings and so on and then you have the blog posts themselves So the blog post includes text for for that blog other information about that blog and then you have blog comments So these are kind of the three core objects that we want to represent option one is have everything in the same document and so here we have a Blog post which says title. Hello world. You have a you have a body in there and and so on comments is the interesting field there because you have You have a list of comments in there So you're basically embedding all the comments into this one document now This could get complicated because you could have thousands of comments on a very popular blog post Right you don't want all that to be in a single document because it gets more expensive to access this document If it's really large in size on the other hand your application might need you to display only five Comments on a page and then move to the next page and next page and so on you can't iterate over them If if everything is together, so that's that's where option two comes in where we split out the documents into two different Objects the first one is the core blog doc Which has information about the metadata you would say the text as well and then the comment Where each comment is an individual document and the comment includes an ID Which you then embed into the blog document So in some sense this is going back to the foreign key concept where you're see you have The blog the blog doc has a foreign key that points to your comments And that would be a way to actually getting access to the comments now remember that no sequel databases including couch base and MongoDB don't have the ability to do joins yet And so this is something that you might need to compute at the application level so with with these new technologies While you're getting a lot of advantages. There's some things to be be careful about when you're planning your application Now using this concept of breaking up go ahead question Yes Right it's pretty early to say how it would be implemented But at the core what we're seeing is you need a way to reference related content, right? At it could be at the most basic level. It's one level of indirection And and a way to have the database be able to understand that that link I think it would it would depend what we've seen is a lot of Documents might be in a single table or a single bucket or a collection which mean that you need a self-join Right because you have different different objects that are stored in the same table In some cases people prefer to have multiple Collections or buckets and and you need joints across buckets So it would be it might we might need both But this is probably some we're talking about two years down Maybe a year and a half down two years down the line But we've definitely seen that it's many applications The there is a need for it a lightweight join is probably the first thing that would be implemented No, not yet. I think a lot of answers would be not yet Because it's it's these products have been in the market for two three years, right? That's right. That's right Yes, but again, that's something that comes up and we're we're seeing what the best way to integrate full text is So if you take the concept of splitting out you can go into fairly complex document models like you have with relational But the and the advantages are if you have all the information in one location It's easy to get everything together. It's almost like a pre-computer join in one in one location Versus having some complex application logic in your app So given given these advantages and disadvantages Do you think how do you know if no sequel and document databases is a right approach for you? What we've seen is There there are several questions that might help answer this so have you changed the way that you're using your database your relational database Are you just serializing your objects and storing them as key and value and then perhaps? Yes, but you have a lot of sparse tables where not all the records really need or have all the information for all the columns That that's one one of the one of the things where we've seen no sequel might help If you the other one is your application changes if your application changes really fast You might want to consider no sequel schema less and it's a get the flexibility Of these databases so that you can iterate faster and push out more changes to your application And finally at the extreme are you just using your relational database as a key value store? In which case you're not getting the advantages of relational database You're not getting the advantages of no sequel and so you might want to think of a different approach so before moving on to the scaling model any questions about their model That's right It's it's one of the it's one of the the key key propositions I would say with my sequel Typically you need a caching tier on the top and then that's another thing you're managing with couch base the caching tier is integrated So you don't really need to manage it you you have your caches warmed up when your server starts off and so on You don't have a cold cache problems when your node goes down because data is replicated So there's a lot of management and operations Considerations where a solution like couch base would significantly help it makes it incredibly easy to add nodes and Rebalance data across and also to be able to keep your application running So we could we actually can do upgrades whether it's software upgrades database upgrades hardware upgrades with the application running So you never need to take your application down you add new servers Rebalance information out remove servers get those out again rebalance information out and and your app is just running all the time That's right So with couch base the we have the query tier is Uses HTTP APIs and so it's very different in some sense. You would have similar Operate operators you have equals and you can do a range queries and so on but it's not it's not sequel And so we'll talk a little bit about accessing data and talk about what the issues there are Mm-hmm Right, right. I think it's It's easier to think of it in terms of your ORM there and you to model your objects in terms of In the documents in terms of your objects But it's it's early to have there aren't tools out there that can help you through that process yet And so as we move further with more adoption We'll see that there's more tools out there to kind of guide you through that data modeling process So let's talk about scalability. We talked a little bit about it with auto sharding and See how it's different with with the relational world and with document databases So here you see that you have a bunch of application servers They sit behind a load load balancer. This is kind of your typical application You have your database which is which is either shared everything or a shared cash or a shared disk kind of a system and if your demand grows If the number of users grow you can just throw application servers at it and scale the application tier pretty easily And so your costs and your performance are fairly linear But with the data tier you have you you tend to go up you tend to scale up a quite a bit which means that your costs and your performance are non-linear and You get to a point where you might not be able to meet your performance needs and So with no sequel and specifically distributed document databases your database itself scales out so you can add additional nodes as you go along and your costs as well as your performance tend to stay linear because As your application grows you have more users You just add more nodes you distribute the data across those nodes So you get additional IO as well as additional memory and that's how you serve your application out And so from a scalability perspective, there's a couple of questions as well Do you constantly keep upgrading your your hardware to to keep going up in terms of number of CPUs or Disk usage and memory and so on are you reaching a limit to your read write Throughputs with some of our customers. We've seen latencies in the microseconds In with with some customers throughput is could reach millions So you're actually processing a million ops per second reads and writes a mix of them And and so if that's what your application needs then you might want to look at our distributor document database I think it depends in some cases updates are implemented in place And so in that case you might see a little more disc fragmentation So you might have to deal with the disc fragmentation issues, but you might have compaction where you like a reorg That where that goes and cleans up your Your data you have a couch base in version 2.0 We implemented as an append only and so right is not a problem at all. You're basically you have very fast drain rates and write rates So what are some of the other aspects and the first one is accessing data? And you asked a question about well, how do I actually interact with the system there aren't standards yet? With relational database a sequel is has been around for a while well-defined language a lot of extensions to it But there isn't a universal way of accessing the these databases at the core you have SDKs so you have client SDKs which are smart SDKs that actually understand cluster topology That that give you the performance benefit with auto sharding in some cases the Your queries are executed over HTTP and so you might have an HTTP API to query it But before you jump in you might want to make sure that the language that you're implementing your application And is supported for these for these databases with couch base. We support C ruby php Java net and so you have a wide variety of languages to choose from No standards there's no standard way of accessing all these in the same way right so your application might be built for a specific Database you might have to change that if you switch the back end With couch base, but actually a memcash d compatible hundred percent compatible So if you already have a memcash d application that's connecting to it You could just switch memcash d out with couch base and memcash d is pretty pretty heavily used Across the board for all to be applications Yes So couch db is is a standalone single server document database Which which was which exists and we embedded mem base was Was a key value store and so if you if you kind of put those together becomes couch base But couch by itself is a cluster of I think it's order utilize the Servers so distributed server so the it's it's commodity hardware. It's a cluster of commodity hardware, which is what it means at the core Consistency consistency is the other aspect where with relational you have acid you expect multi-document transactions you you have Strong atomicity with updates when multiple records are updated and so on at the moment with couch base We have Transactions at a single document level, which is why when you have all your data related data in a single document for an object Your atomic right so you write to one location you read from the same location and so you're strongly consistent But if you want multiple objects to be updated at the same time insert it at the same time That's not supported at the moment Availability we talked a little bit about how availability This is very important particularly when you're using commodity servers You're the quality of hardware I mean are we as great and so you need to be able to handle node failures And so if you have a cluster you have replication across intra cluster And so if a node goes down you can simply fail over to your replica and keep running the application and we have You can configure the number of replicas you could have one two or three replicas depending on the size of your cluster It's it's actually much much easier so what happens is the clusters kind of show up as a single instance and client SDKs that we support understand the cluster topology and so when you connect using the clients the client knows Exactly where that key is located you via consistent hashing So so it manages so we have a cluster manager if we get time for it We might go into a little bit more about how that's done But when the cluster gets the topology changes maybe when you add nodes When you remove nodes when you rebalance data You the cluster map and your clients gets updated as well And so that operation is atomic and so when when a node goes down for example You start getting getting not my not my data errors And so the client knows that okay something's changed It tries to go connect to another node gets the new cluster topology the new cluster map And then goes and connects to the right node. And so that is completely It's abstracted for the user Exactly This is for couch deep the couch base with Mongo. They have another Similar kind of an approach where you have auto sharding. It's it's not necessarily or distributed Uniformly because it's a not consistent hashing, but it's it's similar. It's the client SDKs take care of it the Replication is is slightly different so replication is for your or a high availability and that happens Whether you you know how it respective of the cluster map and Order there is replication. So the replication is in the master slave flavor Where in couch base every node is the same every node is equal And so it's you just clone them basically you just add more nodes in Mongo You have a slightly different flavor where you have a master node for the shard and replica nodes And so every server is either a master or a slave and then that's how data is replicated If the master goes down you you the replica will get activated as new master And so the question is is no sequel the right choice for you if you have requirements for security encryption a Kerberos support Complex joins across objects Extreme compression needs and maybe not not yet as I would say for to the gentleman at the back We are fairly early in the product cycle and we hope to add features as as we get demand for them But at the moment there no sequel is a great choice for supporting your interactive web Applications or LTP applications where you need very very high throughput at a low latency at a consistent way So we have about 10 minutes left I had a session a section here about couch base overview, but if there's any questions I'd like to answer those first Would you like to have a brief overview of couch base server, all right, so I'll try to keep this Simple and not make it into an advertisement But couch base is simple fast elastic no sequel it embeds a cache So you don't have a problems with managing a separate caching tier as I mentioned earlier You have online rebalancing online upgrades and maintenance so your application is always running 24-7 and Replication takes care of high availability So if you have no failures, you can you can keep the application going a replica will then take over as the master bunch of Customers that we support Zynga is one of our big users a all advertising as well We have a lot of paid deployments at the moment, but to answer your question about The the couch base architecture and how it how it actually works each node has Two aspects you have the data manager and a cluster manager now the cluster manager is always managing the health of the cluster It's talking to the other nodes trying to see whether they're healthy trying to have Replication going across rebalancing going across it also manages the UI the UI is is Aggregated so every whichever node you you go to it will look exactly the same It aggregates all the results from across the cluster and and brings them up for for my stats and a monitoring perspective The data manager on the other hand manages simple inserts updates deletes Querying and so on and so each node looks exactly the same and it looks like this and So in in terms of the deployment you have your web app at the top you have a client library Which is a smart library The library understands what the cluster map looks like because it connects with the cluster managers And knows where data where each key lives and so it can directly connect with that server and the data flow is Through to directly to each node and so if you have a key The client knows exactly what it lives connects to that node directly and gets access to it And this is an example of how we write to your your question was about Rights earlier. What's the speed of rights? It's incredibly fast because we have we have a tiered system So you have you have a RAM tier which is your caching tier as well as your disk tier And so when a when a request arrives write this key for me it puts it in there Next it's going to send a response back because it's it's just storing it a ram. It's really fast saying that yes, I have written it Quite immediately it gets replicated to other nodes in the system So these so if this is for durability needs so you start sending it out of your other replicas And then it's also put at the same time into a disk write queue which actually then persists it on disk And so this way you get your high throughput with your rights You get durability with replication and you also get persistence by storing it on disk So for the disk we it's couch DB at the back end so it couch DB is It's an up. It's an append system an append only system, but we actually persist to it So we have our own persistence layer Yes, so there is a compaction supports you have auto compaction or manual compaction which which triggers every now and then and Cleans up all records That's right That's right. So we have observed functionality, which is if you really want your data to be Synchronously written out you have the option of saying okay Replicate it first because replication is really fast, right? It's it's in memory. It's really fast Replicate it first and then return me that your response And so if you don't want an async for if you're creating a new user ID, for example You want to make sure that it's actually there and it gets persisted so you have an option to do that for but a lot of our users are Have data where you don't really need that to be persisted and you can have multiple replicas that that are available for durability That's right, so let's actually just step through this really quick couple couple more minutes and and this is the last side I'll go through us and skip through the end assume. This is where where you're starting off, right? You have three servers you have a bunch of information on each server and for each document You also have a replica on another node, right? Now so that's that's the replica that's sitting on another node Now you have your application servers You might have multiple application servers and you have your couch-based client, which is kind of the smart client that includes the cluster map and Here what happens is you ask for document 5 It it uses the cluster map Hashes gets the key says that okay document 5 lives on server one That's where I need to connect to and gets it back So that's really how how it works on the the client to the server layer great question a Great question is one where you actually have have it right there So when you add new additional servers and you can have multiple servers at the same time you could have five ten It doesn't matter. You don't have to do it one at a time. You want to rebalance out because your demand is increasing What happens with consistent hashing is the document IDs always hash do us the same number And so that doesn't really change what changes is where that that virtual bucket lives So when you rebalance you start rebalancing out specific V buckets or as we call them virtual buckets To other servers and the cluster manager knows exactly where those V buckets live And so the consistent hashing continues nothing nothing needs to be rehashed Exactly exactly so there is a level of indirection that allows you to do that And so now it it tells that the clients the cluster map has changed And so now you're going to go to another node and directly access it from there Failover this is an interesting case where you're connecting to node three here a server three And something goes wrong and it goes down What happens is if you have auto failover already set up Then your replica documents get promoted to active And so your virtual V but your virtual buckets on your other nodes get pinged and say hey You're active now get get promoted and the cluster map gets updated And you then move to the the correct server. So that's how flush failover works and That's right across the clusters. Yeah, it's like a gossip gossip like protocol across the clusters But with that, thank you very much. Hope you enjoy the presentation