 Hello and welcome, my name is Shannon Kemp and I'm the Executive Editor of DataVersity. We would like to thank you for joining this DataVersity webinar. Three things you need to know about document data modeling in NoSQL, sponsored by CouchBase. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share our highlights or questions via Twitter using hashtag DataVersity. If you would like to chat with us or with each other, feel free to pull down the chat icon in the top right corner of your screen for that panel. And as always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and any additional information requested throughout the webinar. Now let me introduce to you our speaker for today, Matthew Ravel. Matthew is the lead developer advocate at CouchBase in Namia where he helps to grow the CouchBase community and works with developers to build scalable low-latency backends for their software projects. And with that, I will turn it over to Matthew to get us started with today's webinar. Hello and welcome. Hello, thank you very much. Great to be here. So yeah, document database modeling. It's an interesting subject and it's one that CouchBase regards a lot about. And I hope that I can share some useful things. And I would say that the first thing that it's important to get across today is that we are actually still learning. You know, I'd say this current NoSQL phase probably started, you know, 2005 when CouchTV was launched. You know, obviously non-relational databases have been around since the first databases because they were indeed non-relational. But, you know, we've had Lotus Notes that's flunking away in the background. We've had all sorts of different types of non-relational databases around. But it feels as though, you know, certainly since 2010, things have been growing quite a lot for the document-style databases. And so, you know, that's only five years compared to the 40 also years that we've had to work out what's best to do with relational databases. So when you walk into a bookstore or go on Amazon, you won't see books like these about data modeling for document databases. There isn't that great wealth of learning of academic study and testing around document-based base modeling. That's not to say there's none, but it's still early days, there's still, I suppose, the sort of learning we're talking about, sort of understanding that we're talking about is that it's been gained in practical application of these databases. So the title of this talk is a little click-baity, you know, kind of three things you want me to make. And, you know, there are more than three things, but it all comes down to an initial question of when you know the sorts of questions you're going to ask and where you want the computation to happen. So in the relational world, we always filter applications with the idea that the database management system would be responsible for computing the answers to our questions. We write SQL queries and the answers come back with relatively little effort on the part of the application layer. But with non-relational databases, with no SQL databases, that's changed. Partly, that's because the creators of no SQL databases were focusing on other questions, you know, was it scalability that they were focusing on, or was it, you know, uptime, availability, that sort of thing. So query was actually, you know, kind of pushed out of the window for a while, and actually the answer to a lot of no SQL queries has been, well, you know, you handle that in the application layer. So that's kind of changing now to an extent, but there's still this question of where do you want the computation to happen? And that depends partly on when you know what questions you're going to ask. And these query methods here are very couch-based specific, but you can take some of the principles from this and apply them more generally to other document-based spaces. So the first question, or the first criteria is, like I said, when do you know what questions you're going to ask? So in the couch-based world, if you have predictable queries and you're happy to have the computation happen in the application layer, then really you can work with the key value method of using couch-based, and that will give you super fast response time, strongly consistent answers across a distributed database, but you're asking the application layer to do all the work of actually doing the interesting stuff. So in effect, what you're doing is you're pre-computing the answers to your questions and then storing them in the database. Then if you have queries, again, at design time of the application, you know you're going to want to ask, but you really want to offload some of that computation to the database layer. Well, in the couch-based world, you would use views, and if you're familiar with couch CP, then again, what views are. But essentially that's creating secondary indexes on your JSON data using MapReduce queries. And there are other things you can do other than just creating secondary indexes, but that's the primary output, really, is another index on the data that you can query. And then for queries that you don't know, you're going to, you know, those known unknowns, I guess, or unknown unknowns maybe, for those queries that you don't know you're going to need upfront, then in the couch-based world we have Nickel, which is our new SQL-like query language for querying JSON. And that's pretty much, I'd say, the future of querying with CouchBase. You're going to be able to apply the SQL-like language to huge numbers of JSON documents, and we'll look at that a little bit more later. But effectively, yeah, that gives you the ad hoc query that would be used to with relational databases but in a document database. Okay, so let's start off with looking at key value. So the principles here and the things I'm going to go through are the idea that when you're storing documents in a document database in a key value fashion, so what do I mean by that? Basically you have one index, and that is the key that you're storing the document with. So really, I'm going to go through the idea that instead of treating the databases as a great resource of answers that you can piece together whatever you want, instead of what you're doing is you pre-compute answers, store them, and then pull them out when you need them. The next thing is very similar, you're storing object states. And then there are two questions, two things you need to work out in order to be as optimal as possible, and that is choosing when to embed data or in one big document and when to refer to other documents in much the same way as we would refer to other rows and other tables in a relational database. And then the last thing is you need to design your keys well. So let's have a look at that, pre-computed answers. We're used to asking questions of databases. You know, SQL and the relational model have given us, really, they've made us kind of spot, really, because you split your data neatly into these normalized tables and into rows and so on, and that means that you have a very good, or the database has a very good understanding of the shape of the data and lets you ask almost any question of it. And that's kind of okay, that's good, that works in many circumstances. But the trade-off is, as we know, one of making scaling easy, making up time easy as well. And also SQL queries can take quite some time to compute. So we've come up with ways of ameliorating that, such as introducing caching layers to cache the answers to common queries and so on. But yeah, so the relational world has allowed us to think in terms of, well, I'll ask whatever question I want and get the answer back. And a lot of the time, you know, the way that we design applications, this might not be optimal, but we might, you know, design our applications in such a way that we're asking the same question over and over again every time we display something. Whereas the document database way of doing things is more like a library of answers. So, you know, you, well, there's two things. One is you acknowledge that you might have the same answer in multiple places, but in different contexts. So if we think of a traditional library, the theory of evolution, for example, might be described in several different books in the library. And that's fine. And because you don't expect there to be one version of it, you expect to find it in different contexts. And then the other thing is that the context itself is important for how you get to it. So if we apply this more to those spaces, then what's actually happening is that you are, instead of going out and asking the same questions or variations of the question every time, when you come to need the data, instead what you do is you build the data or the answer. You build the answer and then store it as a document. So what do I mean by that? Well, Martin Fowler is someone who has written a great deal about Most Equal. And on his website he talks about the idea of the aggregate oriented database, or as I call it here, the answer oriented database. So here we have an example from his website of an order form. Now, a paper order form in the real world, in physical transaction, puts together all the same data for that one transaction on the same piece of paper. And then if we come to enter that into a relational database, well, what we do is we split out all of the different components of the order form into rows and tables. And we split them out and store them separately. Whereas in the answer oriented database, we take all that data from the form and we keep it together. So we store together the data that we access together. And this principle of storing the data that we access together is key to how we model our data in an answer oriented database or a document database. And going back to that idea of pre-computing answers, here is a screenshot from Skyscanner. Skyscanner is an aggregation service which you type in your flight requirements and it will go to multiple different travel agents and so on and get you the price. Now, the way that they do that is through a mixture of API calls and screen scraping. And that would get very costly if they did the same set of scraping and API calls every single time that someone wanted a flight from Manchester to San Francisco, for example. So instead of what they do is they run the query once and then store the answer as set of JSON documents in Cache Base in this case. And then when someone comes back to find that, to do that journey request again, it simply pulls out the cached version from Cache Base. So it's that idea that they've already computed the answer and then they can supply it very quickly by pulling it, the answer, pre-computed out of the database. And so that is one of the key principles is that you do the work probably asynchronously and then store it and then present it when it's requested so that a human being has to wait only a minimal amount of time. Another example would be, say, with the social network. Instead of building the news feed every time that someone views it by doing all the SQL queries and so on to find out what their contacts are doing or what updates they've posted, instead when someone posts an update, you build asynchronously all the different news feeds that are affected by that update. And then you store them in JSON documents and then pull them out with a single query, a single get request when one of the followers wants to view their news feed. Okay, so that kind of brings us on to the idea of embedding and referring to data. Here we've got a very simple kind of e-commerce order represented in a set of tables. So we have the order details here. They are linked to the master order record there. And then we see the customer number 40. Well, there's customer number 40's details. And we're actually ordering product item number one, and there is product item number one up in the top left. So quite familiar with that. And obviously that's denormalized. That's normalized. We have references or joins to each part of the data. We don't duplicate data, and doing so would be a sin. Whereas in a JSON document database, it's quite okay to have a single document that represents all of that. So here we have the customer's details. We have all of the product details, and we have everything that we need for that order in one JSON document. And so that's a heavily denormalized and an embedded document version of that order. And that's okay. That's one way of doing it. But we could also still continue to split out data and have canonical copies of particular records, rather than duplicating that data across the database. So let's do that again. So here we have just the one order document because we don't need to normalize. So it's okay for us to embed an object here in our order data. So we could have multiple lines here, multiple items represented here. So that's fine. So we lose one of our embedded things from the relational version. But we're still pointing out to other documents. So here we have customer ID 40. So we have a canonical record for this customer that we're linking to. Then we have a canonical record for item or products number one. And we're not duplicating that across the database. So very simple stuff. In one we're embedding, in the other we're referring. So when should you embed data? Well, largely it's a mixture of two questions. Is the speed of access the most important thing for you? Because if it is, then clearly, depending on document size, but clearly the embedding data in one document is going to be quicker than doing a few different lookups to retrieve, say, three or four documents to get the same data back. Another one is how is the ratio of rights to read? So if you have a very heavy read workload on that data, then maybe an embedded document is the best way to go. Because if you're mostly reading the data, then you avoid one of the problems of referring to the data. And that is the data is the idea of not having transactions in most non-relational databases. So if you want to do an update to that data that affects multiple documents, well, if it goes wrong halfway through then you've got to handle that in the application layer. Most non-relational databases won't help you with that. So the other thing is you should embed data when you really don't want to duplicate data. Sorry, when you're happy with duplicating data. That's an error on the slide, sorry. I like to say when the application layer is capable of keeping all the multiple copies of that data in sync. So if the address for the customer is copied across all of the order documents that represent the orders that that person has made, and then they want to update the address in their open orders to change where they're going to go to, then in the application layer you need to be the one who is deciding or making sure that that update across multiple documents goes well. So when should you refer to data? Well, basically there are two main cases. In the calculus world we suggest that you refer to data as often as possible, and I'll explain why in a moment. But one of the reasons is that referring to data gives you consistency of data. So if there's, again it's not rocket science, but if there's only one copy of the customer record and that is updated in one place then that update will be reflected in every place that you refer to it. And the other case is where your data has large growth potential. So if your document database is a recording instant messaging conversations, the conversation between two individuals could span many years and become quite enormous. So you might want to rather than embed all of that conversation data in one document, you might prefer to perhaps paginate by day and then have a massive document that links out to all of the sub-documents that account for each day. So why do I say that we tend to recommend to refer to data in the calculus world? Well, it's primarily about the nature of CalSpace itself, because CalSpace has an integrated member cache D layer. Anything that's in your working set that's basically your go-to data is in RAM. So you're talking about sub-millisecond response times, both certainly for writes and for reads of any data that's in your cache. So if you size your RAM appropriately then you can get some very fast response times from CalSpace. And so if it takes one and a half milliseconds to read two documents or two milliseconds to read three or four documents, that's not really a great cost when you consider that on a spinning disk, just doing a disk seek could take that much time for the disk seeks for your SQL query if you're in a relational database. So we tend to say refer where you can because that will ensure great consistency of your data. Okay, so that brings us on to key design. So when we're dealing with key value data, the key is absolutely the most important thing you have because if you cannot recreate that key again, then you'll have trouble finding your data. Now with most document databases, CalSpace included, there are other ways to get it. So you could build a view, for example, to find your data. You could do a nickel query with CalSpace to find that data again. But if you're sticking to pure key value model, then the way you design the key is absolutely important. And there are three broad types of key that we tend to see. One is something that's deterministic from data that you already have. So imagine that you're creating a user profile and it's for a website where the user's login using their email address. Well, at the point of login, you have their email address. So that's something you have about that person that you could then key their user profile using that. So you can just look up the email address and then you have a route into all of that person's data. Another way would be some kind of random or computer-generated key. And then there are compound keys. So these can be either a deterministic portion with some semantic loading onto it, or it could be a UID that has a deterministic portion or whatever. So we'll look at that in a moment and go to what you can. So here we have a user profile modeled as a Java class. And there's JSON Equivalent. So we key it using the email address. So that's great. Like I say, that means that when I log in now with my Matthew at couchbase.com email address, we can do a simple lookup on that key and get my user profile back. Brilliant. But what happens then when I want to change my email address? Well, there are a couple of things we can do. We could create an entirely new document that's keyed by the new email address and then destroy the old one. But maybe that feels a bit messy. It doesn't feel quite right. It could leave a trail where we could have some orphan documents perhaps because the delete didn't happen quite correctly in the application layer. So what else could we do? Well, we could perhaps do a lookup document where now Matthew at couchbase.com simply points over to the next, the new document, the new user profile. So there are a couple of things we could do. But maybe they're a bit messy. I don't know. What we could do instead is we could remove the loading on the key and we could completely separate the key from that email address. So now we've got some kind of computer generator key. Our email address itself is moving into document. That's great. I can now change my email address whenever I like. But it also means that if the key is 1001, that we have this issue that now how do we find that person's data doing a key lookup? Because they log in with email address and the email address no longer corresponds to the key. Well, we could make them log in with whatever key name we've got. We could say, right, your user ID is now 1001. But that doesn't seem very fair. So instead we could come up with something else. The way that we would do that in Couchspace is we would use a, basically a manual secondary index, a lookup document. And so in this case, the path that we'd follow is when you create a new user profile, you would use an atomic counter in Couchspace. So it's an increment to that particular key that would then give you back the next number. And that would then load it into your user ID. And then you add, so that might be 1001, and you would add a new document keyed by that number that you get back. And the number has no meaning. It doesn't matter. It's just a number. And then you'd say if your user profile data keyed by that number. But then the important next step is that you then create another document that is keyed by the email address. And its value is simply the number that you used to key the user profile. So if you do a get on the email address, then that will return 1001. And then you can do a get on 1001. And that will give you the user profile document back. Again, none of this is rocket science. You know, this is all quite simple stuff that, you know, people have been putting into practice. And I'm sure that once there's more academic study on this type of thing, then, you know, this will look quite, quite naive. But it works. It works in practice. And, you know, we see in the wild people with multiple lookup documents. So you might have a single, you know, a single document that's keyed, like I say, by some number. But you might then have multiple email addresses that the person might have associated with them so you can have multiple documents that are keyed by the email address. And then the value is just the 1001 key of the document. You might have some kind of Twitter API ID or a Netflix ID or a user name that they might use somewhere. You know, it could be anything you want. And something to point out here is that in the campus world, certainly, people tend to use key prefixes to denote the content and the type of the data that's stored in the document. So here we've got u, colon, colon. That means that we have, you know, user profile data in that document. And fb, colon, colon means that it's going to be that the rest of the key denotes some kind of Facebook API token or something. And so what we do is we're semantically loading the key so that you can find that. Now in other databases, you might use collections or something similar to denotes or to put together types of data like this. But in the campus world, our nearest school is a bucket. And a bucket is more of an allocation of resources than it is a semantic grouping or a namespacing. So we tend to namespace keys in this way. So yeah, compound keys look at documents with predictable names. So just as we saw there, you know, something like an fb, colon, colon, and so on. And this is really putting into practice that idea of referring to data by using a lookup document as a manual secondary index. Okay, so here we have my user profile again. And we could, if it was an e-commerce system, we might be tracking the various things that a user has looked at. One way to do that would be to load that into the user profile. But like I said, it's a good idea to avoid having potentially unbounded data loaded into another document. And the list of products I look at could get quite large. So what we could do instead is we could have a products viewed document that simply has an array in of the various products that I've looked at. And so what we're doing is we're building up our keys. So our key is u, colon, colon, then the user profile ID, which like we said is a random number, then colon, colon, and then products viewed. So we semantically load the key name to tell us what's in the document. And similarly, product date, p, colon, colon, h will give us product date. We'll do a lookup on that. And then we might decide to put the image URL into another document again by appending colon, colon, img onto the end of the products key name. And so building up our keys like this means that it's easier to then reason about what the key name might be when we want to come back later on and find that data. Okay, so that was key value. I want to talk a little bit about the other ways in couch space of indexing and querying data. So this is, you know, this is quite new not only for couch space, but for NoSQL in general. Like I say, a lot of NoSQL has kind of put query to one side. I know not everyone has, but certainly those databases that were focusing mostly on scale and uptime availability, you know, query became kind of a secondary concern. So this is something quite different for document databases or at least scalable document databases. But it's okay, you'll find yourself quite at home because we're dealing in fairly familiar concepts there. So the two ways that we have of generating these additional queries in the couch's world are, like I said at the beginning, this nickel, which stands for non-first form query language, sorry, non-first normal form query language. And that refers to the first normal form, which, you know, says that basically the data that you put into a cell at a relational database has to be just one item of data. You can't have nested data. And obviously the documents are all about nested data. So we wanted to come up with a query language that was similar to SQL and, in fact, basically is SQL, but with additional functionality that lets you deal with that nested nature of JSON documents. And then the other way is, again, as I mentioned earlier, views. And views are something that's been around in the couch family for quite a long time, starting off with couch superior and then coming in to couch base. And those are map-reduced queries that you write in JavaScript. Most people just write map queries, but, you know, you can write the reduced side as well. And they allow you to emit indexes based on the content of JSON documents. So a very simple one might be, if you have a group of people, you could emit the key of every document from that list of people where the city equals Paris or something like that. So that gives you, instead of manual secondary indexes, it gives you a series of, sorry, a way of creating automatically generated secondary indexes. And so now that we have these two ways of doing more interesting queries of couch base, I say now. Nickel is in beta right now. Couch base 4.0, which will come with Nickel, is out later this year. But anyway, now that they're in the couch base world, there are these two different interesting ways of querying data. It's useful to come up with some ideas for when to use which type. So as I kind of hinted at in the beginning, Nickel is really, we think, the future of couch base. It's really the way that you're going to do most of your querying. But that doesn't mean the key value of user out of the window, not by any means. But Nickel has this ability to do ad hoc querying on JSON data, but data that you just don't care from key value lookups, at least not on the base base layer, or from views. So we would say if you're doing ad hoc querying, then Nickel is probably the way you want to go. Whereas perhaps if you have predictable queries where you know you want to emit an index on certain data, then a view might be the thing to do because you're not so much querying data, you're just creating another index view to query. And then bring it in the reduced side of the MapReduce. Views are really great when you want to do some number crunching. So one of the samples, that's the ships with couch bases, a set of beers and brewers. And so you can very quickly write a function that will let you emit the breweries, a list of breweries in order of the maximum ABV of their beers or something like that, or to reduce that would let you sum up all of the ABVs for a brewery and then list the breweries in order of the total ABVs, or something like that, you know. Whereas you know Nickel, so whereas views are good at the number crunching side, Nickel is more about dealing with, I suppose, textual data, you know, sort of the stuff that you find in JSON. And particularly for that nested JSON data. So you could have an object that has several arrays, which then have other data inside them going all the way down. And JSON makes it easy to go into those layers and then pull out what you want from them and mix the results together with other data from other documents. And then just a quick note about more on the ops side rather than the development side is the way that Nickel and views work is, you know, views are basically, it's that typical MapReduce thing where the cluster sends out the work to all of the servers and basically where the data exists is where that particular, you know, the index runs across those and then passes it back together. Whereas Nickel, we see, you know, more the, if you have very large clusters, you might end up running indexing and query services as separate servers from the data server. So that lets you, you know, basically grow your cluster according to the nature of your usage of it. Anyway, that's, like I said, that's more on the ops side than the development side. So with Nickel, there are a few things to keep in mind. Indexes, types manually, staking types and then key spaces, which basically means joins. Okay, so with indexes, Nickel relies on an index clearly. So you always need at least one index as you would with traditional SQL. So you create your index and then anything else you do is the equivalent of a full table scan until you create secondary indexes. So, you know, you can test things out and then once you understand where you're going, you can create the indexes, the additional indexes on the data, in the adjacent documents that allows you to then query things much, much, much quicker. And perhaps the difference, the key difference with Nickel indexes and SQL indexes is that a Nickel index won't cause you trouble if one of the documents haven't had that particular key value there. It'll just ignore that document. And clearly that's really important with flexible schema JSON documents. And there are two different types of index in Couch Base for Nickel. We see views there. Like I say, views are just a way of generating or are at the very most basic a way of generating a second very index. And then we have a new type of index, global secondary indexes, which are, like I say, run on a separate indexing service, which can coexist with the data service. So, you know, every single server in your cap plus it could still look the same as every other one in terms of functionality, or you can split them out if you choose. So, when do you use GSI and when do you use views? I think basically, you know, GSI, global secondary indexes will be the main way that people use, create secondary indexes in Couch Base. But the, you know, views have something that GSI's don't, and that's support for multi-dimensional geospatial queries. So, views are not dead by any means. In fact, we're doing much more with them than ever. Like I say, geospatial and multi-dimensional being part of that. So, probably it's not that interesting to go into the detail right now of which type of index you would use, but this is something to bear in mind when working with Nickel. And, you know, one of the things that I'm typically fond of with Nickel is you can do joins across JSON documents. So, basically, you know, you include a foreign, you know, the key name of a document in another document, and then you can write effectively the sequel that says, you know, give me all of the airlines that do this particular route, or give me all of the routes from this particular airline, doing a join as you would with the sequel. And we work across key spaces rather than tables, and also within key spaces. So, you can do a join across documents inside the same bucket or key space, or across multiple buckets or key spaces. And the reason that there's that difference in terminology between key spaces and buckets is that key spaces could come to mean something else later on. It's just kind of future-proofing the language. And something, yeah, that I mentioned earlier, it's really important that we go back now to offloading computation to the database layer, you know, rather than constantly asking the application layer to handle that. So, Nickel allows you to do a lot of the data work on the Cowspace cluster level now. You know, a lot of the hard work that you would have had to have done by hand effectively in the application layer now happens on the database layer. Okay, so that's pretty much the sort of things that we found at Cowspace in terms of data modeling. Now, those are the basics. If you go to blog.cowspace.com, you'll see, you know, some great articles on modeling, you know, like a user profile store and different types of scenarios. But yeah, I'd love to have some questions and thanks for listening so far. Matthew, thank you so much for this great presentation. We have some questions coming in, certainly. And if you have questions for Matthew, you want to ask them, submit them in the bottom right-hand corner in the Q&A section of your screen. And of course, one of the most common questions we get is a question of whether people will get a copy of the slides and the recording. And I will send a follow-up email out to everyone with exactly that and anything else requested throughout the webinar by end of day Thursday. So, let's just jump right into it. Matthew, the first question coming through is, would you recommend papers and books expounding on what you outlined in slides 16 and 17 via number one, the map, and number two, the usage pattern to help data store design? I'll be quite upfront on that side. Nothing comes to mind immediately in terms of, you know, real formal learning. But there are certainly some great blog posts out there. There are, you know, if we talk more generally than couch pages, and certainly there are books written for other document databases that go into some of this. And then there are some new couch-based books as well that go into modelling data. So really, I think the most interesting stuff is published on blogs right now. Certainly on our own couch-based blog, but also, you know, people using all sorts of document databases are writing interesting stuff and giving interesting talks and conferences about how to do the stuff. And I'm quite sure that there are people working on the more academic end, but I haven't seen anything myself lately. Thank you very much. And you know, we recently sent out a survey on that particular question and topic, and I would have to agree with you, Matthew, that blogs are most certainly been the number one resource for people on this topic. The next question coming in is the ability to do joins across back at something. Let me restart that. Is the ability to do joins across back at something that is only available in couch-based, for example, can joins across collections be done in other databases as well? Well, joins in the, you know, kind of the sequel type of join where using effectively just sequel is, as far as you know, something that only couch-based does right now. Certainly there are ways of querying other document databases that might, you know, with some work on the application layer give you an equivalent result, but probably not with the same level of familiarity as you would get from, in fact, you know, just what I think sequel happens to have some tweaks to handle, Jason. And going back to the blog site you mentioned for couch-based, it's blogs.couch-based.com. Is that what you said? Blog.couch-based.com, yeah. Okay, I'll make sure to put that in the following email as well. There's another request for that reference. Specific to couch-based, will N1QL come to couch-based light? I hope so. I certainly think that's a longer term aim. So at the moment, primarily with couch-based light, you're doing, you know, the map-produced view type of querying. I know that on the iOS side of things, you know, we kind of, we've certainly built in the iOS query model where you're able to create some of the data in that kind of way, and that's something that we're working on bringing back to query to the Android version and Xamarin version as well. But certainly, you know, our plan is that the NICL should be everywhere that you see couch-based. And so within time, NICL will come to couch-based light. But I'm going to have to, you know, say that I don't know when that will happen, only that I hope it's, you know, I hope it's soon. Perfect. So another question specific to couch-based. Is this designed to run on commodity hardware? Yes, yeah. So pretty much the entire, I think we'd have to leave out the SQL membership badge at the door if we said it wasn't. You know, I can't really think of any major deployments of couch-based on anything other than commodity hardware. Or in EC2, you know, or Microsoft's Euro, or, you know, we have people running it in joints and places like that. So, you know, to borrow a phrase, you know, it's like the pets versus cattle thing. It's very much a case of you treat your couch-based server nodes almost as cattle because if one goes away, hey, it's all right, you know, there are others. So, yeah, it's designed very much to run on commodity hardware or in unreliable cloud hosting services. Yeah. So, and kind of going back to a couple of the other questions, Matthew, is how efficient are joins in N1QL? That's a really great question. It remains to be seen. I know that's a bit of a cop-out, but part of what's happening between now and the general availability of Nickel and Countries 4 is that, you know, we're working on improving the efficiency of the query engine. And we're already, you know, we're already pretty happy with how it is. But the other reason I can't really answer that is because, right now, I think Direct TV are the only people running Nickel in production, and that's using the beta version and that's something that's spoken about at CouchBaseConnect conference last month where there are some of the queries and sort of, if you watch this, you might also like this program type of query that they're doing, their EPG is happening with Nickel. So, it's hard to say how efficient it is, but certainly that is our primary aim because CouchBase has got a reputation for being really fast and you don't want to lose that by having an inefficient query engine. So, what's your space, I guess, is the answer. That makes sense. I love all these questions coming in specifically to N1QL. For N1QL, do you separately define the indexes before you execute the query? Yes, but you don't have to. So, like I said, if you don't define the query, then you're doing the equivalent of a full table scan, which, you know, you've got millions of documents, probably isn't going to be efficient, but it'll give you a feeling for what the answer is. Well, it'll give you the answer and then you can build the index for that. So, you only ever need the index on the primary keys and it can do any query you want, but if you want them to be efficient, you'll need to build those secondary indexes. All right. And so, again, continuing along those lines, is it better to embed an array of foreign keys within a document to refer to other documents or is it better to have a foreign key in the other documents that are queried against? I'm thinking of a traditional many-to-many relationship where you might have a special join table used to link tables. So, that sort of question is the sort that we generally answer with, well, would you like to hear about our consulting facility? No, seriously. Look, it really depends on the use case itself, but generally speaking, I would say that embedding foreign keys within a document is probably okay because, yeah, I mean, it does depend on the use case and the data model. What I would say is you'd probably embed both ways if you want absolute query flexibility. But, yeah, there's not much penalty for that either because, you know, you talk about tiny documents as a result and you're not going to slow down indexing or anything like that if you're doing pure KV. So, there isn't really any additional pain other than you have to handle it in your application layer. So, I'm going to kind of cop out and say that I don't really have time or the insight of the security base model to go into it anymore right now. So, another N1QL question is, will there be support for that in couch-based light? Do you think? I think we might come back earlier. But, yeah, I hope so. I hope we do get to that. I mean, the answer is yes. And the reason I say I hope so is because we don't have a definite roadmap date for it just now. But, yeah, clearly, you know, the processing power of a mobile phone or a small embedded device is somewhat different. So, you know, we're kind of looking right now at the best way to get query or suffocate query into couch-based light. And Nicol will, I'm sure, be part of that story. Love it. I love all the insight into what's coming. So, where does couch-based fit into CAP Theorem? Well, we're strongly consistent. So, we, you know, we favor C, P over AP. So, there's a single master, oh sorry, single active copy of each record and then there are replicas waiting in the wings. You do all your writes and reads with that one active record. I mean, you can do replica reads if you want with the caveat that they might be out of sync. But, certainly, for KB, it's strongly consistent within a cluster. And, you know, that's the trade-off that's the right trade-off for some people. And for other use cases, availability might be what you're looking for. But generally speaking, availability isn't the problem with couch-based. You know, but we do favor C, P. Interesting. And kind of back to the kind of the modeling questions along that line. Are any of your clients actually modeling the documents, the key indexes and fields ahead of time, or is it more ad hoc? I'd say that a large portion of what we do is couch-based working with customers on the development side is working with them to work out those data models. I'd say that if you're kind of coming up with your key naming and so on and in ad hoc fashion, then you'll pay the price for it later on. So, certainly, if you want to have an efficient application layer that's accessing the data, you want to use the database itself efficiently, then it certainly pays to spend a few days in front of a whiteboard working out what's going on. Sure, that certainly makes sense. I don't see any other questions coming through. Go ahead and type any, and if you guys have any more questions. So, Matthew, before we close it up, though, what's the number one question, the number one modeling issue that you see from your clients that maybe we haven't addressed yet. So far. I think it's just getting that idea... Well, there's two things. One is the sheer simplicity of it, which also means that it's really, you know, it's enough for it to handle yourself to have a situation. So, there's that. And things like having to manually put in a key value pair in your JSON document that says what type of document this is, because then if you've been doing, you know, you can then do a query in Nickel or with a view that says, emit all of these documents that have type user profile, you know. And the other thing is schema, schema, what's the word? Schema, basically, I've completely forgotten English what I'm looking for, but it's having recording in the document the schema number or something like that, you know. So you might start, you know, schema number one might be your first one at versioning, that's it. Yeah, so, and then, you know, when you create a new schema type, you would have the schema version number two. And it's really important to record in your documents what schema version they're using, because there's no other way of telling. So then what you do record that data, then you can say, well, give me all of the documents that are of schema type less than the current schema version, and then I can go through some process in the application layer of updating them. Thank you. And that's perfect. That actually prompted a couple more questions, which I think we have time for. So just one question specifically is what is the speed of the views pre-computed document results, specifically the person is interested in the latency. And then also a separate question, and maybe we can get, well, let's answer that question first, and then I'll get to the other question if we have time. Okay. Well, views in CalSpace are an index that you generally are both deliberately eventually consistent. So they run a slight lag behind the key value view of CalSpace. And the reason for that is because we don't want to hold up the entire cluster by saying, well, before you write this, then we need to then update the views and the indexing and so on, because then we end up in a slightly less and partition tolerant database situation. So generally speaking, the default, and this is the default, is that the view index is run every five seconds or every 5,000 writes. But that's variable. You can change that. And also you can, at the time of reading that index, you can say, well, I'm happy to have a stair index or whatever it was taken to him right now. Or please run the indexer again before giving me the result. I mean, you pay a slight penalty. But the actual latency, the actual hit that you're taking depends very much on the size of your dataset and the CPU power of your servers and the size of your cluster. So again, it's a bit of a, you know, how long is a piece of string of sort of situation. But it certainly does run a slight lag from trees, if you like. Matthew, thank you. We do have one more question. If you have a quick answer to, what are your suggestions on sharding approach for CalSpace? My suggestion is don't worry about it. CalSpace handles the sharding automatically. So it does a CLC32 hash on your bucket and key name. And then the cluster itself is a huge hash space. And that, you know, the hash number that comes out at the end of the CLC32 process determines where in the cluster it lives. So that's one of the beauties of it. You don't have to worry about it. That's a very short answer indeed. That's perfect. Matthew, thank you so much for this presentation today and for the Q&A. Just to remind everyone, we will be posting the recording of the webinar and slides to dataversity.net within two business days. And I will send an email out to everybody with all of that information. And thanks to CalSpace for sponsoring today's webinar. Always great to have you guys join us. And I hope everyone has a great day. Thanks to our attendees for all the great questions. We just love the engagement as always. So Matthew, thank you again so much. And especially joining us from the UK, I noticed late in the evening for you. So I really appreciate your time. It's fun. Thank you. Thank you, everyone. Have a great day.