 Hello, and welcome. My name is Shannon Kemp and I'm the executive editor of Data Diversity. We'd like to thank you for joining this Data Diversity webinar, NoSQL Data Modeling, using JSON documents, a practical approach, sponsored today by Couchbase. Just a couple of points to get us started. Due to a large number of people that attend these sessions, you will be muted during the webinar. For questions, we'll be clicking them via the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share questions via Twitter using hashtag Data Diversity. As always, we will send a follow-up email within two business days, containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now, let me introduce to you our speaker for today, Dave Siglow. Dave is the director of technical product marketing at Couchbase, a 30-year veteran of the database industry. David previously spent over eight years in senior product management roles at Oracle. Most recently, he was the product lead for Oracle NoSQL Database, Oracle Berkeley Database, and Oracle Database mobile server products. Before that, he was a VP engineering at Sleepycat, the company behind Berkeley Database, and helped senior technical roles at Informix, Elastra, and Firsida. Dave has started his career as a developer in the oil and gas industry. And with that, I will turn the floor over to David to get today's webinar started. Hello and welcome. Well, thank you very much for that introduction and welcome. And thank you everyone for attending today and giving me some of your busy day. Hopefully this will be a useful session for you in terms of thinking about how to approach data modeling for NoSQL and JSON documents. A little bit about myself. As the introduction mentioned, I am the director of technical product marketing for Couchbase. This is actually my first marketing job ever. I've usually been on the product side of the house. I describe myself as a database guy. Pretty much you name the database technology. I've probably worked on it, or the database company, and I've probably worked for them, or I know a lot of the folks that started them. I've mostly been on the product side of the house, either engineering, product management, QA, support, et cetera. It's kind of exciting to take a different role at Couchbase and work in the technical product marketing space. And my way of looking at technology and databases is they're only really useful when they're deployed and solving real-world problems. If you build a technology and nobody uses it, it was an interesting academic exercise, but we can all go home now. So today we're going to talk about a little bit about what is Couchbase and why people are using NoSQL. Hopefully we'll go through that section fairly quickly and spend most of our time talking about how do you identify the proper application that can benefit from JSON and NoSQL? How do you approach data modeling for that application? How do you approach data modeling for JSON? How do you access that data? How do you migrate data into a NoSQL database, for example, and then we'll leave room at the end for Q&A. So without further ado, let's talk a little bit about what is Couchbase? Clearly the company I work for has to give them a plug. So Couchbase provides a couple of different database products that deliver the functionality that's needed for the digital economy kinds of applications. It provides storage, obviously, for JSON data, but it also provides a query interface, a methodology for building indexes on a distributed system, active-active replication, using Couchbase mobile. You can build mobile applications that automatically synchronize either with each other in a peer-to-peer configuration or with Couchbase server as a back-end database. And we've recently announced additions to the product for full text search and analytics. And the interesting thing about this is that it provides a unified programming interface, a unified API, to access all of these different functions and a unified administration and management console that lets you configure and manage each one of these services within a single cluster. So you can have clusters of different configurations that have different levels of query support, indexing support, replication. They can support mobile applications, search and analytics. What many people are doing kind of in the SQL and big data space today is they tend to build specialized clusters. So I have a cluster for my data. I have a cluster for my queries. I have a cluster for indexing. And over here I have CACTA, which is a cluster for my replication. And I might have a cluster using Elastic Search and yet another cluster for Spark. And what we keep hearing from customers is that while this is great when you need it, most applications don't need necessarily the sophistication and the administrative overhead that each one of these clusters, these separate clusters, brings to the table. So for a lot of customers, the fact that they can get a single platform that supports all of this from a unified API and management perspective vastly simplifies their introduction to NoSQL and building applications for the digital economy. Very quickly, some of the customers who are using Couchbase, NoSQL generally, but Couchbase specifically, include top companies in the e-commerce, the global distribution services for travel, media companies, gaming companies, and financial services companies. And here's the obligatory logo slide. We also see adoption for NoSQL in general, Couchbase specifically, in areas like healthcare, manufacturing, utilities, but probably these five areas are the places we've seen the biggest adoption for the technology and for the company. Healthcare is certainly an interesting space that's growing quickly, but hasn't quite gotten there yet. I think we'll start to see much more healthcare adoption in 2017. Customers approach NoSQL from different perspectives. If you look at customers like Gannett and Marriott, they both had very large relational database implementations that were kind of their legacy systems that they built 15, 20 years ago and have been fine-tuning and trying to get more performance and scalability out of over the last five, 10, 15 years. For them, their adventure to NoSQL started with, okay, let me identify the places where I can replace or enhance my relational database and slowly introduce or gradually introduce NoSQL into my infrastructure. If you look at folks like eBay and cars.com, they had a slightly different problem. They couldn't approach the problem gradually because they had a scalability problem that was just hitting them in the face now. They were reaching the limits of the performance and scalability they could get out of a relational database. So eBay and cars.com, for example, were customers who said, we need to embrace NoSQL now. This is not a gradual process. We need to look at our services-oriented architecture and implement NoSQL-based applications immediately in order to scale and give our customers the rich personalized experience that they want. And then finally, Equifax is a good example of a customer who looked at NoSQL and said, you know, I need to build a brand-new application that does something we don't provide today. And I could try to fit this application onto my existing data warehouse framework that has all these years of credit history or I could build it so that it uses NoSQL and interfaces with the data warehouse which is exactly what Equifax did. So they enhanced their data warehouse by adding a new capability and a new type of application to their infrastructure. For those of you not familiar with what NoSQL is, it's essentially a non-relational database. It's not the fact that it doesn't have SQL because some NoSQL databases do, and we'll talk about that a little bit more later in the presentation. It doesn't really mean not only SQL because it's actually probably a subset of SQL for many NoSQL products. Basically means that data is not being stored using the standard relational assumptions. And part of that affects data modeling which we'll talk about today. And other parts of that non-relational aspect also affect transactions and how the database actually treats data. Most of the NoSQL systems out on the market are distributed. They're built so that you can essentially store your data across a set of nodes in a cluster. And the idea is instead of buying bigger and bigger machines, you simply add more machines to your cluster. You scale out instead of scaling up. The data itself is partitioned automatically and replicated for high availability and scalability. In other words, NoSQL systems keep multiple copies of the data for high availability and then distribute those copies across the cluster so that if a given node fails, the data is still available from one of the other nodes where the copies reside. And the NoSQL database kind of manage that automatically. Most NoSQL databases are called schema less, which means that you can store data and retrieve data without having to specify a schema beforehand. And some of them support JSON, including couch base. Others support other data formats. In the last year, what we've seen in the NoSQL space is that companies have gone from being a single storage model or a single modeling solution to multi-model. So now what you'll find in the NoSQL space isn't necessarily a document database or a graph database or a columnar database. You'll see different combinations of data models depending on kind of what their customers are asking for. In the case of couch base, we are a key value and document NoSQL databases. So you can store data in both types of models, in both types of structures. So why are customers looking at NoSQL? Fundamentally, there are technologies drivers in the digital economy that are pushing customers both online and pushing for customized, personalized customer experiences. And enterprises are looking at this and saying, okay, I've got more data. I've got more applications sharing data and that the internet is connecting all of my applications together. I'm moving some of my applications to the cloud and these are a series of things that are kind of pushing the company to innovate technology that they might have had for quite a while. So the technical needs are, there are huge amounts of innovation and therefore I need something that's basically very flexible and very simple to change and update. And I need to be able to operate at any scale. I need to be able to deploy something that might start out small, but I need to be able to quickly grow it in response to customer demand or the number of applications that come in and start sharing and using that data. And from a business need, timeliness is crucial. If you look at companies like Uber and Lyft and Viber, a lot of these companies are coming in and disrupting existing companies and existing economies and they're doing it by innovating first and that's one of the things that customers have seen quite a bit is that if they don't move quickly and if they don't reproduce their cost of operations and increase their revenue, they will be left behind and will become the laggards even if they own that space to begin with. One of my favorite examples is Pokemon Go. That team had no way of predicting how fast their application was going to need to scale and one of the things that helped them was that the fact that they had implemented using Couchbase and so it could quickly scale their cluster in response to increasing customer demand and this is kind of all being driven by the requirements within the digital economy. So as application developers, architects, data modelers, the big question that a lot of folks are struggling with is, okay, how do I use NoSQL and relational and how do I combine them? Do I replace my relational databases or do I complement it? Do I extend it? And it really depends on what it is you're trying to do. Wherever it's possible for NoSQL to be the database of record then customers are often replacing their relational databases with NoSQL. For example, shopping carts, session data management, anything that is specifically object driven and needs high performance and high scalability is something that customers look at and say, oh, I could use NoSQL for that, that could be my database of record and for that I'm not going to use a relational database anymore. I might still use relational databases for other things but wherever it's my database of record I'm going to choose to replace my existing technology. There are other customers who look at it and we mentioned Gannett and Marriott earlier where they look at NoSQL as adding performance, scalability, availability, flexibility to existing infrastructure and will basically say I'm going to extend what I do currently with my relational database by adding NoSQL to my infrastructure. Bottom line, almost every single customer I talk to runs both relational and NoSQL databases in their shop. There's no wholesale flip a switch move to NoSQL because there are things that relational databases that have been developed over the last several decades do really well. And to add to the confusion, NoSQL vendors including ourselves are adding features that you typically find in a relational database including things like security, query language, analytics, things that you traditionally found in a relational database are starting to appear in NoSQL databases to increase the number of use cases that they can be applied to. And relational database vendors are starting to introduce NoSQL features like automatic sharding, like support for JSON, some forms of distributed processing. So you kind of need to look at what is the problem you're trying to solve and which technology is the best one to solve it with. Most customers decide to make the migration from relational to NoSQL for those top four reasons. It's easier to scale, you get better performance, significant cost savings in terms of infrastructure purchases and it's highly flexible. What's interesting to me is two years ago the number one reason why people were moving from relational to NoSQL was cost. They were looking at their expensive relational database licenses and thinking I could do that functionality with a NoSQL database and save a huge amount of money in the process. Last year, according to study from the IDC, that changed and the number one reason is now agility, performance and scale. It's interesting that cost has gone to third or fourth place in the decision matrix that customers are making about when to migrate or why to migrate to NoSQL. Which to me is very interesting. It means that customers are starting to identify the needs of the digital economy, the needs for flexibility and scalability as being even more important than simple cost savings which motivated a lot of them in 2010, 2011, 2012. So how do you get started with NoSQL? What we're going to talk about today is we're going to talk about how to identify the right application, how to model your data, access it and how to migrate it. In terms of identifying the right application the best thing you can do is find applications that are service oriented and compartmentalized so that you can identify the pieces that you want to migrate or the services that you want to migrate to NoSQL. One of the mistakes that some customers make is they look at their legacy system which is full of spaghetti code and they try to renovate, innovate, re-implement that in a six month to nine month project and that those kinds of projects are usually doomed to failure. Customers who are most successful are the ones who find those applications or components of an application that you can identify and say I'm going to put NoSQL in right here for this service or this function. Generally speaking you should look for applications that are going to benefit from the features and characteristics of NoSQL. Applications that need fast innovation, that need fast scalability, that need high availability, that need global data distribution around the world are the applications that should come to mind about these are the ones where NoSQL could actually bring benefits. You don't really want to change a relational database application that's working just fine and replace it with NoSQL for the sake of using the technology. You want to get benefits from it. What most of our customers have done is they've looked at their infrastructure and have identified things like a caching service in front of their relational database or they've identified a specific application that has a very narrow application scope and data management and they'll identify that as a candidate for NoSQL or they'll identify a physical or logical service within a large application where they can say this service is going to start using NoSQL and especially if they have an API that allows them to abstract the storage layer underneath the application may not need to change at all the application still calls the middle tier API that manages storage in the database but underneath it's actually starting to use NoSQL. So if you've picked your database, if you've picked your application if you said okay I'm going to start looking at using NoSQL for this application then the next question is well how do I design the schema? How do I design how do I model my data? And the first thing it kind of helps when we talk about this is to clarify a little bit on terminology all of the NoSQL vendors use slightly different terminology but we pretty much mean the same thing we use servers in a cluster so whenever we talk about a database we're talking about a cluster of systems a cluster of nodes where different services and different amounts of data are stored and when we talk about a logical database it's essentially a bucket or a collection of documents or objects as opposed to a database or a table so you'll often hear us talk about at couch base buckets and in other products for example collections and each row in a relational table is essentially in our world a JSON document it's a document or an object when you think about how to approach data modeling in a NoSQL world the two differences are primarily that in the NoSQL world you have relaxed normalization in other words you don't have to the database doesn't enforce schema adherence whereas in the relational world it does of course the advantages in the relational world we're very familiar with you know if you've normalized the data and you're not storing the city name in every record you're storing the city ID and then you're looking up the city name in some ancillary table that saves a lot of duplicate data and it saves a lot of storage resources however it also means that you have to look multiple places to get the data that you need if I want to display an address which includes a city ID I have to go to the city table that's city ID and then they have to go and look up a state table to figure out what the name of the state is in the case of NoSQL what you can do is essentially optimize objects for that data access pattern that you have so all the data that you need in order to access an object is basically stored and co-located together in a single object I don't have to go anyplace else to go get the data access in NoSQL generally speaking really really fast I'm not looking up data in multiple tables I'm looking up data in one place in one object because that object has all the data that I need that object can change as the application changes I don't have to go in and modify the schema in order to change the data that I'm storing because objects in the NoSQL world are typically self-contained they have all the data that you typically need in order to look at the object it is a great it has a built-in kind of architecture to support clustering because I can distribute those objects across multiple nodes but as long as I'm accessing a given object I'm only accessing one node access is very very fast in a relational world if I spread my data out across nodes in a cluster that means I actually always have to go get data from different nodes and that can be very slow and then finally because my data access is a single object and is very very fast I have lower server overhead I'm not doing joins I go get the data and it's a very efficient operation so I basically can get more operations per second out of my storage my database storage can necessarily out of a relational and that's just kind of inherent architectural differences JSON is JavaScript object notation it's based on JavaScript and it's a lightweight data interchange or data format description and we'll talk more in detail about what's in there it's generally language independent lots of languages have support packages for JSON and in general less verbose than other storage formats and it allows you to represent and contain both objects and arrays inside a structure and the big payoff is that there's no impedance mismatch there's no difference really between a JSON dot job object and a Java object so JSON to Java or JSON to programming objects the translation is very lightweight for those of you that have been using object relational mapping systems or MS you know that there's a huge overhead translating relational tables to programmable objects or application objects in the JSON no SQL world that translation is very lightweight so let's talk a little bit about how JSON works and how the model works in the relational world you define a schema and that schema contains essentially the name of the column and what format it has and all of the rows in that table look the same in JSON the data itself the object itself is self describing each record each object contains essentially a pair a attribute name and a value and it contains multiples of these so instead of describing what the schema looks like somewhere each object itself is self describing one of the advantages that this provides is that different objects in the database don't need to look the same in fact different objects in a JSON document database often look different whereas in the relational world every row in a table looks the same if I want to add a new attribute I have to go modify the schema in the JSON world I can have these two documents like the one on the left which is a user document and the one on the right which is also a user document with different structures on the left we have addresses which contain a billing address and a shipping address and on the right we have an object which contains addresses and for billing I actually have an address but for shipping I've got a totally different value I don't have a structure I have something that just says same and it's up to the application to interpret those values so you can have different types you can have different fields and you can change them pretty much on demand whatever the application wants to store as long as it's self describing in this format it's valid JSON and the database really doesn't care a whole lot about what your JSON structure looks like it just wants a well formed JSON object and it's going to store it alongside a primary key so when we think about how you change your data model in the relational world you have to change the database schema you have to change the application code and then you have to modify the interface with the customer in the document database model typically all you have to do is modify the interface so if I'm using a web application all I have to do is start requesting new fields or add some new fields to my form store those as part of my JSON object and I'm done I made that modification I deployed it customers are starting to supply that data or I'm starting to provide that data to customers or the application without having to change the underlying scheme in the database or change the application itself the first object that we think about when we think about objects inside the database is basically the object ID the primary key then you can certainly create object IDs that are just a sequence of numbers you can create them so that they're a UUID value each object has a unique value associated with it our most successful customers have looked at those keys and said you know what we're going to do is we're going to actually build keys that are strings that are human readable and easy to identify and the reason why you do this is so that your application can easily construct an object ID dynamically if I'm looking for information about an author colon colon chain I know what the primary key or object ID is for that I don't need to look for it any place else it's author colon colon chain if I'm looking for a blog entitled no sequel fueled Hadoop I don't need to go look for what the ID of that thing is I know what the ID is it's blog colon no sequel fueled Hadoop so most of our customers today are building kind of human readable application constructable keys that are easier to find and don't have to don't require an external look up there is a trade-off larger keys the larger the key is the fewer keys you're going to be able to fit in memory the database cache so if you make huge keys that are 128 bytes long that will work but you're going to be able to fit fewer of those object IDs or keys in memory so there's a balance between making it human readable parsable verbose and making it so verbose that you are consuming more memory than you want to in storing these keys in the cache the second thing to think about when you're modeling your data so we've defined our object ID and we know what that kind of looks like is defining how objects relate to each other or the relationships between objects and there's essentially two models there's a bottoms up model which says their objects belong to other objects and that determines where you place the foreign keys so you might have on the left hand side a comment which in turn points to a blog and blogs point to authors and that's where you put your foreign keys the reason you would do this is if your primary access pattern goes in that pattern if your primary access pattern however is the reverse then what you might do is a top down model or for example your author has multiple blogs and those could be foreign keys or they could be nested objects and each blog has a series of comments associated with it again as separate objects or nested objects we've kind of broached this subject so why don't we go ahead and dive in in the relational world you would typically have multiple tables in order to express relationships in the JSON no SQL world you get to decide as a modeler whether you want to have nested objects or related objects in the case on the left hand side you have a document user that's an object and it points to two other objects which are addresses for that specific object so in the case on the left hand side we basically are storing three separate objects in the database and we're building in essentially a foreign key into the parent object pointing at the lower ones in the other example on the right hand side the nested example we're only storing a single object and that object has embedded objects inside it including a object called addresses which contains two addresses and an object which is called accounts which is an array of different accounts so you get to choose which one of these two you use and the question often becomes so which one do I choose you gave me this flexibility now what do I do well here's a simple cheat sheet for thinking about should I create related objects or should I create nested objects and most of it has to do with how do you access the data and what are the read and write characteristics of your applications so if you basically have most of your reads are only on parent fields and you never use the children then you might store the children separate documents or separate objects and the reason for that is if I'm only retrieving the parent and I don't have to go get the kids then that becomes a very efficient operation however if my application always gets both the parent and the kids every time I get a blog entry I also get all the comments then that might mean that it's better to store them as a single nested object or an object with nested objects inside it again so that I can reduce the amount of time that I spend going and getting the object. In the case of writes you need to think about when how you write whether you're writing just the parent or the parent and the child together in which case you make that decision of storing separate or nested objects and finally kind of the third dimension on this has to do with concurrency and contention. If you have objects that have a lot of updates that are occurring to them then you may want to express those as children so that you can essentially parallelize access to the child object rather than serializing access through the parent object essentially if you have an object that's getting thousands or tens of thousands or hundreds of thousands of updates per second I'm going to try to model that so that the piece that's highly that I can basically break out the pieces that are being updated so that I'm not always contending on the entire object. There's a great presentation recorded called the agile document modeling and data structures that was recently presented last week at Couchface Connect. There's a link in the presentation that gets you to the on-demand recordings and you can search for the agile document modeling but that presentation spends about 45 minutes going through different types of trade-offs and JSON object modeling for Couchface and it's a great presentation it talks about document types, talks about objects versus arrays talks about different time stamp options and formats it talks about one of the things that's particular to NoSQL and JSON is you now have primary logic you can have empty values you can have null values but you can also have missing values so that the data attribute itself is missing from the JSON structure and how to deal with that in your application code. It also goes into details about how you can specify schema and some amount of normalization data standards, data typing standards in the JSON object itself and libraries that you can use to enforce Couchface as a database doesn't today enforce any schema adherence as far as we're concerned if it's a well-formed JSON document we'll store it if it's not a well-formed JSON document we'll return an error but we don't try to enforce any particular schema there are ways to enforce schema however in the application so you've built and modeled your data the next question becomes how do you access your data in Couchface and in most of the other NoSQL products there are typically different ways of accessing the data you can access the data in Couchface via a key value pair API which allows you to perform CRUD operations create, retrieve, update and delete there is a query interface that allows you to access views but the one we're going to focus on today is the nickel or general query interface that lets you access data through indexes in the relational world everybody's familiar with SQL Couchface introduced nickel let's see that would have been last year and it is a SQL for JSON it has all of the attributes that you'd expect from a SQL language and features that allow you to parse and manage objects that are nested within JSON structures within a SQL statement as well so an example of nickel queries these are going to look very much like the SQL that you're used to seeing the first one simply is a select with a where clause the second one is a select with a join so nickel supports joins and we have the special clause on keys which essentially tells the executor what is the primary key for the table that you're joining to you can even join on objects by specifying an inner join against a nested object as in the third case if you're using nested objects I'm sorry if you're using related objects if you're using nested objects as opposed to related so this first one is if I've stored multiple objects here's how I can perform joins between objects in this case I've stored objects nested within my records and in that case what I'm doing is I'm actually not doing a join but I'm using the where any clause in order to say match this to any value that's in that particular array so where any is one of those extensions that nickel provides that help you access data of course with nickel as with sequel you can also perform prod operations so you can do insert update and delete we also have an upsert and emerge so you have multiple ways of inserting updating and deleting data directly in nickel as well we've talked about queries queries require indexes so in couch base what you do is you have a series of options to create indices including simple single column indexes compound indexes functional indexes you can create an index on a function of a column and partial indexes which allow you to essentially create an index with a where clause in addition to those kind of four fundamental types there are additional indexing options that you can consider including array indexing memory optimized indexing and then two things that really aren't different indexes but are interesting options to consider as you model your data are to think about indexes as covering so in a covering index the entire query can be resolved within the index it never has to go look at the data and a duplicate index allows you to create multiple copies of the index and the system will automatically mode balance between those so for example if I have an index that is becomes a hotspot everybody's accessing the data through that index and it's becoming a contention hotspot one of the things I can do in Couchbase is I can create essentially two copies of that index one on one node and another index on another node and the system will automatically recognize those two indexes as equivalent and will essentially mode balance between the two which essentially lets you scale indexes even when the index is highly accessed or has hotspots some of the things you need to think about in terms of data modeling when you're building indexes and some of the things you need to unremember from your relational days are around how indexes function in the relational world indexes are synchronous if I do an update to a piece of data in a relational database all of the indexes immediately get updated in Couchbase index updates are asynchronous basically you write to the data that's transmitted that's put on a queue and transmitted in memory to the indexing service and the index is updated asynchronously and inside your application you get to say whether you want to wait for the index to be completely updated the one you're reading or whether you'll take the index in whatever state it's in usually the lab behind the data is we're talking milliseconds but for some applications they want to make sure that they're looking at the most recent updated index or that the index is updated at least until the last update that the application wrote the other thing that you've learned after years of working relational systems is that indexes slow down writes if I create a table and I put one or two indexes on it that's just fine but if I create 10 indexes that's going to slow down every write operation to that table in the Couchbase in the relational world because indexes are asynchronous it doesn't matter how many indexes you can create you can create two indexes 10 indexes, 20 indexes and your write of the data is not affected it just puts it on a queue to be sent to the indexing service to be added to the index in the relational world if I want to load balance index reads and writes I have to do it in the application there are many relational indices that cover the same set of attributes in most relational systems I have to use a query hint an optimizer hint in order to tell it use that index this index or that index whereas in the Couchbase world load balancing is automatic if we see two indexes in the system with the same signature with the same set of attributes the system will automatically recognize that it gets to load balance between those two many relational systems indexes and data kind of conflict and contend with each other in terms of memory usage in the Couchbase system using memory optimized indexes effectively you can pin the index in memory and we actually use a different storage engine that allows us to optimize storage of those indexes if you have a query planner parser planner executor which we do in Couchbase we have a very missing year old friend explain from the SQL world but not in Couchbase because explain will tell you exactly what the query plan was and what indexes were used so to summarize very quickly key value pair access in Couchbase is very very fast but you're only going after a specific key MapReduceViews are great for certain types of applications but they're not a general purpose query language so we use a specific API to go get what is essentially a materialized view whereas nickel queries provide the highest flexibility and allow you to do almost anything you can do in SQL you can do in nickel as well and you can create indexes to support your nickel queries so let's talk quickly about migrating data one of the challenges of designing a noSQL system is essentially thinking about how am I going to migrate my data from relational to noSQL and some of the things you need to think about is you know are these do I have to do ETL is it a one-time deal or do I have to do this iteratively is it a batch style interface or is it a streaming interface and essentially what we recommend is especially for the first implementation is keep it simple you can certainly do things easily with couch base such as exporting using comma separated values from a relational database and then use nickel to actually load those files load those records directly into couch base doing the ETL directly within nickel you could also use SQL from your relational database to perform the ETL and get the data formatted before it goes into couch base and then bulk loaded into couch base as JSON documents most important thing to do is align your data model with your data access model plan for failure because there's always going to be bad source data there's going to be nodes that fail there's going to be jobs that need to be restarted and so you need to ensure that it's interruptible and restartable one of the questions we get asked a lot from customers is how do I keep the two things synchronized and we've seen both of these patterns used by customers in fact I saw this presented by several customers at connect couch base connect 16 last week where they were using Kafka as a way of either streaming data to couch base or streaming data to the rdbs as changes occurred or they were using for those of you that use for example Oracle or other relational apps that have a gateway product like golden gate often what customers will do is build a handler that allows them to automatically take changes that occur in the transaction log and move them to couch base as they occur either in batch or in near real time so fundamentally you want to pick the right application for no sequel you want to think about your data model from a data access perspective you want to define what the document looks like what's nested versus what's referenced based on how you can access your data and then use indexes to accelerate access to that data pick the access method here we've focused on nickel as a query language but you could also pick key value pairs for certain aspects of your application or views for other aspects of your application and then when you're thinking about your proof of concept pick the application that's going to have high degrees of payback identify those success criteria and then review the architecture that you're going to use with the vendor the no sequel vendor that you're using in case of couch base we have a team of professional services that can certainly help you review what you're planning on building and with that I think we have about 10 minutes left for questions so let's open it up for any questions that might be from the audience I love that picture that's great actually I stole that shamelessly from the guy from Equifax that's great and we do have lots of questions coming in and just to answer one of the most commonly asked questions I will be sending a link to the slides and the link to the recording along with anything else requested throughout this webinar by end of day Thursday to all registrants so keep your eye out for that it comes from me so just diving right into it David I believe this first question says I believe column stores in relational databases is faster than no sequel implementations in most cases do you care to comment it depends on the access pattern again so column stores essentially group sets of attributes together and co-locate them on disk based on the column families now if your queries only need to access the data that's in a given column family then it can be very fast and it's particularly fast when you're doing aggregates if I say I want to sum all of the records I want to sum this column let's say average or I want to average the income in all the personnel records and all of that data is in a single column family it's co-located on disk it's very very fast to execute however for applications that need to access data across column families it becomes very slow and very expensive because now you're accessing two different locations on disk two different nodes so it depends on the kind of application you're building I would absolutely say if you're doing aggregates and your queries don't span column families column family store can be a way to go but if you're spanning column families or you're doing more generalized access it's not necessarily going to be a win the other problem you're going to run into with column families is they can become inflexible so I've defined my column I've taken my record, my object and I've broken it up into column families and now all of a sudden I realize I need the data from that column family to move over to this other column family and that's both data intensive and operationally intensive to do and so document stores in that case it may be just a question of oh I need to just add this attribute to this object thank you and what are the implications for data consistency and data quality seems like this works great for narrowly defined applications like mobile apps what about large scale integrations so data governance and data consistency is kind of the I think it's the emerging question in the no SQL space especially coming from DBAs who are used to kind of managing and making sure the data is clean there's a lot of people who kind of say that DBAs have gone away in the no SQL world I think so I'm an ex DBA myself in general you have the same kinds of issues with data cleanliness that you have in a relational database only now since the applications very much very much application controlled some of that control and monitoring and consistency is going to have to reside in the hands of the application developers and there you have to have certain agreements and standards there's nothing saying that I can't say for example I can't go into the database and say give me all the people all the person objects that are missing this attribute because some bad application programmer didn't store that attribute with it and I can find all the records that have that characteristic so nickel allows you to build for example data cleansing data governance data standardization scripts that you can do to examine monitor and audit the data but in the no SQL world it is true that the database is not enforcing consistency or normalization it's pretty much the application when applications share data when multiple applications share data that may or may not be a problem depending on how those applications interact one of the advantages of no SQL is different applications can look at and manage different attributes or different objects within the JSON store without having to necessarily coordinate and modify each other's schemas I'm application A I can add an attribute to that object and maybe I'm the only application that cares about that particular attribute that I just added application B will get that attribute when it returns when it gets the JSON object but it's basically going to ignore it because it doesn't need it I didn't have to go to a DBA for application A to start storing and managing that particular object or attribute within the JSON object so it does provide with a higher degree of flexibility there still does need to be some coordination between for example application teams in terms of what they're reading and writing this questioner goes on to expand if everyone gets to define what they mean by XYZ it seems like integration requirements for widely shared data could get large beyond measure is there something I'm missing no actually there's nothing you're missing that's a good question what we've seen in our customer communities that either customers will start to build separate objects for separate applications or they'll have some rules about these particular attributes belong to that application and they're stored with the object and get retrieved by that application used by that application so we've actually seen customers go from two or three applications sharing a common data source to literally tens of applications sharing a data source and in 90% of the cases they're all accessing and managing the same set of attributes or characteristics of the JSON object in some cases they're adding additional ones in their application specific I don't think the management becomes untenable but it is it is something you have to manage in the process just like in the relational world before I can go add an attribute I have to go pay homage to the DBA and IT tells me oh yeah it'll take X amount of weeks or months to add that particular attribute to the table and then upgrade the table you're paying the cost of normalized data management at the database implementation level the SQL side the cost doesn't go away you still have to coordinate that but you're not paying it down in the database you're paying it at the application design and coordination level so I don't think there's a magic bullet in either case what no SQL tends to do is make it easier to change and more flexible speaking of changes in the discussion of keys making the key deterministic was listed as a best practice how is the concept of a key changing for example author change changes name handled in the model so if you're changing the primary key it's the same thing as essentially a delete and insert when the primary key or object ID when that changes the object goes away and gets deleted and comes back you can we do have verbs like verge and up cert that allow you to insert an object JSON object if it doesn't already exist or update it if it does so we have verbs within nickel that make it easier to manage kind of those things but you can certainly change the primary key if you want that will essentially result in a delete of the old record and delete of the old object and insert of the new object in the specific node where it belongs based on the hashing of the object ID perfect again and we got a lot of great presentation comments here and specifically from this questioner who says you know I'm just wondering how to access control is done in no sequel I'm sorry I don't think I understood the question I'm just wondering how the questions is I'm just wondering how access control is done in no sequel I think you're talking asking about security so security varies by no sequel product all of the no sequel vendors are scrambling to implement security and do access control lists and those kinds of features in couch base today most customers control access to the data via the application they can identify and authenticate users and then they manage user access to specific data within the application in an upcoming release of couch base will have role based authorization controls RBAC controls which would allow the customer to implement access controls directly in the database as opposed to in the application and you're going to find that varies depending on the no sequel vendor that you talk to we are right at the top of the hour we have a ton of additional great questions coming in then maybe if I can get those over to you that would be certainly appreciated you know just very quickly I know are there any modeling tools to use alongside couch base are I knew that would come up there are a few there are a few JSON modeling tools but they're not terribly sophisticated because JSON is self-describing so mostly what you're going to find are JSON editors which are you know fine there are some relational data modeling tools that via JDBC and ODBC can talk to a no sequel database but essentially those tools come with them the concept of normalized data and flattened structures right they don't typically support nested objects so although you can use relational modeling tools with some no sequel databases including couch base you're essentially you know kind of hamstringing yourself from the beginning I have yet to find a JSON modeling tool that's more than kind of a JSON editor that I really like yet I am on the hunt though if anybody is building one for folks in the audience to wrap up real quickly if you're interested in learning more about couch base or about data modeling here's some resources that you can look at I also have in the slide deck some general resources about couch base if you're interested in downloading and trying it out but some specific resources around data modeling I included the data modeling and data structures reference to couch base connect 16 in this but there were several data modeling related presentations last week in couch base connect that if people are looking to explore more in depth could be very useful David thank you so much for this great presentation and for these resources I will likewise include that in the follow up email which will go out to all registrants by the end of day Thursday with links to the slides and the recording as well and thanks to all our attendees for being so engaged in everything we do we just love all the fantastic questions coming in and certainly appreciate it and I hope everyone has a great day and thanks to couch base for doing today's webinar it was really a great presentation thank you very much thanks David