 Hi, everybody. This is Dave Vellante. We're here at Wikibon headquarters. Adam Fuchsia is back. He's the CTO and founder of Squirrel. We've been doing a number of whiteboard sessions around Accumulo, some of the innovations that Accumulo brings, some of the things that Squirrel has added to Accumulo. You hear a lot about schema-less databases in the NoSQL database world. Well, how do you build structure and add schema to a schema-like environment? And that's what Adam's going to talk about today. So Adam, take it away. All right, thanks, Dave. Yeah, so I get a lot of folks asking me, once I've got my database up and running, I can store key value pairs in it. I can do searches. I can do range queries for those. But how do I organize my data? And there are basic ways of organizing their data. And there are some more complex ones. And we're going to touch on a couple of more complex concepts there. One of them is denormalizing data, bringing it together to answer a particular set of queries. And the other is secondary indexing. So I'm going to show you a basic diagramming technique that we use to diagram data models inside of Accumulo. And then I'm going to show you how that applies to indexing techniques. So to start off here, we'll look at a simple organization of data, which we have diagrammed here in what I like to call a hierarchical data decomposition diagram. Obviously, I need to find a better name for that. But what we have is essentially a data set centered around people. And people have friends. They're friends with other people. There's some history of those relationships. They have things that they own. And maybe there's a count of those things. So inside of Accumulo, we have a couple of features for storing and retrieving data. We can insert keys in random order. So we can transform this data into a set of key value pairs, insert them, update them in random order. But then when we query them, essentially we're limited to range queries inside of that key space. So a range in a sorted key space actually turns into a hierarchy in the row, count family, count qualifier value format. So every key has elements in it. It's got a row, a count family, a count qualifier. And these are associated with a value or potentially a set of values. So if I have a range, I can select a particular row. I can select a row in a particular column family. I can select column qualifiers under that. And in fact, I can select prefixes of those elements as well. So if I want to query for a set of people and all of the things about them, if I use this hierarchical data decomposition, then I can query for prefixes in this tree structure. So a range would translate into a single person or a person and their friends or a person and their particular other friend. Any of those hierarchies inside of this hierarchical organization really translates into a very simple query for a cumulon. So if we take this abstract view and we instantiate it, here we have Alice and Bob, which are two people in our person table. And Alice has friends. Alice has friends with Bob and Charlie. And they have some history. Alice also owns a couple of Oldsmobiles. And Bob is over here, has a couple of friends, and owns five houses. Why not? All right, so what we've done here is essentially we've grouped all of the information associated with those people together so that we can query them all at the same time. So if you think about a document store in general, where you might have a hierarchical document, you have a number of features that are all grouped together underneath the title of that document. We've done a similar thing here. And in fact, in this instantiated view, I can take a traversal of the tree from root down to leaf. And this actually forms a key value pair. So Alice being perhaps the row portion of that key, Alice is the fact that Alice owns something, being the column family, and the thing that she owns being the column qualifier, and an account being the value in that case. So we have a number of concepts that sort of tag along with this mapping of data concept into key portion. In the row portion here, what we're doing there is really controlling how the data gets partitioned throughout the cluster. Inside of the column family, we're doing column-based partitioning. So these are locality groups or vertical orientation of the database. Inside of the column qualifier, that's where we put anything else that has to do with uniqueness. So row, column family, and column qualifier determine the uniqueness of the thing that we're trying to store. And then the value is any extra information that we would want to tag onto that. So here's a very basic concept. We can also take this basic diagramming technique, and we can make a couple of other abstract versions of it. So some basic models that we use, or some basic table designs that we use for indexing in particular, one of them is a document table with an inverted index table. So these are two tables that we would pair together. The document table is organized by having UU IDs, or just IDs of the document. Within that, we have fields of the documents. And within that, we have values associated with the documents. So given that I can query ranges on this, and ranges turned into prefixes, that essentially gives me an ability to very quickly retrieve all of the fields in a document or a particular field in a particular document and retrieve those. But maybe I don't know my document ID. Maybe I don't know exactly which document I'm trying to query, but I know characteristics of that. And I want to search on it based off of those characteristics. So that's when I start querying on the value itself. And this is our basic secondary indexing. That's the basic concept of secondary indexing that we want to use here. So in order to support that, we'll take parts of the value. We'll create those as inverted index entries. So from that value, I generate a set of terms. Each of those terms maps to a set of UU IDs. Those UU IDs are references to this other table, to our document table. And then maybe I keep which field was seen in there, and perhaps some other information. What that looks like is when I'm ingesting data, I have a record or a document. That record goes into a single place. Everything grouped under the same UU ID in my document table. And then I generate a set of index entries. So one place in the document table becomes multiple places in the index table. On the flip side, during query, if I have a particular term, I'm going to find that term at one spot in the index table. But it's going to map back to several UU IDs inside the document table. So that's basic what we would call term-distributed information retrieval. And it's a great technique. We use it all over the place. We extend it to do a whole bunch of things like geographical indexing. But it has flaws. It's not perfect. And in fact, in the field of information retrieval, there are two dominant spaces. One is term-distributed information retrieval. The other is document-distributed information retrieval. And for document-distributed information retrieval, we tend to group things together into partitions or into shards. So we take a set of documents. And this is the diagram associated with that. We take a set of documents, group those together in a partition, and generate index entries, and put those index entries for that set of documents into the exact same partition. So in this type of hierarchy, I've grouped my index entries and my document entries together. That gives us a different view here. As we're ingesting data, we'll take a record, put that into one partition in our table. And inside of that, it has document portions and index portions, just like our simpler indexing model. On the query side, if we have a single term, we can map that term into each of the partitions and look for the index entries associated with the documents associated with that particular term in each of those partitions. So we parallelize the query across all of our partitions. So these are a couple of the simple generic table designs that we're using, in particular inside of Squirrel, inside of our product. We've extended these. We've added a whole bunch of others to do graph organization, to do some other specialized types of indexing, and some modeling associated with schema. But there you have it. That's really how we're adding a little bit of organization to a schema-less world. Thank you, Adam. So you're seeing the infrastructure for big data becoming hardened. And it was so-called enterprise ready. Everybody talks about that. Accumulo is playing a key part of that, not only in terms of its ability to provide fine-grain levels of security, but also high levels of scalability and performance. So check out squirrelsqrl.com for more information. Some of Adam's work will also be on there. Also, check out youtube.com slash silicon angle for these and other videos associated with this topic. Go to wikibond.org for all the research and check out siliconangle.com for all the blogs. Thanks for watching, everybody. This is Dave Vellante of Wikibon. And this is The Cube. We'll see you next time.