 Hello and welcome. My name is Shannon Kempen. I'm the chief digital manager of DataVersity. We'd like to thank you for joining the latest installment of the monthly DataVersity webinar series, Advanced Analytics with William McKnight, sponsored today by Vertica. Today, William will be discussing analytic platforms should be columnar orientation. Just a couple of points to get us started. Due to the large number of people that attend these sessions, he will be muted during the webinar. For questions, we will be collecting them by the Q&A section or if you'd like to tweet, we encourage you to share highlights or questions by Twitter using hashtag ADVAnalytics. And if you'd like to chat with us or with each other, we certainly encourage you to do so. To open the Q&A or the chat panels, you will find those icons in the bottom middle of your screen for those features. And just to note, the chat defaults ascended just the panels, but you may absolutely change to chat with everyone to enable chat networking with each other. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout. Now, let me turn it over to Jeff from Vertica for a brief word from our sponsor, Jeff. Hello and welcome. Thank you, Shannon. And hello, welcome to all of our attendees here. My name is Jeff Healy and I lead Marketing for Vertica. As Shannon mentioned, we're a proud sponsor of this webcast. And hopefully you're seeing a screen of my first slide here, which has a gentleman, very good, that has a gentleman I'm sure many of you are familiar with based on today's subject, which is probably why you tuned in around columnar architectures. And this is, of course, Dr. Michael Stonembreaker. And Dr. Michael Stonembreaker is really kind of the database guru out there in the industry, going back to the 70s when he invented Ingress with one of his colleagues, and since that time, Postgres and all other kinds of databases out there, but the one we're going to cover today's topic is the Seastore. He's a touring award winner from 2014, which is the equivalent of the Nobel Prize for Computer Science, incredible mind around databases. And he's got this book called Making Databases Work. Well, why am I talking about Dr. Michael Stonembreaker? Well, he was the co-founder of Vertica, which is where I'm going to just go into a little bit of detail before passing it off to William. And what Dr. Stonembreaker saw here was a tremendous opportunity not too long ago in terms of being able to derive insight in very, very large volumes of data that could not be accommodated by row store databases, particularly more transactional databases, which is where OLAP came around and more analytical databases. Well, one of those was Vertica. So many of you may be familiar with Vertica as that column store, as that SQL analytical database where you can load and store very large volumes of data, I can petabyte volumes and perform very, very fast analytics against it. Well, how do you perform those analytics? Well, it has 700 plus in database analytical functions, everything from time series to pattern matching, geospatial, the list goes on and on. And you can run those, you can access all those analytical functions through SQL. SQL, you can run, do analytics. And you can do predictive analytics as well, because not too long ago, just a few years ago, we took the approach of in database machine learning. And the whole idea was to be able to do end-to-end machine learning by virtue of just using SQL. If you know a SQL, you can do it. Create, train, deploy, machine learning models. There are algorithms that come out of the box. Many of them have been around for some time, linear regression, logistic regression, naive base, that you can take advantage of and takes advantage of all the full volumes of data. So if you want to even score data models from that are written in other languages like Python and very common are programming languages with data science scientists, you can do that. Using PMML, you can transport the models in database and score those models. Again, larger volumes of data, no down-sampling, we believe, and it's been proven with our customers that you get better accuracy as well in machine learning models. And it's also a query engine. So within the space, you'll see many entrants, very competitive space with some vendors just offering a query engine, being able to analyze that data in place. So that's also really common. A little bit about our customers. So we've got thousands of customers, a variety of analytical use cases. This is the part that really fascinates me within this database industry. Everything from predictive maintenance. Phillips has a very large predictive maintenance initiative with their MRI machines. And Vertica is that analytical data warehouse that powers that. Within ad tech analytics, companies like the trade desk in Tabula. Trade desk generates 40,000 reports a day based on Vertica. It's one of the largest analytical clusters on AWS. 500, 600 nodes, lots and lots of users. Concurrency scale performance, super important. And then everything from custom behavior analytics where you get companies like Yes, within e-retail. Making sure that you can do dynamic pricing, live, almost to the second, A-B testing that determines campaigns are forming better for you. Energy trading, smart metering, security analytics around intrusion detection. The list goes on and on, particularly around IoT with those large volumes of data for other IoT and other use cases like Climate Corporation has. This is a software company was bought for a billion dollars there. Underlying embedded analytical engine is Vertica. This is around agri-tech to make farmers be able to get better yield from their crops. And then into it when you use TurboTax, analytics behind it is Vertica. That's all for me. I just want to give one last call to action and when we do my job as a marketer, if I didn't. So all everything that I mentioned around Vertica is now available as a service called Vertica Accelerator. It's got all the functionality of the core unified analytics platform. But we took away and we automated the setup administration management and it runs within a customer's own AWS account to ensure the security is preserved and you can take advantage of discount pricing through your AWS contract. Go to Vertica.com slash Accelerator and I'm looking forward to the rest of the webcast back to you, Shannon, over to William. Thank you very much. Jeff, thank you so much and thanks for this great kickoff and thanks to Vertica for sponsoring and helping to make these webinars happen. If you have any questions for Jeff, feel free to submit those questions in the Q&A section of your screen as he will be joining us in the Q&A at the end of the webinar. Now let me introduce to our speaker for the series, William McKnight. William has advised many of the world's best known organizations. His strategies form the information management plan for leading companies in numerous industries. He is a prolific author and popular keynote speaker and trainer. He has performed dozens of benchmarks on leading database, data lake, streaming, and data integration products. And with that, I would give the floor to William to get his presentation started. Hello and welcome. Hello. Thank you, Shannon. Thank you, Jeff. And I enjoyed that little walk down memory lane back to the origins of the columnar ideas with Michael Stonebreaker and so on. And I was around for some of that, so I've been with it ever since and this is a bit of a, I might call it a soapbox of mine. Trying to get our clients onto columnar for their antelig databases. I hope you don't mind. I went ahead and went right for the money shot in the title of the webinar. Not something I usually do, but antelig databases should be columnar. So I want you to see that and then try to prove it to you over the course of the next 45 minutes or so. And if you track me, you know that I'm all about the benchmark and so we do have some benchmark results to share with you at the end of this presentation to try to try to underscore some of the things that I'll be sharing with you today about turning databases in a columnar direction. And so there's a little bit more about me. I've been introduced so I will move right along here. And guess what, my first slide is also about some of the origins of this. Jeff's already talked about it, but there is a paper out there called Seastore that Michael Stonebreaker, as you can see there at the top, was an author of. And Seastore is a database management system based on a column-oriented database developed by a team at various universities. It includes Michael Stonebreaker and a few others. And the latest release was in 2006. So it's not like you're going to go out there and buy Seastore for your column oriented needs, but Vertica, our sponsor today, is a commercial fork of Seastore and certainly embodies the columnar orientation more than any other out there. Now we are squarely in the realm of relational databases today. I'm not talking so much about cloud storage and other forms of storage today. Relational databases have had, I would say, a bit of a resurgence in the past few years when it looked like maybe the world was going in a Hadoop direction there for a while and it turned hard right back to all the goodies that the relational database gives us, not only has given us but continues to give us. So hopefully you're tracking that space and surely you are using some relational databases somewhere in your enterprise if not completely just using relational databases in your enterprise. And I'm going to talk about this because it's very important to understanding the value of columnar by looking at the internals, that is, of a relational database. Now over the years, and I've been with this industry for a long time, relational database design is virtually unchanged. The only thing we've sort of done differently and this is probably in the past decade is parallelism. So we're still doing the third normal form, we're still doing the star schema, et cetera, et cetera. And for good reason, we're still doing it too in our practice for sure. Hardware, however, storage capacity has increased tremendously. So as CPU performance, they've gotten cheaper and transfer rates and seek times have increased only modestly still. Now they've increased, to be sure, what is great performance of a few years ago may not even be good performance today. I'll admit that. So we do need these things to get better. We have much more data, we have much more complexity around what we're asking about the data. We have much more concurrency and far less downtime available to us. So these things have to have come aboard if you will in the past few years in order to keep up and they certainly have, but do notice the last bullet I said they've increased only modestly compared to everything else. And what that means is that's kind of left us with where the bottleneck continues to be around analytic databases, analytic queries, and that is in the input output, the IO. So let's build on that a little bit. Now, row-wise database, which is what probably 99% of you are working on today. Now some of you are also maybe working on some columnar, which is great, but row-wise databases, and we usually don't even use the term row-wise, we just say a database, but truly that's what they are. They store data in rows. So here you have a sample, some sample rows of a sample table, one, two, three, four, five, six columns, and most of us, even database professionals of multiple decades, only know that the data is stored something like this. It looks like something like this. So if you do a select splat from table, this is what you get. Well, this must be what the data is stored like. Well, you're kind of right there, and I'm going to show you in some more detail how that data is stored in a row-wise database in a few moments. But I just wanted to orient us around the row-wise database, how it looks. So sometimes a row-wise database will store, well, most of the time it stores all the columns in the order in which they are defined in the catalog, which is the order that you get it back if you do a select splat. So there is some serious ordering importance to the columns in a table. However, some of them will occasionally store like data types together and just sort of work out the the ordering on the way out by looking at the catalog and so on. And there's different games, I guess you might say, that databases are playing with some of the data in a row-wise database. But generally speaking, all columns have to be stored consecutively for a row. Now, so the data is down there on storage. Multiple forms of storage today, but to get it to the CPU, usually it goes through memory. It doesn't always, but usually it goes through memory. The CPU will interact with primary storage or main memory, referring to it for both instructions and data. So we have these, bringing your attention to the fact that we have these different caches, L1, L2, other forms of caches, and I call them preprocessors to conserve the CPU. So the CPU is still necessary to process the data in storage, however L2 and L1 can do a lot and they add a lot of value. However, if we're sending too much data up this pyramid from storage towards the CPU, they can get bottlenecked. And when they get bottlenecked, all the goodness that they provide is throwing out the window and those levels get skipped. Famously, the L2 cache gets skipped quite a bit in the configurations that we have in place today, whether it be on the cloud or in-prem. So that's something to definitely keep in mind. And Vertica has a concept around their cache called the depot, by the way. So there are multiple different terminologies out there in regards to really a lot of this, but definitely in regards to this caching. So just be aware, different databases will call it different things. But the data has to come up to the CPU and we can have bottlenecks if we're sending too much in the unit of IO, which is the data block. So I said before that a relational database, when the data is oriented row-wise, which is probably, I would say 95% of the time. Well, I take that back. That's only if I include operational purposes as well. On the Antelake side, we're getting smarter about this. We probably are still about probably 80% row-wise out there with our implementations. Anyway, what you see here is one value for every column of the record. And you have three records on the page, small record like this, you would probably have a lot more. The data block itself can be different sizes, and that's certainly a design consideration. Anything from 4k to 64k, 128k, and beyond. It depends on how much you want to gobble up in an IO, and that depends on your query patterns. Because if your query pattern is one row at a time, that's one thing. If your query pattern is, there's a lot of scanning to be done, you want to grab it all in the IO. So you're going to have bigger page sizes and so on. But what I don't show you here is that these records also have a bit of a record header at the beginning of them. And of course, I do show you the page header and these things called row IDs at the end of the page. And then there's this page footer, which is just primarily used to make sure that everything's in sync within the page. Now, the row IDs are offsets to where the records begin on the page. So if I ask you to go to, I don't know, page three, record two, it can jump three times the size of the page from the beginning of storage, land you here at the beginning of the page, go to the third offset from the end, go to that offset within the page, and there we're reading the, actually I said the second record, didn't I? The second offset, then we're going to go read that second record, the one for Craig Lennox. Okay, great, that's how it works. What's in the index is actually what we call the record IDs, which is what I just said. The page number and the record ID number within that page. So you can start there at the index and come on in straight to the record that you want and get the rest of the record, if you will, the rest of the record that's not in the index. Okay, okay, that's a lot of words and I'm sorry for that. Hopefully it all did make a little bit of sense anyway. So I'm going to move on here. Now, in a common orientation, and I think you all probably have picked up on the fact that in a common orientation what we're saying here is that the columns are stored consecutively, with consecutive values within the data blocks, not the entire record. So what you have are independent areas on storage for different columns, or you might throw some columns together in what might be called column families or whatnot, containers, and we'll get to that in a little bit. But for now, let's just pretend that each column is in its own block. Okay, and obviously if the number of records, the number of column values as a result, exceeds the size of the block, there'll be multiple blocks. There can be many hundreds, thousands of data blocks for a given column, for a given vector if you will, by some terminology. And that's okay, but it's still far smaller than if you had the whole record in here, and that's kind of the point. So if you have a lot of queries that are going to query just selective records or excuse me, selective columns either in the select or in the predicates, then you're going to be very interested I would say in columnar. And what I do encourage my clients to do is to not mess around with their design, trying to approximate what columnar does for you. I've seen this plenty of times where our client will have kind of gamed their row-wise database to have just some of the columns in one table and then the rest of the columns in another table because of this IO problem that we're having. And I don't like to see designs compromised, I want them to be highly managed and implemented directly from the design into the table, so I don't like that, I think that affects manageability. Anyway, what you see here in addition, I want to bring your attention to something here, that in this columnar data block you don't see all that overhead that we saw back here. You don't see the row IDs, we don't see the record headers, which I didn't even show you here, but trust me they're there. We just see the records, the actual column values. I don't know what column this is, but anyway the reason for that is that since there are fixed length, they're all going to be at a predictable offset. So 1120 would be for record one, 1121 for record number two, et cetera, et cetera, et cetera. I don't need an ID map in this case, which saves overhead, which means I can query more values onto the page number one, but also means I'm going to grab a little bit more when I do my IO. And so there are many tricks that we can do through compression to even make this better, and I'm going to get to that. So in a little bit of summary of this section, traditional databases, data is stored by row using data blocks 4k to 32k. Like I've said, some of them take you all the way up to 128k and beyond. Hopefully if you're working, if you're a database professional working on an analytic database, you know what that block size is for the database. It's pretty important in the grand scheme of things. Now for smaller databases, you're probably not going to see much difference in suboptimal block sizes, but in larger databases you will for some or maybe a lot of the queries. So it's something clearly worth paying attention to. So if you want to do something rather simple, like calculate the average sales for the A stores in New York, you're going to have to go through this table, you're going to have to select where class equals A, and then you're going to grab all the sales for that. You're going to average it and send that back to the user. So for queries in a row-wise database, traditional databases, you might go through an index to get to just those records that have class equals A, if you have an index on class for example. If you also have sales in that index, you never have to touch the table. But if you don't have sales in that index, it's not a compound index, you're going to come on over here to the table and you're going to grab all the sales associated with those record numbers, and you're going to aggregate them and do the division, and that's how you end up getting the result. That's how it works in traditional databases. And I'll show you how column works in just a bit. But first, this concept of mixing columns. Now I'm going to use the word containers. One of the databases out there uses this term, but there's many terms for this. Now there's a lot of possibilities. Let's say you have a five-column table. I'm showing you some of the possibilities in terms of the design. These are all columnar except you could argue that possibility one is really not because what you did there is you threw all the columns into one container. Now kind of never mind the fact that there's I show container one and container two. Actually let me address that. The point of that is that sometimes your columnar compression routines and so on, they only work within a certain subset of the records within that table. So the containers might be divided, there might be multiple containers for a given column value just based on size alone. It's not really a design concept. It's just based on the size of the column in the number of records. Possibility one is all columns. We threw them all in one container. Probably shouldn't have done that. That's just that just looks like a row-wise database to me. Possibility two is taking the column idea to the extreme. And by the way, this is the default for a lot of databases out there that are columnar orientation. So if you don't take this in hand, what you're going to get is a container or whatever they call it per column. So we have one for name, one for city, one for state, one for lifetime value, one for birthday. Okay, great. But there's plenty of possibilities where you could combine commonly used columns into a container because they're going to be accessed together most of the time, like city and state. It depends on your query pattern. Anybody who says, ah, it doesn't really matter about your query pattern, just kind of throw the data in here and we'll take care of the rest, they're wrong. You are suboptimal in that arrangement for a lot or probably most of your queries. So design time, if it's done by an expert, you know, done with all information in hand, I shouldn't say all, I should say, you know, a lot of good information in hand goes a long way. I'm not saying belabor the database design to no end, but I am saying that it's still a thing out there. By the way, it's kind of a lost art because there's so much else that we're attending to these days in terms of the cloud. We're getting distracted with a lot of things, not with our database design and we should be. We should continue to put emphasis there and on the data modeling and yada yada. Okay, possibility four, you have four columns in a container, possibility five and so on. There's probably, I don't know, 20 possible combinations here with just these five columns. My point is you still have to take this in hand. Okay, so what does the columnar orientation do? What does it look? How does it look different, you know, if you're doing a query? Well, here is a columnar orientation with vertical partitioning of the data. That means every column has its own storage area. It's, there's nobody else in there but date. Nobody else in there but store number. They're stored independent. It's kind of like you go to Spotify and you want to play a song. You don't want to play the whole album. You just want to play a song and that's kind of how we're consuming music today and that's how we're consuming data as well. So most queries are not going to want all the columns of the table and hopefully we haven't gamed our design so that it actually does because we have a faulty design but in the case of the earlier query that needed class and sales then it's not going to IO the data, the store number, the state, the category, whatever else is in here and we all know customer tables will probably, you know, have dozens to hundreds of columns in them. So consistent data, let's talk about compression. We're starting to move into that area. Consistent data types are easy to compress and these are definitely consistent data types because within a vector here it's going to be all the same values. The resulting storage size is typically less than 50% of the size of the raw data but it does depend on the columnar compression routine that you choose and these are some of the common ones that are associated to columnar. Now they're not necessarily exclusive to a columnar orientation but their value proposition is much higher when you do have a columnar orientation. Therefore a lot of these columnar compression, a lot of these routines are associated to columnar definitions and like I said about the design, like I said about the the size of the data pages, like I said about dividing the columns into their column families, all right, all that stuff, I've said hey pay attention to that. I'm going to say pay attention to your compression when it comes to columnar as well. Now some databases will let you just create the columns and fire away and it will learn as it goes and figure out what the best routine is based upon the data that it's seeing in the column and my experience with that is mixed results, I'll put it that way. I would rather see you think about it, think about the data pattern which you know pretty much, you know the data pattern, you know what the data is going to look like that's going into a given column when you're creating the column, most people do and so that lends itself to what compression routine you should use so let's look at some of them. One of my favorites is run length encoding and so let's say you have a bunch of, let's say you've you have a bunch of records that have this exact same value for quarter one and they're all stored in sequential order so this is only going to work on probably your left most one or two or maybe three columns in a table depending upon how you cluster the table but let's say you cluster this table by quarter and I show one two three four five six rows here that have q1 okay but it could be hundreds, it could be 500 so instead of storing the q1 value 500 times let's say you just store it once in some other table actually in the column vector and you just say q1 is the value for rows one through 500 so you have a start and you have an end you don't really have to have a start it could be assumed to be one plus you know whatever the prior was but virtually you have a start and an end to what that value where that value is going to show up in terms of row order so simply all of this quarter vector can be summed up in four short rows right q1 q2 q3 q4 because that's all there is now realistically you probably haven't had the year to the left of the quarter that might be a better thing to compress on this is just an example but I'm showing you how you can save a lot you can save a lot of space and again we're trying to optimize that thing that's the bottleneck which is the i o so you're going to get a lot less in the i o but it's a lot more if you follow me dictionary encoding so rather than storing the full value of a long text something like england or united states of america japan obviously countries here the original size if it's a fixed length is going to be 30 i think that's what we typically do char 30s for country although it could be mistaken there could be some that's larger but anyway char whatever it is let's say 30 that's 30 bytes every time now what if you had a a dictionary if you will or a thesaurus as i sometimes like to say that somewhere else it says hey this is where all the values of england are so instead of having to repeat england you just put in the surrogate value the compressed value which in in this case is a zero a one for the united states of america a two for japan a three for arginity tina and so on and so on okay there's another one for united states of america down at the bottom okay so that's just one byte that's just one byte instead of 30 yes you had to store a dictionary off the side for this but that's not going to take up very much space compared to the savings that you're going to get for a large table with a large char this that you want to compress okay now here's some here here's an example of how you can have you can have different compression routines for different column families that you have in your column or database you can have in container one here for example you can have dictionary compression and i say local because like i said before if the table is large the database is going to cut it off at some point and say okay i'm going to create a dictionary for this section of the table then i'm going to create another dictionary for this next section of the table and so on think about that like as 10 000 rows or something like that i'm trying to think of what it is in sequel server it goes by number of rows i think it's 10 000 but whatever it is you have a special dictionary for that and then you're not storing you know you're not storing Smith Jones and Johnson all the time Williams Jones and Miller you're just storing the surrogate and in my example here the surrogate is three bytes so that's fairly meaty in my experience for a surrogate value for a dictionary value but if i had tens of thousands of different rows i had that super high cardinality probably wouldn't be of a lot of value here but anyway you could have multiple bytes in there for your dictionary this is really great if you have low cardinality now state container here we have one length encoding wyoming that's in records one through ten texas that's 11 through 25 florida 26 to 34 etc now when we cut over to container two here in this example well this time texas is one through six george is seventh grade so i think you can see though that unless you have repeating values from row to row this probably isn't going to be a very much savings to you so watch out for that and finally one i didn't show you another slide on but it's called delta compression so instead of actually storing the for example birth date this works great for dates but it works for other things as well like amounts any anything that's not that doesn't have a wide wide range of values in them this works great for so birthday uh it picks a median for the container let's say it's july the fourth 1970 okay this first person is plus 73 days of that that that means that smith's birthday was whatever 73 days is plus july so what would that be august september sometime in october apparently that's when his birthday is same year and then the second one negative 198 minus 198 so subtract 198 but do notice that we're only storing the plus or minus value we're not storing the 10 bytes of the date there are a lot of obvious i o advantages to this if everything's in a really wide range and you have to have a big number in there probably not as good for you something that has everything that's is original and that's probably not too good for that either but dates values things like that yeah it makes a lot of sense for so you can have a mix and match of compression within a column your database now eventually a database has to bring everything together and show you the results and that's called projection so there's the selection part where it's doing everything in terms of the predicates and then projection parts and bringing stuff back right so in row stores every moves on needed columns from the results set meaning it materializes things uh early because it's already fully well it's already fully materialized in the rows and that's what it works on what you don't want to have is a column store that's actually doing the early materialization and it doesn't really have a column or optimizer but it has a row optimizer so what it has to do is take those columns that simulate a row and then throw it into its optimizer and that's called early materialization most of the databases avoid that although some of them still do it and still have some pieces of that left in their optimizer and so that's something to watch out for that's going to be suboptimal and i think i'm going to show you primarily with an example so let's look at this example here we have two vectors product id 414 now hopefully by now you know what that means all right that's your run length encoding so what that is saying is that the value of four is in rows one two three and four so they're all four store id normal stuff here the values are two one three one we're talking about four rows in this in this little table so we're selecting customer id and price from sales this is the sales table all right where product id equals four and store id equals one obviously pretty simple now which rows is product id equal to four all of them okay which rows is store id equal to one two and four i'm going to ignore the fact that usually we start counting it at zero so two and four so selection occurs the bit mapping occurs on the various vectors and it highlights with a one those record numbers that meet the respective predicates and since we said an and what it's going to do now is and them together so what what that leaves us with is records which two and four right and so then it goes out to the customer id and the price vectors and it says well give me the third value for each of those and let me glue them together that's my glue right there okay bottle of glue uh and then it constructs the queries and notice that it's doing this late it didn't have to glue everything together early and then do the row thing all right it didn't have to do that so pretend there's a hundred columns in this table and you can see where the value comes from and there's your result yay okay now before i show you the benchmark and give you some real numbers here to uh to work with in your in your designs do note this trend have you heard of oper litical or translitical some of those databases and it's kind of gray as to as to who's in the category i won't go into who who's there and so on today but some databases put data into both structures that's right both structures now many many years ago when we started with databases we didn't have such a thing as an analytical database right remember that we just had the operational database and some of us still use operational databases for our analytic needs i'll put that aside for the moment but we do have this whole set of analytic databases we all know who they are uh you know okay snowflake redshift big query vertica actin a sequel server the analytical sequel server that is etc etc okay so some databases will put the data in both so it's storing the data both ways and then the optimizer will look at the query go hmm which one's going to be better at and this is why they call themselves oper litical or translitical because they can do both because they store the data of two ways that's how that works and so the optimizer or the user determines which one to use because you can you can say okay use the columnar one or use the row one depending on what kind of query it is if you want to outsmart the the optimizer and by the way don't think that that's a a bad thing to do trying to outsmart your optimizer you still have to do that as a database designer the optimizers are not good enough to where we can turn our brains off and just throw everything over to it not in an enterprise workload that is at a scale okay so for the benchmark that i'm about to show you the results of we use sequel light for row oriented everybody's familiar with sequel light right and we grabbed and go with this thing called duck db which we like to grab and go when we need a columnar example i don't think anybody would necessarily take that to production but it's great for processes processing and storing tabular data sets works with csv and parquet files you can do interactive data analysis with it like joins and aggregating multiple large tables it works with concurrency so we have done and that's my little analysis of duck db but we have done much much bigger benchmarks many times over and this is why i'm on this soapbox because i've actually done it in practice and seen the the changes that it can make not only in a benchmark but all the way into production and beyond for our clients so i encourage you to think about this because a lot of us are out there wondering well how can we add some value you know to our environment well maybe the the way you add some value is you orient your data warehouse i know you've had it 10 years you didn't think about columnar 10 years ago and and now it's going to be hard but it still might be worth it to spec a project uh that works that you know you build you do concurrently and orient your data warehouse and your other analytical stores in a columnar way it's going to improve the performance let me show you how how much okay so here is our first test now i'm going to show you some inserts and i'm going to labor the inserts because columnar let's think about this columnar is not going to be as good for inserts right columnar is not going to be as good for loads right because every column has every row has to be broken apart into its respective columns or column families and thrown into their respective storage units so it's going to be it's not one size fits all it's not take the whole row and put it in one place right it's putting it in 10 places in 20 places in 100 places right so it's going to take longer so in my example here it's a small example i have one two three four five six seven eight nine columns it looks like so nine columns means in our example uh one column uh every column has its own storage so that's nine break apart so here you have the first 10 rows and 100 000 rows got inserted in 80 920 milliseconds or 809 microseconds that's 1.3 minutes for those of you that care about the conversion columnar took 1108 microseconds so a little bit longer a little bit longer a lot of people think well i can't do columnar because it's going to blow out my load window now if you're pushing up against your load windows if you have problems there then you don't want to add to it with this you got to fix some problems otherwise right but if you have cycles there you really want to optimize your selects i think columnar makes a lot of sense so let's look at another example insert 10 000 low cardinality items so we have a smaller table we have 10 rows 10 000 rows were inserted in 737 microseconds versus columnar 793 microseconds whoops and so it's a little bit longer once again it's a little bit longer i don't know the percent there but looks like it's on the order of less than 10 percent okay so yeah you have to pay a price for this all the goodies for columnar and this is where you pay that price one more time let's pay the price again create a very small table here's 10 rows you're inserting a million records okay 519 microseconds versus 535 it's getting closer because you got you don't have as many different storage units for three columns but it's still more for columnar no doubt about it now let's let's let's move to happy state let's talk about the selects this is where you get the benefit right this is where most people care about having the most benefit in your selects so select last name from customer where state equals and i did 50 all 50 states one one query per so how long did those 50 queries take 7845 microseconds and in the columnar it took 3,316 microseconds that's less than half less than half so oh that seems pretty good that is pretty good and in the columnar you there's no indexing because the columns are you know the values are already broken apart so there's that as well so that's what it did on just pure select single table aggregation so we're going to get a little bit more complicated here and without diving too deep into this 2,611 microseconds for row 704 for columnar wow that's 3.7 times the savings for columnar and indeed the more complicated it gets like this one here we have one two three four five queries okay and the difference here is nearly eight times nearly eight times the performance for columnar versus row same queries in the two different databases one's a row database one's a column database and so hopefully you can see here the manifestation of some of what i was talking about earlier in the presentation i was talking about how the columns take up their own storage and so on and by the way some of you are working in cloud storage today as are we and this same principle applies to databases like kassandra and hbase they're not really databases all right applies to data stores like kassandra and hbase that break apart the the file into its respective columns and so that's why those are so great in that world partly why okay there's more to it okay benchmark conclusions columnar is a little slower to load but much faster on queries how much faster 2.3x faster on simple single column scans 3.7 on simple aggregations 7.2x on an analytics query with a three-table join i wanted you to have some walking around relative numbers to think about in your head to go along with this presentation so that you can understand what the value proposition might be for turning your database to columnar or i will say implementing columnar in a better way than perhaps what you're doing so in summary of the overall presentation columnar databases is an alternative to row storage stores each container independently container being a column or multiple columns that you throw together it addresses the idle cpus and the disk bottlenecks the io bottleneck that we still have with us it's great for compression i showed you some of the compression routines it's best when there's a lot of data right that's going to magnify the the the differences long rows because short rows you don't have a lot to break up there and when you can isolate the loads that's a nice way of saying when you don't mind a little bit more time spent on the load or the inserts and it's great for high column selectivity queries and what i mean by that is whenever you are selecting i'm just going to throw a number out there but 25 percent or less of the bytes in the records that you're touching i call that a high column selectivity query i have done uh uh studies with clients and to size them up for columnar and looked at their analytic queries and i have found multiple times that well over 75 percent of queries that are deemed analytical actually are high column selectivity queries we were not out there doing select stars when it comes to analytical queries sometimes but not very often so therefore there's a lot of greenfield here for columnar still and as we get all kind of wrapped up in the the 10 to 20 things that we can be doing to make our databases better for our internal clients and our external clients this has to rise pretty close to the top of the list in terms of what will make that better and that's why i say analytic databases should be columnar back to you shannon to see if we have any questions william thank you so much for another fantastic presentation if you have questions for william more for jeff feel free to submit them in the q and a portion of your screen and just to answer the most commonly asked question it's just a reminder i will send a follow-up email by end of day monday with links to the slides and links to the recording of this session along with anything else requested throughout so uh diving in here um to both of you you know can the columns be variable size um i'll start columns can be variable size that just introduces an id map kind of thing to the to the data block now what some columnar databases will do is they will try to avoid that and unless there's real high variability within the column sizes themselves it'll force it into a fix but it doesn't have to be fixed now turn it over to jeff now you cover that one william that's exactly right the columns can be variable in size so you covered it for sure okay so you spoke about load william you know but what about insert yeah i mean inserts going to face the same thing it's just it's just a smaller load really so the the uh what i showed you here was a lot of uh inserts these were inserts whoops it's running away from me um these were inserts and uh so i actually showed inserts not loads but regardless uh you're talking about the same kind of thing a load is going to be doing this you know a million times whereas what we did is a million inserts uh now would you do one versus the other it depends it depends on the nature of the data availability that you're putting in the into the table so it's more a trickle type of thing it'd be certainly more like an insert but yeah definitely we still load we still do a lot of loads and that would experience some of the same in my experience some of the same relative differences to loading uh row versus column and the only other thing i'll add there william it's around data loading with vertica is that there's a bunch of different ways you can do that right so yeah copy statement is very popular also gonna integrate with kafka um we've got a kafka connector there um and stealing the thing i'll mention is uh there's also etl of course in the industry but a lot of our customers use a elt approach with vertica to use some of those analytical functions for data preparation so just more on data loading there's a lot there's a revolution going on in data loading um these days it's it's not definitely not as you mentioned you know etl only anymore there's really five to ten viable ways to load any table out there in production and that's another area that would that we all need to be looking at absolutely so for null value columns um will the position be reserved in vector with nothing or um that's handled in a different way oh i like these i like these technical questions like this that uh it shows that um people were following the detail that we were putting out there so uh you know jeff is going to be able to say how vertica handles the null but i'll just say generally speaking that um instead of the the pre byte if you will that goes in front of a column normally there'd be an actual byte that gets stored uh alongside the column to indicate whether that column is of a null is null value or not in which case you should ignore whatever happens to be in the rest of the bytes and what you cannot what can also happen there is the the columnar database is database can can compress all that out all the other bytes out and say oh this is the null i don't have to store the 10 bytes the 100 bytes whatever whatever the actual value is because it's null and so it can compress that out of course that does make for a variable length record which means you have to add the id map in that bit of overhead so these are all some of the the pros and cons of it yeah well you've been giving me more credit there i'm gonna have to uh have to dig and i'll put the answer in the uh in the chat sorry for that now no problem no problem at all this is great i love it um so it seems that columnar databases de-aggregates a record how does the columnar database associate different columns back to a single record oh i'm sorry jeff that's that's the magic i mean that's the glue that i showed before and and i mean we can sit here and and kind of intellectualize you know how it does that um but there's there's a lot of work to into doing it so when some of these databases out there that were row-based to begin with have added a columnar uh element to it and i think that's great by the way i think if you're in one of them you should definitely check that out uh but it took them a long time to do that that's one of the the hardest things to do for a database vendor so i mean this is these are this is years in the making to to be able to to provide that option to you but in general i mean i can think it through here that it goes everything is is of an offset each each of container as i showed you in the example each container has each has an independent column value and each container has a different way of getting at the offset to a certain record number and what it would do is simply go to that offset in each container which is going to be different and it's going to pull that pull that all into memory and connect them up and present that to you and you have your result something like that i love it so what is the difference in modeling approach to take advantage of columnar advantages and minimizing load overhead okay well modeling approach this is where i say modeling is about the business modeling should be should reflect the business even at the i don't like to do a lot of quote unquote physical modeling because then i think i'm getting a stray from the business i'm getting a stray from the model being a representation of the business so i don't like to do that and therefore i'm really against these methods of trying to kind of simulate columnar in a row wise database by breaking it apart oneself as opposed to using a columnar database that does that automatically and so i'm going to say that i don't certainly in the logical modeling and the conceptual modeling there's no difference modeling for a columnar versus a row wise and i can't really think of a lot that i do differently when it comes to the physical level either now obviously the implementation the creative table the blah blah blah you know when we get to there that's going to be different but in terms of the model i can't think of very much now in terms of the load for most of my clients and i would think for most people out there the load is what the load is what the load is and the data as soon as the data is available most of the time at least today we're now thinking oh i got to get it loaded so if it's a trickle load we can do that now and that's what we do we don't batch it all up we don't wait for you know it to batch up overnight or in you know two hour blocks or something like that like we used to do and load that so i think the load is whatever the load is and you just hope that the database can handle it William around loading as well some of our customers that you know use Kafka with Vertica they call a micro batch loading it's just a term you know talking in terms of like queuing up whether it's like streaming data or what have you to be able to maintain that kind of performance upon load yeah good and is the id map that came in from the previous question during the previous question the question is is that the id map uh-huh is what the id map sorry that came in follow on yeah was it following yeah it was a came it came in when you were talking about uh offsets yeah yes the id map is all about the offsets of where the records are within the data block from the beginning of the block awesome and they are they are fixed length and notice how they are counted backwards from the end of the block okay so we know the database manager knows where the let's say the third id map entry is because it's three times however big the id map is usually two bytes sometimes one plus the size of any footers which is usually one or two bytes so that's how it knows where the offset is from the end of the page to where the offset is to where the record is or the column is as the case may be within the block it's a beautiful thing beautiful i love it um databases are beautiful all right well uh and colner um thank you all for this great presentation and thank you for uh this great information i mean that is all the time we have slated for this webinar uh again and thanks to vertica for sponsoring today and again just a reminder i will be sending out a follow-up email by end of day monday for this webinar with links to the slides links to the recording uh thank you all thanks for our attendees for being so engaged in such great questions and again thanks to vertica for sponsoring william thanks as always thanks guys