 The next speaker is Chandramali Madhavan. Chandramali Madhavan leads site reliability team at Google Bangalore. He has worked at Google Bangalore for the past eight years in a variety of engineering roles on multiple products such as Google Translateration, Google News, and Google AdWords. Let's welcome Chandramali. Can you hear me? Thank you to the organizers for inviting me here. I'm Chandramali Madhavan. I work in the site reliability engineering team here at Google Bangalore. And I'll be talking about analyzing terabytes of data with Google BigQuery. Before I start, just want to get an estimate of what the audience understands by the meaning of big data. How many of you think a million rows is big data? What about basically 10 million rows? Do you think that is big data from the kind of applications you looked at? Not many hands. I'm assuming you still think it is not too big. What about 100 million rows? Looks like that's somewhere close to where we are, like some of the sweet areas for big data processing. What about 500 million rows? That sounds like big data. And that's sort of the sweet spot for product like Google BigQuery, billions of rows, hundreds of millions of rows. And we'll see how BigQuery can aid us to do interactive analytical processing on these billions of rows. Sorry about that. So when we talk of big data, we deal with a large amount of big data at Google. Every minute, we get 60 hours of video uploaded. We have hundreds of millions of gigabytes in our search index. At the last count, this is the old number we had 425 million users in Gmail. And all of these systems generate a large amount of data, whether it be logs, it could be server data, and data based on user interactions on these different products. And we need to analyze them. And a lot of times, we really do not know what kind of analysis we want to do. The analysis could be fairly ad hoc. So we need a system which is good at doing ad hoc analysis and can do it basically at scale reliably and in an interactive manner. So BigQuery gives you this power. You can store your data with the reliability, redundancy, and consistency, and go from data to meaning at scale very quickly. So BigQuery is an externalization of something called Dremel. There is a paper on Dremel for those who are interested. And they can read the paper. But it's an externalization of the Dremel product. And so we use the same tools, which we have externalized, and internally on these large amounts of data. So some of the ways developers are using it, we have developers using it for game and social media analytics. Internally in Google, it's heavily used for infrastructure monitoring. So for example, finding servers which take a long amount of time for certain RPCs. That would be a use case. Or finding the top installed apps from a particular marketplace product. Campaign optimization, sensor data analysis was a demonstration which we put out during Google I.O. Where we collected a lot of data from sensors and put up a whole system which showed the power of BigQuery. So the agenda today is to show the power. I'll just show you how to load your data, how to run your queries, and the underlying architecture and design. So let's dive in. So this is how BigQuery looks. I hope the network is not flaky. So I just want to show how to load our data. This is looking around for some good data set to load. And on data.gov.in, I found some water quality data, which I've already loaded, but I just want to show how to load it. So I'm just going to give it, the data set is a set of tables. I'm just going to give a new table name. This will contain the same data which those other tables contain. Then the way we load data is through files which contain comma separated values. So I'm going to choose a file for this. The next thing is to specify the schema. Now the schema is just a set of name values where the name colon values. And this is the format in which the data is present. So I have the state name, and all of these types I've just created as string. So I have the state, district, block, and so on. Then I want to skip some header rows. I have a single row, which is the header row. And I'm going to skip that header row. And that's pretty much it. I can choose the field delimiter, which in this case is just comma. And so this table, the job will be created for importing this data. But in the meantime, I'll just show you the power of BigQuery with some sample queries on the pre-imported data. So I can just do a select on this table to get some data from this table. So if you can see the schema of this table, it has state, it has the block name, village name, and a bunch of quality parameters. So I have some pre-created queries. So I'm trying to get the top 10 states. So top is a statistical function with high quality implemented in Dremel, which we have externalized, which can get the top 10 rows from a particular query. And I'm doing a regular expression match on a particular column. So I'm looking at the quality parameter fl for fluoride. It's taking longer than it should really. I don't know. The table has just probably 100,000 rows. So there. So you can see these are the list of states where we have water quality problems because of fluoride. And these are the different counts for it. So we just saw how to load your data. One thing different about BigQuery is that typically you're advised to have normalized data so that you don't have redundant data. So this is the way your typical table would have looked for a birth record database where you had IDs for parents and you had a separate table for parents. But this is not the typical way you would create your data for BigQuery. What we advise you is to denormalize your data. We'll come later when we look at the implementation why it is good to denormalize. So this is the way you would have a data format. So actually the data for both the father and mother will be loaded up as part of the same table which contains the birth record of the child. And as we saw, CSV comma separated value is your format. So the way I showed you of loading data is through the UI. But that's really for small data sizes. For large data sizes, what we advise you to do is to use Google Cloud Storage. So you can store your blobs in Google Cloud Storage. And then instead of the file which I showed you, you can refer to the object in Google Cloud Storage to be imported into BigQuery. Running your queries, we have a bunch of libraries available in Java, Python, .NET, PHP, JavaScript, Appscript, and so on for running your queries. There's a REST API. So you can call into BigQuery using the REST API from your applications. Now some intuition about BigQuery architecture and how to look at BigQuery. So the typical relational database architecture will look something like this. So you have different B3 nodes. And you have a large number of key ranges which each node encompasses. And there's some disk storage where the actual data is present. So when you want to look for a particular record, you will traverse this B3, which is typically flat, and try to locate the actual record. So typically in this architecture, doing a table scan is really expensive. But BigQuery is something which really embraces a table scan. So typically the sort of folklore is if you do a table scan or a one terabyte table, you're going to have a really bad time. But our goal is to perform a one terabyte table scan in one second. Looking at the current kind of hardware, this would require reading a terabyte per second from disk would require around 10,000 disks, processing one terabyte per second maybe requires 5,000 processors. And this is the kind of scale which we have internally at Google, which is externalized through BigQuery. One other big difference is to store your data. We do not store it in record-oriented format. We are storing the data in a column-oriented fashion. So the column-oriented fashion A allows us to load only the columns of interest. So we are not loading the data for columns which do not matter. So something like a select star will not be available in BigQuery. But typically for most analytical queries, you will be filtering on a set of columns. And you will also maybe select a subset of columns. Your tables will have a very, very large number of columns. But really, you do not need to load all the columns. So we will exactly read the set of columns which we require to process your query. And so as you are charged at the rate based on the number of gigabytes which are processed, so basically loading the columns of interest A is efficient. And from a latency perspective for an interactive tool is something much faster than reading all the data. So this gives a big order of magnitude efficiency gain for us when we look at very, very large data sets. So another thing is the late Jim Gray had this famous sort of quote saying, bring computation to the data rather than data to the computation. So the whole BigQuery architecture is about how do you move the compute as close to the data as possible. So this is a form of BigQuery architecture which you can look at. So there are different nodes. So you have mixers at various levels and different shards at the leaf levels. These are still computation nodes at each level. So what you can look at from a map-produced world, maybe you can look at each leaf level shard as your mapper. And the mixers at the high-level mixers as maybe the reducers. But so and then you have some distributed storage, something like HDFS, Google File System, any of these. If you want to find a value, the difference from the previous slide where we looked at the traditional database is I'm only traversing a particular set of nodes in the tree. And in the case of BigQuery, that's not what you'll do. You will be every shard will be reading its set of data, doing a table scan, and basically selecting the row. And only a certain set of shards basically will have the rows of interest. So some examples. So some of these queries you can obviously do in traditional databases. And they'll be equally fast. So select count, which might be just a metadata operation in a traditional database, or max, where you just maybe have to go on the rightmost edge of the B tree. But something like standard deviation. If you want to do this at scale, you have an online algorithm where you need to keep track of the sum of the set, sum of the elements at the leaf, and the sum of squares of the elements. And using that, we can parallelize this, even though our data is split up amongst the leaves. Another thing which you would almost never do in a traditional database is to do a reg X match on each and every row. So you can do a regular expression match on each and every row. So since we are doing a scan, we are able to actually do a computation on each and every row and select a subset of rows which match your interests. Similarly nested selects. These are also things which you can do here, and will still be very efficient. We support joints, both large and small joints. The way joints work for small joints, I think the next figure will tell you how the small joints work. We actually move the small table to each and every compute node in the tree. And basically the join is done at each level of the hierarchy. So we allow you to do a small join if the right-hand side table is less than 8 MB of compressed data. So you can do a small join in that case. There is a large join, and that would work by partitioning the tables and work on all compute nodes. So the syntax is join versus join each. So depending on your use case, you can use either a small join or a large join. So I'll try to give some demonstration of some queries here. So this is everyone see this. So this is a Wikipedia page view data set which we have. It has hundreds of millions of rows. And I'm selecting the title and sum of page views for these particular titles for the English language where the title matches a particular reg X. And then I'm grouping by title and ordering by view count. Let's run this query. So this is processing hundreds of, at least, I had just run this query some time earlier. Maybe I can show that here. So this ran in five seconds, and it processed around 37 gigabytes of data. So you can see the set of results here. Hopefully by the time we go back, it should have still running, not very soon. So one thing is, some of this is part of the general freebie quota. So sometimes it's a bit slow. Yeah, why? So you can see the set of queries, the set of view counts. I want to discuss, so these are some limitations of BigQuery. So if you have a very large table, sometimes you will see this response to large. If you really need all the data, you should really be sending it to a destination table. You typically would not want terabytes of data, but be careful in putting a limit clause if you just want to examine a few rows in the query. We allow very, very large responses if you can send your data to a destination table. Another thing which you couldn't use when you use a group by clause, some of the intermediate nodes will get a large amount of data. So one way of avoiding problems with this is to select, do a sampling of the rows. So we have a hash function which is highly collision resistant. You can use something like hash of user ID, percent 10 is 0, to sample the rows, rather than processing each and every row on the table, and then multiplying by the appropriate factor to get approximation. This is something you do a lot of times in which is to group by a certain key, and then order by, so if you want to find say the unique users, the top K users for who did some particular action, what you can do is use this top command as I stated earlier, it's highly efficient and statistically very, very accurate. This will get you very good value and at a much cheaper cost than doing, trying to do a group by which might give errors with a response to large. So I think I'll just skip this. So this is one question some of you might have, why not map reduce? So typically in map reduce, you have a controller, you have a bunch of workers which are doing the map, and you're going to shard your data to parallelize the execution. So map, so first you have to do the map, data has to be, then you do reduce after that to combine the data together. But between these there is one big phase which is fairly expensive, which is the shuffle. So the shuffle means you have to move data around and that means you produce data on intermediate output on disk and move it around to the correct reducer. Also in many cases you might require multiple passes over your data. So it's not very well suited for interactive analysis. So for interactive analysis, so something like BigQuery will get you your response in a matter of seconds. So in summary, we saw what is big data, how BigQuery is different from map reduce, and BigQuery is really very good for ad hoc processing of large amounts of data, and also gave you a little bit of introduction, how to think about what happens when you execute queries on BigQuery. I'll open it up for questions now. Are there any numbers that you can share using BigQuery versus MapReduce versus a traditional oil and database where you say, OK, for this much amount of data, this is a time taken by BigQuery, and so on and so forth? So meaning I don't have comparative numbers, but just if you look at the Wikipedia example where some of these tables had 400 million rows, we could process it in like 30 seconds, which is fairly like 50 gigabytes of data. So obviously, if you do sufficient parallelization, you could maybe do it in a reasonably fast time, but imagine a system where basically you have it in a very interactive fashion. You don't have to do any. The data is preprocessed in this case, and it is available at these nodes for continuous analysis. So in the MapReduce, just bringing up your jobs, setting up the whole framework will take, the setup will take time. So it is going to be much, meaning if you're ready to wait for 10 minutes, then MapReduce will probably work fine for some of these cases. But if you want it, like some of the use cases, is typically product managers will have workbooks where there are very good Excel connectors and also Google Docs spreadsheet as Apps Script. From Apps Script, you can query into BigQuery, and you're combining them. You get very good mashups, which in real time, these workbooks can get updated. Thank you. Hello? Hello? Yeah. What is the pricing model? Yeah, go ahead. What is the pricing model? So there's a nominal price for storing your data in Google Cloud Storage. And then I don't have the exact numbers, but you're priced on the amount of data which is read for each query. So based on the columns of in your select and the columns in your ware clauses, basically based on the terabytes or gigabytes of data which are processed for that query, there's a charge based on that. Do you support medians or percentiles? Yes, percentiles, the quantiles is there. Yeah, quantiles is there. I think medians should be there. Yeah, I have a question here. Yeah, go on. My question was related to your write costs. So do you compress the data like some of the column stores do and store them that way? And then when you access them, do you kind of have some kind of query optimization inside of BigQuery where based on the ware clause, based on what is indexed, you can kind of reduce the amount of value you need to do when you sift through the data? So as stated earlier, so we are storing data in a column oriented fashion. So in lots of columns where the number of unique values is small, those columns are highly compressible. So that is really compressed. And we will only be loading up the columns which are required for first selecting the rows for your scan and doing the compute on those columns and then selecting any other columns for those rows which need to be returned to the user as part of the select. What I meant was, suppose you have a long ware. So do you have a query optimizer which decides this is how I need to execute my ware clause so that I can probably reduce the amount of subsequent data that I need to do in my group buys so that the amount of data I need to fetch from my disk can be reduced. So do you have any kind of query optimizer you do? Or you just do a blind ware clause execution? I'm not familiar with the implementation details, but we surely have some query optimizers. But I can't answer exactly what exactly we do. OK. Let's imagine I'm going to actually show a page of the top sellers or some top results. And let's say I use limit so that the query is faster. Now, obviously, I can go and limit for 100 and so show 10 by 10 and 10. But then most of the people, let's say, they don't go beyond the first two pages. I'm going to be more expensive, limiting 200. Do you have a yield kind of logic where I can just do 10 and then again I ask you again and then yield if required? I don't think we have any pagination support right now. That's pretty much what you want right now. So it will be a fresh query. There are cash queries and so you can use cash results, but you will have to. OK. Do you think it will be part of the plan in the roadmap? I'm not aware. Thanks. Hi. I have a question which is used for storing the data. It is in the form of some table. Oh, could you just? I can't get you. Hello. Hello. My question is regarding column store which you are using for storing the data. So is there any performance implications when you increase the number of column families in a single table for storing the data? So for example, if you have some more than five, six column families. No, meaning we have customers having tables with thousands of columns. Columns, not columns. Yes. What do you mean by column family here? Like, for example, I just used HBS, for example. So there we have multiple column family and each column family have multiple columns in that for each row. So here, meaning you just have columns. OK. So that's more similar to the big table kind of word. Yeah. Yeah. So here it's columns and if you want sort of mimic column families, you'll have to store maybe a protocol buffer in that particular column, but you access it as a homogeneous unit. But you may not get the compression which you expect if you store it like a protocol buffer. OK. Thank you. Hello. Hi. This side. Is it BigQuery's design to support real-time data? Yeah, go ahead. BigQuery is supposed to work with real-time data or with static data? So a great question. So the way we allow 1,000 sort of appends per day on a table, so which roughly you can say maybe you can do a 40 an hour kind of thing. So you will have to batch your data. So typically what you have seen customers is maybe break their sort of real-time, sort of near real-time data as like maybe early or like maybe 15-minute intervals and sort of create like an append every so often. But if you want something like sub-second, like keeping on appending to a particular table, that won't be possible. That is not, is it supported? No, that is not supported. But you will have to batch it up. So it is near real-time, but not like meaning if you want like sub-second like every second you're sending some few log records or that won't be, that's not supported. So is there any Google products which uses this BigQuery? Any Google products, like any Google products which uses BigQuery? So as I said internally we have something called Dremel which is used heavily for all our log processing. So it's one of the most heavily used tools internally. OK. Thank you. Hello. Hello. My question is like in the beginning of the presentation as you mentioned that it's advisable that we denormalize our data and then imports in the BigQuery. But is the owner always on the client to denormalize the data or the BigQuery import framework supports mentioning this table needs to be? No, you will have to sort of like in your whatever ETL framework or whatever you use, denormalize the data and like load it up. OK, and my second question is regarding, you know, I use it. Hello. So is it open for trial like of three? What I showed you, BigQuery.cloud.google.com, all of you can go and play around with queries. There is a set of sample data sets. Unfortunately, it was a bit slower so I couldn't show you more queries. But you can just go and play around. I think the first 100 megabytes is free of processing. So you can just go and like even load your own data sets and try it out. Thank you. Hello. Yeah, one question maybe here. Yeah, go ahead. So what I wanted to know was more related to the strategic vision of Dremel if you are BigQuery, if you would have an idea. Are you? I'm sort of lost in the crowd. Back, back, straight. OK, yeah, go ahead. So that's OK. So basically what I wanted to know was is joins across big tables apart in Dremel's vision? Is it actually competing with, how is it positioning itself? Is it an alternative to the MPP databases that are there? Or will it continue to stick with what it does well? So for example, large table joins are one example. Do you see that happening in Dremel? Would you have any idea on it? So we have customers who join very, very large tables. It's already available in BigQuery. OK, so and the response times are as fast as the normal. So it might take, say, instead of these less than a minute, maybe like 10 minutes for two very large tables. So but basically it is still very interactive and something you can depend on. One last question. Yeah, go ahead. Are there any use cases which we can do using HDFS and MapReduce, but not using BigQuery? So basically if your data is like, if you are going to do a lot of multi-stage combining stuff, so if you are not willing to sort of de-normalize and put your data together. So let's say it's deeply nested series of joins. And basically there are like 10 tables or something. Maybe a multi-stage MapReduce will work better in that case. So those are some cases where MapReduce, so if you're willing to wait for a few hours for your processing, MapReduce would sort of work. If you have any more questions, please take questions offline. Thanks, Chandramoli. Thank you very much. A small announcement. You have a tangent sticker.