 Okay, so I'm I'm Jordan Tigani. I'm software engineer working on on big query It's big query is a tool for for data analysis Some mode of map reduces a tool for for data processing It's based on some internal tools that we built over the years at at Google and So before I get too far into what what big query is I just want to talk for a second about you know, what what is big data? I assume that since Since we're at a big data conference everybody has some idea of what big data is But I think one of the things that I found is that for Everybody has a different opinion of what of what big data means at Google at Google, you know, we've got You know Gmail users in the hundreds of millions The metadata about those users Their emails etc. I mean that's a lot. That's a lot of data and we sort of Tend to talk about okay, it's starting to get big when you have a billion rows or when you have a terabyte of data For other people it's it means something different if you're doing machine learning and you've got a million rows You may you know that may be that may be big data that may be more more data than you can handle So I think big data is not really just about about the size of the data that you have It's more about do you have a problem with your data? Do you have a problem? Scaling is your is your is your architecture working with that with the size of data that you have and so You know, I think of big data also as as as problems and so you might have a big data problem if For example, you have data coming in faster than you can you can process it you may have 10 gigabytes 10 megabytes of data But it's coming in so fast that you can't you can't write it out to your to your data store fast enough or You know conversely or in a related related way if if you're growing 10% a week You may only have a small amount of data now But you want to plan for when you are two to three orders of magnitude larger Maybe you are waiting to sign a big customer and they're going to you know use You know cause you to go up by tours of magnitude the amount of the amount of data you're processing That's still a big data problem Or sometimes just data architectures make it hard to scale You know if you have a single monolithic database that you're relying on for all your all your your analysis It's it can be difficult to to to shard that into multiple pieces And if you shard it into multiple pieces then you can't do the same joins your that you had been doing before So You know architecture is very closely tied to to to to big data and what what big data means Sometimes it's just a question of cost if you if you've got a mysql instance or if you've got a sequel server instance And you need twice the capacity you might have to go away from commodity hardware And you might have to buy some big iron machine and and your costs will go up by by 10x and the cost could be Cost of physical machines cost of developer time cost of cloud time, etc And I think one of the big the goals of big data is that when you is that cost should be proportional If you double your data size you want to roughly double double your cost You don't want to have to go up go up 10x or come up with a different architecture and You know you think about sort of the old the old model where you had a single server and all your data Lived on that server and you're querying on that server and that was your database server, etc You know clearly that has that has has some scaling limitations and sort of the big data era is the era of You've got giant data centers with commodity machines You want to be able to just plug in a new machine plug in you know an additional node when When when your data size increases and maybe you're physically doing or maybe you're doing it via You know some some sort of cloud offering So big data I think is about scaling out and not not scaling up So for Google from a very early Very early on we recognize the the need to to scale out and all of our Algorithms and architectures are essentially scale out architectures the the the founders and the initial people at Google They didn't want to buy a million dollar database machine. They wanted to buy $25,000 database machines and You figure out in software how you can make the the 20 Inexpensive machines act like your your million dollar your million dollar machine because Then when you go from then when you need to scale again, you don't want to have to buy a $10 million machine You can just buy buy more of these Commodity machines And to give an idea of just some of the scale that we have, you know, Google Google gets 72 hours of YouTube videos every minute and the the Google search index is As of a couple of years ago the caffeine search index is a hundred million gigabytes. That's that's a lot of data if you're doing Analysis, you know, you're not going to be able to process that on a single machine no matter how big that machine is and so big query the product that I that I like to talk about it sort of and Based on the internal tool called Dremel and it came out of a need came out of a big data problem where we had a lot of data and We wanted to be able to ask questions of the data and some of those questions, you know Maybe this is the same question we ask every day It's like how how are things changing, but maybe it's a question We've never wanted to ask before like, you know, is our is our product strategy kicking ass or not and The other thing is we want to be able to get results back quickly We don't want to have to wait for, you know, 10 minutes for a for a map reduce to run We want to be able to just interactively ask ask questions about our data and You know as a sequel databases traditional sequel databases, especially ones that let you do analytics. They don't scale out, you know, they're they're sort of it's an open research problem and To make to make normal sequel architectures scale out But as you've seen, you know throughout some of these talks is is some of the traditional databases Don't do well when you start adding adding nodes and when and some of the no sequel databases I mean the the no sequel is you don't get you don't get sequel. You can't ask the same kind of sequel queries You can't ask the same kind of questions you can from a traditional relational database You know once you have big data Doing doing scans over all that data is going to is going to take a long time and we're we want to We want we want to be able to parallelize things So Google built a technology called Dremel and it scales to thousands of nodes the the VP of infrastructure at Google was recently quoted saying you can You can do non-trivial queries over a petabyte of data and you can get results back in order 10 seconds the the the particular clusters that we have Provision for big query are slightly smaller than that if you need that kind of performance You know, we'd be happy to have you come talk to us and we'll we'll build you that that cluster but Just to give you an idea of how how fast how fast it is Don't you don't have to just take my word for it So this is this is a data set that we have that's every Every page in Wikipedia Every every day it records the number of the number of hits the number of web hits that it got And so what I want to do is I'm going to do a query How well can you see this so a query? to find Which were the Which were the pages that got the most they got the most hits that were in Spanish And they have this they match this funky regular expression so this is the kind of thing you just can't do on a on a on a relational database because You know, you'd need an you'd need an index Of this data and to be able to do a regular expression over over every single entry in that in that In that database would just be prohibitively expensive and so we can we can do this. This is This is a lot of data and it's going to take It's going to take a few seconds And so by the way, this is this is the big query web web UI which is the our external version of Dremel that people can The third-party customer people like you hopefully can can can check out and try and and run over your data set so so this is 22 seconds and we just ran over ran this regular expression over over 611 gigabytes of data So hopefully you find that fast if you don't find that fast then Then you have a different idea of a fast so And a lot of people ask us well, why don't you just use MapReduce, you know MapReduce is a fantastic technology It's used Thousands of times a day at Google very very Google's very heavily heavy user of MapReduce they Arguably invented it Arguably discovered it For uses in big data, and so I just want to talk about why MapReduce doesn't doesn't work so well for these types of things So in a MapReduce, you generally have a master that's that sort of kicked off to to run Whatever processing you're doing and the master launches some some mappers and Allocates those mappers sometimes, you know may even boot additional machines and Then mappers will will read the data from read data from distributed storage process that data and write it back out There's a step that people don't usually talk about in MapReduce And it's often the kicker because it's the slowest which it's shuffle So it's map shuffle reduce and shuffle is how you know how to send the data to the to the reducers so it's essentially an on-disk sort and And finally there's a reducer Which is going to compute the the final the final outcome each of these things is going through through Distributive storage, which is essentially writing it writing it out to disk. There's a couple of you know map bar has a has sort of an in-memory very high performance Shuffle stage, but you know, it essentially these are tweaks around the edges You know MapReduce is is just designed for for batch processing And the next thing about about MapReduce is that many of the interesting questions that you want to ask Can't be done in a single map and reduce phase a map and reduces It's pretty limited and so you're often going to have to do multiple stages of these of these map maps and reduces So to compare the two you know MapReduce is for essentially for batch processing Dremel slash BigQuery You know, it's optimized the architecture is optimized to to get answers quickly With low latency for for sequel type queries and I'll show a little bit about how that how that works Dremel is built as a as a tree tree structured at the bottom you have the same distributed storage level that you did in in In for MapReduce is although it tends to be stored they just tend to be stored in column format and the column format is key because You know, you may have a very wide table. You may have a thousand columns in your table If you're only doing a query over two columns, then you don't you don't need to read read everything It has The bottom of the tree is is leaves and leaves are sort of like the MapReduce workers But it's a long search. It's a long live tree So you don't have to spin something up new every time and the the connections between The leaves and the and the sort of the next higher nodes, which we call mixers These are network very fast network connections. So you don't have to write things to disk You're sort of doing the the reducing and sorting by a By pre assigning them to to the leaves to mixers and if if that's a little bit hand-wavy I'll gonna go quickly into into one example of a Of a dremel query. So this we have a we have a public data set that has every baby born in the United States for 50 years and Has a bunch of sort of demographic information about about the the baby and and the and the parents, etc And so this is this is a query that will compute the the years when the most The most babies were born to mothers over 30 And so we see the same the same dremel the same dremel tree here and we'll start from the bottom because the query is sort of rewritten as it goes down and Usually usually simplified so at the very at the very base level. We're just going to Take the mother age and year so even though we have a lot of a lot of columns we only need to read read those those two and it's essentially just a sequel query Into the leaves then the leaves compute a compute the count Themselves and and the group by and so since we have a 50 years of data in this in this data set They're gonna return essentially 50 50 50 values back up the tree so the mixer is going to Is going to continue to aggregate those and notice it doesn't have to do the filtering anymore You know where the mother age is greater than 30 because that's already already been done and finally the root the root of the tree is going to apply our limit query and and And also sort we want to because we want to we want to find the ones that had the the highest number of Babies in that year so that's essentially how dremel works if anybody has any more questions about about specifics or things that I can You know, I can tell you a lot about it because we published a paper on it But I'd happy to talk to talk afterwards So this is great, you know big query is something that we can use that at Google But hopefully you may want to know how can how can you use big query? So it's just it's just a web API You can send it's a restful API restful as a buzzword for being able to do simple HTTP operations on it you know you can You know HTTP get to read a table HTTP post to insert a insert a query job or insert a load job And the way you generally get your data into big query is you have your data in multiple pieces It can be in CSV format and JSON format you upload it into Google Cloud storage Which has a very very fat pipe from the from the the internet And then you run an import job that'll run That'll import it into to big query And what does this look like? It's a it's a JSON post request to to a big query URL. You can see the first part is just sort of the the the big query endpoint projects so every Everything in big query sort of the root of all namespaces are projects and projects you might think of as your company Or maybe it's a sub project within that company You know say you're doing you're working on some some project And and your your namespace is it's sort of private within that within that project, but you can see that You give it the source URI where it came from and you give it You know the name of the table that you're creating and and some and the schema some some minor information and and that's all that's all you need to do and Kind of to show what that looks like So this is the the Chrome developer console you probably can't see that too well. I guess it's not so bad It's So everything the big query UI does actually goes through the public API So you can you can use the the Chrome developer console to actually see what gets what gets sent and here We can see the request payload Basically just sends the same the same data that I that I that I mentioned before and this is creating a table named You know food art bar with some with some simple simple fields that I imported from From a CSV file I had in my Google storage bucket Okay, so running a query is actually very similar similar A load is a job load is the type of job a query is also another type of job And so you just you give it a very similar payload although obviously instead of specifying the schema and And the the Google storage location you specify the the query that you that you want to actually run If that seems a little bit confusing or difficult Most people don't need to actually operate at that at that wire level. There's a bunch of of Public publicly available libraries for you know virtually any language that you you may want to be coding in That that makes it easy to to connect I'll show an example of so in in JavaScript You know you you basically just There's a query query object and then a request object and you can you can execute it It's pretty it's pretty straightforward a straightforward as JavaScript can be So I wanted to do a couple of quick demos to kind of show show big query in use and So the big query team uses big query a lot Just to understand or to understand our service and to to debug our service And so I thought this might be a good way to to show you okay, so we're actually we're running a production service And how can big query be be useful to to do that? So what we do is we we load We load the big query jobs table into Big query jobs database table into a big query table approximately every hour, so we snapshot it every hour and And you can see the schema here and a lot of these schema fields are the exact same fields as you saw in that in the In those those job Objects that I was that I was showing that are posted. So there's configuration query query And so this is a nested nested data structure and so from this we can run any big query queries that we that we would want to do over that and so one one thing that we often want to do is Is to debug problems that people are having so the the big query Health forum is on is on stack overflow and sometimes people will ask us questions and they'll only you don't need to read this This is just sort of a It's just and someone was hitting an error and the errors indicated that it was unexpected error turned translates into an internal error on our side and often the information that people give us is not always enough to Diagnose the problem easily Maybe they'll send us their query, but it'll be only part of the query or they'll send You know a reduced query, and we want to see what they're actually what they're actually running And so what we can do is So I want to run a query over our jobs that that basically has this sort of This phrase that I picked out of their out of their query Which is probably right even that even if they didn't even if they had redacted it a little bit And I want to say that the the result error was you know internal error And I want to capture a bunch of a bunch of fields so I get the job with a job ID I can look through our server logs the debug info captures our stack trace and I'm gonna I truncate the stack the stack trace because Because You know there's some sort of things that we don't necessarily want to show and we don't actually show our whole our whole stack trace But you can you can see that so these queries are very fast We call these sort of spearfishing you know you're looking for a particular particular thing and you know you know some information And you're looking for essentially one thing so we've also ordered these by time So so the most recent ones show up first and we can we see that the the debug info is you know unexpected Dremel error And that that gives us sort of if I show the rest of that debug info that that'll show a lot more information about About what actually happened and you can also see the query that was run We can see you know we capture a bunch of a bunch of interesting things So in addition to spearfishing there's often you just want to ask questions about the health of your data the health of your service So sometimes we want to know How long are these Dremel queries taking and so there's a quanta so Average is a bad is a bad number to use because Average is very heavy heavily dominated by large values you know if you get a hundred people in a room and one of those people is Bill Gates and you compute their average net worth then Then that's going to be a much higher number than then Possibly it should be and so big BigQuery has a quantiles function Which is essentially percentiles or it enables you to to compute the median is the 50th percentile You can compute like the 90th percentile and so we just want to see how long are these big query queries taking And you know so what's the what's the median? What's the 90th percentile and so we can see that okay, so 9% of them finish within these are in milliseconds so within 1.5 milliseconds And I've lost the edge of this Come on. There we go So if I keep scrolling through the 50th percentile so half of the query is finished in less than 777 milliseconds and previous so the 90th percentile is You know is on under 10 seconds so 90% of queries finished in under under 10 seconds There's a lot more interesting things that we can do of like counting users and counting jobs and counting revenue growth But you know some of those are sensitive internal things So I would probably get in trouble if I started if I started showing those to you So, you know once you Once you're able to to get these answers the next thing you might want to do is Is figure out you know is actually Visualize them or put them in a pretty Chart that you can that you can look at so You know Google's got a charting API and so actually in in the BigQuery team we built a Built a dashboard and this dashboard This is a stripped-down version of it, but it's you know, it's just a simple app engine app that calls into into big query and Hopefully we can get this to resize correctly And we add queries to this to this dashboard that are run that are run periodically so we can see So we can see this is average import size So we can see you know over time how does the The number of files that people are using the number of megabytes of of their import and average records How do those how do those change and if you look down here? We can actually see this is the actual big query query that that was run You could cut and paste this into the into the big query browser tool and and run that to get that in or to tweak it This is the this is the query. I just showed you the the query time quantiles percentiles you can see how how non-linear it is there's there's some you know some people run very very Complicated queries that are that are very expensive Whereas most of them happen in in under a second And again, here's the this is this is the exact query. I just I just ran And so this this app engine this app engine app is has been released publicly As a public sample so people can can try it and and and tweak it There are some other ways to use to use big query So big query via app script is is integrated with Google spreadsheets So I'm not going to show that here, but it's it's relatively straightforward. You just to you just say you're gonna you want to use the big query API and there's there's a built-in editor that that has autocomplete and and And that enables you to to write to write Big query queries and then you can you can use those to use Google spreadsheets to to graph Although perhaps, you know, you're not you haven't totally drunk all the Google Kool-Aid and you want to use Microsoft Excel There's also an Excel connector that allow you to to run run big query queries from within Excel and use Excel's graphing mechanisms to to understand that In addition, there's there's some third-party tools There's a French called company called beam and and they do some really interesting Interesting big query visualizations The other one here is click view That can show some other ways other ways of looking at your data and finally we just integrated announced integration with Tableau Which is another data visualization company Based out of Seattle So here's to to to wrap up, you know big big query And that enables you to get get your results back in in seconds. It does full table scans over your data so you can They figured out a way to make make a full table scan fast by running in parallel And so you can ask much more complex questions and you knew what I would be able to Is in addition to running aggregate queries It's an API that's simple to use on a scale invariant That's one of the last the last point that I want to leave you leave you leave you with is that if you're using big query You shouldn't necessarily need to worry about is my data big or how is my data big enough to use big queries? My data too big to use big query You can run you can run queries over 10 row tables You can run queries over over a trillion row tables and it's just the same. It's the same architecture If you use, you know 10 row tables, it's likely going to be less expensive but But it should just just continue to work and the costs scale proportionally To the to the size of the data that you're you're using So thanks, thanks very much And I'm happy to answer any any questions that the people have Yes, you in the end Yeah, that's you in the stripe shirt When you use a cloud to services to how to upload the data imagine that you have several terabytes or whatever of data in your Services, how does how that can be done to upload to Google? That's the scale So the that's a great question. You know, how do we how do you get the data in? especially if you have terabytes of data and The the thing is usually when you have terabytes of data, you didn't get terabytes of data all at once It's often that that you're that you're bringing in smaller pieces of data every day And and your you know data grows over time so you can run you can run a pens to the same table As you're as you're getting as you're getting the data in, you know, maybe you're doing, you know A gigabyte a day 10 gigabytes a day You can you can break that up into pieces and upload it upload it to Google and Also, if you know if you've got it, I mean so the the wiki media Example that was third so of all the tables that we have that was it's over three terabytes And that took 37 minutes to import and into BigQuery So if you it can be a pain to upload it into Google stores once you once you've done that You know and you can do that in parallel you can do that You know via a number of a number of mechanisms that once you do that it's it's you know getting it into to BigQuery can be Is is has been heavily optimized you can also compress the files Which should give you a little bit a little bit more bandwidth bandwidth improvement. Would you realize that it's you know, it's Four very large data data systems. It's it's it's a tough tough question and And it's but it's also one that we're working on on on improving I mean so Google Compute Engine will be should be available pretty soon for generating your data via, you know Hadoop job into Google Compute Engine You know, we'll have fast pipes to get that into to BigQuery You know if you're running an Amazon's cloud you should be able to to post it still to to Google Cloud storage Any other questions? Yes. Oh Okay, sure So since you don't have a microphone you want to just ask and I'll repeat your question I'm actually not not as familiar with with with Impala. I think there's there's significant differences There's also a an open source Project as part of it. It's an Apache incubator project called drill. That's attempt. That's trying to sort of replicate the The the dremel paper, but that's still in its infancy They don't yet have a working version yet. I'm sorry. I don't I don't know more about Impala Sure Yes one in the back What do you think is the best combination for this technology to? Combinate this with a business intelligence Classical approach or with a big data Hadoop approach How does it coexist with a Hadoop project with a stone and all these elements? What is the intersections and the value that each other gives to the other part? So that that's that's a good question. I think it's If you're if you're using if you're using Hadoop and you're storing your data in and Hadoop Hadoop FS or HDFS You know, you do have to run an export export mechanism to get it into to BigQuery and so it's a non it's non trivial to Integrate I mean that said if you if you are you know using Hadoop you should be able to as you're sort of final reduce and you in a reduced step Write write your data to the to the big query Table so I Guess the the It's a weasley answer the integration right now is not it's not great And that's something but it's something we're actively actively working on we want to we know that that the data analysis is only one portion of the big data story and And that what we need to do is we need to have a wider wider offering that's all I can tell you and tell you now, but Anything else? Okay Yes, so the so the question was is there a way to To query over your data. That's an App Engine data store and so App Engine So we just worked with App Engine to They released a beta beta backup version. So you Via App Engine you you can you can You can say you want to back up your data to a big query table That triggers a big query import job and then you can run you can run queries queries over that So that happens all sort of internally with Google and it's very It's very easy to use you can if you you know search for that You it's I think it's a public beta that you can sign up sign up for if you're interested Anything else? Thank you