 Hello, everyone, and welcome back to the State of the Web. My guest is Felipe Hoffa, developer advocate on the BigQuery team at Google. And today, we're talking about BigQuery and how we use it to learn about web transparency. Let's get started. [?]. [?]. [?]. [?]. [?]. [?]. [?]. ?] Felipe, thanks for being here. So for the uninitiated, how would you describe BigQuery and how is it different from other databases? So we call BigQuery our serverless enterprise data warehouse, which basically means that you can use it to analyze terabytes of data in seconds. It's based on the cloud. It's completely serverless. There are no server costs for you to pay. Do you need to pay for a license? It's all about how much data you have and how you want to analyze it. So people end up paying per queries. Now, with that said, everyone has a free terabyte every month to run all of the queries that they want. No credit can be needed. Also, BigQuery allows you to share data. So if you're storing a lot of data and you want other people to be able to query it, you can decide who uses it. And the best part is that, at the end of the day, you can run your SQL queries, and they will run really, really fast. So let's geek out on that for a second. How does it run very fast, especially for terabytes of data? What exactly is running under the hood? Oh, so if we go back in time to before Google existed, people were dealing with that problem, getting bigger and bigger machines. And Google showed up, Jeff Dean showed up, the MapReduce paper showed up. And people started distributing jobs everywhere. That's where Hadoop was born. But BigQuery is a step forward from that. At Google, you can think that how many logs we have to process. So we had a team at that time wondering, how do we process terabytes of logs faster than what MapReduce can do? And thus, the Dremel paper was released, which is basically an idea of, how do we do distributed analysis of many files that have been stored in a columnar shape? So when I think of a database, sometimes the picture that comes to mind is kind of boring, just like a table of user IDs or something. But the things actually that are being done on BigQuery are really interesting. Could you expand on some of the examples of interesting projects? Oh, yes. There are so many people trying to analyze data. So of course, you have a lot of business use case, DevOps use case, people analyzing logs. But then, you can see also find people analyzing the world, geographic data, spatial data. One that fascinates me is people analyzing genomics with BigQuery. So basically, you can put the DNA of thousands of people there and find patterns and analyze it in ways that people are really, really enjoying. Like, well, now we can run these things at a different speed. Yeah, that's amazing. That's really cutting edge stuff. Yeah. So what are some of the examples of public data sets that people have done? Well, there's a lot of public data sets. I'm sure you know some of them. But to mention some different ones, you can find, for example, the weather for the whole world day by day as published by NOAA. We store it in BigQuery. And we make it available for everyone. There's a full archive that we update daily of every hacker news story and comments. So you can go and analyze all of hacker news, and et cetera, et cetera. We have hundreds of data sets. You did an analysis of the World Cup goals, too, right? Oh, yes. Yes, yes, yes. That was really fun. Just looking at every, so for example, every goal that everyone had done during the World Cup, I put it in to the dashboard. But what's really, really beautiful about that data set is that we have every kick, every time someone touched a ball in one of these matches, you can see and analyze how that happened. That's not totally a public data set. We license that data. But still, we have this idea that we can even analyze soccer and different sports. We have a huge partnership with the NCAA, for example, to analyze and publish their data so people can go into basketball and beyond. That's great. So as you alluded, for some of the web transparency data sets that I work with, like HHB Archive and Chrome UX Report, which we talk about on this show all the time, I'm always pulling out stats about them. And what a lot of people might not realize is that those are powered by BigQuery for all the reasons that you mentioned about scalability, shareability. And one of the things that I really love about it is that it enables people to just discover insights about the web all on their own. So for example, they might go on the hhbarchive.org website and see some curated stats and trends. But going deeper into the data, I equate it to a data berg, where you have this block of data. It's like an iceberg. You see a little bit on the surface. But once you start getting in there and querying, you can find some really interesting insights. Yeah. I love what Ilya started doing there, as BigQuery was being released as a product for the whole world. We used Dremel eternally for a long time. And then we went into this cloud universe where we are sharing our tools. Ilya Grigorik discovered that BigQuery was available, discovered all these features, and started putting data here. First with the GA archive, GH archive with all of the GitHub data, and then with all the web data sets that you know about. So if somebody wants to get started querying this data set on BigQuery, what are the steps to get started? First thing you should know, you don't even need a credit card. You just need to create an account. You probably already have a Gmail account or an Enterprise account. And if you don't, just get one. Then go to console.cloud.google.com slash BigQuery. You will be shown how to do step-by-step, create your BigQuery account, and be able to start querying. And do they start using the sandbox? And does that have the free terabyte per month? Exactly. So once you are in the BigQuery UI, which BigQuery makes it very, very easy in this way for people to start querying. That's on our web UI. You should see on the top of it that you are in sandbox mode. Sandbox mode means there's no billing associated, so there are no charges. But still, every month, you get a terabyte of queries. And also, you get free storage space, too, if you want to put your own data there. So one of the things I noticed with the GH archive data set is that we have some tables that are like six terabytes in size. So what is the effect on users then? Are they able to query these six terabyte tables in the sandbox? Or what are some things that we could do as project owners to make them more accessible to people? Sure. So as we were talking before, scaling up with BigQuery is really, really easy. Analyzing a 100 megabytes table is as easy as analyzing a 600 terabyte table. And then the question is, how do you make your quota last? First, query only the columns you are interested in. That's classical advice. But we also keep improving the query ways to make how you store your data, how people that are sharing data can make it easier for others to not go over quota. For example, a classic way is just letting people query a sample of the data. Just make a random sample or a selective sample of what people might be interested in. And now, lately in the last couple of years, we've added the capacity for BigQuery to partition data. So you can divide it by day, for example, and also cluster data, install data in a cluster way. So instead of analyzing the whole table with a select, you can just analyze a fraction of it, depending on how you cluster your tables. OK. So what are some of the things that users could do to help keep costs down? First, if you don't have a credit card, your cost will be zero always. Then if you put a credit card in, remember to set up your cost controls. If you set up cost controls, then you will never be charged more than what you decide you wanted to be charged. And then you can just relax and go crazy. If anything goes wrong, there are no worries. And now, the users should ask the data providers to cluster tables in the ways that they are going to use it. Let's say, for example, if you have the whole web inside the query, similar to some of our data sets, you might want to cluster it by the domain. So then if you run a query and you filter by domain or a prefix of the domain, you will only go through that fraction of the data set. So it's not necessarily true that if I say limit 10 at the end of my query, it's actually going to only cost me for 10 rows, is that true? Yeah, so there are a couple things to observe there. Usually, classic query behavior is if you write select star limit 10, that will go through the whole table and you will get charged for the whole table. The good news is that instead of doing select star limit 10, you can go to a table preview. There's a free table preview and there are no costs, so that's much better. Just look at your table instead of running the select star. But now, for example, with a cluster table, those queries are also smarter. So if you cluster your table and someone runs a select star limit 10, people will not be charged for the whole table just for the first cluster of data, maybe 100 megabytes. But if you do a select star where something, then maybe all of your 10 rows will not be found in the first cluster. And the query will have to keep opening clusters until it fills how many rows you ask to get. Interesting. So your query could cost and process different amounts depending on exactly what the contents of the where clause are. Exactly. So the query will always be able to give you an estimate of how much your query will cost. It will tell you, before running the query, the query knows this query will cost you 300 gigabytes. So at least you know what's the upper limit and you can decide to run your queries there or make a smaller one. But now, with clustering, that estimate is an upper bound. It might, the query might tell you this query will be 300 gigabytes of analysis. But then it might end up costing only 1% or 2% of the estimate. So there are two flavors of SQL, legacy, and standard. What's the difference? For those that started with BigQuery many years ago, we were using a custom SQL that was developed around RML or on BigQuery around the idea of how do we analyze a lot of logs fast. But then some people were not happy about this. They would much rather have a standard SQL. And we all moved into that direction. We unified the query around a standard SQL as published by the standards organizations. And now the only problem is that you have to deal with both SQLs. If you had queries written five years ago, you may want to switch them to standard SQL. There's no pressure to. Like, legacy SQL will still be supported. But everyone should choose and use standard SQL from now on. If you're using the old UI, it defaults to legacy, is that correct? Oh, yes. So again, BigQuery has a web UI. But now it has two. It has the old one, the classic one. And now we have a new one that just went GA. So it's the future for BigQuery. It's stable now. And in the new one, you always get standard SQL by default. And if you're in the old UI, you can always use that comment standard SQL to force it into the new mode? Oh, yeah. So when you're using either the UI or any API, if the first line of your query says hashtag standard SQL, your query will be read as standard. And you can do the same to tag a query as legacy SQL. But the standard from now on and what everyone should see from now on is just standard SQL queries. Are there any new features available in the new UI? Oh, the new UI is pretty cool. I love the direction where the team is taking this. And one of the cool integrations is just being able to click Exploring Data Studio. So you can get instant visualizations of the results of your query. Yeah, I love that one. And also BI Engine. BI Engine. That's new also. We have BigQuery. Your audience might already know Data Studio. That's our dashboards also, serverless. And it's really easy to connect both. Now, if you connect Data Studio to BigQuery, then Data Studio starts running queries on BigQuery. And that might end up processing a lot of data. So we have this new layer in between. We call it BI Engine. That makes it very, very, very easy for Data Studio to run quick dashboards to get the data from BigQuery. But without running queries, I guess BigQuery. With BI Engine, if you set it up, it will be approximately in the middle and act like an in-memory database. So it's super quick, super cheap. It just, and everything is run transparently between these three layers. So are there any other product improvements or features coming to BigQuery down the road? Well, there's a lot happening. I don't know if you had a chance to play with machine learning inside BigQuery. That's really cool. You want to run a linear regression inside BigQuery. Just say, create model. This is the column that I want to predict and use all of these other columns to predict it. You can run linear regression, logistic regression. K means unsupervised clustering. And we keep adding new ways of new models. In the very near future, you will even be able to run TensorFlow models inside BigQuery. So if you develop, train a TensorFlow model outside of BigQuery, and now you want to use it in patch against millions or billions of rows at the same time, BigQuery is able to also pull that off. That's great. Some really exciting things coming. Oh, yes. So what resources would you recommend for people who want to learn more about BigQuery? Well, there's a lot online. Of course, there are more videos in our channels. My favorite and the one that I'm constantly adding to is I have the subreddit, reddit.com, slash bequery. So everything I learn about BigQuery, everything I find, I put it there so people can keep track with what's happening. Is that a good place to ask query questions or feature questions? Well, so we have reddit for all of these news, all of these announcements. And we also have this awesome community in Stack Overflow. That's where I go when I have problems, but I also spend a lot of time there helping other people. And there are other communities outside this. For example, if you have questions about the web, there are some pretty cool websites that you know the URL better than me. Discuss.httparchive.org is a good place to go. Yes. And we're actually going to have a video coming up soon where Philippi and I are going to be looking at the data sets on BigQuery on HTTP Archive and Crux. So stay tuned for that. Are there any other resources you would recommend? So we cover Discuss.httparchive, Stack Overflow, Reddit, and, of course, find me on Twitter. Yes, Twitter is your Twitter. Twitter.com slash Felipe Jofa. All right, Felipe, this has been great. Thanks for coming on the show. Thank you. So you can find links to everything we talked about in the description below. Thanks for watching. We'll see you next time.