 Can you hear me? Or, oh, this is just for recording, not for... Hello, now you can hear me. Yes, I'm Felipe Hoffa, I'm super happy to be here. I'm so happy to be here that I'm going to come back next month for First Asia with a completely different talk. And today I want to talk about analyzing data with BigQuery. The first example I have is how to analyze 50 billion page views with Wikipedia. How would you do that if you had to analyze 50 billion page views? What tools do you use? Or why would you want to do that? Any answer? Are you awake? OK, at least I got a reaction. Well, so yeah, my name is Felipe Hoffa. I've been working at Google for more than six years now. I started as a software engineer. And four years ago, I became a developer advocate. A developer advocate is basically a software engineer with a license to speak. So I go all around the world giving talks, talking to people. And the most important part is not only talking to you, but also listening to you and bringing your feedback back to our teams. Feel free to interrupt me if you want to. If you don't have very long questions, I'm very happy to go to stop and answer your questions as we go. Since this is about we are at a startup space, I wanted to bring the idea of how do you create startups with data? What are the three basic elements? Which I basically listed here as ideas, data, and technology. If we go back to 1996, everyone thought that search was solved. There were a lot of search engines on the internet. You might remember them. They were all pretty bad. Like every time I had to search for something, I had to go side by side trying to find an answer. And everyone thought that that was it. So many smart people were doing search and not finding answers that probably there was no other way to do things until Google showed up, 1988. And I remember those days because I remember that I had to remember the website with a lot of O's because that one was giving me the answers. How did that happen? So everyone else at the time was using the biggest servers possible. And Google instead started doing things in a different way, using small computers, a ton of small computers, and making them work together. But not only they had a different idea technology-wise, also algorithm-wise. Like looking at the internet, everyone could see the same web pages. The web is one, and everyone was able to download it. Not everyone was able to download the full internet, but that's what these companies were doing. And they were looking at the same web pages, and they were searching within the same web pages. But Google was the first one to look not only at the content of each page, but at the connections between them. And that's how where the page-rank algorithm came from. Just looking at the connections and looking at the results in a different way, and that gave the region to Google. So yes, there was a new idea, page-rank. There was we had to collect the web data, and we had to create the technology to do so. We had to go from those massive one-to-server architectures to hundreds of servers, and Google started publishing all of those ideas. And today, instead of only publishing the ideas, Google puts everything at your service. That's Google Cloud, and I've loved the two stories I heard today, like how Google Cloud is different from the other alternatives you have. But to come back to this, what you need is ideas, data, technology, and we're making it very easy for you to get data. We are sharing with you all of our technology. Now it's all about you and your ideas, how you will make things happen in a different way. And it is possible. So yes, so we started sharing our technology in 2002. We published our first paper, the Google Fight System, then MapReduce, Bigtable. Bigtable, for example, is now a cloud product that anyone can use. Colossus and Flume and Millwheel and Flume have been combined in Dataflow, et cetera, et cetera. And the one I'm going to focus today is started with the Dremel paper and the external product we call BigQuery. Who here knows BigQuery? Good, good, good, good. It should be everyone. And at the end of this talk, it will be everyone. That's the whole point. Very quickly, what is BigQuery? BigQuery is an enterprise data warehouse that can analyze terabytes in seconds. It's simple. If you know SQL, you can use it. You don't even need to know SQL, but SQL is a pretty good place to start. It scales to petabytes. It doesn't matter how much data you have. You can put it there. If you like working with R, with Python, with Datastudio, with Tableau, whatever, it can connect to BigQuery. You have the ability to share data. You can have private data. It doesn't matter how private it is. It will be stored securely and privately. We have a lot of certifications. But you are also, if you want to share data with a partner or with anyone or with everyone here, you can do so. And everyone has a free terabyte of quota every month to write their own queries and play with it. So if you want to open your notebook now and start playing, it's ready for you. Just to run my first demo, how many page views does Wikipedia have in a month? Any guess? Order of magnitude? How many? 1 billion? 10 billion? 50 billion? Let me show you. I'll jump to my computer. So sorry, cameras, for coming here. This one works? No. This way, I have my hands to type code. Hello. Yes. So BigQuery has a web UI that makes life pretty simple. If you just want to start writing queries over your data. Here, I have stored the page views, monthly page views for Wikipedia for many years. Let's speak, I don't know, March 2016. And this table has five columns. Basically, this page, this title, at this hour, in this language, had these many requests. So it's not even a raw log. It has everything summarized by hour. And it's a simple table. It just happens to have 437 gigabytes of data and 5 billion rows. That's a lot of rows. And if we want to know the total number of page views, we just need to add the requests for those 5 billion rows. And you should not deal with this at home, but we BigQuery took me three seconds to process 40 gigabytes of data. And the total is 20 billion page views in this month. Now, to make, I could be cheating. I could have this pre-calculated. So I would love for you to give me any random word. You want to say it? Just give me any word. Which one? More? Both. Both. Let's see what happens with both. So I'm going to run a regular expression. Any special type of both? Let me run a regular expression just to make things harder, where the title, let's say it starts with both. It has anything in the middle, and an 80. Maybe both add, maybe both. And we can let's do it in the language equals English. Just to limit the results. And again, running a regular expression over 5 billion rows should not be something simple or fast to do. But it took six seconds. And then if I want to know what are the pages, I can add the title. I can group by the title. I can order by the first column in descending order, get the top 10. And now it's your turn to guess if the first page is going to be both or what. The first page was Brown versus Board of Education. That was a very important topic in March 2016. Because if we jump to March 2014, we will get a different result. There is Borat. I told you, there's also the same boat, Fresh of the Boat. And yes, Brown versus the Board of Education was also super popular in 2014. And Borat, but at that time, they were the Boston Marathon. So yeah, we can start playing. We can jump to wherever we want to jump. That makes things pretty exciting. Data is available for you to be queried at any moment without you don't have to care about the size or anything. That's pretty cool, no? Now, how does this work? What's the magic behind the query? One is that we have all the data in distributed storage. This is data. You don't need to put data in specific servers. It's just available at any time to be loaded by computers from our distributed storage. Then many computers at the same time as many computers as much related to as much data we have start reading this data in parallel. They start filtering the data, grouping it. They send it to more computers further down the road. These computers group the data further. And then we get our mixer zero that outputs the final results. And that's the basic idea. That's what was the secret behind what we just did. So BigQuery is known. It's pretty cool because it's capable of analyzing terabytes of data in seconds. On the other hand, it's BigQuery's low because it analyzes data in seconds. You usually want things to happen faster than that. You want your queries to run in less than a second. And for that, you should design a system. You should use a normal database. You can use MySQL. In this example, when we're looking for one record with an index, MySQL will be faster than BigQuery. But the more data you start analyzing, the more complex your queries get. When you don't have time to design, you don't have three months to design a special system that will give answers in less than a second. But you just have a lot of data and you want a reply now. BigQuery scales fast. And it's always able to give you those answers. So pay attention to that kind of how each tool is different. Queries that you can take that you need in less than a second, stay wherever you are. If you have them in less than a second. But if your queries are taking hours or days that you're having meetings with other people to define which indexes you should use and they will give you a system a week from now, just go to BigQuery and get your answers now. How to load data into BigQuery? Let's say you have logs, your own personal logs. In this case, I'm going to use Wikipedia logs. How do we load them into BigQuery? The first step is finding them. There is a web page, in this case, with a lot of them. And if you download one of these files, you can tell BigQuery to load it. In this case, it turns out that each file is a CSV. But instead of being separated with commas, it's separated with space. Instead of quotes, it doesn't use quotes. There I told it, please put this file in this table. Take this file, and these are the columns. And the query will ingest that file. And that file will be ready for queries in a couple of minutes. That's for one file. Now, if we want years of data, that will take way longer. Each file waits a lot. You have to start browsing, downloading all of them. You might want to parallelize this. So for that, we have other tools like Dataflow based on Apache Vim. I'm not going to give a talk on Dataflow now, but just to show you a little piece of code. This is how I get one file from Wikipedia. I download it, and then I save it into Google Cloud Storage. Pretty cool function. It doesn't do much other than download and copy to Google Cloud Storage. Now, if I want to run this over every hour of every day, of every month, of every year, I want to scale this. And that I can do with Dataflow. For Dataflow, I can write code like this. Create a Vim pipeline, a Dataflow pipeline. And then I give to this function a list of all the URLs where I can find these files. Now, I can just tell it, run this. Google will take care of scaling this up as come into many servers, downloading these files. Everything will be loaded to Cloud Storage in a few minutes or an hour, maybe. Depends on how much data we want to download and how many servers we allow Dataflow to scale up to. And we will have our files loaded there. And I can also, in spite of not start creating my BigQuery tables. In this case, I'm telling BigQuery, look at these files that I put in Cloud Storage. Just look at them and allow me to run queries over these files. I don't even need to load the data to BigQuery. I just need to point to the hundreds of files that I have here. In fact, I don't even need to decompress these files. BigQuery is capable of reading GZIP files. And then, in BigQuery, I have a table which I left here. I called it Views Wikipedia GCS. Basically, this is my end result. I'm just telling, look at everything that is in this folder in Google Cloud Storage. These are CSV files, GZIP. I set bad records in case there are some bad lines. Don't look at the limiters. And now I have one line, one row. I don't even, in this case, I didn't even bother to tell BigQuery the schema of my files. I can run something like select star limit 10. And this will look at the files stored at Cloud Storage. And decompress them on the fly. Let's see how the internet is working and testing. But yeah, it's pretty fast. And now I'm getting one row per line, which is cool. I cannot do a lot with this because I still need to parse it. But again, parsing it, for that, I can write a different query, a view. A view is just another table that's looking at that table. My code is here. This is how I'm parsing inside the query, each line of the logs. And that gives me the ability to see, to do something like query view, select star. Again, I want just 10 lines of it just to watch it. Yeah, I didn't do much to these files other than storing them on Cloud Storage. But thanks to my view, now I have them parsed and I can see each line and I can start running my computing. And what I would really love to do now is to copy these files inside the query. So they're stored with our special source. This is what I have. Yeah, so this all started with the Dremel paper, interactive analysis of web-scale data sets, paper from 2010. And this is where we showed our basic ideas. Another basic idea other than using a lot of servers is storing data in columnar format. Instead of storing record by record line by line, we store them by column. So then when you run your queries, they are super fast because we are looking at the whole column every time you ask for a column. So even if you have your files outside the query and the query can look at them, you might want to import the files inside the query for everything to run faster. And as I showed you, this is the serving tree, which has also improved a lot since 2010. In 2010, we were not able to do joins. And the big limit of this architecture is Mixer Zero. We could only output as much data as Mixer Zero could fit. But we gave the tree the ability to send data back to the distributed storage. We gave the ability to shuffle data. When you're able to shuffle, you can run massive joins. You can run massive group buys, massive joins. So don't be afraid of just mixing terabyte tables with together. And this is how the serving tree looks now. Now it's a dynamic tree that, of course, starts with column storage, but the serving tree is capable of shuffling at really, really fast speeds. This all starts for a complex query. Like if you had a super complex query with joins and over 100 billion rows, with the query would take like 20 seconds to be run. If we had to read two four terabytes of data, if you had to run 100 million regular expressions, if you had to shuffle 270 gigabytes of data, that's a normal query, the query. And it just took 20 seconds. If you had to do this outside the query, if you had to run it seedually and you were limited by one computer, to read four terabytes of data from a hard drive would take 12 hours. To run 100 billion regular expressions would take 27 hours if each regular expression took one microsecond. And to shuffle data, it would take at least 37 minutes. So for us to be able to run all of these in 20 seconds, it's pretty, pretty cool. Now what about indexing? If you know about databases, you want to index your data and if you're using BigQuery, you can forget about that. In BigQuery, we do not index data, each query is a full column scan and that's why it takes seconds and not milliseconds because it just looks at the full column and you really, and that makes things fast when you want to look at the whole thing. What about evolving schemas? As I was showing you, we have the capacity to parse a full rows on the fly. So for example, in datasets like GitHub, which is the topic I'm going to be talking about in Fossation, you will be able to, you're able to store for example, a column that has just text data as a JSON column and you can parse that JSON on the fly. This is how I would do it. And if you want to do more magic stuff, if you know SQL, but you want to start doing things that SQL cannot do, we also support user defined functions. We can run arbitrary JavaScript calls inside a SQL query that will run on the cluster. And for example, things that I have tried is running leverage data in the distance to find similar strings. This is the code I would use. And the secret behind how, why are we using JavaScript UDFs and why are they able to run fast is because the serving tree calls Chrome V8, which is also our JavaScript runner that started for Chrome, but now we have it inside the query and we're able to run JavaScript code pretty, pretty fast. Now JavaScript code is cool, but you might want to run C code and that makes life much harder, but this happened like a month ago. Let me skip where my ex-teammate Francesc as his new company was trying to run C code in BigQuery. And it turns out you can compile C code to WebAssembler, wasm, and wasm can be run inside JavaScript and Francesc was able to run it. And that made us all super, super happy. He was just complaining that to simple double a number, run a C code that double a number, it took three seconds to run. So that for him was slow. And again, my reply at 2 p.m. that day was, oh, that's super cool, but know that three seconds is a measure of latency, not of throughput. Of course, if you try to calculate one row, it will take three seconds, but the question is how much will it take you to process 10,000 rows? And so 17 minutes later, I had my answer. To run, to go over 10,000 rows, it didn't take 30,000 seconds, it took only 50 seconds. This is how we show how we scale. And then one hour later, after running a couple of optimizations of his query, I was able to process in C code inside the query with his code, five billion rows in 50 seconds. And at that point, Francesc said that I was just bragging which is what I'm doing now. But yeah, you can run that kind of magic stuff. If you try it, please share your results. I want to see where we can go. Connecting ideas. This is, let me go back to Wikipedia. As with the pop-up pop-up, somehow I was able to talk before. Gracias. So this is Redash. Redash is an open source dashboard software. There are many ways to visualize data. I want to run this query with Redash. Here I'm going to look at, I'm going to, I'm looking at, this is August 2014. I'm looking for the Robin Williams Wikipedia page. That was the month when he died. But let's see what happened on Wikipedia during that month. So here I'm able to visualize this curve which shows that he had around 100 page views per hour at that time until we got the sat news. Then he jumped to 1.5 million page views during that hour on Wikipedia. And then it decayed as the news cycles happened. But then instead of having 100 page views per hour, he got like 4,000, 5,000, 2,000 page views per hour. And the question here is, okay, this is sad but at the same time it's pretty interesting to see what happens on Wikipedia's news happened. And the question I have here is if you have a log like this, a log of page views per hour, how do you find related pages? How do you find a page that, what's related to this topic? How do we mind this with a SQL query? Any ideas? What would you do? Do you have any ideas? Anyone? Just analyzing a page view, a log's page view. A page view, a log of pages. So what I can also do with the queries run correlations. So here I have my query. In this case, on one hand I have one timeline, the timeline of Robin Williams. And I can compare it with the timeline of every other web page on Wikipedia during the same month. And I can look at the correlation of those timelines. And I can get what timelines have the top correlation. And these are my results. So Robin Williams, the top correlated page, number of page views per hour was his filmography, Good Morning Vietnam, Zelda Williams, Mocha Monday. The results make a lot of sense. And this all showed up. I'm able to build a graph of related content just by looking at the different timelines. So please go ahead and play with this kind of data. Let me get back to the presentation. Any questions before I continue? No, okay, I will continue. These are my charts, my query. Let's talk a little bit about how BigQuery fits on the enterprise. Some features that are closer to the enterprise world. Yes, of course, it's awesome to be able to be agile. It's awesome to have an elastic database. It will scale. It's cost efficient. It's compatible with the tools you already use. It's secure, top security, Google security. And it's serverless and fully managed. You don't need to configure or turn on anything. It's just working for you. This is highly available analytics for us. You put your data here and you are able to rank queries. For big corporations, normally BigQuery charges per query. That's pretty cool for people that are starting, for people that want to use a free terabyte every month. But for huge corporations, they have access to flat rate pricing. Now we support standard SQL. We have OEDBC, JDBC connectors, DML, Stackdriver to monitor everything that is happening to be able to audit your costs, your users. And we have strong identity access and management. So this is how you would monitor your BigQuery usage with Stackdriver. It's all automated and happens for free. Identity and access management, you can have different kind of users with different kind of permissions, all the things, et cetera. And we have a lot of partners working with BigQuery, which can help you to ingest data, to do ETL, to do analytics, visualization, services, et cetera. So to wrap this talk, I want you to remember that BigQuery is different because you can analyze terabytes of data in seconds. You don't need any servers. It has a terabyte network. That's how we are able to do things at this speed. You don't need to pre-plan your queries. You don't need to have a committee to make decisions beforehand. You don't need data locality that is stored on the network. You don't need indexes. You don't need partitioning. But BigQuery supports partitioning. You just don't need it. It's always on. It has no doubt. It's highly available. Of course, sometimes it goes down as everything, but it's highly available. It's private, but shares data. You can have unlimited users. You can stream data in. If you want to stream data, if you have real live data that you want to bring to BigQuery, you can do it at more than 100,000 rows per second. And the big point is how do we enable your ideas to flourish without having to worry about the technology behind it. And this is what I had. So if you want to find me, you can find me on Reddit, on Stack Overflow, on Twitter. If you want to give me feedback, you can go to the URL. And I'll be super happy to answer your questions. Thank you very much. I want to take a picture of you. You all look so pretty. Hello. Do we do it? Hi, Kelvin. Does that mean that by knowing how to utilize an SQL script, I can literally go to other websites and download or extract the data from there. So if I understand the question, if you can download data from any website? Yeah. So the part that you need to do is you need to be the data engineer. If you have a pipeline that brings data to BigQuery, you will be able to query the data once it's inside. Now, sometimes, especially with public data, you have other people doing that for you. So we have websites like the HTTP Archive, which I love. T-t-t-t-t-t-t-t-t. HTTP Archive, let me bring it in. If you want to look at the top million home pages on the internet, they are all loaded here on the query. And you can start writing queries that analyze what the top million web pages are doing. I have this one. T-t-t-t-t-t. So we were running this experiment. Just imagine you have access to all of these web pages. And now you want to say, you want to see what are the most popular JavaScript libraries. You can do that. Now, if there is any particular JavaScript library that you want to find, you can also find it and you can count how many servers use it. For example, there is a library that does you can mine Bitcoin, cryptocurrencies. We just did, if you just end up on your website, you can mine cryptocurrency, which might not be a nice thing. But what we did here, you have the whole discussion, is my friends use BigQuery to find all the web pages within this million that are mining cryptocurrency. And they found more than 1,000 websites doing this. How else would you run a query like that if you didn't have access to the data and to do it immediately? Including, at the time, there was a Brazilian government website that was eating all my CPU and I published this and the tweet went viral here and in Brazil. So yeah, you can do this kind of games right now. Thanks, Kevin. So is there, I mean, I know I can go and take a look, but is there anything built on top to help visualize the query data? Because I might not, if I'm a small org, I don't have really need for petabytes and that's good, as I say. But my first immediate need would be after I analyze it, visualize it in different ways. So that feels like an engineering interface. I'm just curious how you view it today or what you've tried to use that way to help visualization. Excellent question. Thank you for that question. So how do we visualize data with the query? There are many ways. For example, if I have the results of my query here, I can save it to Google Sheets and I can immediately start visualizing them in a sheet. There's open source software like Redash and now we launch Data Studio, which is a totally free visualization framework that can connect to more than 200 sources and it's completely free and you can create your dashboards here without knowing any SQL. So yes, just load Data Studio and you can start using it. It's really pretty. Just let me click on a sample dashboard which in this case is backed by Google Analytics but it could be backed by BigQuery or any of your sources. Data Studio. Thank you. Anyone else? Yes, please. I have our hand here. Hi, my name is Chong Ming. Hello, Chong Ming. I'm just curious. Like over the years that Google has been curious when running, has there been any instances that it crashed by some users and what are the causes? It does any? So what's the, what are our high availability stats? So yes, BigQuery has gone down and we are pretty transparent about it. If you go to the Google Cloud Health dashboard, not health care, health status dashboard, you will be able to find like if you want to see how much BigQuery fails. We are pretty transparent on when has it happened, why has it happened. No system, it has 100% availability. We do have an SLA of 99 points. I don't remember the exact number but it's pretty good. The whole history is here for you to analyze. Thanks. Great, any questions? Any other questions from the crowd? So what I would add there is the difference between running your own servers and having your servers run by Google is that when your servers crash, it's up to you to work very hard all night long to keep them on. When something crashes at Google, you have a full team of Googlers distributed around the world fixing the servers as fast as possible while you eat breakfast, which makes things pretty, pretty nice. Nice. All right, don't give your heads up. Thank you so much. Thank you very much. I'll see you in March. Yes, see you in March.