 Hello, everyone, and welcome back to the State of the Web. I'm joined again by Felipe Hoffa, developer advocate for BigQuery. And today, we're taking a closer look at the web Almanac and how we can use it to study critical aspects of the state of the web. Let's get started. All right, Felipe. So we have two web transparency tools that I'd like to run by you and see how we can use BigQuery to learn more about the state of the web. First is HTTP Archive. How familiar are you with it? A little bit. I think you have a lot of web pages stored here? Yes. So we have about 5 million web pages. We test them monthly. And we save all this metadata about them, like the requests they use, any technologies that they're using, like is this a WordPress site? What type of JavaScript is it running? And so we save all that information in our BigQuery data set. So we have a table with the content of each web page. Yes, we do actually save the raw HTML responses and CSS responses. We do some interesting things here. So in some of the reports, you can see the state of JavaScript report. At a high level, things like how many bytes of JavaScript are loaded, or the median web page, how many bytes of JavaScript are loaded. So interesting things like this. What people might not know is that there's this show query where you can go in and actually see the query that was used to power that. So looking at this query, is there anything that you can see that jumps out of the BigQuery anti-pattern? Well, first of all, I like it. It's using standard SQL, which is our new standard. And then you're using some advanced functions. But I wonder, with all these millions of web pages, how much data is this query processing? That's a good question. So we're querying the summary pages data set. And the asterisk here means it's the wildcard query over all tables. So we have a table for every monthly release for both desktop and mobile. So that could be since 2010. So just do the math about how many tables that is. So it could actually be a pretty expensive query to run. Yeah, I know your queries have, like, petabytes of data? Yes. No, not petabytes. At worst, yeah, some of the response bodies tables could be petabytes. But this should maybe be gigabytes. Yeah, but at least this query is not going through the content. So BigQuery will skip that column, and your query is way smaller. If we copy this query into BigQuery, we should be able to see exactly how many bytes it will process. Oh, look at this, our classic web UI. Yeah, it's my standard go-to, all the habits are. OK, that's running the query. Now, if you want to see how much data your query will go through before running it, you can click there on the right button. So it's only a gigabyte. Yeah, so basically each table might have a terabyte of data, but if you are only looking at the column for bytes, that stores a number of bytes, yeah, we go through a lot of tables using only one gigabyte of your free quota. OK, so that's manageable. You can query this table 1,000 times. So this is the summary pages data set. And the schema for this is pretty simple. It has mostly just integer values and strings. Pretty standard, and there's one row for every URL. Perfect, so this table for 2012 is, we can look at details here, and yeah, oh, it's only 26 megabytes of data, perfect. This is not one of our huge tables. No, I can show you an example of one of the queries that we're hoping to write. And it does require the raw response bodies data set. And I'm curious to see what you would think about that. So a little intro to the Web Almanac. We are trying to write an annual state of the web report using HTTP Archive data. So we try to answer questions like, same things that you would find on the HTTP Archive.org website, but combine that with expert analysis. So people from the web community who know CSS and know JavaScript will be able to say, this trend that we're seeing with the state of CSS or JavaScript is cause for concern, or we should really be doing things differently. So for example, we have a few metrics here for CSS that they're hoping to the authors of the CSS chapter, Unicrabbits and Adam Argyle, two people who have been on the show before, so check out those videos. You can see that they're looking up usage of at import in a style sheet. So we have the raw CSS, but querying them, I suspect would be quite challenging because of the terabytes of data. What do you think? Okay, show me the table. I would love to see how much actual data you have. Okay, so in response bodies, oh, this is the one that's already running. So in response bodies here, let's go to the latest one, 2019. Oh, you can filter everything there. Details, 6.4 terabytes. Six terabytes of data with all of the responses. How many columns does this table have? Four. So the page, the URL, and then the body, which may be huge. That's where all that data is going to be stored. And if the page is more than two megabytes, you truncate it. Yes. Even then, you have six petabytes of data in 140 million rows. 147. 47 million, wow. So I love being able to query this kind of data. Like, you still get fast replies, but if you want to stay under your free terabyte every month, analyzing this table will bust it. Yes, will it even let you run the query or will it give you one terabyte worth if you're using the free sandbox? If it lets you run the query, it will run once, and then you're out of quota until it gets replenished. But maybe it will stop you before running it. So I actually have billing enabled for this account. So we could run this query if we want. And it's, what, $5 per terabyte? So this would be a $30 query? Yeah. So for some people, it might seem a lot. On the other hand, if you're going to analyze a 40 million pages, like the raw content of them in one query, in a couple, not seconds, but maybe a minute, that's a pretty good price, like. So I guess to get started implementing this query for finding things like at import, let's approach it by taking a subset of the response bodies for only known CSS files, and then we can create a subset table that we can query over and over and refine our query. Does that sound good? Sure. So if someone wants to analyze all the CSS pages, it's much better instead of finding just this page, is let's extract all the CSS pages, which will be a way smaller table. Okay. So let's start with that. One thing that comes to mind is that, we can look in the preview here, is it's not always obvious what type of resource it is. So this looks like a JavaScript file, it's .js, but everyone knows the final name does not necessarily mean that it's necessarily JavaScript or CSS, it could be named anything. What really matters is the MIME type of the response, which is in a different table. So now we need to do a join of that table with the response bodies. So let's start composing a query. I'll just put this together. Perfect. It does default to legacy SQL. So what I usually do is. That's because you're using the legacy UI. I know, I know. I usually just like check this box and say, don't use legacy hide it again, click query again, and then it formats it using standard, which you always know it's using standard versus legacy when you see back ticks versus brackets. Yeah. Okay, so we're gonna need to join this table with also I'll just format, format it, remove the limit. I'll do a join with the, what table is it? Responses table. Thank you to complete our request. And we'll keep the same data table, 2019.05.01 desktop. And we need to join it on something. So in both response bodies and requests, we have a page field and a URL field. So I think we could just join on page and URL because a specific URL could be loaded by multiple pages, but it's always gonna have the same response type. So I guess we can only join on URL, using URL. Does that look good? So far. I would still use a page in case to avoid having multiple replies, like it makes sense, but let's use both. Okay. Yes, in case. Play it safe. And I love that we'd say, well, you can say just using when you have the same name on both sides. I love that too, yeah. So the alternative for people who might not know is like you can join on certain things. With the line. You'll need to have like A equals B and you'll need to alias one table as like A one is B and then B dot page. Yeah, it's just so much easier to use using. Well played. Get rid of that. So what are we actually selecting here? We need the payloads. And I guess we should keep the URLs, pages and URLs. And let's say like mime type. Oh, I just had an idea. Maybe instead of, so the request table, the payload here is a JSON object. And every single piece of metadata you can imagine about this request can be found in here. We do have summary tables where we extract high level pieces of information. Good. That's much better because instead of processing the whole JSON, it's already pre-processed. Exactly. And you can go through one column. And as you click, that's it into the query. So we will not do that. But we will do summary requests. And the table name, 2019, 0501 desktop. Thank you, summary table. I will be using this one. Now, does it make sense to limit the summary requests based on mime type now? Or should we just join the table, save that to a temporary location and then query that separate? Well, if more people wanted to use this summary, of course I would do a complete one. But if it's only for us now and we want to see all these CSS, I will limit it now. Okay, so mime type. Let's look at an example of what some mime types might be here. This is one problem I have with the preview is that we have some very long responses and it makes it very hard to find whatever you're looking for. Sometimes the new UI makes that better. So you should check it out. You know what, let's do it in the new UI because this was a BigQuery episode. Let's do it right. And we can switch between both. Sure. So we're looking at HAP archive summary requests. And I'm gonna filter again for 2019, 05. Summary pages. Desktop. So you're saying if I preview here, this would be a better UI. Sometimes. Yeah. Oh. Can I not scroll? Oh yeah, okay. So I'm looking for mime type. And those are blank. So maybe this isn't a good. We could run a quick query over. So if you say query table here and you run the command approach top count. And you can now complete mime type and give it, let's see the top 10. Yeah. I can read your mind. That's a fast way to just go through everything and count an approximate. So this is what we want, text CSS. Yes, sir. All right, if you don't mind, I'm gonna go back to this legacy for a second. For sure. Let's jump between both. So we're selecting page URL, mime type payload joined by page and URL. Okay. Between response and summary request. Uh-huh. And this query. Save it to a table. Yes. Scratch space. Call it CSS responses. Is there anything else you could see that we would need to? I see that's where we're not running. Can you press, click on the red button? Page. Uh-huh. So the summary requests. Don't have the page. Doesn't have page, it just calls it URL. So just using URL. Cool. So because it's a summary one, it just gives you the. Just wanna verify. Yes. Okay. So we will. We don't have a payload now. So from the bodies, you might want the body, not the payload. Oh, true. Okay. Body, body. And now we have a six terabyte query. It's okay. But we only incur that cost once. Because we will get a way smaller table out of this one. Nice. How long do you think that should take? Six terabytes. How long did it take? How long would it take without the query? By hand? Or what? Yes, on your laptop. Oh, I don't know. It's a MacBook Pro, maybe. Couple minutes. Yes. So while the, well a couple minutes, yes. So while the query is executing, we can see this chart of how many parallel inputs are looking at the underlying table and you have the underlying. We have like 8,000 different jobs and half of them are active. Some have completed their work. You can see them going up. Okay, so this blue means it's getting closer to finding the result we're looking for. Exactly. But only when it reaches active, it's done. Is that right? Yeah. So basically, we have a mix of active and pending and then there will be a time when there are no more pending or active jobs. Okay, oh, we just jumped. Yes. Yes. Basically running a join like this is a little crazy. We're moving a lot of data through two tables through the network. So, yeah, it might get crazy. You can also see here where things are happening, where are we spending most of the time. For now, it's this hash. It's being able to join two giant tables. Is that hashing the using field? Or what is that hash? So basically, it's not a hash that you see, but it's a network hash. So you can find what data of this table matches with the data of the other table and you can output. So yeah, there is a hash happening here and there's a lot of repetition inside the network. So as we wait for this to finish. Yep. I wanted to also show you a data studio dashboard for the Chrome UX report. Perfect. Let's talk about the Chrome UX for a second then. This is a different data set. It's a different data set, but it's also hosted on BigQuery and we have these different visualizations of the data that people may be familiar with on PageSpeed Insights, but this one here is the data studio connector. So we can enter things like developers.google.com and connect, allow it. So you are just entering any web page on the internet and you are able to see. This is by origin. So the difference being a web page could have like slash product slash 123 as a product ID, but this is origin level, which is only the protocol like HTTP, HTTPS, any subdomains, then the domain like example.com and we create the report. So when we view this dashboard, the first thing people should see is a bunch of months. So we go back to July, 2018 to the most recent data set in April and we see like fast FCP is 45%. What that means is that the percent of first contentful paints on this origin are considered fast, less than one second, 45% of the time. So all of this data is available in BigQuery as well and a quick shout out to BigQuery, we can also look up the ChromUX report project, which I have down here and the data sets are available in monthly tables, yeah, 2019, 04. And we have things like just a preview for a second. We have each month is grouped by it's effective connection type like 4G, the form factor like phone and histograms and histograms are repeated records. So they're a little tricky to query. So one thing that makes things a lot easier is this materialized data set. Where you can look at the metrics summary, where it just says this origin for this month has a fast FCP of whatever percent. I love when you give people a summarized table so they don't have to query the whole. Exactly. But also people have access to the raw data if they want it. So this looks like it's a partition table. Good. It has a weird date. So we don't want to do today's date. We want the April 1st data set. Fast FCP and origin. Formats we can see. So this should be 40 something percent. Yep, okay. So this query is exactly what the data studio dashboard is doing under the hood. Pretty cool. So there's a site I'm interested in. I know that Reddit is going through this whole redesign. And what if I want to see the percentage of fast pages that they have through time? So is that hhtbswww.reddit.com? Yeah, most probably. So we could remove the where date and group by that date. Cool. Yes. Fast FCP, so here's an interesting problem. So we don't need to group by date because they're individually fields. Order by date is cool. Yes, order by date is still needed. And now we can see what percentage of pages were fast. So in 2017, 63% of loads from Reddit were fast. So what would you recommend for visualizing the data? So we could export to Sheets. The new UI has integration with Data Studio. What do you feel like? Let's do this on the new UI. Okay. So we'll try it in the new UI. Paste the query in. Run a query. It's already in your cache. I can explore in Data Studio. This would be good as a time series. Time series might look nice. Okay, and we need to make some changes here. So let's roll that up. Instead of looking at the record count as a metrics, we want to see fast FCP as a metric. And you can remove record count. And now this is monthly data. You know the secret? I think so, but tell us anyway. Yes, I usually come here to edit data source. And I tell Data Studio, this is not daily data, but this is monthly data. Let's see if that works. Aha. Well. I did not know that. Aha. And what we can see here, that is really interesting for people that are developing the new Reddit site, is that Reddit is getting slower. So I want to caution you from jumping to that conclusion. So is the website getting slower or are more slow users using the website? That's an interesting question. It sounds philosophical, but also keep in mind that some people, some websites have different origins depending on their traffic. So maybe things like mobile traffic on a M.subdomain have now been merged with the WWW origin. So maybe now there are more mobile users being users. So we cannot quickly jump to the conclusion that Reddit got slower, but we could just say the percent of fast experiences has gone down over time. Yes, from 60% to 40%. Why we don't know? But there is a degrading. Well, what we were showing me on developers, developers Google Chrome, it was fairly stable. But yeah, this is interesting and it would be really interesting for Reddit to know why things are getting slower. I want to show you one thing that I always do is set the y-axis to zero because when you have a chart like this, it looks like it bottoms out, but it's really 40%, which is still kind of a high percentage. And also the other thing I would do is change these decimals to 2%. And I think what you did earlier was good. You go in here to the data source. And this is not a number, it's a percent. And now it shows up as 60%, 62%. So this looks good to me. Beautiful. It's less dramatic as it goes on, but still it's going down. Nice. I like your chart. Cool, thank you. So the other queries are still running. Yes, somehow our join turned into a very big network. So I think we should let these go, but is there anything else that you can think of with BigQuery that would be good to demo here? Yes. Just as a note, what I would suggest here is, if this is important, we should create these table ones and have it pre-joined for everyone to use so they don't need to go through this. And I'm wondering if we could have, maybe we have a splotting join here and we could have optimized this query before, but for now let's think of other questions that we can ask. So for the Web Almanac, we are currently in this data analysis phase. And pretty soon by November, we're going to release the full report. And for anybody who's interested in contributing to the Web Almanac, you can go to almanac.hhbarchive.org and there's a button in there that you can check out to link to our GitHub repository. And if you follow this, you'll be able to see all the different ways that you can help out. We have subject matter experts writing about the state of things like JavaScript and CSS. And people are peer reviewing each chapter. Other experts from the field are ensuring that the results that are found from these subject matter experts are not only technologically sound, but also the writing style of everything fits our Almanac as a whole. Data analysts are doing these queries. Web developers and designers are building the website and it's even being translated into the top 10 languages. So we're really excited about the Web Almanac. Check it out, almanac.hhbarchive.org. Felipe, anything else that comes to mind before we go? I would love to optimize this query. So if we don't have time to do it in this video, I will send you the optimized version of it. And why don't we do this? So we have discuss, another URL, discuss.hhbarchive.org, where we share queries about hhhbarchive dataset. So we could post this query there when we're done. How does that sound? Perfect. We have a link here so people will find the fast way of running this query. Exactly. Excellent. All right, so thanks a lot for watching. We'll see you next time. Check the links in the description. See you next time.