 Hi, everyone, and welcome back for another episode of Ask Chrome, where you get your questions about the web answered by members of the Chrome team. I'm Katie Hempenius, a developer, programs engineer focused on performance of the web, and co-author of the page weight and resource hints chapters of the web almanac. And I'm Rick Viskomi, developer, programs engineer working on web transparency projects like the Chrome UX report and HHP archive. I'm also the creator of the web almanac and author of the performance chapter. In this episode of Ask Chrome, we're answering your questions about the web almanac project and the state of the web. Let's go to our first question. But actually, before we do that, why don't we start with what is the web almanac? So the web almanac is kind of like an e-book about the state of the web. We have thousands, if not millions, of data points in the HHP archive. And one of the challenges has been surfacing that to people. For anyone who was intrepid enough to mine the data in BigQuery, they can find out all these interesting stats. But actually exposing all of these things in a consumable format has been a challenge. So the goal of the web almanac was to bring together the raw data and stats and trends of the HHP archive with the expertise of the web community. So we've gotten 20 chapters with, I think, 27 different authors. And each of these chapters focuses on one specific aspect of the web. For example, CSS, JavaScript, CMS, e-commerce, page weight, resource hints, HHP2. So lots of different things about the state of the web. And we hope to make this an annual thing. So each year we can track how the web is progressing. Our first question comes from Shane Jones on Twitter. They ask, this time next year, where do you expect the almanac results to be in regards to performance and site speed? I feel 2019 has had a big push on this, and 2020 may bring even more. Yeah, this is a great question. I'm a glass half-full kind of person, so I feel like the web will be getting better and faster. But the web almanac is full of millions of websites. So overcoming that inertia is going to take a lot of effort. But I am optimistic. We are making really great progress on the awareness front. We have tools like PageSpeed Insights and the speed report of Search Console that are bringing awareness to developers about how fast their websites are. And I think we're going to be seeing a lot of improvement over the next few years. We're going to be moving that needle in the positive direction. And personally, I would say performance is something people are always going to talk about. In the 70s, people were complaining about the performance of their text editors. So it's always going to be an issue. But I think when we typically talk about performance, we're talking about the performance of static sites. When we have everything we need to develop our fast static sites, we just need to do it. So yes, hopefully we'll see better performance in the coming year. Next, I'm in Lucille asked, do you intend to scan pages other than home pages? Yes. So this is one of the biggest limitations of the HHB Archive data set, which forms the foundation of the Web Almanac. We get our URLs from the Chrome UX Report. It's a data set of real user experiences. And there are about five or six million websites in there. And so when we were planning to expand from 500,000 websites in the HHB Archive, we had to decide whether we're going to go for the breadth of websites or the depth of websites. We chose the breadth. So we're now crawling 5 million websites. And we had to choose between which pages of those websites we're going to crawl. And so by default, we just went with the root or the home page. And this is a known limitation. We're trying to fix this because the home page is not necessarily representative of the entire website. So to combat this, we need to offset some of our resource limitations. The code is about 10 years old, and it can't handle too many more websites. So we're going to have to scale that up. And it's also maintained by community contributors. So getting the infrastructure to scale up and also determining how we're going to scale down into the depth of websites. Choosing those secondary pages is not as straightforward as you might think because how do you determine which pages are the next most important after the home page? Is it the biggest link on the page? Is it the most linked page on the site? So these are quite a lot more ephemeral, right, than a root main. Right. And there's also the challenge of pages that are authenticated or depend on certain search queries to return the right data. So it's a challenge and it's something that we're hoping to solve in the next couple of years. Next. A Diabatic Unreaded asks, in the media chapter in the bytes per pixel numbers for JPEG, PNG, GIF, WebP, and SVG, does this take transfer compression encoding into account? I wouldn't expect any difference for most of these formats, but I must much less sure about this for SVG. Yeah, this is a short answer. This is, yes, all of the page weight and resource byte metrics are coming from the transfer size metric, which does include compression over the network. In response to stats in the CSS chapter, Simon Peters asked, what is using the queue unit? Is it Japanese sites or some common third party CSS? So this is another great question because it gets to the heart of data mining. So once you uncover an interesting piece of information, you're always asking why? Why is it that way? And you can find yourself going down a rabbit hole of trying to answer every little question. So I don't have an answer to this question, but I will say that if people have questions about the data set, then you can also ask it on the discussion forum for HTTP archive. Discuss.httparchive.org. And some of the data analysts who hang out there will be able to answer the question. Or you can actually look at the data set yourself. The data set is publicly accessible. And one of the coolest things, in my opinion, about the Almanac project is all the queries that are used are posted on the GitHub repo. So you can go in, look at the queries that the analysts use, and then start tweaking and investigating yourself. OK, moving on to our final question. Caleb on Twitter asks, what are the top three things viewers might consider doing to help their organization using data from the Almanac? Yeah, this is another good question. So the first thing that comes to mind is using the distribution information about the web to inform you about how your individual website is performing. So for example, if you see that 15% of websites have fast first content full pane experiences and PageSpeed Insights is telling you that you're in that bucket, then you know where you stand in relation to the rest of the web. Or if it says that you're in the slow bucket, then you know how much more that you have to do to catch up with everyone else. Another thing that I hope Almanac readers take away is simply awareness about the state of the web. One of the most shared stats from the web Almanac is that 85% of websites include the jQuery library, which some people are surprised about. Either it's too high or it's too low in either direction. But for me personally, it was actually eye-opening to see how people react to that stat. People are like, I love jQuery. Of course, it's going to be everywhere. Whereas in some of the circles I communicate with most, they're like, no, I use the latest JS. I'm always writing bare bones vanilla JavaScript. So it's a great way to start that discussion. So going on to the third thing, the accessibility chapter in particular was really sobering for me because of the metrics about how accessible the websites are being built. The metrics don't really look good for a lot of websites. They're not built with accessibility in mind, which is really sad to see because we need to make websites that are accessible and usable for everyone and the fewer barriers in place better. Next up, I want to bring up a comment that we received about the web Almanac. It would be more interesting to see this normalized by traffic. If we want to see the real state of the web, traffic matters more than the amount of sites. I think that's a good point. Do you have any comments on that? Yeah, so it's a valid point that the web is visited in more places than others. So if you look at the top 5 million, it's not an even distribution. A site like Google.com will get more visits than ricksshoewarehouse.com. So I do think that it makes sense for certain metrics. But depending on some questions that we have about the web, like how often is jQuery used, I think it makes sense to make the denominator of that, like the number of websites. But for others, I do think it's valid. For example, the performance chapter looked at the state of user experiences. And we would say 15% of websites are fast or something like that. It would be really nice to be able to say 60% of user experiences on 4G networks are slow, something like that. So the only reason why we don't do that is because we don't have the data. Right now, the Chrome User Experience Report is not popularity weighted. We don't have ranking information either. So you might remember in the Alexa data set of the top 1 million, it would actually rank websites, like Google number 1, whoever was number 2. And with that information at least, you could weight the data in some way to give more importance to websites than others. But without that, we have to do the best we can. That makes sense. So here's another question. How is the HTTP archive constructed? What's the methodology that you guys use? Yeah, so the simple answer is it runs on web page tests. It's the backbone of all of the testing infrastructure that is created and maintained by Patrick Meenan. And so to get all of these web pages, 5 million of them to work in sync on web page tests, we have a custom built PHP 10-year-old web server that actually controls all these agents. And they're run from multiple locations in the United States, I think Washington and California. So we run both desktop and mobile. Mobile is emulated. It's not an actual device for resource limitations. And we also run a few different types of tools on these web page test agents. The first is Lighthouse. So this is a tool from Google where we audit websites for various things like performance or accessibility, SEO. And we only test these in mobile also for resource consumption issues. The other tool that we use that's really helpful is WAP-A-Lizer. This is a tool that detects over 1,000 different technologies like WordPress or jQuery. And we are able to group these into categories, like JS frameworks or CMSs. So this enables us to get really interesting breakdowns of insights. So for example, the CMS and e-commerce chapter are pretty much entirely based on these categories of websites. And one other thing that we used was Patrick Hulse's third party web library. So we were able to group together third party domains and identify them that way so that way we can identify which percent of page weight, for example, comes from third parties. Not only that, but he also broke them down by which type of third party, whether they serve video or if they're for analytics or social, just giving us more granularity for even better insights. What would you say makes you most excited about the WebAlmanac project? The thing that excites me the most is the fact that it's an annual report so we can see how the web is evolving over time and better track things that maybe are regressions in certain ways and catching them and trying to stop them from happening. The other thing that really excites me is the fact that we had 85 different contributors to the WebAlmanac, both from within Google and external contributors, from all around the web community. That's really encouraging for me because it's an entirely volunteer project. So it was great to be able to see people taking time out of their day and just contributing for the goodness of the web. And also the fact that people come from all of these different backgrounds and areas of expertise. So we had, for example, people with accessibility and SEO experience, and not only were they able to contribute to the chapters, but they were also able to help our website make sure that it's SEO-friendly and accessible. It has this little scheme to get help with the website. But it's also about eating our own dog food. If we're going to preach best practices, we also have to actually make sure that we're delivering a website that does that. Yeah. And I think what excites me about this project is it makes the data that's in the HB Archive more accessible because I think there's so many interesting insights to be had in there. But I think it can be maybe intimidating for people to get started with, or you might not even realize that it's out there. So I'm hoping this encourages more people to start looking into the data set. So you can just better understand what the state of the web is. All right. That does it for this episode of Ask Chrome. A huge thank you to everyone who submitted a question. We couldn't have done this without you. And if you found this discussion interesting, you might want to check out Rick's other show, State of the Web, on the Chrome Developer's YouTube channel. You can check out links to the channel and also to previous episodes in the description below. Also, the web almanac at hcsbarchive.org. Thank you for watching, and we'll see you next time.