 Hi everybody. Thanks for coming. You are in the Room 606 for enterprise content inventories. If this is the wrong room, please exit the plane immediately. My name is Greg Dunlap. I am a senior digital strategist at Lollabot, and I've been in the Drupal community for about 12 years, and I've been doing software engineering for 25, so I'm old. In my spare time, I really like pinball and cats, and you can find me on Twitter at GregD Dunlap. Lollabot is a strategy design and development company. We do high-profile websites for large-scale publishers. We spend a lot of time with our clients working to understand goals, identify problems, and really dig into the whys of the projects we work on. Not just like what do you need, but why are you asking? Why are you looking for the thing you need done? In addition to development and architecture, we also do content and digital strategy, which is the department I'm in, and design and user experience. I started getting into this field after 25 years of engineering because I was really interested in looking into big large-scale problems and getting them solved before the coding starts. Along that line, last year, we won a project for the state of Georgia to transform their Drupal platform, which ran all of their state agencies, to Drupal 8, to migrate it, and to do a lot of content strategy that they wanted to do because they really wanted their existing Drupal platform was about seven years old. It was running on Drupal 7, and the world has kind of changed since then. It's not very responsive. It had a lot of problems with data availability, a lot of real old-school content authoring experience stuff, and they really wanted to focus on making it much more on the channel, much easier for their state citizens to access the data and information they need. So our team at Lullabot came in and did the kind of strategy and some design and UX work, and the build-out and migration. When we started approaching this project, we're looking at 85 state agency websites across a gigantic network. And they were all individual Drupal installations, and we kind of needed to get an idea of what we are getting into. We needed to know what kind of content is out there. What are the commonalities between agencies? Like out of all of these agencies, what are the needs that they all need to solve, but also what are the needs that are unique to each property? What needs to change in terms of their content to meet the new vision of the platform that they're trying to create? What do we have to fix up and adjust? And where are the landmines hidden? Where are the places where people have coded an entire new CMS in the body field? Where are the places where they're using 18 embedded tables with spacer gifts to create page layouts? We needed to figure all of this stuff out. And so that meant that it was time for a content inventory. And this was going to be a content inventory on a scale that we had never really approached before. We knew that in addition to analyzing the individual sites on their own, we would want to aggregate them to look at the content holistically across the network and see what kind of problems there were system-wide. Usually we would do this pretty simply in Drupal by taking a site, dumping the menu table, and looking at the list of URLs. And on the scale that we were approaching this, we weren't really going to have that option. We needed new tools and we needed them to be really repeatable and to be really functional because we knew we were probably going to get everything that we wanted the first time. We knew we were going to have to run this repeatedly to get stuff out of it later that we weren't predicting earlier. We needed computers to do what they're good at, to do the same thing over and over so that we could get to the real work. And so we started researching tools to get this done and we ended up with a really nice tool set to move forward and do this kind of work in the future. And that's what I'm going to talk about and some tips that we learned along the way. So what do we need from a content inventory? As I said, we usually do content inventories either by assembling URLs by hand or pulling them out of Drupal. But this time we thought that it might be more useful to look into web spiders to actually a tool that would automatically and recursively spider all of the sites and collect the information that we wanted. We were also thinking about building that tool ourselves or building some tools to do content analysis and manipulation and stuff like that. Thankfully, we didn't need to do that but we did think about it. And the other thing we really wanted was a native application that was not a cloud solution because most of the cloud solutions charge on a per URL basis and we also wanted to have more control. Like if we wanted to do a thing where if we wanted to be able to set up a set of scripts that run overnight to grab the content, analyze it and slice and dice it in a bunch of different ways, we wouldn't be able to do that with a cloud system. We wouldn't have the freedom or the access to manage it that way. We wanted to pay once and we want to have more control. And so as we were looking at that, we were trying to think what do we really need in the end? So we need a list of all of the content obviously. We needed a way to run analysis on the individual content. We needed to be able to do it on specific slices of data. So for instance, we want to run it for one site. We could do that. If we wanted to run it for all of the elected official sites, we could do that. We wanted to be able to arbitrarily figure out which slices of data we were going to work with. We wanted to be able to specify internal or external criteria by which we could do that. So for instance, internal criteria, which of the sites are using a content type called basic page or external, which of the sites fall into a vertical that they have defined like elected official sites. And we wanted to be able to extract information that would be useful to pass on to our dev team for migration, if possible, so that they would have a head start on understanding what kind of content they were dealing with as they got into the migration. So there's kind of three steps there. And the first is generating the inventory. So the first thing we did was we looked at a lot of tools to spider websites. And this is a tool we ended up with. It's called Screaming Frog. It's a spidering software with a super weird name. They're from the UK and they're an SEO company that built this. At its core, it really does what you'd expect. You give it a URL and it goes up and follows all the links until there aren't any more and it gives you a report of what you found. It's highly configurable. It's capable of handling extremely large data sets and it reports a lot of really interesting information. So here's a screenshot of just some of the stuff that you get for every individual URL that it spiders. You can see there's stuff like H1 information, meta description, H2s, canonical links, text to HTML ratio, all sorts of different things that are really cool. It has a lot of stuff that's very SEO focused. So for instance, it has links on all of the H1s and H2s and stuff which is really important first. For instance, generating Google previews and stuff like that. And it has all sorts of stuff for external links. So you can get a list of all the external places that you're linking to in addition to internal, only the HTML, only the images. There's a lot of data on a lot of ways to break it up, which is really cool. You also get cool charts. So here's, this one shows all the types of data we've retrieved. HTML, JavaScript, CSS, images, PDFs, et cetera. You can click on each one and get a filter of that data list and all of that kind of stuff. And then finally, one other really cool thing is that it integrates with Google Analytics. So for every URL that you grab, it will also pull the analytics data for that individual URL. And you can define which analytics data it does or doesn't pull. So for, you know, visits, unique visits, bounce rate, goal completions, all sorts of things. Anything that's in analytics that can be defined for an individual URL can be pulled in along with this inventory. Obviously, a very popular, a very powerful tool when you're looking through your content and you're trying to determine what is important, what pages are popular, tying that back to the data within the pages and all of that kind of stuff. And all of that is really cool, but there was one thing that really struck us when we were looking into it. And that was the ability for it to extract data out of the HTML of the pages using XPath. XPath is a kind of markup language that allows you to grab, it allows you to focus in on an individual portion of an HTML page, like for instance, the H1 or the class of the H1 or the count of the number of H1s that are on the page. There's a surprising amount of information hidden in most Drupal standard themes, HTML, and being able to pull it out was really, really compelling. So if you look at this, this is the body tag for a common Drupal theme. Out of this, you can determine the node ID, the content type, and the context for the page you're viewing. A lot of other things that you can get out of that are the fields and blocks on a page, the field types, the views, all of that kind of stuff. And when you're doing an inventory, being able to aggregate this kind of architectural data is really interesting. We had actually planned to have to grab these URLs and tie it back to the individual databases in order to get a lot of this stuff out. And we ended up not having to. You can't get everything out of it, but it got us pretty far and it allowed us to do a lot of stuff that was really helpful to us for migration down the road. So for instance, one of the things we were able to do is to detect potentially problematic content or places where the body field had been misused. Like here are things with embeds, too many embeds, embedded JavaScript within the body field, inline styles, tables, all of that kind of stuff. Being able to grab those things in an automated way across a network of 85 sites and hand them off to the dev team, really started to give them an idea of the problematic content that they were going to have to deal with and where they might have to focus their stuff. And we only really scratched the surface of stuff that we're going to do with this, but there's a lot of really cool possibilities. It's not perfect, it's not cheap. It's about $250 for a license, and it eats up a lot of RAM. It's really resource intensive, but I would kind of set it off at night and go to bed and have a set of data waiting for me in the morning, which was really cool. So that was sort of the first element of our process. The second was content analysis. We had originally planned to write a tool that you could pass on to public web APIs to do things like readability data and all of that kind of stuff, accessibility scanning and that kind of thing. But we ended up not having to, again, because we found a tool called URL Profiler. What URL Profiler does is that it takes a CSV of URLs and runs a wide variety of analyses on them. This is also a tool that was built by the SEO industry, so you can see that a lot of this stuff is SEO focused, and a lot of them are also paid services, but several of them are free. For instance, simply the readability tool itself is very powerful and gives you a lot of information. For instance, you can get sentence header and paragraph counts. You can get the amount of time that it approximates that you would read, take a page to read. The top 10 words used on a page, sentiment analysis, is the language on the page generally positive or negative, and a lot of reading level scores. There are different algorithms for determining basically grade level of the text on a page, the complexity of a text on the page. It's smart enough to strip away non-content elements, so you can give it a starting, you can give it a tag to read between, so you can give it basically the content section of a page, and it won't analyze all the text in the menus and the sidebar blocks and stuff like that, and it has Screaming Frog integration built in, so you can upload a Screaming Frog spreadsheet and get back a spreadsheet with this combined with the Spreeding Frog data together in one place, which is really cool. It's a subscription-based tool, but it is native, it's not that expensive, and it works for up to a million URLs a month, which is about as much as we would ever need, so from a value perspective it's really cool. There may come a day when we want to write our own tool to make this more pluggable and integrate with more things, but right now it's really amazing. After running this gigantic inventory and running the URLs through all of these tools for analysis, what do we do? Just making sure about time. We have this massive CSV with 100,000 URLs in it, and we were able to aggregate some queries and stuff like that by bringing it into a SQL database, but we really wanted to be able to set up different kinds of an analysis on different types of data, and we really wanted to generate those chunks pretty automatically. Excel, while it would do this, was really hard to work with data sets this big. We really needed to break stuff down. What we found was a really nerdy command line tool called Go CSV. It's basically a very, very fast command line tool for running information on CSV files, and it does a lot of most of the things that you would be able to do with a SQL database. It has a lot of different commands, and the information can be totally put into a shell script and the results of the commands piped into more commands, so you can really string information together. The one command that was really useful in this for us, and here's a list of all of the possible things that you can do with Go CSV, but the biggest thing for us was this command called filter, and what it does is it allows you to filter a column based on any information or any set of information, so for instance we could say I want to filter out all of the document files, all of the PDFs, and you'd be able to do that, and then you could use that as the basis for a new file, and then if you further wanted to say now out of that set I want to filter it down based on the individual domain that it came from, you could do that. You could kind of run these strings of things together. As long as you know the possible values that you want, you could script it all out and generate these subsets of data that were really interesting. We could filter on content type on response code. You could write these shell scripts to pipe filters into filters into filters, and we did exactly that. For any analysis that we ran for the state of Georgia, we had a set of shell scripts that we ran, which ended up with this group of data, so we would have all of the URLs by content type, by domain, by response code, by the individual vertical, because they had defined sort of seven verticals for elected officials, law enforcement, environmental, etc. Then the docs and no docs directories, so the docs is only the PDFs and documents files, and the no docs was everything except the PDFs and document files, and then those we could break up by domain and by vertical and by content type, etc. It gave us the ability to script that out to do it automatically, was really, really great, and it saved us so much time and really allowed us to dive into this data in a lot of ways that we wouldn't probably have been able to before. All we had to do was run our script and we'd have all of our slices again. That was really cool. So, having, you know, once we've done, so, you know, we've defined a bunch of tools which are really cool for running inventories and generating the data that we need, so what are we looking for when we look at inventories? When we look at all this data, we've got this gigantic huge set of data, and it's like, where do you even begin? I'll say that to some extent this is more art than science, and I have a tendency to just sort of dive in and see what grabs my eyes, and it really depends also on the domain you're working in and what the problems you're trying to solve are, but there are a couple of places that I will generally start. The first thing I'll do is examine the edges. I'll take that big spreadsheet and I'll just start sorting by different properties and see what kind of content lives at the extremes. For instance, Screaming Fog reports the page size of the HTML on every page at every URL, and so seeing what lives at the highs and lows of that can be really interesting. Some of it's really obvious. Obviously, you know, gigantic tables increase the bloat on a page, but sometimes you find out really interesting things, so as an example, out of all of the 200,000 URLs that we scanned across the state of Georgia site, this was the one with the largest page size. Does anybody want to guess why? But it's not, but that's not the HTML. It should just be an image tag, because this is only the HTML of the page. Somebody base 64 encoded the image and pasted it into the HTML in the WYSIWYG. Now, why anybody did that is anybody's guess, but it was, it's interesting, it was interesting that somebody felt the need to do it, and it told us a lot that isn't necessarily have to do with the CMS. It told us that the users are very crafty, and it told us that when they want to do something, they are going to do it, and knowing that was really interesting going forward as we started to think about authoring tools and about what we didn't want to allow people to do. In addition to the large page sizes, the small page sizes are also interesting, the ones that have almost no content on them. So, for instance, we found that there were a lot of pages that were nothing, but the meeting, basically a link to a PDF of meeting minutes, and no other content on the page at all. And it's like, you know, obviously, you know, there are a lot of reasons for sunshine rules and stuff like that, that those things need to be made public, but there's obviously a need for a better document management solution and aggregation solution there based on the amount of these things that we saw. There was a need that wasn't being met. So, you know, going to the extremes while hilarious and having great stories that you can tell for conferences is also really, really useful. There's a lot of really cool stuff that you can find there. Another thing that I like to do, especially on networks of sites like this, is to look at things in aggregate and in slices. You know, my colleague, Jeff Eaton, likes to call sites like this archipelago sites. They're basically groups of islands collected under a single name, and some information along those is more important in aggregate, and some is more important based on the individual islands. And you shouldn't ignore either, and looking at them through different lenses can be interesting. And here's an example of that. So, when we first ran our analysis on Screaming Frog, we realized that shocker government has lots of PDFs. In fact, they had more PDFs than actual HTML pages. One of the reasons that we wanted to strip out the documents when we ran this stuff through URL profiler is because URL profiler doesn't retrieve anything for the PDFs. And we're just like burning 100,000 URLs in the thing for no reason at all. So, we just wanted to get rid of them. That in and of itself is really interesting information, but we're also kind of interested in how those PDFs were being used and distributed across the network. And so, we're able to then generate this slide, which shows the ratio of PDFs to HTML pages on every individual site. So, as an example, so on the left are pages where there's more than double the PDFs to HTML pages. The middle is where they're about even, and the right is where there's more when there's double the HTML pages to the PDFs. So, you know, what this told us was that the use cases for PDFs were really concentrated among one group of agencies for the large part. And that meant that this wasn't necessarily a problem that needed to be solved network-wide, but that we could concentrate it on the use cases of these individual agencies. Some of them we realized as we looked at, there was nothing to be done about it. You know, it's no surprise that the Department of Revenue has a lot of PDFs. They've got a lot of forms that have to be printed out and blah, blah, blah. But some of them were completely ridiculous, and that gave the State of Georgia the ability to concentrate with those agencies when they're doing the migrations and content audits to say, hey, you guys have a real PDF problem, we need to figure out how to solve it and what to do about it. Both of those things were interesting for different reasons, and looking at data in aggregate and also in slices is really important. Beyond that, some other tips. Grab more, not less. We found that initially we wanted to, we tried to limit the amount of data that Screaming Frog would grab. Like for instance, at first we told it not to grab images, but then over time we found uses for that data, and then we would just have to run it again to get it in addition to all the rest. So my way of working right now is that when I do an audit for a site, I grab everything. You can filter out the stuff you don't need, but you can't filter out what you don't have, so grabbing everything is better. And grab it all at once, kind of the same principle. You want to grab everything and grab it all together. We had originally again thought, you know, oh let's run this set of agencies, and then we'll come back and run this set of agencies, and again, you're just starting to put all of these pieces together, you're just better with one great big pile and turning it into the separate things you want rather than taking the separate things and trying to bring them together when you need them. We kind of had to accept that this was going to be an iterative process, which is one of the reasons why we wanted to make it so repeatable. We knew we weren't going to know everything we needed to know upfront, and so we wanted to be able to come back and run everything later. And then like I was talking about, considering migration is really cool. Think about the things that you can pull out of the data as you're running it, and that your dev team could use for migration purposes down the road. What's next for us? We're thinking about a lot of different ways that we can take sort of like engineering or DevOps practices and bring them into the content strategy in UX realm. So one of the things we've been thinking about a lot is integration with real content with Sketch. Our design team is always designing in Sketch, but you know, we're always trying to figure out, again, on the extremes of the content, how does this design work when the content's eight pages long? How does this design work when the content is very short, et cetera? And Sketch and Figma have the ability to do plugins for bringing that kind of content in, and so that's something we're playing around with right now. Similarly, generating, I saw a talk at a conference about a year ago where a guy wrote a tool which would take a HTML shell of a layout and then take real content and generate the two together. So again, you could see how pieces of content laid out together. I thought that was really interesting and whatever else we can dream up because we're crazy nerds. And I think that's it for me. I have five minutes left if anybody has any questions. Sure, can you come up to the microphone, please? We're pulling all the data out using XPath, whether you use that for migration or went into all the hundreds of individual databases to do that. We did use it to some extent. We didn't grab a lot of detailed information at first because this was kind of an experiment for us. And so when they did the migration, it kind of gave them a broad look at what the problems were. But when it came down to individual pieces of content, we did end up going into the databases. What they ended up doing on the dev side is that they aggregated all of the databases together and ran something that was similar using SQL to ferret out all of the data that they needed to grab, like where all of the garbage was. I think that going forward, now that we've had one of these under our belts, we can actually do a lot of more with that than we did this time. But this time, that was how it ended up working out. What was that? You mean in terms of how are we handling them in the migration or how does the scanner handle them? Screaming Frog does identify redirects. And so we were able to figure out what pages were redirecting. And to some extent it was useful because it told us, for instance, when they had set up redirects for marketing purposes and things like that, we were able to identify them to make sure that they got migrated. But most of the work of saying, after the migration, what are we going to do to manage redirects from old data to new data and stuff like that fell on the hands of the dev teams. We didn't really deal with that very much. Yes? No, thank God. There is some Spanish content on the site. To the best of my knowledge, URL profiler doesn't analyze it. I was actually in a Slack the other day where somebody was looking for a content analysis tool that was multilingual. And there are language specific ones, but not one that's genericized across languages that I know of. So no, we didn't have to deal with that at all. Yeah? So what Screaming Frog does is that it can use the canonical tag in the metadata to identify what the canonical URL is for everything. And then when it creates the giant list, you can tell it I only care about the canonical URL. I don't care about the aliases. So you can either say I want a cataloging of all of the redirects and all of the final places where they redirect to, or you can tell it I don't want that information at all. And so depending on your use cases, you can have that in there. You cannot have it in there. For me, I only really cared about the canonical list of information. So that's how I used it. Yes? URL profiler does reach out to externally hosted sites, but I believe that it just uses port 80 for everything. So if you can, from inside your network, get to the outside world, URL profiler will work. Now Screaming Frog, which is the one that reaches in to scan the sites, may be a different matter depending on how your sites are set up. Another thing that we ran into is that if you're behind a service like CloudFront, it will often accidentally identify what we're doing as a DDoS attempt. And that's no good for anybody, obviously. So we sometimes had to mitigate that by either getting on the VPN and running it from inside the network or by using special domain names that bypass CloudFront and things like that. Yeah, that's a possibility too. Screaming Frog does have its own identifier as a browser, so you can add that to the firewall. And you can also change it in case you need it to match to something that's already in the whitelist too. So it's important to bring content over because you want to keep that content. Do you have anything that you would like to share about the opposite of that, which is having the opportunity to remove and call the content for the help of it, or just because it's... Oh yeah, that absolutely happened. So one thing I'm not even talking about here, and this will be the last question, is that after we broke all of this data out, we gave spreadsheets to the state services. We worked with the digital services for the state as a whole. It was our client. And we gave them an individual spreadsheet for every site that had, that they could use for content audits with the owners of the site. So they would, as they were doing migration, they were going to the individual sites and working with them with this spreadsheet, which had all of the data for every URL, plus the analytics data and the reading time and all of that stuff. And using that as criteria with the agencies to say, you know, to go through like a racy analysis or a keep, kill, combine kind of thing, you know, to say, this is the content we're keeping, this is the content we're throwing away, and this is stuff we're merging into one thing. And that was, but that was really done on an agency by agency basis because the individual needs of the individual agencies were so different that for us to come up with broad rules about what should be kept and what shouldn't be kept, we didn't feel was really appropriate. So yeah, cool. Thanks everybody.