 Awesome. So thanks everyone for tuning in. My name is Amanda. I'm excited to share some work that I led and then developed with my colleagues at the Harvard Business Review. Our data and operations are essentially closed source, unfortunately, but we for this project use a lot of the learnings, tools and tactics from the open source world. So I'm excited to share that. Excuse me. Some background about me. I have a degree in public relations basically, which will make less sense to you. I think the more I talk, but that's all to say my data career started sort of after my formal education ended, which is why I'm really also thankful for a lot of the open source resources, open data and research stuff that really lets you kind of learn on the job. And so I ended up working in public health research for a while at the NIH and then I pivoted to industry doing survey and product research for public relations for public health campaigns and then TV, film and now digital publishing at HBR. I'd love to know more about y'all in the chat or Slack after. I think if you've ever related to Hackjob Panda, this talk might be up your alley because it's sort of our DIY solution to something that may have other solutions, but this is ours. And maybe you're here if you're familiar with HBR, in which case that's awesome. I'll have some HBR data trivia for you that you won't find anywhere else. If you're not familiar with HBR, briefly we are a web and print digital outlet dedicated to improving the practice of management through like really expert research content. We have a paywall and subscription business model and we release a print magazine six times a year and have a pretty big online presence at hbr.org on most of the social media platforms and we have several email newsletters and a growing podcast network. Today I'm mostly talking about our articles, so in the magazine and on the website. You've probably seen an article, but when I say article, I'm talking about sort of all these components you see, including the author bio, which is usually at the end. And then the blue sort of highlighted pieces are things that we have readily available in our web analytics platforms, which is what I primarily work with, but then things like the summary and the actual body, those live elsewhere, and we would have to join that in separately. And here's sort of a snippet of the limited view you kind of get with web data. It's often, you know, you're mostly working with the headline and events that happen oriented around the headline, like did they view the page? Did they view a paywall while on that page? Did they start subscribing what in a visit that included that article? So that can be kind of limiting depending on what kind of question you're trying to answer. And as it happens, we started getting more questions from our colleagues that they came to us as a web data team, but it kind of started to feel like we needed to know more about the content of the article to bring that into the fold to better answer questions, like how people are navigating. The headline doesn't really explain why someone may have left the page or maybe tell the whole story about maybe they clicked an article that had a similar title immediately after. And the content is your bread and butter, so it kind of made sense to want to bring that into the fold. So I was like, great. Let's just go figure out where the sort of raw article content lives, invite it to the party, and then we'll get our drawing on. As it turned out, it wasn't that simple with our infrastructure and the available resources. In theory, we had access to everything we needed to make that pipeline happen. But it was more that like the time and resources weren't going to be allocated in a timeline that worked for us. So our team was kind of just given the blessing to like do it yourself. And, you know, other teams like engineering would try to fill in some gaps, but there wasn't going to be sort of a dedicated sprint or commitment to sort of making this data available in a repository that made like professional data engineering sense, if that makes sense. So our team ended up resorting to a lot of open source tools and practices and tutorials to kind of package this full article data to make it useful for us and for others. So until that point, the ways we knew how to get our article content was scraping our own site, which is not optimal for a lot of reasons, the Web Analytics platform, which doesn't have the full content. And then in our world, every article is also a product because we can sell a print of it. So it also lives in this product information system, which is just a really gross place, no API, this Kluge UI that you could scrape, but it really wasn't worth your lifetime of effort doing that. So we had to kind of go back to basics and ask ourselves, like, what even is an article? What is the purest form of the article where we can like start from there? And as it turns out, it was XML, helpful developer helped us crack that. I think a lot of people here are probably familiar with XML working with like government documents or whatever. But essentially, it's a lot like HTML, you have tags that have a semantic meaning, like title or author bio. And then you have the schema document, which is an XSD file, which is like the code book for all of those entries, where you might specify like the title should be string and not a date. And maybe you want to impose a character limit, and you can spell that out in the schema document. So we figured that out. And it turns out behind the scenes, you know, there's this XML and like an editor will write in a text editor, that'll become XML. And then it goes on these parallel tracks to our print publishing platform. So that becomes the magazine. And then also to our content management system for the website. So we figured that out. That was great. I plan to do some cleaning and wrangling to make the data sort of tabular for each article. So each article would become a row and all of these XML fields would be columns. But I also started thinking sort of what is the whole data life cycle of our article, because we're going to want to use this sort of more downstream. And so we kind of cover this content management silo on the left, again, mostly XML. But then once the article kind of goes out into the world, you know, it's in print at the store on the internet, there's a lot that goes on to measure it and then analyze it. And there's all kinds of sort of browser data constructs like cookies involved and analysts are working with different analytics software and maybe doing stuff in our Python or Excel. So that kind of balloons the formats. And then the end goal ideally is that, you know, we're taking these insights and making new content or features. And these things probably sit on top of the article so we never get baked in or hard coded. And the platforms we use for those kinds of features we're often serving parts of the article, maybe reformulated or targeted using JSON. So we have this XML again, and we knew we had this end goal of like, we want to make more data driven decisions using more of the content than we could before. And we have to kind of bridge this gap. And in another world or universe or another company, there's more or less a straight line you could draw that's like proper data engineering. But again, that's not the timeline we were in. So we were just trying to come up with something that was reproducible, debuggable, and kind of could be made why they accessible. So in our world, my solution was making our data package that would convert that did the work of turning the XML into sort of a table, which people are familiar with. I did a lot of like oral history ethnography to figure out what all the fields meant and what was generated by a platform versus a human and what the fields meant. And I built in built that into the documentation and offered a lot of export formats for people using different tools. And then there was sort of a phase that I call social elbow grease, which is socializing the data getting in front of people. Because if you build it, they won't come, you have to do some stuff. And in our case, that stuff was having a datathon with our team. And then taking that show on the road and our team calls a roadshow anything where we're presenting either at a meeting we host or we show up at other teams meetings and say look what we did. And so that helped kind of get more advocacy and enthusiasm around the data. So to give a sense for the scale of what we were working with, we ended up with a sort of archive. I'm comparing it here to the New York Times annotated corpus, which is a common benchmark for text analysis. I should say, our full archive actually goes back to the 1920s, but we haven't digitized all of that. So this reflects what is online mostly. So generally, we cover more time, we have fewer articles or articles are longer and more complete with summaries and tags. And our vocabulary is a lot smaller than the New York Times, which makes sense because we're more of a niche outlet and not covering world news. At this point, I had done a lot of data janitor work. And I was honestly kind of exhausted. And I never want to look at the data again, because it took a lot of just almost detector work to figure out what was going on. And that's how I knew it was time to sort of bring in the team and get some fresh eyes on it. And so that brought us to our datathon, which we did a really compact format compared to what you might be used to. There's like full day, full weekend hackathons. But ours, our goal was to just dedicate more time than we usually do to sort of dig deeply into data, figure out some QA and work out any kinks that maybe I missed in the data cleaning step, and then maybe make something interesting, try to tackle those questions that I showed earlier. And it was great since people could use whatever tools they wanted, Excel, Python, R, and in fact, we have people doing all of that. One of the interesting products out of this format was my colleague Abby. She made a shiny app that was basically Google Trends for our content. So you could compare terms or ngrams in the article text and or the author bio at the bottom and see how those frequencies changed over time. So we got to look at fun, kind of trivial things like, you know, the rise of Uber is a brand in our content versus us using it as an adjective. And then looking at how our contributors changed over time in terms of how we describe them. So here kind of looks like there's a convergence or balancing of the professor author, whereas in the past, we had people, fewer people identifying as professors, and maybe more independent authors, but now that's sort of converged. One of the more immediate and interesting these cases after we presented at one of our roadshows, Abby's tool was really gin up a lot of excitement. And someone came up to us, it was our localization product manager. So he was like, Hey, I'm working on Spanish machine translation for articles. And I'm finding that there's some problematic titles, where I have to override what the machine translation says. So in these cases, it was sucking up can become kind of vulgar, if you let Spanish do its thing in automatically. And then the word cool can end up meaning refrigerated or interesting in different contexts. So he was having to override a lot of these and was just kind of catching them as he noticed them. But with a tool like Abby's, he could sort of get ahead of that. And in fact, cool and suck make up a real percentage of our articles and sucking, I guess, is maybe more prevalent in some of our recent stuff. Although it's a little misleading because 2020 isn't over yet. So that's not sort of a full denominator. But I don't think you need me to tell you that maybe sucking is on the rise this year compared to last year. But what I will tell you is that what we learned is that open source tools and practices, concepts like a hackathon or just like committing the documentation and even the tutorials we use to build the package up to this point helped us sort of rescue what was almost closed data within our own enterprise. We have these other tools that are disposal, but not really. And making space for people to make custom tools allowed us to bring more people to the data table. And that was awesome because I think it brought people who weren't necessarily data people or data wasn't a big part of their role on paper. But it got them interested in able to self service some questions that otherwise would end up in our backlog. And maybe we wouldn't get to it in a useful time period for them. And another thing I love about sort of being able to use R and stuff like that is that, you know, there are enterprise analytics tools, but you know, those cost money and there's license seats. And I've noticed like some people don't even nominate themselves for access to these tools because they think it's like not in their purview. But if you're using something free that you can spin up and just send a link to and like eliminate a lot of that friction, I think it opens doors for people. And we found that the state of can be useful for onboarding and just engaging people with what your product or bread and butter is. So for us, that was articles. We don't expect people to sort of read our whole archive when they start. And that's not a useful way to get to know the product. But like a tool like Abby's again is a great way to sort of get a sense for how things have changed over time. And I do want to acknowledge that we weren't like totally dead in the water. We got this far because we had experienced our developers and helpful and excited colleagues who were able to fill in a lot of gaps in our organization. So I do want to thank those people more specifically my colleagues at HBR. And also the CSP conference organizers and anyone presenting and listening today. I have an appendix of the slides which you can see at the link if you're interested in how we ran our data on both in person and remotely and some of the our links that led us here. So with that, I'm happy to answer any questions and thanks everyone for your time. Yeah, thank you. We had a lot of smiles and laughter and comments on your suckings on the rise this year comment. I think it may be the quote of the conference. Everything seems to be sucking a little more this year. It's true. Yeah, so one question that came up in the chat was about the code. So I think you said the code is closed. Is that right? Or is there a plan to open it up eventually? I could probably do something with dummy data or we have a limited feed that's basically the same XML like you can get 10 or 100 articles a day through that feed. And then I guess the process it's the rest of it could be open sourced. I haven't prepared that but I can I could do that. Yeah, it was kind of well documented elsewhere and maybe I didn't put all those links in the appendix but yeah that can the process can be open sourced. Yeah, there was also the comment that HBR is really known for clear, concise case studies and it's awesome you just delivered a very clear, concise case study. Oh, thank you. That didn't even register for me. I think we have a trademark case study so I didn't put it in the title. But yeah, thank you very much.