 Good. So as I said before, my name is Peter Leonard, and I run the new Digital Humanities Lab inside Sterling Memorial Library, which is maybe the biggest and most famous of the buildings that make up Yale University Library. And what I want to talk about today is the notion of text and data mining on licensed collections. And I think the reason why this is interesting to us in the library was a couple of reasons. Some of what we were talking about just before the session began, and I want to sort of articulate the reasons why text and data mining on collections whose access to which is extraordinarily complicated is kind of an interesting opportunity for librarians and for libraries. So despite the challenges of working on licensed material, we think that there might be some really interesting ways for libraries and the people who staff them to be really helpful to the scholarly work on these types of collections. So I'll say really quickly that in the Digital Humanities Lab at Yale University Library, what we're trying to do is we're trying to help people answer humanity's questions. We really keep the focus on humanistic inquiry. And this distinction, I think, is helpful when we, especially when we're talking about text mining, because there's an extraordinary, let's turn off the auto advance here, there's an extraordinary amount of expertise within disciplines such as computational linguistics, corpus linguistics, that is very similar to some of the techniques or some of the tools that people ask within literary text mining. I think what differentiates the work we do in the Digital Humanities Lab and in other Digital Humanities centers that I see around the country is that we're trying to keep our focus on humanistic questions. So you might be running the same software on a corpus as a corpus linguist, but if you're working with an English professor, you're probably going to be asking and answering very different types of questions. I should also say that our interest in working with copyright and licensed material for text and data mining context actually originates at CNI in a presentation given in 2012 by Joel Herndon and Molly Tamarkin who at that point were at Duke University. And their talk at CNI in 2012 was what to do with all of those hard drives, data mining at Duke. I wasn't in the room when that talk was given, but because CNI puts the PowerPoints online of all these presentations, I was able to go through and look at what was happening at Duke. I was really inspired by their interest in taking some of these extra hard drives and turning them into a corpus for text and data mining within their Social Sciences Humanities Support Center. So that was 2012. And what I wanna do in today's talk in 2016 is take us through, I sort of have a roadmap with two paths that I'm gonna bounce between. On the left here, I wanna talk about the contexts for text and data mining. I wanna talk about the context of the library's digital strategy. I wanna talk about the context in terms of our collections. I wanna talk about collection development. I wanna talk about digital scholarship and with vendor platforms and technical architecture. On the right is a different path which will interleave with some of the things on the left and that has to do with a very particular corpus that we decided to begin with which is Vogue Magazine. I'm gonna talk about what restrictions are present on the Vogue Digital Archive, what helped us overcome those restrictions, our plan, the tools we use, some of the text mining, some of the metadata, some of the image mining, visual culture computation, some of the lessons we learned and sort of our strategy for going forward. So to begin with, I wanna talk about the context of our digital strategy. How does TDM fit into, on vendor material? How does that fit into our digital strategy? And I think that the thing that motivates us in the lab, in the Digital Humanities Lab at Yale is helping scholars engage with digitally actionable research objects. And what digitally actionable research objects just means any type of material that's ready or prepared or amenable for quantitative or algorithmic approaches. And as you can imagine, with a lab that has humanities in its title, we see this as complementary to, but not in any way something that supplants the tradition of close reading that exists on most of our campuses. So we're here to help folks who self-identify as being interested in quantitative and algorithmic approaches. But we're not under any illusion that this is somehow the next stage or this sort of takes away from French or Italian or English. My own background is in literature and so this is something that's really important to me is to think about this as complementary. So the point is, how do we get to digitally actionable research objects, things that actually are amenable to TDM? And in a collections context, I'm talking today about licensed electronic resources, but it's important to point out that the notion of digitally actionable research objects is present throughout all of our collections. It's something we think about in terms of our physical collections, I would say almost all of which are library-owned. It's also something we think about in terms of our electronic resources, including those that we produce ourselves, things we scan in-house, either because we have a big digital project or because somebody comes to us and says, could you scan all of these manuscripts or something like that? We also have, of course, a small fraction of things in the electronic resources space that we truly buy, like a CD-ROM, but most of our electronic resources are licensed. And that takes me to our collection development context. I think it'll come as no surprise to anybody in this room who works in a large research library that the amount of money that we're spending on licensed electronic resources is growing every year and is really quite striking. At our library, over 66% of our non-rare book spend is now on licensed electronic resources. And obviously Yale is a library that has a historic strength in physical collections. We aren't like a science library. We have a medicine library, a science library, but there's no gainsaying the 14 million books that we have on the shelves. And even despite that, we're spending a lot of money on licensed electronic resources, including a lot in the humanities. And despite all this money going towards licensed electronic resources, and despite the undeniable value they provide to our clientele, I think it's also undeniable that there's, of course, far less usable in some ways than books that are sitting on a shelf. There are all sorts of arbitrary restrictions on the use or the download of this digital material, especially once it's sort of frozen in a vendor's digital browsing or searching system. These interfaces in which these products are embedded are oftentimes of variable quality. I think we've all had the experience of licensing a collection and seeing what the search browse interface looks like. Looks like, and it really turns out to be kind of a B minus product. But you can't really do anything about it because the vendor controls the engine. Especially in the case of eBooks, we see enormous instability in the resources, right? We get 20,000 eBooks and then the vendor takes away 5,000 and adds in 3,000. Just keeping track of those things and making sure they're expressed in our OPAC is really challenging. And it makes it hard to think about some of these eBook vendors as like stable sources where I can always rely to go get something. The items themselves, regardless of whether they're databases or eBooks or whatever, they're often not findable. A lot of that is our own fault, right? It's tough to integrate these things with our catalog. And of course you can't loan licensed electronic resources. I mean, there's some beginnings of a move towards facilitating that, but especially in the eBook space. But in general, we're paying a lot of money for things that are in a certain sense less usable. And this takes us to sort of the context of digital scholarship on campus. And I think it's easy to claim that there's a sense of kind of moral imperative for libraries and librarians to tackle these problems. As the European Union of Research University Libraries says, the right to read has to include the right to mine. And really if you abstract from that, what you're talking about is making sure that scholars control their own encounter with source material. When a faculty comes to Sterling Library and takes a 19th century novel off the shelf and checks it out, the publisher does not follow that faculty member out the doors of Sterling Library and peek over his or her shoulder and say, you've read this chapter four times, you should really move on. You can't read this chapter any more times today. Or you're looking for certain words, but I'm not gonna allow you to look for two words at the same time because my interface doesn't support that. But we accept those types of limitations on our licensed electronic resources because we don't have a choice. There are also of course enormous technical challenges to meeting this moral imperative to facilitate access to licensed electronic resources. Dealing with the raw data itself brings all sorts of storage challenges, making sure that it's backed up as a preservation challenge, making sure that data doesn't leak as a protection problem, helping scholars make sense of terabytes information as an analysis problem, visualization problem. But despite all that, I think there's some interesting opportunities for librarians and for digital scholarship centers and digital humanities centers in this context of a moral imperative to fix this problem, but significant technical challenges to doing so. I think there are real interesting opportunities not only for digital scholarship centers, but also for those centers and those labs to build working relationships and improve their working relationships with subject liaisons, who are often the people who are deciding to license these resources, who are committing some of their collection development budget towards these archives. So with that I wanna turn to an actual real world example of us trying to tackle some of these problems and the problem I'm going to, the archive I'm gonna talk about today is Vogue. So just as a question, does anybody know when Vogue was first published, American Vogue, does anybody guess what year Vogue started? So Vogue started in 1892, it is a large archive. For many years, Vogue was actually published out of Greenwich, Connecticut. So for us at Yale, it's kind of a local story. There've been something like 2,798 issues of Vogue over that time. For a long time, Vogue was published every week. And then when Condé Nast bought it, it became bi-weekly and now it's the magazine that we all read today as a monthly. All of those issues add up to about 400,000 pages. That's a lot of material. Yale does have two or three full runs of Vogue magazine. One is in the arts library, one's in our storage warehouse. So you could, if you want, check out every single issue of Vogue magazine. We've had faculty who have done this. It's many cubic feet. If you were to scan all this material, which we did not, but ProQuest did, you'd end up with about six terabytes of information. And in that information, it's not just one thing, it's many things, it's text. Of course, there are articles that have been OCRed. There are images that have been scanned. There are captions to those images which seek to describe what's going on in those images, whether photography or illustration. And interestingly, as you go back in time, those captions become more semantically meaningful as the reproduction technology of printing was more and more primitive as you go back in time. Captions are actually more descriptive. So the captions in them themselves are very interesting data set. There's, of course, world-class photography. Covers by Annie Leibowitz. Illustrations designed by Salvador Dali for the cover. And this all points to the applicability of this fashion magazine in a lot of interesting disciplines. There's, of course, a great gender studies story to tell with 124 years of a women's fashion magazine. There's an art history story to tell with articles by Nakalin and others. There's a cultural studies story. And so that's why we fixated upon Vogue because we thought that it might be a compelling corpus to bring to light through methods that went beyond what the vendor could provide us. Now, despite all these advantages to Vogue, there are, of course, significant problems with it that impede our use. One of them, of course, is that it's hilariously under copyright. It's still a going concern. You can still buy Vogue on the newsstand. You can get a subscription. People make a lot of money from Vogue. When we pulled the library of Congress records we discovered that Kandey Nass had extended copyright backwards in time all the way pre-1923. So if you had any thought of using sort of fair use or 1923 cutoff date as a way to do text mining on this archive, you'd be disappointed. I think if we do this on other big magazines, we may find the same thing. The copyright has been extended all the way back. It's also under license. So although the original items in Vogue may be copyright Kandey Nass, there's been significant work by ProQuest which created this product, the Vogue Digital Archive, which go above and beyond what Kandey Nass provided. They scanned the whole thing. They did OCR on Vogue. They created metadata. Who wrote this article? Where does the article jump to? Does the article jump from page 70 to page 120? That's expressed in the XML. So we have semantic units of articles. They also did article segmentation by drawing boxes on each magazine page, capturing whether something was an ad or a third of a page of an ad or two articles on the same page. So there's great value that ProQuest has added to this that frankly we would never be able to do. If we had scanned our own copies of Vogue, we would never have done all this article segmentation and would take a really long time. All of that is under copyright. And then when just when it couldn't get any worse, you have New York Times versus Tocini, which is of course the famous lawsuit by a freelance journalist saying you can't take my articles, New York Times, and put them into Lexus Nexus and make money off of monetizing the back catalog. So this applies to freelancers generally and it applies to many of the writers who wrote in Vogue who weren't on salary. And fascinatingly, of course, it also applies to people like Annie Leibowitz. Annie Leibowitz is not on salary at Vogue. She takes the cover photo. She's a contractor. The copyright is for Vogue for that cover, but you can't just take that image and print it on a t-shirt and think you're okay. So these are all the problems we had to face. And what actually helped us overcome those problems? Well, I think if you take one thing away from the talk today, a couple things, but one of them is very important, which is the notion of a perpetual access license. Yale Library signed a perpetual access license with ProQuest, which put us in a good negotiating position to talk to them about getting access to all of this data. We weren't just renting the data. We had made a substantial financial commitment to the archive. And I think if you would go a little later in the discussions that circulate around the transmission of the electronic data, this actually may have been the key thing. As you imagine the obverse situation, imagine a case where we just paid for one year a subscription to the Vogue archive. And we were very clear about we were not gonna renew this. We're just gonna pay it once. What would have been our negotiating position to get six terabytes of digital data transferred to us? I think weak, but because we had said we purchased a perpetual access license to this, we had a much better position to begin those discussions. So said before, the other thing that helped us was ProQuest put a lot of time and effort into the Vogue digital archive product. It's a really high quality product. They did things that we wouldn't have had time to do like segment articles. We had great support from our collection development team, including our world-class copyright and licensing librarian. Without her help, this project couldn't have worked. We also had local technical resources, and I think what's interesting to point out is that these resources were outside of library IT. We wanted to have zero impact on library IT because we didn't wanna burden them with an experiment. So because we had a digital humanities lab, we were able to marshal the resources and handle this outside of the normal pipeline of library IT. And finally, we had a willingness to experiment to try something because we thought the payoff might be interesting. So here's the plan. Turns out, if you're trying to transfer six terabytes, the fastest way to do it is to put it in the mail, put hard drives in the physical mail. So for unknown reasons, these hard drives were shipped from the UK, flown across the Atlantic, landed in New Haven, we had six terabytes worth of hard drives. This was actually faster, probably, than trying to transfer them through some online automated system. So atoms were faster than bits. Our plan was that as we opened up the hard drives we discovered, we had 400,000 JPEGs, about 3,000 by 2,000 pixels, and we had 400,000 XML files. And so what we wanted to do was, that's all we had. So we first of all had to recreate some of the technical apparatus that powers the search and browse interface on search.proquest.com. We pulled all of the metadata of the facets from the XML and we built a database. And then our plan was to put the experiments online. So this is, nobody has to read this. This is just the raw XML behind one of the articles, an article from 2015 in February, about actually a play. Tom Stoppard's The Hard Problem. This is an example of the amount of interesting cultural stuff you get in Vogue. It's not just fashion, it's cultural history. So we basically pull the facets like, is this an article? Is this an ad? Who wrote the article? When did they write it? Who was the editor at the time? We pull all those things and we built a database that looks sort of like this. A very simple non-relational database with some indices so we can have metadata in order to do our text mining or image mining experiments. So here's our tool strategy. We've got six terabytes of information. We've transformed it in certain ways. We're gonna use a mix of out-of-the-box digital humanities tools, data mining tools, and then we're gonna do some of our own custom development, things that nobody's built a tool for. And then there'll be some experiments which are hybrids, for which we're using somebody else's text mining engine but we're marrying the results of the output of that engine with our own custom front-end or we're extending or sort of gluing together various people's Python scripts. Another key slide I really want people to remember is our strategy for dealing with this in-copyright license material. How can we express the results of our experiments? How can we show people what we've done either within a context of Yale or outside Yale's walls while still respecting copyright and licensing? And our strategy here was to store only certain things. Our strategy was to store what I term dimensionality reductions of this enormously large data set on our servers. What's a dimensionality reduction? It might be the frequency of the word silk or mink in time but only the frequency of that word in response to a user's question. It might be the frequency of red pixels in the covers that were designed by Irving Penn. So none of these can be used to recreate the archive. They are transformations of a six terabyte data set into a response to a specific query but even if you stole everything on my web server you could never recreate Vogue nor could you recreate the Vogue digital archive the ProQuest product. And the other thing we store are citations with permalinks. We essentially store deep links back into the commercially licensed archive on search.proquest.com so that when you see a pattern when you see the phrase of little black dress peaking in a certain decade in Vogue you can then go in and look at the articles for that year or that month. And by storing only dimensionality reductions and only citations with permalinks we think we're well on the safe side of fair use. We don't store anything else on our servers. So let's take a quick look at something really simple. Let's look at one of our experiments and what I want to show you now is a live server. The big test, the big lesson from these things is never do a live demo. So here comes the live demo. What I want to show you is 124 years 400,000 articles in Vogue magazine and what we're gonna do is we're gonna look at that as a term frequency count. And by default we just sort of get some sample terms I can reload the page. This particular engine you're looking at right now is Bookworm, designed by Ben Schmidt and others affiliated with the Harvard Cultural Observatory and Rice. Ben is now at Northeastern and this is part of the team that built Google Books and Gram Search in 2010 who then decided to implement a kind of clean room tool called Bookworm which is essentially Google Books and Gram Search with a bring your own books, BYOB strategy. So what I can do is I've got 1892 on the left, I've got 2015 on the right and I can go in and I can say, let's look at two collective words for women. So what I'm doing now is I'm pulling out of the corpus this particular pattern and what you'll see is that I've got women in blue and I've got orange is girls and I think there's two patterns that you can take from this particular graph. The thing that almost everybody fixates upon is that there's this point around 1970 where people stop using the word girls and they start using the word women. And this of course coincides with a lot of different things that coincides with the feminist movement and also coincides with Grace Mirabella taking over Vogue. So the editor of Vogue has always been a woman. And when Grace Mirabella takes over she's either reflecting societal change or she's shifting the tone of the magazine towards referring to adult women as women. The other interesting thing to point out is that there's a time, there's some decades where women and girls track pretty evenly here in mid century and thirties and forties. But there's also a time when it seems like girls only means women underneath the age of 13. That the use of girls as a slang turds for women, grown women seems to be correlated with about 1915, 1920. Now, if I have a question about this if I have this kind of macro pattern if I have this kind of distant reading of the corpus and I say, well, that's interesting but what I wanna do is I wanna do a sort of a close reading of what is going on in 1915 when people are only referring to or thesis would be that they're only referring to women under the age of 13. So what I can do is I can click through and now I've entered a ProQuest server and because I am VPNed into the Yale network this link will resolve. If I were not VPNed into an academic library which is described to this, this link would fail. And what you can see here is this is an article from 1908 and a simple good style frocks for little girls. This does seem to be a use of the word girls for women under the age of 13, right? So this, we're now in the ProQuest search interface. We've left the Yale server, we've bounced to a new tab and I can do all of my close reading. I can cite this article. I can do all sorts of things. All I'm storing here in these links is metadata citation permalinks and the actual colors here are just dimensionality reductions. I can also look at fashion history as opposed to social history. I could look at maybe Fox and Mink and let me add another term. Let me look at Shearling. And now we can look at three terms for fur that occur in the corpus. And of course you see all these interesting things like, gosh, what's going on with Mink in the 1950s? This seems like a really popular term. So I can go in and I can see all of the ads and all the articles that mentioned Mink in here. But this actually points to a kind of refinement of how we can think about time, how we can think about diachronicity. If I have a suspicion about these, there's something about the mid-century in Mink that's really interesting. But it's kind of an interesting thing to be peaking because if you think about it, it's this luxury item but it's also a seasonally bound item. So what I wanna show you now is some advances that we've done again with Ben Schmidt's code in pushing term frequency counts into a D3-based interface, a data-driven documents interface. What you're looking here is a very different visualization than time on the x-axis and frequency of the word Mink on the y-axis. What we've got here instead is years on the bottom, one aspect of time. And then the other aspect of time is January through December. So it's the seasonality of fashion. And what you're seeing is there is a huge 1950s obsession with Mink, undoubtedly. But there's also a seasonality to the advertisements of Mink. It begins to be advertised essentially in October. And that sort of makes sense, right? You're thinking about buying, you're not buying a Mink code in May. You might get a better deal on it, but most people are merchandising Mink there. This is also true of other things. One of the things that my colleague in the Arts Library at Yale, Lindsay King, was interested in is discussions of women's lives, words like work or college or job or boss. And one of the patterns she pulled up was this big focus on the word college in advertisements in Vogue. And what she was interested in is who's advertising to women using the word college and when is it happening? And so we have another visualization here of the word college, again, years on the bottom and months on the top. So what's going on is in the 30s through the 50s is a huge amount of articles and ads that mention college. And they're occurring in July and August, which is exactly when you'd be trying to sell people something. What would you be trying to sell them? Well, how about a Braymar sweater? She got a Braymar before she went to college. That's the first step. She knows that's how you start the business of being a freshman. So here you have this really interesting fashion merchandising story, which is expressed not only in years, but expressed in the seasons. Because we have, again, this excellent metadata from ProQuest, year month day of when each article was published. Now what I've shown you so far are some very simple text mining searches, really, just engram searches. And what I want to do is I want to fixate on the notion of shearling. So some of you might have been surprised when I typed the word shearling in there. It's not a word I was familiar with before. Turns out to be, I guess, some sort of fur, some sheep thing. So the problem in doing engram search, the problem in me typing in words like women, you know, girls, is that it presumes that I know what to look for. It presumes that my knowledge of the corpus is complete. And I could have checked out every single issue of Vogue from the Yale Library. I could have browsed through the ProQuest interface and read every single one of these 400,000 pages, but I'd probably forget what's in there. So what we need is, there are really show interesting patterns. We need to find a way to show patterns that are latent in the data that we don't know about. And an obvious way to do that is to think about applying a kind of machine learning algorithm like topic modeling. So have people heard of the phrase topic modeling? And they're great. So a lot of people are doing this. It's just a way of using term frequency co-occurrence to suggest latent patterns, latent discourses, latent themes in a corpus you have not completely read. And that can seem really anti-humanistic like why would we want a robot to go through and like count word frequency? It turns out it works really well. And let me try to show you that live. So what I'm gonna do now is I'm gonna go home and when I pull out my topic modeling interface, what I've done in this website is I've asked the algorithm to surface 20 themes, 20 discourses from Vogue. The number 20 is sort of arbitrary. You can get a statistically defensible result by asking for 50 topics or five topics, but we thought 20 themes in Vogue was a good place to start. So once on the screen right now, you can just sort of scroll through. But in order to talk about a topic, I wanna show you basically three areas. I wanna look at the unigrams, the single words that are characteristic or predictive of that topic. On the right in blue, I want you to think about the biograms or trigrams, the multi-word phrases, which are characteristic of a topic. All of these have had stop words removed. That is function words like of and the, and that's why you're getting weird phrases like museum art, that's like museum of art. And on the bottom here, we're getting a diachronic visualization of what we're calling the art discourse. All the robot knows is that words like art and exhibition and museum and paintings tend to co-occur far more often than they should statistically, assuming a sort of stochastic distribution. The computer doesn't know what those words mean. It's just a completely naive Bayes approach. So humans can go in and say, gosh, this to me seems like a discourse about art. Again, a topic you might not have thought of in terms of Vogue. In order to evaluate this topic, what we can do is we can pull up an arbitrary point in the line here, let's just under Grace Mirabella or something, we've hit 1962. And so let's take a look at a particular year of articles about the art discourse. So here is books of things pastel. So my test is when I pull up this random page in Vogue, is this actually an article about art history? And let's wait for this to come to, yeah. Okay, for many years, symbolists Odellon Radon restricted his palette to black and white, yes, it's an art history article. So the algorithm has found all of the articles in this corpus, which could be thought of as belonging to a latent art history discourse. It's like a phantom subject category that no human ever applied, but that comes organically out of the corpus. Here's an article about Robert Motherwell, our spotlight on art. So all of these are articles that are essentially about art history, and you get articles in here by Knocklein, by Barbara Rose, like you're getting really interesting art historians publishing on art history in Vogue. There's also a dress-making theme. So again, we're gonna look at the words here and the phrases there, and then the distribution of those words, essentially this is a measurement of the average saturation of articles with this topic over time. In the early years of Vogue, a huge discourse was how to make your own dresses. This is before sort of mass merchandising, it's before the Second World War. So this theme about here's the pattern, go ahead and cut it out, right? This is present before the Second World War. But the last one I wanna show you is a really interesting one. This is a topic that emerged, and we took a look at it, and we decided to label this topic women's health because the words that are present here are words like women, exercise, body, cancer, health, and phrases on the right, like breast cancer, health fitness, health and fitness, right? But more interesting than the words and phrases themselves is the incredibly distinct distribution, essentially the number of articles over 20% saturated with this theme, with this topic, with this discourse, which is this bump right here underneath Grace Mirabella. And if I pull up any articles in this huge bump of a discourse about women's health, I think what you're gonna find is that here are these great articles like are diet pills safe? How to start exercising? Why pot smokers are playing with fire? There are articles in here about contraception, there are articles about breast cancer, there's articles about health and exercise. And this is essentially part of, I believe, Grace Mirabella's effort to turn the magazine towards the concerns of real women, the everyday concerns of women. Advice on vitamins, who needs them, who doesn't. Grace Mirabella was married to a physician, Grace Mirabella tried to remove tobacco advertising from Vogue, this was a real focus of hers. And you can go through and find these articles about everyday health questions and they really emerge underneath Grace Mirabella and they disappear very quickly when Anna Wintour takes over the magazine. We also have a couple different ways of visualizing patterns in the actual advertisements in the magazine. So one of the things you can do, which is really simple, is count who advertises in Vogue. So one of the things we did was to look at the advertisements. So let me pull up on this screen, a visualization of who are the cosmetic companies that are advertising and who are the stores that are advertising, just to compare them. Now of course, Vogue publishes at a certain point in its career, it publishes every week and then nowadays it's a monthly, so we normalize to one ad per, essentially per month. And we can sort these frequencies of the ads in terms of weighted average or average year. So it turns out Nordstrom has been advertising most in Vogue more recently. Saks has been doing it for a long time, Bergdorf Goodman, Arnold Constable and Company, I'm pretty sure they're out of business, but they had a big series of ads in Vogue a long time ago and the same is true of our cosmetics, right? So CoverGirl has been doing it quite a bit recently but it's actually Revlon who's been statistically advertising the most over time. That's nothing more than counting the number of article of ads based on the excellent quality metadata that we got from Vogue, but it kind of, from ProQuest, but it kind of tells an interesting story about these different companies, Virginia Slims, that was a really set period in time. Come a long way. Now lastly, before we close for questions, I wanna talk a little bit about visual culture because it's important not to lose sight of in all this text mining or metadata mining the immense visual richness of the Vogue corpus. So one of the things we've been doing with the collection by hand is just taking issues every 10 years. Every 10 years we'll take an entire year's worth of issues and we'll take those covers because the covers have this kind of indexical relationship with what's behind them. The covers are what greet you on a newsstand to convince you to buy the magazine and the covers are the first thing you see when you get the magazine through your mail slot. And so what we did is we went in and we tried to think about what would happen if we were to take the covers and pretend they were printed on kind of like overhead transparency and put them on top of a projector and we're essentially taking the mean value of each RGB pixel at that point. And what you get is this progression, like in 1900 the covers of Vogue are very characterized by this header which never changes but the illustration in the middle does and so it doesn't look like anything. But the thing that people usually fixate upon is 1970 and 1980. Let's go into 1970 and 1980, there they are. So this is kind of creepy. It's like the same woman with the same turn of the head with her same hair. This is every cover from 1970 and this is every cover from 1980. So obviously they're an aesthetic rut but their circulation is going up underneath Grace Mirabella. Essentially what you're looking at in this progression is an anti-pattern. When you can see a clear pattern here it means that Vogue is doing all of its covers the same. When you can't see a pattern it means that everyone is a unique work of art. And in fact if you look at 1940 and 1950 which are these things underneath 1970 and 1980 they look like mush and that's because every single cover, this is the illustration era, every single cover is designed by hand by Salvador Dali, by Horace Torres, by Irving Penn. This is amazing works of art. These are some of the covers that were in those decades in which you just can't see a pattern. So what we've done by this kind of work in Photoshop is to create a kind of anti-pattern of sort of repetition or innovation in the Vogue space. There are other things we can do that are a little bit more quantitative. So what I wanna show you now is some work that's based on, of course, Lev Manovich's tool. What we did here is we wanted to think about on the X-axis going up we have the mean saturation. We have how colorful are all of the covers and on the Y-axis, excuse me on the, so yes, up is saturation, X is time. And what we've done is we've expressed each data point instead of just a dot, what we've done is we've taken the actual cover themselves so you can zoom in and see the actual Vogue covers but it's most helpful perhaps to zoom out and to see what's going on in 1970 and 1980, this huge jump, there are basically no covers that are less than like 20% saturated in 1978. It's this massive amount of saturation, the colors are getting very colorful, at least as measured by the mean saturation of all the covers. And that sort of corresponds with that kind of very sort of florid era. This is either some of the covers you the women's faces, right? They have an enormous amount of color in them. And we think this is just kind of an interesting macro pattern. We've drawn sort of red lines around this huge jump in saturation in the 1970s and 80s. Again, hard to capture this when you're looking at each individual cover but when you plot them as a function of their saturation using Lev Manovich's tool, you can do that. Here's the last thing that's visual culture that I wanna show you. And this is actually a site that isn't public yet. But what we were interested in doing is, of course there are problems with taking the mean saturation or the hue or the lightness. There's a lot of different color models. The problem with saying, oh, I want the mean hue of a color, of a cover, is that if you've got a woman in a yellow dress on a blue background, it's gonna like the mean is gonna be green and that's not your human interaction with a cover. So what we wanted to do instead was to use K-means clustering to segment out all of the colors in a particular space. So if I'm wearing this and my blue jeans are not consistently dark blue, but if you were to do K-means clustering on my jeans, you would get a dark blue, essentially, all of the stitching and things like that would average out. And so what we do is we then bucket those, we basically bucket those resulting colors into the CSS3 palette, which is about 92 colors. And then what we do is we build a search engine based on color. So what this means is that I hover over a particular cover. I can pull up, I'm hovering over a kind of green here. And so what's coming up is all of the colors that have that green. They're ordered by the amount of green, the percentage of green. So this cover has the most, less, less, less, all the way down, but there's some green right there, right? At the bottom, we built a kind of visualization of where here's the green, there's some sort of skin tone, and there's the blue. So this is a way of kind of thinking about building a search engine based on color. And the algorithm doesn't discriminate semantically about what's there, right? So in this case, the yellow is vogue, right? And then here it's the word special, first, or smile. The algorithm doesn't care. It's just finding that yellow, clustering it towards an average yellow, and then putting that into a CSS3 palette. We can sort the palette, of course, by hue, saturation, or luminosity, or the frequency is kind of the default. What we're trying to do here is just kind of take all of the sort of visual richness of the collection and find ways to surf through that in different ways. The final thing I wanna talk about before we open up for questions is the notion of giving this data to researchers at Yale. So what we've done, we've done a couple different things, but one of the most successful projects we've done is to transfer six terabytes of this material to a graphics lab on campus that's run by Holly Rushmeyer, a professor of computer science and former chair of the department. She has an ultra-secure computer that's not connected to the internet, and her undergraduates who are working in math or stats or applied math and computer science have done great projects on the vogue data. So one of them, let's see if I can pull this up, is Cristiana Wong, who took all of the data and ran facial detection algorithms on the corpus, and she, Cristiana is a woman who is a computer science major, applied math major, but she had taken a lot of gender studies classes at Yale, and she was really interested in the faces relationship with the rest of the image. And so in her presentation, she does math on detecting a face, which is really easy to do with essentially autofocus algorithms from smartphones, which you have to run on CPU and battery constrained devices. It's a doddle to run those on static images so you detect a face, and then you compute what percentage of the large picture is the face. We have the bounds of the large picture because we have the rectangles drawn by ProQuest to identify each image. So in Cristiana's work here, what she does is she goes through and she applies facial detection algorithms to this. She talks about successes and failures. Where are we finding faces? Where are the left eye, the right eye? And then what patterns can you find over time? What percentage of faces are pictures? Do we tend to get a lot of cases where the face is the entire frame or it's a very small part of the frame? This was actually her senior project. And so we're so excited that a computer scientist was able to do her senior project on this ultimate humanities dataset, a fashion magazine, 400,000 pages of data sort of mediated by the library. So lastly, a couple of final context slides of where we close. I think an obvious question is what's the relationship between a website like what we've shown and the vendors evolving search and browse platforms? ProQuest, Gale, Epsco, they're all developing TDM portals. These are gonna be great products. We're really excited to see them coming. They're not here yet. Some of them are in beta. I think the improvements that are represented by the TDM portals will benefit all of us. I think it'll be a stratification of support. If somebody asks me to come in to an intro class and just talk a little bit about text and data mining, I'm so happy that we might have like a ProQuest TDM portal. If somebody can say I'm really interested in this African newspaper, it happens to be in ProQuest, the TDM portal works on that. Could we try an Ngram on top of that? Could we try to do a term frequency count? I would be so excited to not have to build that from scratch. On the other hand, when a faculty member comes to me and says here's the deal, we're doing this text mining and I need to do part of speech tagging and I need to take out all gendered adjectives and I need to stem the words back to their dictionary forms, I would rather have total control over the corpus and for that I need the six terabytes worth of data. For that I need the raw OCR and XML so that I can have complete documentable control about what I'm doing. And I think this has to do with some of the keynoters' notion of sort of computational reproducibility. We need access to the raw data, we don't want the stuff stuck in black boxes. Of course, TDM portals are part of a moving target of vendor support for TDM broadly and that spectrum is every, probably all of us in this room know one or more parts of the spectrum. Like the vendor who says no, you may not have any data. The vendor who says well, go ahead and scrape the site, believe it or not that's what people are now telling us sometimes, just like slowly scrape the site with Python. That doesn't seem too smart. Limited API access, right? Like we'll give you n-gram counts but we won't stop a list for you and we won't tell you if we're collapsing case, like these not so great APIs. Unlimited API access, I think some of the big TDM portals like ProQuest will undoubtedly be very fulfilled. Downloadable scrambled data and the example of that is JSTOR, DFR, Data for Research, right? You can get all the articles from all in copyright journals from JSTOR. The twist is they're in word frequency order. So you can't read them as a human but an algorithm can still do topic modeling on them to full data. That's the final spectrum. That's the six terabytes of hard drives in the mail that you can do anything you want to. So what are the lessons learned from building robots reading Vogue? I think that a polite but persistent approach to vendor relations, this is something I'm sympathetic to because my own organization is a large organization. I know how difficult it can be to make sure you're talking to the right person in the organization. It's not surprising that a salesperson will never have heard of text and data mining or if he or she has, they're not quite up to speed on it. For Vogue, some of the best questions were answered by, so the best answers we got for people in the machine room at ProQuest. You really knew what they were doing. They were like, here's the metadata translation table. Here's the file to make this work. We built on a lot of tools, some of the other tools, some of the color extraction you saw here was based on Python scripts written by Cooper Hewitt Labs or adapted by Cooper Hewitt Labs. Bookworm, obviously a project of Ben Schmidt at all. The mallet tool comes in UMass Amherst that's what we use to do topic modeling. I can't emphasize enough how important having the technical ability to do this inside our own library was important. We didn't want to burden ProQuest with support for this because we knew we were on our own and we didn't want to burden Library IT. And then finally, I would just say the thing that was really cool about this project from my perspective is getting the data into the students' hands. Students are keeping on working on Vogue and I think we'll see a lot more results from Holly Rushmire's Vision Lab in the coming years. Thanks very much and be glad to take any questions.