 It is my honor today to introduce Erez Lieberman-Aden Lieberman and John Baptiste Michel, who will be discussing with us today all the exciting work that they've done in culture romics, an exciting new field that they've helped found. I will keep this short, but one story I would relay is that at the engineering school at CIS, these guys are known as the pan-disciplinary guys, because the word interdisciplinary was used so often in all the websites and all the brochures and the materials, as I'm sure many of you know. But there was some concern that no one is really truly interdisciplinary in their work. Erez and JB, as he goes, is actually completely defies that stereotype. And if you look at the New York Times story that broke two months ago with their leading science paper, that they have managed to include Google, MIT, HMS, the engineering school, and almost every institute at Harvard in their work. And I think that is exactly why we want them here today and why what they are doing is so exciting. It's because they are truly breaking down barriers and helping, especially the humanities, do exciting research and explain to the other sides of the academic health why humanities matters. And so with that, I'm going to give it to you guys. And one reminder is to please turn off all your cell phones, if you can. And it's important, there's so many electronics in the room, we don't want any interference. And then finally, this talk will be recorded. And so your questions will be saved for posterity on the internet. So bear that in mind. Amar from the Berkman Center might interject at the end during the Q&A with questions from Twitter. But otherwise, please write down all your questions. And at the end, we'll have some time for those. Okay, so thank you very much, guys. And take it away. Hey, it's a tremendous pleasure to be here. Thank you so much for inviting us. We're here to talk about culture omics, quantitative analysis of culture using, in this particular case, millions of digitized books. I'm Aris Lieberman-Aden, who's my collaborator on that piece, Michelle. And we represent all kinds of organizations that tolerate us. We're going to give you a bit of an overview. There's all kinds of things that you can do at the Harvard Library. In fact, there are certain things that undergraduates are required to do at the library that perhaps is a little bit odd. But anyway, the traditional thing when you're not doing that is to read a few books very, very carefully. You find some books that you are interested in, and you read them very, very carefully. That's a pretty good method. You'll never get through the whole library that way. But you might find distressing. Well, here's another approach. You could read all the books, but do it very, very not carefully. Is there some benefit to doing that? Well, that's really the question I'm going to tackle in this talk. And more precisely, the question I'm going to tackle is, can we use computational methods to get a macroscopic view of an entire civilization by reading all books very, very not carefully? All right. So we're going to give you a very quick principle of why that can be an interesting method, because it's true that one might only wonder if you're just reading books what are you doing? So we're going to start with the English language. There's a very nice thing about English verbs that they're very regular. So to express the past, just add a little particle to the present. Like this distance with professor over there would say, today I've still learned. Yesterday, I've still learned. Very easy. Now, of course, that doesn't hold in the most important of the cases when you're, for instance, chased by a big rock, you'd say, that rock almost got me, not get me. So some of the verbs in the English language are not regular. They do not conform to this regular equation. Those are the irregular verbs. Now, the nice thing about irregular verbs from our standpoint as mathematicians interested in cultural evolution is that they tend to become regular with time. Like, for instance, this is a cartoon that was describing the past initial crisis. Remember the money you were saving for a rainy day? It shrunk. So in the future, this will certainly be a strength because that becomes regular. So that's a very nice feature for us mathematicians. They wanted to quantify that to say, let's take this evolving feature of language and let's look at how it changes over time. So here is our plan. There's lots and lots of stuff that's been written in the English language over time. It looks kind of like this. So we would take any regular verb that we're interested in, for instance, of the verb see. And we would take all of the forms of to see. And we would just write down whenever any of them occur in this text. I mean, of course, we're not going to do this, right? But if we found a thousand undergrads, then we could deploy essentially all of these undergrads to go through all of these texts and write down all of the other verbs. And so that seems like a pretty reasonable plan to us. So flyers went up, undergrads were informed, and one showed up. But she's a very, very good undergrad, so she makes up a thousand. Yeah, she's a wonder grad. Why is she so wonderful? Well, so the plan was, look, we can't read all the primary sources. So why don't we read the secondary sources? We'll take 11 grammar textbooks spanning Old English and Middle English, and use them to compile lists of irregular verbs in Old English, which are still in the language today, and try to figure out what happened to them. It turns out that for these 177 Old English irregular verbs, 145 were still irregular in Middle English, and 90 native them are still irregular today. The rest of them have all regularized. So now you can arrange these verbs, because we like to make talks from mathematicians, so we put them in a big table. So these are the 177 verbs that were irregular 1200 years ago. Now, the verbs that are most frequent to put at the top of the table to be and to have become do and find. The verbs that are very rare come to the bottom of the table, to delve, to bind, to span. Now, we colored these verbs as a function of whether they are now still irregular, or whether they have become regular with time. So as you can see, the verb to be and to have are irregular verbs are in black. The verb to starve, or the verb to melt, are regular. They come from the regular equations that are in red. Now, if you notice, as you go down this table, it becomes more and more rare. So this exemplifies the fact that when verbs are more rare, they regularize faster. This is a very simple thing. If you are not reminded of stuff, you will forget about them. So here at this level of society, if you are not reminded enough that these are exceptions to a rule, you will just use the rule to complicate those exceptions. And they will disappear from the exceptions and they will become regular verbs. And this is what you observe here. Now, there's a very simple mathematical feature, which is if a verb is 100 times less frequent than another, it regularizes 10 times as fast. It just emerges from the table that I've shown you before. And we have a nice equation here that relates the half-fiber and irregular verb to the score with its usage frequency, which is really neat for us. This is something that whatever model you have of evolution in the English language, the model must come from to this. This is an experimental feature, an empirical feature that shows how this particular grammatical rule has been evolving over 12 centuries. So if you come up with a way in your mind of how English should evolve, it must reproduce that. So that's something that could not have been known if you hadn't looked at language the way we did. So this is one example of why it's useful to this type of computational large-scale analysis of language and culture at large. Now in addition, you can actually make very nice pictures with that, which is always nice. So here's the friend of mine, Jean-Anton Saagosti, when we told him about this result, he said, that's really nice because here's what happened. At the beginning of the time, somebody took the English language and he put it at the top of an hourglass, right there. All the verbs are irregular. You have to think, drive, all are irregular. And then at times he goes by, those verbs fall through the opening of the hourglass. By doing so, they get into the ED particle, so they become regular. Of course, some verbs are much more used than others, so they're much bigger, like you have it, you think. So it takes them much more time to go through the opening of the hourglass. That explains the result. And so we sent this one, we sent the paper to nature and we were very lucky to have it accepted. We also sent them the picture, and they thought that the picture was soon as they had to put it on their cover. So that really made our day. Cool. So that's a proof of principle that you can try to do this kind of thing for several years and not immediately end your scientific career. So thus duly encouraged, we set out to do this better. Let's review what we've learned. The best thing you could possibly do, if you wanted to quantify history, if you wanted to get some sense of historical trends, is to read absolutely everything, record it, write it down, make a very, very beautiful table, right? Let's just get the thousand undergrads, a million undergrads and write this down. And if there were a Y axis for awesome, this is about the most awesome thing you could do. The problem is that if there were an X axis for practical, not a very practical approach. So what we did was the far less awesome, but far more practical approach of using secondary sources. Still, we have always dreamed of getting over there into the upper right quadrant. Now, why is it that it's so impractical to be there? Well, yes. The reason it's so impractical to be there is that we're just two guys, right? So us doing it could take a very long time. But of course, if you could find somebody else to do it, why, that would be very practical indeed. Apparently, Google, since 2004, has been very, very, very rapidly digitizing every book they can possibly get their hands on. So suddenly you have this option, which is both extremely awesome. You've got all the books. You've got all the words, everything are written, but also practical. So now we went to Google and we told them about this idea and they said, okay, let's give it a shot. So what we wanted to do really is to make a tool that anybody could use. We're not really interested in reproducing our study and just publishing it, but we're interested in making something that anybody could use. So a measurement tool for history or cultural trends. Now that means you need to release data because if you just do everything behind a firewall, nobody knows what's going on and there's no reproducibility, so there's no real sense. So we need to release data. Now, Google has digitized millions of books. The ideal data release is to take those full text of the millions of books and to release them in the wild. So now, of course, there's a Y axis for how awesome that is. It's very, very, very high. But we're taught a very simple principle, which is that if you have five million authors, say five million books, five million authors, five million authors, it's five million plaintiffs. So if you release that, it's going to be a massive loss. And of course, for us young scientists, it's not so great, not so practical. So that solution, we did not retain. Now, of course, as we always do, we try to go for the more practical thing, still a bit awesome, but not that much. And so we thought, well, okay, we have these millions of books. We can actually use the full text. What can we do about them? Maybe you can release some statistics that are useful. It will never be as useful as the full text, but it might still be quite useful. So here, the statistics that we have decided to release are what we call n-grams. So a gram is a word. A two-gram is a two-word, like a table is a two-gram, and a five-gram is five words, the United States of America. So n-gram by extension is n-words. So we decided to take to count the number of times n-grams appeared. So there's an example, for instance, in this image here, there's the foreground a gleam of happiness. So what we do is we take this foreground and we ask how many times it appears in books that were published in the year 1800. And books that were published in the year 1801, 1802, 1803, 1804, and so on until 2008. So that gives us a trajectory of how much this particular program was used. And we do that for all the five-grams that appear in the books that we have been looking at, which amount of five-minute books total. So that gives us a table that is about two million lines long, about 500 columns wide, and that tells us things about the way phrases, words, and sentences have been evolving over the course of the last few centuries. And that's what we have written. Now, let's just give you a general overview of where all of this data is coming from. So for centuries now, since the events in the printing press, people have been writing books. Many of them have a few dirty faces depicted in a row. Once they got their books, where did those books go? Some of them were obtained actually in publishing houses. The vast majority of them that we still keep track of live in libraries. And Google has been taking those books from the libraries in publishing houses and digitizing them. So we approached Google and we said, hey, you've got all these digitized books. Why don't we work together and try to create some kind of corpus, massive collection of texts that we think is reliable enough for the types of studies that we want to do. Now, it turns out that of the 12 million was the public number when this figure was made. Now it's 18 million of the books that have been scanned, not all of them were suitable for use. There's two major reasons that we couldn't use them. One is we didn't think the metadata was good enough. For instance, the book claims to have been written in 1703, but we don't necessarily believe it. The other is the scan isn't necessarily good enough. So somehow we don't have a good sense of what the words are in that book. The pictures aren't good enough for the algorithm that recognizes those characters, the OCR, optimal character, recognition algorithm. Somehow it said, hey, this book doesn't look terribly good. So we had to throw lots and lots of things out. In fact, we threw out about two-thirds of the database. What we were left with is five million books to analyze. And as JB just described, what we did is create this table using those five million books where we trusted the data and trusted the metadata, made this table showing frequency of words and the phrases over time. And here we're enormously indebted to Yuan Shen, Matt Gray, and John Orlock, our collaborators, in making that step possible. So now when we got this data, we did a lot of analysis with it that we're going to show you in a second. And we published it in the Journal of Science, which is right there. And again, somebody had made a nice picture that we could send them. Again, they're on their covers. Again, it was a great day. And this particular picture represents, it's actually an image. It's not a digital picture, it's an image from a sculpture. This is an artist called Matei Tan, who makes sculptures with thousands of books, tens of thousands of books. And here he did a very big power with it, which is not as tall as it looks because there are mirrors inside it. So that when you look into it, you see the reflection to the infinity. And now the interesting thing is that if you try to count the books you see in this particular picture, you see, I think, around 100,000 books. So this is still 50 times smaller than all the books that we could analyze in this study. It's still far more than the number of books you could read, but far less that we can read when we're not careful. So the point is, with this data, we can start doing very nice things with cultural evolution. So here's the state of the art as of six months ago about irregular verbs. So if you're interested in knowing about the verb to thrive, you'd go to a leading scholar, a contemporary scholar, Steven Binker, with amazing hair, and you'd ask, Steve, so how about the verb to thrive? How should I complicate it? You'd tell, you know, most people will say thrive. I know a few, however, who would say thrive? So they're very few. And then you'd go 200 years ago and you'd ask this other distinguished gentleman with equally fantastic hair, how about the verb to thrive? And he would tell you, you know, in my days, people thrive. Just a few of them thrive. That's pretty much it. And you'd actually not even know where this point, sorry. You'd say, this must be blue over here. This must be red over there. That's it. And now we're going to show you two lines of this 2 billion line long table that we have done, the lines that person could thrive and thrive. So this is a real data that should unfold now. So this shows you the trajectory, year by year, since the 1800 year 2000, of the word thrive and of the word thrive. So you can see exactly when the shift occurred, how fast it occurred. You can see that people don't always thrive. There's less thriving here than there's thriving over there. And so it delivers a very quantitative picture of grammatical evolution. In fact, nobody had ever observed grammatical evolution, the evolution of irregular verbs that way. This here, we can capture the exact moment when that occurred and how big that was. We knew that things evolved, but we have not seen it. That's the picture of it. And we have, so that's two lines out of the 2 billion line. So the rest, you know, those table is at least one billion time more awesome than this one picture. So I want to show you this example. So there's two reasons, there's two ways that you can tell yourself that what you're doing is not nonsense in a study like this. The first is, you know, sort of garbage in, garbage out. If you include lots of books that have bad data or bad metadata, you're going to get trajectories that make absolutely no sense. So we did our best to make sure that wasn't the case. The other thing is that after you have the database, you can start doing sanity checks. You can say, look, I know that certain things get big at certain times. I should be able to see that in my data. And here are two examples of that. This is from the English language corpus. So we have 43 heads of state, people who became president of the United States or some equivalent position in their country. Now, usually becoming head of state is pretty good in terms of getting your name out there. So we said, okay, well, is it pretty good at getting people's names out there in our data? So we lined everybody up so that their zero year was the year in which they ascended to power. And then we computed their average trajectory. And you can see very, very, very clearly that there's this very dramatic rise for these leaders at the time they ascended to power. Similar thing we did with treaties. So if you take a treaty, people very, very rarely talk about treaties before the treaty is signed. There are a couple of exceptions, but by and large people don't really talk about treaties before they're signed. There's a, okay, let's take 124 treaties, just a list of treaties that we downloaded from Wikipedia. Let's take all 124 treaties and let's just plot their frequency over time relative to the date of their signing. And again, as you can see, this is very, very clear and dramatic rise right after the treaty is signed. So that tells us in some that not only do we think we're putting pretty good data in, but to the extent that there are phenomena where we know at a certain time this should spike at a certain time this should exhibit this feature, we actually are seeing that in our data. So we have some confidence that when we're showing you all these kinds of trajectories, what we're showing you is to some extent reflective of culture at the time in the language that we're interested in. So we like irregular verbs. We've been doing all this stuff on irregular verbs going back many years. So it was kind of irresistible to say, okay, we've got this microscope. We can point it at anything that we want. Let's look at the irregular verbs. Yes. Can you talk a bit about why you like irregular verbs? Yes. Yeah. So the reason that we like irregular verbs is the following. First, they worked. Everything else didn't. We're going to talk about the things that didn't work. Anyway, perhaps more suggested answer in terms of why they worked is the following. Irregular verbs have this unique property. People have been talking for some time. So they regular verbs have very, very high frequencies. High frequency things tend to be irregular verbs. Low frequency stuff tend not to be. So why is that? Why do the irregular verbs have such high frequencies? People have talked for some time and we expand on this further in the talk about the fact that low frequency irregular verbs tend to disappear. So what you have is actually a very, very neat situation where with an irregular verb on a kind of course pass, you actually have a Boolean, you have a Boolean property. Is the verb regular or is it irregular? You're going to say one or zero. And you also have this other property, namely it's usage frequency, which you can measure precisely and which is continuous and which is related in some monotonic way with fitness. So you actually have a neat situation where you have a fairly measurable phenotype and you have a fairly measurable continuous fitness function. So it makes for a good combination in terms of trying to do an evolutionary study on some cultural phenomenon. I mean, if you look at tables, what's a table? How is table changed over time? It's very, very hard to track. How often are people using tables? I don't know. How many tables were made this year? Who knows? So irregular verbs solve a lot of the problems that are really difficult about tracking culture. It turns out that actually now we're playing with power, so we're not necessarily stuck there anymore. But it's not because every member tells anything to your community about our culture. It's not about meaning. It's not about the meaning. The interesting thing is that people have been talking about cultural evolution for a long time and there have been very, very, very few measurements of what that could be. Irregular verbs are a very nice instance of that. Yeah. The other thing that's neat about the irregular verbs is what you're seeing is really the emergence of a rule. The ED thing is a rule. It's a grammatical rule. It says when you form a past tense you add ED. The irregular verbs are hold-overs of dead rules. So you have this really nice phenomenon you're studying which is about the propagation of the rule in a language. So there's a nexus of interesting things which have made irregular verbs the objects of study for decades. I mean, they just fascinate people for very, very long time and make them a really good object of study here as well. But I think you shouldn't get started too much on irregular verbs because we can talk about them for a long time. There are many other interesting things I want to show you. But anyway, since we are talking about irregular verbs for a little longer, let's tell you that's a regular verb. This is actually largely under-reported in the media, but you should hear about the verb chide. It's the fastest verb on the planet. It's gone from a usage frequency of about 10% in the regular form. People used to say chode 90% of the time. Now people say chided 90% of the time in only 200 years. That's actually incredibly fast for a regular verb and it makes us very excited whenever someone says that they chided someone else. Now of course we tell you that irregular verbs disappear but sometimes in the case, like some sneak in, the verb to sneak, for instance, was actually a regular verb that has become irregular in the last five years. So Stephen Pinker, who I think pointed out, that sneak snuck in. It's on the pattern of the verb to stick, sneak snuck, stick stuck. I think this is probably why this happened. This is one example of things that have become irregular. We can track regular verbs in different countries. So the United States, they're leading in forces of verbs, leading in forces of many things, and leading in forces of regular verbs in particular. So here's the evolution of the verbs at N-E-T, burn, burn, learn, learn, smell, smell. There's a class of these and they had become regular, much faster than the others. And in fact we can, so in the beginning of 1800, both the U.S. and the U.K. tended to use these verbs 25% of the time in the regular way. Now the U.S. really took a leap on the U.K. very, very fast, became regular in 1858, and the U.K. only caught up in the late 1950s. So it's interesting, so the way we did this, by the way, is that we compared the user frequencies in books published in the U.S. versus books published in the U.K. So this is how we're about to tell you something about U.S. versus U.K. language. Of course, it does not correspond to exactly the American language or exactly the British language, but it is definitely enriched for this language. So the regular verbs are something that's actually very, very measurable. Everyone kind of understands them, so that's cool. But what we wanted to do is say, okay, we've got this crazy microscope and we pointed at the types of things that you couldn't dream about measuring. I mean if you were, you know, a physicalist in the Vienna Circle, right, you would take concepts like collective memory and say, that doesn't mean anything, and it doesn't mean anything, because obviously you could never measure anything like that. So we said, okay, let's try to measure something really, really, really, really vague. So let me tell you a history of the year 1950. Pretty much no one gave a damn about 1950 for most of recorded history. From the 17th century to the 18th century to the 19th century, no one cared about 1950. In fact, through 1910, 1920, 1930, no one cared about 1950. Only starting sort of in the mid to late 40s, people realized, hey, 1950's really going to happen and we better get ready for it. It started to be a little bit of a buzz about 1950, but nothing made 1950 interesting like 1950. During the year 1950, people were just walking around obsessed. They couldn't talk about anything other than things that were going to happen in 1950, things that had just happened in 1950. They were just totally fascinated by 1950. And in fact, this persisted for years after the fact 1951, 1952, 1953, people were still kind of debriefing. Gosh, 1950, what a fascinating year. So much change. And then, as with all things, 1954 rolled around and people realized, hey, 1950 is kind of passe. And they have at rates that we can measure. And what's interesting about this is that when you look at the history of the year 1950, it looks exactly like the history of 1910. More or less, looking more or less like the history of 1883 or any other year that we have on record. In particular, what we see are two very striking features. One is that we talk more about time than we did before. The other is that there's actually two phases in this process of forgetting. This is a very, very rapid forgetting phase, what we might call our collective short-term memory. And there's this much longer phase, which we might call our collective long-term memory. So it's like, okay, that's great. Let's test our memory. Let's figure out what the half-life is of the collective short-term memory. And so we plotted that over time going back about 150 years. And what you can actually see is that the rate at which that falls off used to have a half-life of somewhere in the mid-30s. Now it's around 12. So the extent to which we lose interest in the past is getting faster and faster with each passing year. Right. So of course we look at how we forget about things but also learn new things like that. So for instance, there's technologies. So if you look at telephone, for instance, telephone was invented around the year 1850 over here. But it started being mentioned in books only 25 to 30 years later. Nobody talked about the telephone, presumably because there weren't really wires and people were able to use the telephone, even though it was invented in principle. Now the radio did not have the same fate. Radio was broadcast much faster after it was invented close to the year 1900. So there's this idea here that for any invention we can probably measure how fast it starts entering culture. Here we took inventions invented in the first 40 years of the 19th century, the following 40 years and again following 40 years. And we ask, we take all these inventions and we ask, when were they invented? And we look at the trajectories, the aggregated trajectories. And we see here that basically the inventions that were invented in the early 19th century took longer to really increase in the culture than inventions invented closer to the 20th century. You can see here that as a function of time, it looks like the trajectories of inventions is just becoming faster. Inventions of new technologies penetrate culture faster than other technologies. There can be many reasons for that which we don't know about. There are many hypotheses. The fact is that this is data about the way technology penetrates culture. Data was pretty hard to measure if you don't have this data. Now the next thing about new and old is that it's fame. So we're all people, some of us get famous, some of us don't. And there's an interesting question as to how that happens. Who are the most famous people born in any given year? And what does their trajectory to fame look like? So what we did is we went back for every single year in the last two centuries and we said who are the most famous people born in that year? The class of 1850, the class of 1855. Here we're showing you the trajectories for the 50 most famous people born in 1871, the class of 1871 if you will. And we can learn a fair bit about fame just by looking at their names. So Orville Wright, he was involved in flames. Ernest Rutherford, who won the Nobel Prize for scattering experiments in physics, Marcel Proust, he wrote books. So there's an incredible array of different ways in which people get famous. But what's interesting is if you kind of look on the average at the class of 1871, what you see is that it sort of takes a while before anyone notices them. No one really cares about the class of 1871 in 1883 because they're all kind of 12-year-old pumps. Yeah, Orville, that's really cute. I'm sure somebody will fly. But then eventually they sort of start to get noticed. And what do we mean by get noticed? We mean that their frequency crosses the threshold of about 10 to minus 9th, one part per billion, which is the frequency of the lowest frequency words in the dictionary by and large. So if you're as commonly used in the language as an infrequent word in the dictionary and we feel like if you should be in the dictionary, then you've made it. Anyway, but these people, they don't just make it, right? As soon as they pass that threshold, they start this incredibly, incredibly rapid ascent, then they get a peak and then this sort of slow decline as we forget about them. In terms of that, you can measure that pretty quantitatively. So this story here that there is just all is the story of any class, the class of 1871, the class of 1865, the class of 1920. And here, so for any given, so this is the median territory of the 50 most famous people born in 1865. There's an age at which they become famous. This age for this role is 34 years old. Then there's a how fast they rise to fame. It turns out that it's an exponential rise. Every four years in this case, they're fame double. So with four years, we've all talked about them twice as much in books as we used to before. Then you reach a plateau at the age of 70, sorry at the age of 70 and then people start forgetting about you, of course. But they forget about you much slower than they learned about you. That's pretty good news. In any way, by the time this is actually meaningful in a long time that you're not here anymore. But the point is these parameters, although they capture each and every trajectory, they have changed a little bit. So the dubbing time has become faster. We now, celebrities born in the 1920s became famous much faster. Their fame rose much faster than celebrities born in the 1800s. Dubbing time went from eight years to close to two years. With two years, these people became more and more famous, quite the same. This is very, very big. And of course, the age at which you become famous has become smaller. So celebrities born in the 1920s became famous before the year they were 30. So who knows? And this is a study that was done on people that really probably did not benefit from the television, for instance. So now with the current media internet and the television and so one can only wonder how fast this is changing again. When you say mentions doubled, is that as a percentage of all words in a given year? Yeah, right. So everything that we're showing you, so everything we're showing you is there. So the frequency of a word to us is the number of time that word was seen that year divided by the total number of words we saw that year. So everything is normalized for the number of books that year, the number of words that year. So I think I made you wrong, but I imagine that the technologies linked to the speed of publication do not have an effect that would change by a factor of five, the half-life of fame in these years. And probably the faster published technologies would change by two or three years how fast information that had actually occurred penetrated culture, but I doubt that they would affect the speed at which they rise afterwards, does that make sense? So though you're completely right that there is a line between when something occurs and what it is recorded in the book record. You need to write about it and you need to publish it. This line seems to be as far as we can tell from all the studies we've done here. But that's where you hold the short questions and then dispute at the end. Because the thing is I feel like probably many of you in the audience are like, oh, this is very interesting, but how does this relate to me? But actually there's good news, which is that many of you are actually quite young. So I wanted to give you some advice. A heart to heart with data. All right, look, what should you do? People are interested in getting famous. You should know what are your options. There's a whole bunch of different strategies you can engage in to get famous. And so we looked at the 25 most famous people engaged in different strategies, things like becoming a political figure or becoming an author, becoming an actor, becoming a biologist, et cetera. How does that work out? Well, it turns out that if you want to get famous fast, then you should become an actor. Because actors become famous in their mid to late 20s. They kind of become famous sort of pretty rapidly and then kind of plateau. Now bear in mind that these are old people who were born between 1800 and 1920. So these actors are not benefiting from the television age. Anyway, they at the age of about 40 or so were kind of plateau. And that's it. So you're young, you're famous. That's great. But actually if you want to be rather more famous, you could wait a little bit longer until you're sort of mid 30s. And be a writer. Writers after their mid 30s start to get more and more and more famous, but actually can peak at much higher levels than actors and will persist for even longer. Still, if you really want to get famous and you're really willing to delay gratification, then what you should do is become a political figure. Because of course, our 25 most famous political figures, 11 of them became presidents of the United States. Nine became heads of state of other countries. They had to wait till around 50 or so to become famous. But when they became famous, they became more famous than any of the other groups. There are certain things that you should not do if you would like to become famous. I mean, in principle, you could become a biologist, an artist, a physicist or a chemist. That's kind of like being an actor. You will get equally famous, but you'll have to wait till you're 65 or 70 to do it. That's kind of waste of being famous while you're alive. But anyway, it's kind of up to you. The thing is, and this is really a rookie mistake, definitely don't do what we did. Do not, under any circumstances, get tricked into becoming a mathematician. The problem with doing math is that nobody notices. People say, oh, mathematicians do their best work when they're young, which I guess is great to get all the work out of the way. There's still nobody who notices you until maybe a tiny bit when you're 80 and are not necessarily in positions to do much about it. Now, there are many, so there are many things that we're going to show you after. This is not one of them. This is the aspect of suppression and censorship on the cultural trajectory of these people. So here's a famous Jewish painter, Mark Sagar. So we're looking here at this trajectory in the English books. It rises at some point and continues rising. There are some accidents along the way, like that with anybody else. Now, let's look at the name of Mark Sagar in books published in Germany. This is what happened. The red trajectory here. So it becomes very famous and then it starts decreasing. It starts decreasing here too, that's fine. But now it reaches a zero between somewhere between 33 and 45. This is not something that happens with the famous people. It's very, very rare. It's surprisingly, not that surprising afterwards when you realize what happened between 33 and 45, but between those, and during the Nazi regime, there was an enormous amount of suppression and we see that very cognitively in those trends. It's so surprising actually that we have been able to find evidence of censorship in many cultures. So for instance, you can take the case of Jesse Owens. So Jesse Owens is a remarkable figure in the history of athletics. What did Jesse Owens do? Well, he won four gold medals in track and field in 1936 at the Olympics. That's great. That made him very, very famous right in 1936. You see that huge spike right in 1936 in the English Corpus. That's interesting. What's sometimes much, much more interesting is where and how he did this in the 1936 Olympics in Berlin. And for Berlin in 1936, Hitler is in charge. There's this ideology of the racial supremacy, the Aryan race, the inferiority of other races like the African race, and the track and field events. These are the most prestigious events of the Olympics. So Hitler wants to make an example here of the superiority of all of his athletes to be very public going to be in Berlin. Trouble is that Jesse Owens wins four gold medals. I mean it's just the most direct and public repudiation of the Nazi ideology you could possibly imagine. So it is interesting to see what the German response was. What responses would be seen in the German Corpus and basically you see that same surge in interest in Jesse Owens, but you don't see it until 1945. Basically this gives me some sense of the sort of psychology of what it is like to live under a totalitarian regime that's controlling information. I mean we kind of laugh, you know, when the Iraqi Information Minister says, oh everything is completely normal, but you know there are Abrams tanks rolling in the background. But the point is facts, you know, in this case of Jesse Owens, you can see very clearly illustrated that if something happens, no matter how public, that directly repudiates the dominant ideology, just pretend that it didn't. And that can actually work on the scale of an entire civilization. So we see that happening in many many governments. So here's for three heroes of the Russian Revolution, Spotsky, Zinajet and Khamenev. They're both very famous before the Stalin did not like them anymore. And two of them were executed. One of them was assassinated. Now when you're assassinated, typically your friend tends to shoot up. So here it drops and it maintains, it becomes suppressed for the next 50 years. This type of suppression here, this again is really uncommon. You have to wait for the perestroika for the memories of these people to be restored in books written in Russian. We see such examples not only with people but with Evans too. So here in Tiananmen Square two big events occurred, one in 1976. And we can see here in English books that we probably talk a little bit about Tiananmen Square due to that event right there. In books written in Chinese we talk a lot about that. Now of course in our recent cultures what we remember most about Tiananmen Square is not what happened in 1976 but what happened in 1989. Indeed you do see a sharp shoot up about the mentions of Tiananmen Square in books written in English. We do not see that in books written in Chinese. Very likely this is the result of suppression about this particular event. Now it's not only the authoritarian regimes or other types of regimes, it's also here in the U.S. During the Second World's Care in the U.S. these people were asked, so these were Hollywood writers, Hollywood directors. They are known as the Hollywood Ten. They were asked by the Congress to come testify about their supposed links with communists. They refused to go. They said no, really good. Movie executives gathered and said these guys we're going to blacklist them in the sense that they're not going to work for us anymore. They're not going to be credited in our movies anymore. They're out of the picture, literally. And you can see that the trajectories of these people, after 1947 when that occurs, John I. T. Pick goes down. This is the Midgen Traitor. It goes down to 1960. It's only 1960 that Donald Delfin Trembl was created in a movie called at the name Exodus in this case. And you can see sometimes a very heavy weight of decisions in this case. Yeah, so here's a great example. So Albert Malty is one of the Hollywood Ten that JV was just telling you about. It's interesting to compare him to the director of Ilya Kazan between razor hand if you know who Albert Malty is. Can you raise your hand if you've ever heard of Ilya Kazan? This is pretty striking. Now, why is it that you don't know who Albert Malty is, but you know who Ilya Kazan is by and large? Because actually Ilya Kazan, up until 1947, was less famous than Albert Malty. Albert Malty was doing pretty well. The thing is that both of them get testified for the House on Art and Activities Committee. Albert Malty stands on principle, refuses to name names. Ilya Kazan does not. He says my career is not worth it. He names names and has forever been associated with that fateful decision. But let's see how that affects his life. You can see that Ilya Kazan goes on to have a tremendously successful career. Albert Malty's career really kind of peters out after that decision and suffers enormously in the 13 years in which the black list is enforced. So it's really interesting how you can take this kind of data and use it to give you a quantitative picture of an ethical decision that individual people made at individual moments in time and the consequences that those decisions had for their lives. Probably if Albert Malty decided to name names and Ilya Kazan had decided not to, all those handbrails would have been precisely inferred. But half the time Kazan's getting mentioned as this final statement. I mean his name is there in the negative kind. It's true but half the time he's also being mentioned for Katna Hottain Roof, you know. So actually we don't know that. We don't know that in this case he's being mentioned in the negative workbook. That's my point. You don't know. You just know his names in a book. Yeah, I mean it's absolutely true that there are, you know, when you look at these texts many of them are negative, many are positive, many of them are neutral. But the thing is he's still around. He's still someone people know about. Albert Malty. He spelled his name right. So we wanted to take a closer look at one of these phenomena. In particular the Nazi censorship that took place during the Third Reich. So what we did is we started taking the blacklists that the Nazis produced. In fact there's a Nazi librarian Wolfgang Hermann who created blacklists. And they were very very systematic about this as with many other horrifying things that they did. And so Wolfgang Hermann in his blacklists actually writes them all these names of people, you know, who should be pulled out of libraries and he actually categorizes them. He says, oh, you know, here are the philosophers who work on it. Here are the people who are writing about history who, you know, we shouldn't have around them. Here are the people who are writing about religion. They're very very systematic. In fact his blacklists formed the basis for the 1933 book burnings in Berlin and throughout Germany. So we took those blacklists and we said well how have the blacklists affect the, you know, lives and mentions of people who were mentioned on the blacklists. So there's four blacklists there from Wolfgang Hermann. His politics, literature, history and his philosophy and religion blacklists. We also added one more which is the names of all the artists in the Degenerate Art Exhibition. The Nazis took artists that they did not like and they created, instead of just getting rid of their work, they got rid of them out of their work, but they also took some of their work and put it in this sort of mock exhibition where they just made fun of it. Both Degenerate Art it circulated all around Germany. So we took those five blacklists and what you can see is that for each of the blacklists, the best of the client, we also compared them to a collection of names of Nazis. I think we actually measure, sort of, get some sense of the effectiveness of these sort of blacklisting information control schemes in the different groups. So for instance, the artists decline on the whole by about a factor of two. The philosophers, we're writing about philosophy and religion, you see the client about a factor of four, actually in terms of mentions of them, whereas if you're actually on a history blacklist, it seems that the suppression was somehow not as complete as the fact that maybe that wasn't as high a priority, they only declined about nine percent during the third rite. Of course, in contrast to what the Nazis, there's a 500 percent increase in mentions of them. So now what that all tells us is that all these signals are so strong that we should be able to detect censorship and detect suppression without knowing about it, without having to know who was being suppressed. So let's case in point here is Henri Matisse. If I knew nothing of Henri Matisse during the Nazi years, I would say I know his fame before, I know his fame after over there, now his fame during should be somewhere between the two. So I would say that his fame is over there, but I know that he's been talked about as much as that. So now this, the difference between those two points or the ratio between those two points gives me an index for suppression. I can compute that number for all 700,000 names that appear on Wikipedia. And then I can say if that number is very, very big, maybe it's suppression if that number is very, very large, maybe it's propaganda. If it's nothing, it should be pretty much what I expect. Here's the distribution of suppression indices for 5,000 names in English between 32 and 45. There's nothing special that happened in terms of censorship of people in English books between 32 and 45 that we know of. And indeed, the distribution of suppression indices is highly centered around one, which means that what we observe is pretty much what we expect. People are talked about as much as we thought. Now in German, German books, very, very different. We have a distribution is whole shifted to the left, which means that it's a logarithmic scale here. So actually this, this shift is quite important. People are talked about much less than we would have thought they should. And more importantly, the distribution is much wider. So it means that there are many more people that are being suppressed than we would have expected. Here Pablo Picasso is talked about 10 times as less as we would have predicted. And there are also people at the far right of this in terms of suppression index at the far right of this distribution here, which are who are talked about more than we would have thought. This is probably due to propaganda. So now we have actually looked at these names and we have compared these results with annotated with the manual annotation of these names and we found really striking results between this cognitive method here and the results of a human annotator. What that is, is that it tells you it's something, if you're a historian and you're thinking about Nazi censorship and you want to know who are the people who have been censored by the Nazis, how are you going to find this out? You can try to read all the books and then see that as you go as you progress through the Nazi years there are some names you used to see but you don't see anymore so probably they were censored. That's very very difficult. You can use this method to propose a list of people who probably were censored. So you take all the people here given to an historian and the historian says, okay so these are candidates for suppression. Now I'm going to go and check whether these names were actually suppressed or not. So it's not something that replaces the work of a historian, something that completes it. And point of fact we actually took many of these names from the extremes, sent them to an historian working at Yad Vashem and said okay we're not telling you which extreme they come from but which do you think? And point of fact their qualitative assessment matched very very well with this sort of quantitative algorithm. In fact it could take names like for instance Hermann Maas who is at the furthest extreme. He's a protest minister who spoke out against what the Nazis were doing. He's one of the most extreme values in our distribution. He was actually later recognized by Yad Vashem for his activities during the war. So anyway so we had this sort of you know ability to do this but it's it's much less fun and you know it's just like two people who can do this kind of thing. So we said well we've got to make this data available to people and so a couple of weeks before the paper came up you know we approached our collaborators over at Google and said hey we you know we've really got to get you guys to take our prototype and create a web-based version of it. And they did an amazing job actually led by sort of John Orloff and put together this tool in two weeks fly. Which is really amazing and gives you it gives you and then sort of the rest of the world the opportunity to play with this kind of information. So let me tell you what what this is. This is the Google Endgrain viewer. It lives at endgrain.googlelibes.com. You can type in a word or a phrase and you can see how frequently it's used over time. So in this particular case people decided that they were interested in whether the United States is a plural or not. In back in the early 1800s people used to refer to the United States as a plural. The United States are strong and their economy is doing very well or whatever they might want to say. These days we refer to the United States in the single that the United States is. So people have been debating for decades actually over a century how when did that transition happen? When did we start thinking about the United States in the singular and people have said all kinds of things. They haven't been able to just look. So this you can just look. It turns out that the transition of the 50-50 point is around 1876. And in fact if you actually look at inaugural addresses of presidents you can see that before that period of time before the window around the 1870s the inaugural addresses tend to refer to the United States in the plural pretty much exclusively. There's about a 30-year window of time where they kind of do both and then after that they're referring to the United States entirely in the singular. The interesting thing is that you don't have to take our word for it. You know this number you can actually check it for yourself in the sense that if you're interested to see examples of the United States are in context. You can click around here I think and you can that will lead you to Google Books that contain this particular sentence and you can actually see the highlights of the United States are subject to the extremes of heat and cold. It's still true today in Boston. And so what that does to you what that does is this is the front end for a digital library. So you have a library with millions of books over the place. How are you going to browse through it? You can do the classic way which is kind of with the car catalog or you can have a way which is like that where you say I know this term is interesting. I'm not necessarily sure what particular feature of that phrase or sentence is interesting. I see here that there's an abundance of this term there. I can click on it and open it. And so and actually we're working with the Harvard libraries to develop such a browser for them for the Harvard libraries. So this is a pretty powerful way to browse digital libraries. Now I think we just also wanted to make another point which is you know because we've gotten very very used to this old kind of card catalog method and especially as there are increasing discussions about things like DPLA, Digital Public Library of America it's going to be important to start thinking beyond just sort of card catalogs types of interfaces that we have now to new types of interfaces. Interfaces potentially you know like Dan Graham Viewer and all the types of future ways in which we're going to be able to visualize books on global scale in order to think about what what the interface really is going to be for for the digital libraries of the future. So yes I think we have like we have to speed up because I think that people want to be assertions but before we do that I want to present to you the end grant. So an important thing to know right is that we launched this thing right and within 20 so nobody at Google actually looked to be useful they were like oh but watch as we do this we don't know like a hundred professors will use it but then it turned out that actually in the first 24 hours there were over a million queries run on the end grant viewers so we came came quite quite the thing and so of course there are Oscars and you know there are you know for good movies and there are Grammys for good music and so we thought oh this is great we should give end Grammys for for good end grants. So the first one goes to the best for everyone. So you know today we want to do our best put our best foot forward and do everything as best we can. Well this person here this is an example of a query that was actually run on the first day this person here looked at best and and realized that actually in the late 18th century nobody cared for doing their best. They were not interested in putting their best foot forward rather they wanted to do their best. Put their best foot forward be the best they can be. What's going on here and actually there's a very very simple explanation for that which is character recognition errors right. So this is the this thing called the medial s which is that you would like to there's a different way of spelling the s here in this case when it's in the middle of the word. This was not incorporated in Google's optical character recognition at the time we did the study. So they recognize this as an s so that's why we see so much best thing in the early in the 18th century. We did report this fact in the very very extensive hundreds of pages of documentation all about you know this paper and how you but nobody was interested in reading that because it's like if you have your fun new gadget boy etc you know you're not going to read the instructions and so that's actually a really good sign that people were enthusiastic about it. Here's another interesting story so man goes into his condo association meeting older gentleman there he says you know said some sentence used the word fortnight nobody knows what he's talking about he gets very upset but our hero is very resourceful so he boots up the n-gram viewer and he says look you know fortnight it's much less frequent than it was many fortnights ago he shows it to this older gentleman who understands and so tension is defused and so you know it's great that among other things with the n-gram viewer we believe we are bringing peace now we're also helping marketing firms they want to incentivize so they can strategize their synergy in ways that is very successful at the end of the 20th century I think that's a pretty important marketing vision now this frustration frustration so this is a study of a very specific kind of frustration is the one where you bumped your foot in the devil go argh this is a 2a arc you know very very popular in the in 1982 there's a surge of 2a arcs of frustration there we think this might have had something to do with rating very nicely sure we can we can confirm nor informally now it turns out that we can study the the the argh of frustration until the 8a arc and you should not look at what's going on there too very often now it's not it's a very different argh than the piratik arc which is with many arcs are different than this one just just making fear that you're not confused so there's lots of other things people actually done with the raw data so now that's the writing from here with the entire 2 billion raw data set available so for instance the past of london created this little app that tells you how good big your twitter vocabulary is and you know I don't know how good this app is but it certainly got really really popular actually trended on twitter which is cool another cool thing that the folks did this is a student working with us as well as John Bohan and correspond for science the science is rolling out this paper that said let's do something interesting with this data and so they said well you know what should we do with it that's interesting and they thought hmm you know wouldn't it be great if we had a science hall of fame now how do you usually generate a whole thing well usually you get a whole bunch of kind of older people and they all sit around in some room and cogitate in unknown ways and decide who is actually important of the football players but of course in science I guess there was some kind of shortage of older people to cogitate which is kind of hard to understand so anyway this somehow never got done so we said oh well we're scientists let's do this in a scientific way and so they took all of wikipedia and they took all of the scientists who are in wikipedia and they kind of algorithmically ran them through this system to figure out who the most famous folks were and in fact they also defined a unit of famous and you know Charles Darwin he's really really famous we'll call him one Darwin and then everybody else is usually measured in milli darwins because it's very very hard to be as famous as Charles Darwin just to give you a sense of things Stephen Pinker is in the 30s 30 something Miller Darwin's so it's very very hard to be as famous as Darwin right so and then people at wikipedia thought that they could actually take this gold standard of fame the military to ask if when you're famous you get a good wikipedia article so they have this they have this internal measure of how good the wikipedia article is when it's green it's good when it's red over there it's not good but you notice there's something that happens here if you rank the most famous man by fame John Dewey is really famous he's ranking is pretty good now if you rank the most famous women by fame you see that the ranking of the article is not so it's not the most perfect way to see it but but you do see when you control for fame women tend to have articles of poorer quality than men it's already known that wikipedia is written at 80% by men but they literally have to reflect on the quality of the article for men and women so that's so far an important question but I think we all want to close here our remarks which is a culture mix we named it for genomes culture mix is the application of high throughput methods to the study of culture basically now we start yeah like we start with we start with books we know full well that books do not contain all of culture there's many many many many other things that make up culture there's newspapers there's uh there's non-textual sources also there's manuscripts for really hard to decrypt that go way back there's maps there's pictures and paintings and sculptures and canoes and whatever you want so um there's many things however what's interesting about today is that these things get more digitized the present is much more and more digital but so is the past google and many other companies and universities and nations are digitizing these objects of history they are digitizing the past the past is becoming more and more difficult and when you have the ability to have all these things to look at this thing from a computational standpoint we argue and we have demonstrated in part that it can be very useful to have methods of looking at these objects that are not as careful as the ones that we currently have looking at these objects in the millions so what the the last very interesting thing about it is that we don't need to wait for copyright laws to change for that to be useful so the books that we have studied most of them are under copyright but by releasing the data we did about them we didn't raise anything that was under copyright we just really satisfied and these statistics hold some very interesting information so even if today copyright laws still hold us back in in the research world we can still gather millions of these objects that are under copyright and do interesting with them that don't break copyright law. One of the things that's actually really neat is that since this paper come out only a couple of months ago people have thought oh this is this is kind of cool and actually they built a whole bunch of n-gram viewers on their own so for instance Google n-gram viewer but Colombian newspaper said oh this is a cool way of viewing their data so they put up their own n-gram viewer someone said it's a massive uh corpora of the score musical score said oh let's do an n-gram viewer for scores so you can plot you can say okay i'm interested in the following note how frequently was used over time so i think that what we're seeing is that we need to learn new languages of sort of multiple types we need to figure out how to digitize all of these repositories but we also have to figure out clever ways of representing the data ways that are engaging so we have that sort of vocabulary as these repositories come online so finally we wanted to thank all the people who were involved in this study many many many many people in fact hundreds of people on google books team especially John Orlant, Matt Gray, Peter Norvig and Dan Clancy at Google Steven Varys, Adrian Varys, Fliberto Duvon Shent, my wife was Eva who came up with a great idea about studying censorship Martin Novak, Steve Pinker, her advisor to the ticket executive editor of the American Heritage Pictionary for her input on words and they all worked very hard to achieve the insight we needed for Canada as well all of the people who funded us thanks very much so thank you very much i think we want to do many questions and if you can identify yourself briefly your affiliation and then state a question with a question mark and then if we can couple them sorry but it happens sometimes and if we can couple them two or three questions at a time to batch process them for awareness and taking that would be fantastic who would like to start us off i have a question about maybe corecast so can you you think you can use this date sorry thank you to project culture i mean because you're projecting fame you're projecting how long it'll take for the decline of you know mathematician scholar scientists etc i think you have to be careful with this but it's absolutely the case that um one ought to be able to make some source of predictions with it and again you have to be very careful the basic level is simple prediction in the world is that any time you see a linear trend anywhere in all scientific fields right that the trend should continue because most things continue to be the way they are like first derivatives and so that's going to apply to a lot of the types of trends that we've shown but it's not going to apply to to many of them as well and so i think one of the things we need to learn is you know push our boundaries and learn what our boundaries are in doing things like that right we can have pretty interesting numbers for aggregates so we take you know the most of the 50 most i mean people are all irregular things like that we could say with confidence that the trend will continue now for an individual query for an individual word an individual person that that's much much more Andy McAfee from MIT have you come across any examples of small and historians as opposed to what you all do getting something wrong i think it's you know small and historians as a collective often have a wide variety of opinions so usually where you are in the minority opinion turned out to be we actually have looked spend a little bit of time looking for that that much and we have not found this yet but we're hoping that as these programs this can be a tool that the way we think these intervals with history is quite first it can give you certain hypotheses on which to build oh i didn't know about this thing what why is it so it's starting point for discovery and maybe in some cases you can also prove or disprove prove or disprove some hypotheses the key thing is that these hypotheses need to be formulated in quantitative ways yeah this currently like by condition in large scale this is currently not the case in most parts of what i mean there definitely are kind of one off remarks that people say sort of without thinking goes with never checkable where people said oh you know nobody talked about cabbage until the 30s you know when you see you know reports of cabbage and the congressional records or like that you know so there are occasionally running to sort of one off things like that which you can check here and we just know that people are wrong you know sometimes people say things that are dramatic right but did they really mean it right so there's an incredibly inspired piece of you know civil war history it says oh yes you know before the civil war everyone said you know the united states are after the civil war everyone said the united states is you know like so whether you know it was a singular past and was exactly what the war was being fought over and this is a great beautiful interpretation of states rights et cetera it's i was deeply inspired when i read it i mean it's kind of wrong strictly speaking but i'm no less inspired for that reason i don't know if you know the person was writing it you know many decades ago i don't know you know what they what they meant by it to to some extent so it's uh it's definitely something that we're learning but how to map qualitative historical plans on to quantitative plans and make them talk to each other uh George Blakesley Leslie University it'd be interesting with the presidential campaigns about to kick off to test the prediction concept with your fame trend analysis tool and watch the trends as they go to see if you could determine who the winner might be then again i think this will be particularly relevant in newspaper record in the twitter record in like much more instantaneous media here all these things again are really true in books we think they might apply to some extent to other media but they're really specific to books won't you really keep telling me to ask if you want me that one two three another question there's somebody in the back uh you uh focus on uh text and the books uh i would like to know voice data and uh for example uh sound and voice data as the classical music and music yeah that is a very important data you can consider about that yeah so we definitely thought about important issues there are um pluses uh about it there's also minuses uh one of the minuses for instance copyright law the deal with uh says that if somebody transcribes a score they now have a copyright on that transcription um so that means that you all of a sudden have a digital world that is loaded with an incredible array of people who have claims navigating that is very very challenging another challenge is that if a single note is off in a score it's actually much harder to detect than for instance a single letter is often a word because if you take a word and you change one letter like if you change even pinker do i don't know or you know make the eye into an L that kind of seems very obviously wrong um or at least you know kind of very foreign um and therefore low light you know that that kind of correction is much harder to do in the case of music so there's a lot of new challenges and it's obviously an incredibly important uh area and as uh we mentioned earlier there he is actually currently a musical n-gram viewer that is run on a relatively small data set but you can get a sense about that looks we started with text because text is easier to interpret much easier than music or our movie or our sculpture so we started with that but clearly things are going to be true in that direction yeah there are sound recordings that go back down a century so one could do some of this do you have the i'm jason kaufman from the berthman center do you have metadata to differentiate fiction from non-fiction and different genres we do it turns out that uh so the many answers to that so at google for instance each book has a metadata by metadata i mean who's the author one was the christened was in the fiction run fiction whatever um the way they do these metadata is by aggregating it from many content providers there are many many many conflicts here it's tremendously is a really difficult problem so for instance we have done a fiction course but we took all the books that for some somewhere in the header there was fiction this is enriched for what we mean by fiction books but like something like a scholarly study of Shakespeare's play will be a fiction book because it's about fiction so uh so there so in the data set that we have produced it's not it's not that king uh now at the Harvard level for instance in the Harvard libraries they have much more uniform data data so working with them we might be able to have a much better solution for it so we're time for three more questions and i will throw in one if i may what are the implications as as people like you and very enterprising companies digitize more and more of the world what are the implications for intellectual property and copyright something that we think a lot about at the berthman center well so i think one of the important things to think about that then comes out of this and i think this is a point that jb made as well the approach that's being taken now i think in a large these many these digitization projects especially when they're not happening in the private sector is like oh we need to push for copyright reform in order to make these things possible and and i think that's true to an extent but i think tools like ngrams here there's many many books that you can't access in any form in google that google has digitized for copyright reasons they cannot make accessible in any form except through ngrams here and if we can think of many tools like this they might give us enough of the case to make for ourselves but we should digitize this stuff anyway and there's going to be tons of stuff that we'd be able to do with it and then once we have also the strength that we're doing this stuff we've got everything digitized then push for the copyright development otherwise we get into this difficult situation where without the copyright reform there's not much we'll push to really be digitizing without digitizing it's not as much we'll push to do the copyright reform George Malkin from the central square last week i think i heard Ethan Zuckerman mention about censorship in China so that people can talk about Egypt but it doesn't trend so the secondary data the metadata doesn't it doesn't show up so people don't know that other people are talking about China and i'm wondering about that in terms of what you're doing with with censorship um so that if you look and see what the topics are the on twitter or facebook or other things that are out there in the digital world so you don't know that other people are talking about it so then the suppress the suppress the indicator that well yeah i don't think that has any for our work i don't think that has any consequence there right but but it's uh there's one way the one word about it is that if you suppress the indication that it is trending it suppresses the possibility of other people to see about it and to keep it trending up so there's there might be from a mathematician perspective there might be an interesting thing going on from from public policies and points because obviously yeah there's extremely strong feedback loops in these types of things and so if you cut those off that certain should have consequences so you think that you could if you looked at the data you could be able to to see that parse it out it might not be impossible and either some features about how pacifizers or the nature of the rise that might be in fact the difficult that's not impossible yeah there's all kinds of subtle cues for for this sort of stuff like folks over at archive uh so an archive orders their the archives of physics uh you know paper heat print archive and they always every day send out all the new papers and they order them and so they've done really really careful studies that have the order of those papers actually have a dramatic impact on long-term citation statistics uh and you know something goes wrong on a quicker day and you randomly punch something else in that first slide it actually has long-term consequences for all the papers involved so those types of subtle mechanisms that can be really important and really persistent but that's a very interesting point David Weinberger from the Harvard Library Innovation Lab does the corpus have enough structural information in it so that you know what's in um words that are in headlines subheads indices in tables which are really interesting objects as well so there are basic stuff that gets cut off things like things in the header which often will just contain the title over and over and over again you don't want that that gets cut off page numbers get cut off and so what you're looking at is really the the meat of uh that's itself what the author wrote but the front matters you can the indices get removed you know words that words in chapter sorry words in chapter titles and captions is that sort of structure marked up because that can be very interesting and valuable to know it as a further level of analysis words that are used in uh subheadings or the big challenge here is if you don't actually get this stuff this stuff is not going to come out through an n-grants type um structure we could in principle create like a caption n-grants corpus you can mark it up like you can say instead of being just the word uh burn it's in burn plus indication it was actually in the in a quotation but we can mark up the it's not impossible that it ends up on an n-gram but the main problem though would be to be sure that what you were saying is a header is leading the header like consider the problems that we had just being sure that the book was actually written in the year where it all was written in so now if you're talking about quotations and headers with the volume of five million books that come from many different sources it's really a very difficult problem it's a problem that might be tractable that evolved a single university library today uh yeah so uh two last questions in my left ear right uh so i just wanted to ask quickly about how your methods of analysis change versus data same whenever you change your time scales so like because you're looking at you know orders of hundreds of years but but if we do start looking at data sets like twitter and this sort of thing where the time scales shift what is invariant and what has to change we agree we're empirical sentences so we are really driven by what is in the data so we tend to in fact start with this with this data we we thought that you know we had many ideas about trying to look at well the the questions that wanted to look at were not at all the most interesting and we actually don't talk about them anymore so we just look at the data and see what is there and then try to understand so i might i might suspect that uh if we had the twitter right now that's what we would be doing we'd be looking at the data and try to adapt to it i don't i don't think that would necessarily um transport directly some of these methods there but there are some generic things that one thinks about uh that probably would lift over things like what's the time resolution of your data how much depth do you have at a given time resolution uh so for instance if we want to have one year resolution we can throw less words into a one year bin if we want a 10 year resolution decade by decade resolution then we can afford to have a lot more work for bin so we look at the much rarer stuff with one decade resolution one year resolution everything that's crucial is that every medium can have kind of characteristic lag times so for instance books usually if something has happened in time x there's a two-year lag before people write the books about it uh because publication just takes a very long time twitter is very different right we're giving this talk now and in principle someone can be tweeting something on twitter right at this very very moment uh and so it is essentially zero lag so questions like that about the lag of the medium um and sort of how it works as an echo chamber in that way uh can can definitely have an effect on your thinking what would probably think we would think about what he'd normally hold university do you envisage that going forward you could analyze the opposite effect that is from words to meta tags and by this i mean for instance if you have cairo sphinx and pyramid obviously that article or that sentence is about egypt and the reason why i mentioned that is precisely or las vegas so and that could also be relevant in terms of censorship because sometimes you can't talk about somebody or something without mentioning it so that artist that was born in 1923 and so on and so thank you this is actually michelle pucco makes this point about censorship they said oh censor creates much more sort of discourses than that it gets rid of um and there's a sense in which when we talk about censorship we're we're talking about a very particular discourse is this works in multiple persons name you mentioned there's many other ways in which that discourse can take place and those are much much much harder to track uh so that's that's absolutely an issue that one must recognize an important compound for these for these types of studies as far as the statistical classification um like you're absolutely right that type of classification of determining the metadata from the data is possible we spent some time looking at that and that's a very very important direction yeah thank you this is called making a language model you can say i can i'm given a language and i make a model out of it so for instance you could then say using that model oh this is a book from the 1700 by looking at how another book first great i think i'm gonna say thank you very much for the amazing talk