 Let me first say, I'm the setup guy here. Actually, Hal is the punchline. And the reason for this is that on this, as on many of the projects that Hal and I work on jointly, it's about half a dozen at the moment, I wave my arms a lot. And then Hal actually makes things work. One of the reasons that we're presenting MediaCloud is, to a certain extent, it doesn't work. And one of the reasons that it doesn't work is that we are taking an engineering approach to some extremely deep computer science problems, by which I mean we are looking at solutions that exist out there. We are linking them together with the very able help of David LaRachel, who's sitting there on the corner, and duct taping together existing algorithms and pieces of code to try to solve a very, very difficult and deep problem. We are clearly not going to solve it based on the methodology we have at the moment. And so in part, the reason we're coming and talking to computer scientists is that there are large chunks of this problem where we need deep algorithmic help to get forward on it. But just to give you a sense for how nasty and messy the problem that we're playing with is, I've had sitting behind us a paper from 1965. And this is a paper that in the field of media studies is pretty seminal. I don't know to what extent it actually ends up making its way into computer science departments. I'm guessing probably not nearly as much as if you were showing up at a good communications department. But this paper is by Yohan Galtung and Mari Ruj. And it was a study of four Norwegian newspapers over the course of about four years, so 1961 to 1965. Galtung and Ruj literally took the morning and evening newspapers in Oslo and clipped out every single newspaper story that mentioned Cyprus, Congo, or Cuba. And then with an enormous stack of stories in front of about 1,200 that they'd accumulated over four years, they decided to try to figure out what is allowed to become news. And this is a very fundamental question of all the things that happen in the world on any given day. Only a certain number of them end up being considered to be newsworthy. Only a certain number of them end up being amplified and get the sort of attention that you get from showing up in a newspaper. This is a question that has extremely important economic implications. It's got extremely important political implications. There are lots of stories out there of both wonderful and tragic events that essentially end up being invisible because in one fashion or another, they cease to be news. There's also the possibility of events that get far too much attention and far too much scrutiny by virtue of the fact that they may not be particularly newsworthy, but we're sort of addicted to them in one fashion or another. We can't stop looking at them. You can think of celebrity news. You can think of fascinations with the individual machinations of political campaigns. The question of what gets to become news is an interesting and sort of deep political one, and we think also one that you can start asking some quantitative questions about. The reason this paper was interesting is that in 1965 these two took a crack at answering this question quantitatively. And so they went through these articles and they figured out some very basic things. When you mention the leader of a country, news involving those high status individuals is far more likely to get attention than ordinary people on the street. Stories that involve large, powerful nations like the United States or the Soviet Union were far more likely to get mentions than simply interactions between Cyprus and Greece, for instance. If you had the prime minister of the UK visiting, there's a much greater chance of becoming news. So based on doing extrapolation from analyzing that text, they were able to put forward 13 rules, which, while they may not actually be methodologically supported by the data, turn out to be a pretty good way of understanding how news gets shaped. And in fact, our understanding of news still owes a lot to this paper. I've spent the last six or seven years sort of screwing around with this paper and trying to find different ways to ask questions about this. And I put out a paper in 2003 using a brain dead stupid methodology that doesn't work very well, which basically involves going to Google News, searching for the name of a country, and saying how many stories do you have about this Google News, and then mapping it. Anyone here who's ever done any algorithmic work can come up with 20 reasons why that shouldn't work. You're right. All 20 of them are good reasons why it doesn't work. You can still make some interesting generalizations about this. And what I was able to demonstrate roughly in that paper is that media attention in a lot of US media sources roughly tracks GDP per capita. We are far, far more likely to read news about wealthy countries than we are about poor countries. Simplest way to understand this is to look at Nigeria versus Japan. Nigeria shows up a very gentle shade of pink in this map. Japan shows up a significantly darker shade of red. The darker the red, the more intense the attention. The lighter the blue, sorry, the darker the blue, the lighter the attention. Japan and Nigeria have roughly the same population. They're both about 130 million. They're both enormously strategically and geopolitically important. On any given day across US media, there's about an 8 to 12 times ratio of news on Japan as there is to Nigeria. Now, it's interesting. This actually shows up very, very differently in different media. It gets down to about 2 to 1 in British media, in part because of long involvement of the British Empire within West Africa. So there are historical factors that tend to shape this. But within US media, you can make some interesting generalizations about how income works in here. If you're interested in this, you shouldn't read my stuff because it's methodologically problematic. You should read Dennis Wu, who's over at Northeastern. He's done a really nice job of this on newsroom and news gathering factors, where he looks at where AP and Reuters and all of these other news organizations are deployed, and he's able to demonstrate a very tight correlation between where newspaper men are and how reporting happens. Yeah, I'm sure that's true. Yeah, absolutely. And so the work that I've done on this was around Google News, around the BBC, around the New York Times. So this at this point is Google News. One of the interesting things that I was able to do early on in that was look at Tecnerati, which at that point was a pretty good analog for what was happening in the most popular blogs. Saw a very, very similar pattern. In some ways, even a more pronounced pattern. A lot of cyberutopian celebration about this idea that we were going to get a more diverse view of the world out of blogs. I'm guilty of some of that, since we've been building this network of blogs from the developing world. When you look at a large set of them, you seem to have a very similar geographic focus and geographic attention. So these are really inept ways of doing this. And I mentioned they're methodologically problematic. Where we would like to get, and what MediaCloud is sort of designed to do, is to take what we were doing here and get just beyond geography to get into questions of what the topics are. Who the individuals you focus on are. Can you make large generalizations about what certain types of media are paying attention to or not paying attention to? And what this means at a certain point is what we really like to have is the ability to take a media source like the New York Times and give you a nutritional information label for it. Over the past week, here's what the New York Times has focused on. Here's what the New York Times has not focused on. Here's how it compares to some of the other players in the space. We would also like to be able to build this into your web browser. So we'd love to be able to voluntarily let you look at what you're looking at on the web and give you an indication that all you've looked at over the last day is recaps of the Packers 49ers game, and you haven't actually found anything out about Latin America. Again, voluntarily. I don't want to mandate this on anyone's Firefox, but it would be nice to have as a plugin that can go in there. What we really want, more than anything else, is we want a way of sort of understanding the interactions between new emerging media, existing media, where are new ideas, new memes, new ways of framing a story coming from? I think we have a mental model of how new media works, which essentially says we have wire services out there. We've got APN Reuters. They report. They end up in mainstream newspapers, where they may add some things to it, embellish, and then bloggers comment and fact check. And that becomes, I don't know that that's true. Actually, I don't know that that's the existing cycle. We're starting to see other cycles sort of come into play. One of the really disturbing ones that I've seen come into play is people start talking on Twitter, and emerging topic ends up making it into the right frame on Twitter, and then the newspapers report and it whether they have any facts or not. So the question is, how do we eliminate the question mark out of this and start putting some arrows and weights in there? Who's feeding whom? How does the dynamic work? How do we go back and forth on it? So the ways people have done quantitative media analysis over the last decade or so fit into three general categories. Probably the best work out there is being done by the Project for Excellence in Journalism at a site called journalism.org. They have something called the News Coverage Index, which looks at a set of 53 US media sources. That includes a couple of websites. It includes some radio stations, some newspapers, a real swath between large and small press. And they literally have a room full of people, about half as many people as there in this room. And they hand code the data. And what they're able to tell you on any given week is roughly what stories dominated the news hole, how much attention they got, they can often tell you something about the tone. It's a very, very powerful project. If you're at all interest in media attention, you need to be subscribing to their newsletter. They're doing some of the best work out there. But it's agonizingly difficult to do. It requires enormous amounts of money. There are commercial companies doing this as well. There's a group called Media Tenor, which you can hire to essentially do brand management for you. And they will do a combination of automated search for your company or for your issue, as well as sort of manual classification of was it a positive or was it a negative mention. They can give you a general sense of how you're being perceived. These methods are highly accurate. They're incredibly flexible. You can ask fairly subtle questions. Was the coverage positive or negative? Were people talking about Obama in this particular context? You can do this yourself. And in fact, we do a lot of media experiments where you just sit down and hand code 20, 50, 100 pieces of media. The problem is you get to a point where you just can't do it anymore. It's very, very difficult to get a sort of large, statistically meaningful data set. You can outsource the problem and have lots and lots of different people work on it, but then you start bumping into questions of intercoder reliability. How do you get everybody using the same classification? How do you get rid of your error collection? One of the biggest problems for us is that this is never real time. PEJ, by investing millions of dollars a year, can tell you what's going on about a week after the fact. And for their really big trends, they're telling you about three months after the fact. So we'd love to get to the point where we can do some of this stuff in real time. There is some gorgeous stuff being done closer to real time, not exactly in real time. And this largely is in the field of link analysis. So there was a great paper in 2005 called Divided They Blog by Adamich and Glantz, which looked at a set of sort of hand classified US political blogs and tried to get a sense for what's the left and the right in the American blogosphere talking to one another. And the way they did this is they looked at links to one another and they looked at links to third party sites. And what they found was a huge cluster of blue blogs linking to one another, a huge cluster of red blogs, a very, very small bit of overlap between the two. Our friend and colleague Esther Hargitay went back, repeated some of this research with a slightly different data set, looked at that intersection where the left talks to the right. And the left usually says very rude things to the right and vice versa. It's not that the golden middle ground of democracy, that's actually the space at which we all yell at one another when we actually bother to address one another. Gorgeous, gorgeous piece of work done by John Kelly and Bruce Utling who's sitting in the room here looking at classifying blog spheres based on who they link to. So not how one blog links to another, but linking to third party sources. These are clusters of Iranian blogs and what you're seeing here is the Iranian blogosphere clustering in four general topical spaces. You have a reformist group down here, which is the group that we all sort of thought we knew about in the Iranian blogs. We sort of knew there had to be a conservative group, but we didn't know very much about it. We actually were able to find out from this that there was a group called the 12ers who believe in the 12th Imam, who have their own set. There's a group called Mixed Networks that are actually sort of hard to define and hard to say who they're all about. In the top left there's this wonderful cluster that no one would have anticipated, which is Persian poetry, which turns out to be one of the most powerful forces in the Iranian blogosphere, people who are trading back and forth, love poetry, some of which actually gets political and occasionally gets censored. If you do this for the US blogosphere, you end up with a knitter cluster, which I tend to think of as having a certain amount of parallels to it. You didn't think that it was there, but it actually turns out to be an enormously powerful sort of cluster that you would only get out of doing a link analysis. What's great about link analysis is you don't hand code. You sort of set out a crawler, it goes, it figures out who points to who. A link is a pretty unambiguous symbol. You can use really large data sets. You can automate it. There are good tools out there that look in terms of hubs and authorities and networks. You can throw that at your data and you can come out with some good generalizations. The problem with it is that the link is only one aspect of content. It doesn't help you at all with traditional media, because newspapers for the most part don't link, even ones that you've moved online. They don't link particularly often. Works really well for blogs. It works really poorly for mainstream media. It only looks at the structure. It doesn't look at the content of the text, which means that you're looking at a small part of what's inherent in that. There's a real danger that you somehow conflate the link structure with an underlying social structure. I don't know that there's necessarily evidence, although I'm hoping Bruce will jump on me if I manage to get that wrong, that these guys necessarily hang out together. One of the other challenges with it is that after you've built these clusters, you do need someone who speaks Persian or Russian or so on and so forth to come in and tell you what these clusters actually are. Because what you're able to say is that these things are related to each other in terms of graph theory, but it doesn't necessarily tell you what they're about. So we're starting to get excited about things that are doing content-based analysis. The projects that are out there are pretty primitive. There are people who are taking Google News and trying to map together the way that Google News clusters together stories to do comparative attention to stories. There's a beautiful art project that actually gives you a sense for what you might do with a huge data set called We Feel Fine, which is basically just a specialized spider that looks for people saying, I feel, and then tries to figure out what people are talking about in terms of their emotions and how it differs demographically between the people who are speaking. I think probably the best rigorous work that's being done on content at this point is being done by a group at Cornell called Meme Tracker, which is looking at quotes. So phrases like, you can put lipstick on a pig. I, you can, and so, they look at how a quote is used in full and then subdivided to try to figure out how a story moves through the media. So the theory being that when someone says a full quote that's apparent and then you can look for children coming down off of it and you can watch through media diffusion. And their paper on Meme Tracker actually had a very, very interesting finding suggesting that there's a very small subset of blogs that may actually be leading newspapers, which again goes counter to that theory that I posited which has sort of become the mainstream belief of how blogs work together. Content analysis is just becoming possible. What's nice about it is that theoretically you should be able to work with any sort of unstructured text. If you can do it well, you should be able to do large data sets and automate it. You should be able to do pretty visualization around it. There are huge downsides. It's inaccurate. It's never gonna be as accurate as hand coding this stuff, at least not in the near future. There are language constraints. All the systems we have to do this to sort of look at a text and say, here are the entities in these texts. Here are the topics in these texts, our language specific. So even if you build a lovely system for English, it's a real challenge to get it to Russian, which is in fact the challenge that we're working on right now. And now more than a year into this project, we're discovering that even if what you're doing is duct taping together components, it's an enormous, enormous programming investment. So we have put out a prototype system called MediaCloud. It's at mediacloud.org. The truth is what you see on mediacloud.org is disappointing. It's an alpha that we put out a long time ago. It actually doesn't give you a very good sense of what our system does. In a moment we're gonna hand you over to Hal, who will show you what our system does. The very basic concept behind it is that we have a spider out there. It's not a spider, I'm sorry. We are subscribing to tens of thousands of news feeds out there, RSS feeds. So we're going to newspapers, we're finding all of the RSS feeds they have. We're going to blogs, we're going to any media outlets we can find. We're subscribing to their RSS and Adams feeds. We're pulling in all those stories. We're actually getting the full stories, which is pretty non-trivial because you're usually getting a snippet out of the RSS. We're going to the URL, we're stepping through the pages, retrieving all of it. We're extracting the story text out from all the cruft. So we're pulling all the formatting data, all the navigation data, and we're actually getting to the story text. We're tossing that towards OpenCalais from Reuters, which is a pretty good extractor system. We're also using other systems to try to get term extraction and topics out of it. And then we're dumping it into a database. And at this point, that's largely all we're doing. And the reason behind that is that while we have huge research questions, we're trying to sort of be the middleware layer. And the middleware that we're trying to be is we want to provide terabytes of data coming out of media and tools to pull this in and do some analysis on it. But neither of us are communication scholars. So we'd really rather let those communication scholars figure out how to build experiments and ask questions about this, whether they're journalists, whether they're journalism critics, whether they're communication scholars. We're also building a system that has a bunch of little parts in it, some of which don't work very well at this point. And we very much want to work with the computer science community to try to figure out how to make those things work out. But for a sense of how this actually works at this point, I'm going to hand it over to someone who actually works on it. So I'm going to start off by showing you the end results. I hope. So what I'm going to show you first is the tail end of what we're getting out of all the work we're doing. And then I'm going to go back and step through how we're generating those results and all of the good and the not so good things we're doing. So what we've built is what we're calling our dashboard. And the idea here is you can type in any of the new sources we're covering for the New York Times and you can get a sense of the kinds of language that the New York Times is using. So this is a little bit interesting. You can see broadly what kinds of things they're talking about. You can read, this is for the week of October 12, which is when the playoffs were. So you can see they're doing a lot of talking about baseball. And then you can even, if you're wondering about the specifics of any of these things, if you're wondering who Lidia is, you can dig down in and you can see the specific sentences that were used to discuss that term. And how you should put it, you're running this live on the left side. Yeah, so this is a development system which is running off of my little USB drive here. So there might be weights of a few seconds. So this gives us a sense of Lidia is, actually don't even know who he is. He's a manager, place, okay. And then this links off to the stories so you can look at the actual stories. And I'll give you an example of how that's helpful in a sense. In a large sense what we're trying to do with this dashboard system is trying one method of going from all this sort of raw data that Ethan described we're collecting to trying to be able to help people make some sort of conclusions about the substance of what's in the story. So we can do this not only with individual sources, but we can also do it with sets of sources. So we can take the list of the top 10 US mainstream media and we can get the same sort of thing. We can get a sense of what they're talking about as well. It looks pretty similar to the New York Times. And then this over here is just a trend line to see sort of over time over that week what they were talking about and how it's different. And then the other thing we can do is we can throw, we have all these lists of words. We can throw this into a clustering system. And what this is basically telling us is how do these sources group together according to the kinds of words that they use in their stories. And what we get here is we get a couple of sort of pretty hard journalism sources. And then we get two clusters which are mainly the local newspapers who write a lot about sports and the non-local newspapers who don't write much. We can also do this for the new media. This is gonna take a second to load. This is a list of a thousand blogs. One of the challenges we have is just figuring out which media we wanna track and how to make sense of them. I'll talk in a little bit about what it means. One of the questions is what are the top 10 US mainstream media? What we've used for this list is some data from Google about which of these sites are getting the most traffic. But it's always a difficult question to just identify those lists. For this popular blogs list, what we're using is a list from blog lines. Blog lines has a list of their top and top and feeds. So it's not at all perfect, but it's one pretty good snapshot of the most popular blogs in the US at least. And so this is where it starts to get pretty interesting. You see that this set of words look really quite different than the mainstream media. You have some of the same political stuff like American and Obama. But then you also have a lot of tech stuff. You have this Google and internet and software and Microsoft stuff in Twitter. And then you have all blogs that talk a lot about what they love. So love always comes up a lot. And then you have, and then the photo stuff is always, you always have big clusters of photo stuff. And then if we cluster this, here's where we start to get some really pretty interesting results. The first thing you find is that the US popular blogs list actually contains about 12 Spanish language blogs. So that's just a good test case that your clustering is working at a base level. And then we start to get some interesting blogs. You have this sort of Google cluster, which is a cluster of people talking mostly subsequently about Google. And you can just sort of dig into this stuff and get a sense of, for instance, how Google Sightseeing is talking about Google. You can go to the page for a specific source. I think this at random, which I shouldn't have done. The reason this has a crappy little cloud here is it probably only had one story for the weeks. It didn't have much to work on if we go to like tech crunch. That'll give us more interesting results. And then again, we can sort of dig into individual sources so we can get a sense of how this is, and what sense is this covering Google? And what sense is this about Google? Is this an artifact or is it real? Tech crunch obviously seems pretty real to talking about Google a lot. And then we see this sort of expected political cluster here, lots of people talking about Obama. One interesting thing here is that when you're looking at the actual content at this level, you end up with all of the sort of right and the left stuff together. Because they're all using broadly the same kinds of language. And it's not at least at this level clustering with all this other stuff. It doesn't pop the right versus left distinction pop. It don't pop out too much, but pops out as the political versus non-political. And then very interestingly, we get two clusters here. One for quilting and one for knitting. So who'd have known, right, that of the top 1,000 blogs, 50 some odd of those blogs are about quilting and knitting. And if you dig through to these, you may seem all to be really about knitting and quilting. And if you add those together, you get about as many blogs. Oh, good thing. If you have about as many of those blogs in quilting group as you have in the Obama group in the politics group. So we're already starting to get pretty interesting results about what sort of composition of these groups are. We also have a list of the long-tailed blogs. And that is just a random sample of 1,000 blogs of roughly all of the blogs in the US that we got from the Spinner API. Spinner is a company that tracks blogs. And another thing that we can do is we can compare the sets of sources to one another. So after my pitiful laptop thinks for a minute, what's going to come up here is a set of comparative word clouds, where on one hand we have the sets of words that are more likely to appear in the popular blogs than the random long-tailed blogs. And on the other side, we're going to have the kinds of words that tend to appear more in the long-tailed blogs than the popular blogs. So on the left, you see something that looks pretty similar to what we saw for the raw feed, except you see the technology stuff really pops out a lot more. So that means there's a lot of representation of technology in the popular blogs, but not as much if you go out into the long-tail. And then you have this really interesting list in the long-tailed site. They talk about God a lot. They talk about education, teachers, teaching, and students a lot. They talk about love a lot. Again, they have a law list, which is all this comedy stuff that pops out that's not another list. So this, again, is giving us a pretty good snapshot of by going from this sort of raw list of text and words we have, they're actually making some interesting conclusions about how the content of these sets of sources differs from one another. So those are the sort of results that we're getting now. I'm going to just move through really pretty quickly, because we're taking a lot of time. But I'm going to move pretty quickly through each of the components and what we're doing at each stage and what's working and what's not working. And I hope that you'll sort of pepper me with questions, criticism, suggestions afterwards. So there are a number of challenges here. One is that there's lots of media in the world. And we are a sort of medium-scale system. So we're a sort of order of magnitude above what you can do with hand coding. But we're never, as a project of a research center at Harvard, we're never going to be of the sort of Google news size, or where we're tracking millions of sources. We're always going to be in those sort of tens of thousands of sources size, most likely. So that means we have to make choices about what we're going to cover. And a lot of those, it turns out, it's very hard to make choices for instance about what is the set of 1,000 or even 10,000 blogs that are going to represent all blogs in the world. What is the set? What set of media do you choose to represent mainstream media? Even when you have those list of sources, it's non-trivial, especially for the mainstream media, to just collect the feeds from the sources. So one of the main things that makes this project possible is that all these sources published are assasins. So it's very easy for us to get what list of stories are published in general, once you have the RSS feed. But it turns out for virtually all mainstream media, there's no feed for all the stories from that source. So there's no New York Times RSS feed. There are 200 some odd New York Times RSS feeds. So there's some challenges just going out and figuring, writing some code to go out and capture what those 200 feeds are. And then there are substantive hard, there are additional hard questions about what constitutes the New York Times. Is Freakonomics part of the New York Times? Or is it just a separate blog? And those are questions we just have to address and struggle with. The other part of our system, which is probably the most straightforward part is just our crawler. Our crawler just goes out and gets the RSS feeds from each of these feeds every few hours and then just downloads all the URLs at times. The one little bit of extra work it does is it tries to do paging. So if there's a little next page length at the bottom of the story, it tries to get the subsequent pages as well so that you don't just get the first page in the New York Times story. You get all 10 pages. The next thing we do is once we have all that content, when you have a New York Times story, you end up with not just the story, but you end up with a lot of cruft around it. Well, we think of cruft as far as navigation and ads and other stuff that's not the text of the story itself. Ideally that would be in the RSS feed and for most blogs it is, but for virtually all mainstream media, all you get in the RSS feed is maybe a sentence or two of the description of the text. So we have to somehow figure out how to strip out all those ads and especially in the navigation from the text of the story because otherwise, you'll think that every, that two thirds of all New York Times stories from the next four years are about Obama just because almost every page increases into Obama. So we have a system that does this extraction. So if we take, this is an example of a story that we've downloaded about Iraq and oil investment. And what we've done is we've taken, if we look at the actual story, it looks like this. So you can see it's got all this cruft around it. This is what we downloaded. And we ran it through our extractor and we ended up with this. So we ended up with a big block of text. And the way that we do this is we have a number of different signals. The core signal we use is the HTML density of each mind. And what that basically means is for a sentence like this over here, which is the link which we don't want, there's gonna be a lot of HTML around that to format it and make it a link and not very much text on that line. For a line like this, there's gonna be lots of text and not very much formatting. So as a very broad signal, that does a pretty good job of telling us what is the substantive text and what's not. That doesn't do a good enough job. So we end up having to add a bunch of other modifiers in there. You can actually go to one of these stories and you can see this tells, this is our little extractor page that tells us what the system is doing for each line. So these purple lines here are ones that we're extracting. The score right here is the raw HTML density. And then the score here is the modified density with all of our other modifiers added. The example is some of the modifiers very long lines in absolute terms are more likely to be substantive text lines that are far away from a previous line. So the further away it gets from something that looked like text, the less likely you are to be real text. And then we do, the final thing we do is we do a similarity test. You just look at how similar is this text to the text that we do get for the title and the description of the RSSD. And if it looks very similar to that text, then we'll have a discount on it. So this is an example of a line that looked like maybe it was not substantive text because if you look at the text of the line over here, see if it's got some links in it. So it's got a fair amount of HTML. But once we did all of the functioning on it, we scored it correctly. This is also our training system. So these checkboxes over here say whether or not it should be included as substantive text. So what we do is we design this algorithm in a generic sense. And we have a set of about a thousand stories that we just manually gone through and trained like this. And then every time we make modifications, we just re-run against the trains that we see what's how accurate it is. The biggest problem with this system now is we capture almost all of the text we're supposed to capture. The problem is including text that we shouldn't be including. There's a lot of edge cases of text that looks like substantive text that it's not really. So that's been our challenge going forward. So after we get all of this text, we just have a generic tagging system. So we can just send the text into any sort of tagging, any set of tagging modules. And it comes back with a list of tags. So one of those tagging modules is CalA. So all we did was we just connected the CalA web service and sent it this text and it sent us back a bunch of tags that it thinks the story's about. So it thinks it's about New York and the Islamic Republic of Iran and electricity. And then we can also send it through other tagging modules if we want. We have another module called our New York Times Topics which is a module which is just a simple dictionary mixture which just means we take the list of, the New York Times publishes the list of topics that it's about three or 4,000 news topics that it thinks are interesting to the world at any given point. And we just try to find those words in the text of the story. You see it ends up being a higher quality but much smaller list. And it especially is not, it's much better for coverage of domestic news than international news which again you see here. Yahoo has a similar service to CalA. You just show up some text and it gives you some tags but in our testing it's not as good. It includes more nonsense stuff like vigilant guardian. And then the other thing that we do that we're actually using more for the analysis that we're doing right now is we create vectors from all this stuff which is just means for every story that we download, we just break it up into the individual words and have a list of how many of each word appears in each story. And that's important because there's a great deal of natural language processing work that's based on just having those list of words that appear in each story. That's for instance how we do the clustering that we do is based on those stories. And then the final part is the clustering stuff which is a great example of us just using, plugging in stuff that's already there. We use a toolkit called Pluto that's a very good toolkit that has lots of knobs and switches of half different algorithms to run the clustering on and different metrics that judge the similarity by. So we basically just plug into this Pluto system. We can create any, we can just do any clustering run by giving this set of stories and it starts in an end date and we run it. And part of the nature of clustering is always that it's incremental. You just twist knobs basically until you get results that look good. It's always a sort of secret of the stuff. So this lets us sort of, you can see I just spent a lot of time iterating over these lists of clusters to try to get clusters that seem to make sense. One example of the limitations is the toolkit that we're using, you have to tell it how many clusters you want. You have to tell you want 10 clusters or five clusters or three clusters. So that's a matter again, you just run it and it sometimes seems like it's got stuff mixed up or it sometimes seems like it's putting stuff arbitrarily and that's just something you have to, you know. I just want to jump in and just talk for maybe two or three minutes about what this might mean for people who are sort of using this for media research or political science research. And it's just a really early experiment we did with some colleagues over at George Washington University. They were asked by colleagues at the US Institute for Peace what we could do as far as analyzing media to understand media coverage in conflict environments. And so what we did is we took a fairly narrow data set. We took a month's worth of data. We looked at seven newspapers, 17 political blocks, a very, very small subset of what we have and this is just because we wanted to focus on how we transformed and visualized the data. We could easily scale this up and look at a much, much larger set. The goal was to look at three specific conflicts that were taking place during that period of time. Two, which were sort of real conflicts and one which you can think of as a stalled conflict which is the ongoing action in Darfur. So this is a set that included the around election protests. It included the end of the Sri Lankan War and we were also looking at Darfur at the same time. We started looking at this and one of the things that we realized was that just to get a comparison for what was going on. So first of all in this set, the two lines that you can actually see distinctions in are the Iran coverage. Everything else simply flattens out and it's just orders of magnitude smaller. Even despite the fact that a 30 year long civil war ended, you simply can't see it within the data. This becomes even more apparent when you bring in another story that was going on at the same time which is Michael Jackson's death. And the paper that we really want to get out of this in the long run is we think we have definitive proof that Michael Jackson ended the Iranian revolution. We think this might in fact be a major revolutionary. In fact, I would go further and suggest that I would look for Ahmadinejad as having blame for the death of Michael Jackson because it does appear to be what killed off the Iran elections protest story. Again, all other stories that have in comparison to the size of these stories end up paling in comparison. It's interesting, we had expected to see with Iran a little bit of blogs leading mainstream media because particularly we saw a huge amount of information being picked up in Twitter, also picked up in blogs. It was very, very difficult for newspapers to report from the ground in Tehran. We expected to see maybe a little bit more of a leading effect than we did but we do see a little of it with that blue line being blog attention compared to the pink line being newspaper attention. It looks like you can make a case that blogs were a little bit ahead of the story in terms of their intensity of it followed by newspaper holding onto it perhaps a little bit longer as bloggers fell down. This is normalized, there's problems with the data but this is sort of giving you a sense for where we might wanna go with this. We're also able to drill down again using these graphing techniques, using similarities between words, try to figure out specifically how people are framing within this set. And what's interesting is when we use these tools to sort of look at what words were used in a particular set of stories, you realize one of the interesting limitations of our tools were basically amplifying noise. So we're putting a microphone at specifically stories that mentioned Iran, then we're looking at words that were more common in blogs than they were in newspapers. And what we get out of that are some things that are interesting and some things that are not. It turns out that neocon and grandstanding turn out to be noise. Our parsers in working the phrase those grandstanding neocons appears 40 times in our set. We should simply be tracking every sentence and only allowing one appearance of it but we've sort of amplified that noise. Other parts are interesting. Charles Krauthammer turns out to be pretty fascinating. You drill down on him and you find out that a comment that he's made gets mentioned and analyzed in six different blogs. It only makes one newspaper. So it's an interesting way to watch how the frame is changing between the two. With Sri Lanka, we actually got a pretty novel result. We expected that what we would see is that real responsible, that's in quotes, newspapers would cover Sri Lanka, bloggers would ignore it. Actually what happened was when peace was declared by the government, whether it is in fact peace or not, there was a pretty concomitant spike in both of them. What's interesting is that the newspapers came and did sort of an analysis story two weeks later saying, okay, so what now? What's the future of Sri Lanka? We didn't see that picked up at all in the blogs. So we see sort of a secondary peak which you're drilling down into it appears to be about what are the long-term implications of this? How did this change? We don't end up seeing it within the blog data. The Darfur thing made me really happy. My thesis on Darfur was that we would find that bloggers were talking about it but it wasn't actually showing up in the newspaper because nothing actually happened in Darfur during the period of time that we're talking about. What's interesting is that Darfur now has entered the language as a metaphor. And so drilling down, if you look in how Darfur gets mentioned, it's simply mentioned in the context of genocide, in general, Congo, refugees, displacement, it's simply invoked. So it's no longer a news story specifically about Darfur. It's a way to sort of bring it out as an analogy. So look, this is a very superficial paper. It's a small amount of data. But looking at that question of how media might work differently in conflict environments, this gives you a sense for what you might be able to drill down both in terms of general incidents and also being able to do the differences in the language. We have a tradition of ending media cloud talks by telling you what's broken. And we started making this slide together over breakfast. And we started with everything, and then we started subdividing out of there. This turns out to be a hard problem on about half a dozen different levels. On the technical side of it, extracting, really difficult. Getting clean text out of format at HTML turns out to be almost impossible. We're probably going to need to go to a sentence by sentence model so that we don't end up with repeated sentences all the time because newspapers tend to repeat the same headline over and over. And if you can't get it correctly out of your extractor, screws up your data. Clustering's really tough. The algorithms exist. They work. You can tweak them. But they have certain fundamental limitations. If you're clustering into 10 categories and an 11th topic comes up, you got to restart your clustering systems, at least with the ones that we've been using. The big stuff we'd love to do, we want to be looking at things like meme discovery. And that example with Krauthammer is actually a pretty interesting piece of meme discovery. We were able to find that bloggers were talking about Krauthammer in a way that newspapers weren't. You could imagine following the word Krauthammer in and around context over a period of time and looking to see whether it got amplified or not, whether other media picked it up. We actually think we're starting to have a system where you can look for people reframing an issue. We often see in this data set a blogger saying, if you want to understand, you have to understand the story of x. And then you look for x over time. And sometimes x will get picked up by another blogger or by a newspaper. 99 times out of 100, x just disappears into the ether. And what you're watching there is a meme die before your eyes, which is sort of exciting. But trying to figure out how to do that in a way that's not manual where you can actually go through it and look at the whole ecosystem. I was joking with Harry before that what we really want help on is getting from that list of tags that we're generating out of these automated tools into some sort of a hierarchy. How can we then say, OK, here are the 40 specific tags, but also here's oil, which is probably the best way to sum this up, which means if you want to do that media nutritional data that we talked about at the beginning, you can say 20% of your stories were about oil, or at least mentioned oil. There's also the non-technical sides of this that get incredibly difficult. We've been able to get as far as we have, not just because we have completely brilliant engineers working on this, but brilliant engineers who speak English. And it turns out that if we're going to rebuild this system for Russian, for Arabic, for Chinese, we need a great deal of linguistic data. And we really need to be working with programmers who speak those languages, because there are 1,000 knobs that need to be tweaked and twiddled to get there. There are enormous, massive legal concerns associated with this. We are functionally making copies of 1,000 newspapers every day. And at some point, someone's going to wake up and decide that this might have some copyright issues. We really wanted, when we did this, to release all of our data. And we've been advised by our lawyers that that is probably not a good idea. So we're now moving towards an API plan where we essentially say, come use our data. We're going to monitor what you're doing with the API. If you're republishing The New York Times with your own ads on it, we will cut you off. But we'd love to let you do it for other reasons. The other one that I'll just mention in closing, and then really we will shut up and let you ask some questions, is dark matter. We can track stuff that we can spider and scrape and subscribe to. We can't track Facebook. We can't track links that are forwarded from one person to another via email. These turn out to be, we think, incredibly important vectors and understand in the media ecosystem. So we've got a good start. We have tons of problems to work on. We would welcome your help on it. And that's where we're at. Web Ecology is affiliated with Berkeley. It is a set of researchers, some of them in Cambridge, some of them Harvard affiliations. Some of them are just fascinating, no researchers working on things. They are very much also interested in these questions of what quantitative data tell us about what's going on. They put out a really nice paper on Iran and on Twitter. From what I understand, they're now working on a question that sounds frivolous and isn't, which is, why did people turn their Twitter icons green and when did they stop? And it's a really interesting question about social action. It's a distributed social action. So they are beloved fellow travelers. And is there any sort of formal relationship to some of the bearing in the way that we collaborate? The word formal and the word perfect tend to work poorly together. We know them. We hang out with them. We like them. What are the differences between the two? They don't have a system. They are grabbing data sets and then they're out of the way. So for instance, they actually use some code that I've written to spy on Twitter. Twitter is something we'd love to have in the system in the long run. One of the things that I should say is that we made a point of trying to deal with open data within the system. One of the things that we really worried about is we're out there to build a new resource. So we've released all the code under a federal license. You can go rebuild what we're doing by putting the code up and then rebuild it to database. But we also understand that that's bullshit because it takes you a year to build the database that we have. It's an enormous amount of computational resources. So that's why we're not trying to build the data or the API. Because of that design decision, we've also tried to steer away from collections of data where people become and say, oh, you're hard, we'd love to work with you, let's give you some data. Twitter is one of those cases. You can get all of Twitter's stream data if you either pay them or if you beg them. We've stayed away from that because if someone else is going to replicate our work, they may not be able to pay or they may not be able to pay. Twitter actually just announced its first meaningful revenue is $4 billion from selling information about their system which includes access to all of their feeds. So given that that now appears to be part of the business model, for now we're probably gonna stay at that arm's length. But yeah, they used some of my scripts. They've now refined them. They're collecting tweets on specific tags and then they're doing analysis. We're certainly sharing ideas back and forth. We'd love to get to the point where Media Cloud System could be including Twitter as well. What's tricky about it is that 140 characters doing things like clustering or doing a partial analysis get pretty tricky. Thank you for your question about the relationship with the newspapers that you mentioned, the potential to be concerned about it. Do they consider you their friends or do you think you've discovered anything about them that they don't know about themselves already? So what's been interesting is figuring out who's interested in this system. People who study journalism, so folks like Columbia Journalism would do, get fascinated by this because it's a way to get from anecdote to data and have very different analysis on it. There are some new sites who are fascinated by this. We've had a lot of very graphic conversations with public radio and international. It happens that their president is also totally data-driven and weak and she loves the idea of being able to monitor their coverage, compare it to other networks out there, get a sense of what they're strong at and what they're weak at. One of the many missteps of this project is that we launched it at the same time as American journalism is collapsing. So trying to get people interested in what they're covering better or worse doesn't work real well when everyone's response is we can't do anything anymore, the sky is falling. And so it's been very, very difficult to get people so paid attention to this. What people do want to do is ask questions about how journalism is changing in business ministry. So for instance, a colleague at USC Edinburgh said, could we use media cloud to study what happens when journals start paying attention to how many people look at a story? That's a great question. So all these newsrooms are now sort of refocusing on this question of how much web traffic we generate. Does that change any web? Does that change your coverage? If someone's got that in-news group data, we should be able to have that outside data and we should be able to form it. Another thing that this potentially lets us do is to answer the question of what happens when the Boston Globe is at your strength. He sort of generates this as one of, a broad way of describing what we could do with these sort of comparative media source analyses is the DM will say, what does the Boston Globe add to the world, right? And the converse of that is true, which is the Boston Globe goes away. The truth is for some of us, what we really want is a strong one. One of the things that we're considering doing is the New York Times has actually been very friendly about this and said, do you want to run this on our archives? Our open data side of this makes us say no because there are archives aren't as open as we'd like to be. The researchers in the sort of go, well, hell yeah. I mean, one of the questions that everyone would like really good data on, everyone I hang out with, would like really good data on is what's the percentage of international to domestic to local coverage over time? We know through Paul Starr's work that in the early 1800s, we were running at 80 to 20 international to domestic. We know that it's shifted probably to 20 and 11 international. Did that happen all at once? How does that happen gradually? If we could throw that at the New York Times from 1850 on the present, now we'd have to write huge amounts of stuff to figure out that Prussia was not the US, so that it's a whole sorts of rule sets for different moments of time, and that's the sort of analysis that we're going to actually put out. I guess my question really was, why doesn't the Boston Globe want to know what the Boston Globe has to the world? I mean, don't they have a commercial interest in understanding themselves in the way that we're making it possible to understand? Or do you think that they, that's what I'd say, that you say to them, our coverage is in proportion to nationals. The coverage of foreign countries is in proportion to their per capita GDP. And now we do that. That's like, of course, that's the only way we get people to read our stuff. I mean, do they know it already? Can they use what they have, perhaps not for ends that you would want, but... So I've been sort of in the business of confronting journalists about shortcomings. They've come out of their problems with data since we last met. What ends up happening is you sit down with journalists and you say, you know, it's funny, you guys don't write very much about the world. And they say, yeah, I know. My last 10 stories on Mali all got spiked, right? So it's not my fault. And it's not even my editor's fault. It's my publisher's fault. And the publisher, if you then talk to them, you're lucky to have the position to say, look, it's your fault, actually. You guys in the audience, and the reinforcement that we're getting is that you want hyper-local readers. You want your kid's name in the paper from the little league game. You don't give a crap about what's going on in an territorial game. And every time we run a story on an territorial game, our circulation goes down. So, you know, don't tell us that we're screwing up. We're giving you what you want. Now, something in that cycle is probably dysfunctional. It doesn't make a ton of sense that an increasingly globalized and interconnected economy that our access to media gets more and more and more hyper-local. You know, it would make sense after the petroleum crashes and none of us strayed more than 10 miles away from home. But at a particular moment in time, it seems like something about that reinforcement segment isn't working right. I'm hoping we can help answer some of those questions. But those are, you know, those are belt-and-roof sized questions. You know, we're shipping little pieces off of the best. Mr. Journalist, that's your later. Wouldn't it have that, how's the rest of the work and not necessarily the ways it works more? If I thought that journalists were actually paying attention to my data, I would be more worried about it. But they're talking about each other. So, some of this is not hard to generate in turn. So I actually have a suspicion I've been looking at The New York Times for about six years now. One of the fascinating things with The New York Times is that you can pick any country no matter how small. And in any given calendar year, they'll run a minimum of one store. So you sort of want to be able to make that sweeping question and say, you've never written about camera. They answer once a year if there's something on camera. And it's never like, here's the annual feature on camera. It just sort of comes around within. So I actually think they probably are at The Times where they have really smart data and a lot of people looking at things like that. Is it a possibility? Absolutely. Any time you do surveillance or surveillance and you can have some sort of analysis of what your patterns are, you can have changes come out of it. But the way that I look at this, the reason that slide is up there as far as nutritional information is that I think in general, you want to have more information about a little less information about less information. If what's happened to The New York Times over the past 10 years is that it's become more and more and more of a local thing. And this was certainly one of the forces of The New York Art Museum is trying to make it more competitive with the post of the day of the news. Well, that's an interesting thing is it's still the paper of record. It's still this international newspaper that we think it is, if it's actually shifting. I'd rather have the data out there. Is it going to have second order effects? Sure. Absolutely. I'm willing to take this on. It would be happy if we were successful enough that we actually have a list of that. Let's see. But I actually have no idea about the metrics and processing, but are there, as you look at the tools for extracting terms on it, are there tools on there that go beyond treating the documents with added words and added terms and words and then they seem suitable and not suitable for them? No. And that's one of the reasons we don't go too far with that. There are expensive tools to do that. There are hardly any useful tools to do that. There's a lot of computer science research and a lot of paper tools out there. But as far as we have found, there are no of the sort of easing the duct tape of the tools and the kinds that we use, like, for instance, I mean, there's a lot of natural language processing stuff out there that's where it's set. But it's a big part in the process that there don't start. I mean, we would love to see what they are with it, but make it easy to go from a sort of bag of words to topics. And that's been our biggest challenge in this stuff that we haven't really figured out how to do. The stuff that we've done well, that we've done well, there's a pretty good job of sort of living in language and the questions of language and sort of language narrative and sort of thinking about what the topics are. And we can do a pretty good job if you know what the topic is. So if you really want to look at dark part, then we can get a pretty good grasp of what the coverage of dark part is. But really, what we haven't figured out how to do because as far as we've got to do it, is to take some arbitrary set of stories and say, here are the 12 topics. Or here are the 12 meanings that are in this set of stories. Yeah, I think there's a lot of that. This is not to say there's not work being done. Again, our model is we're engineers, and we have less than a full-time programmers on to do work on this. So the work that we can accomplish is mostly to plug in sort of big pieces that do a lot of work for us. It's a library, and it's not open source. It's not open source at all. So there's lots of sustainability in this. If it's not available as a library, and it's not open source, at this point, it's not open source. Part of the reason for this, just so you understand, is that people have been selling document, clustering, and how to extraction systems for six to seven figures for years. Back when I was in industry, we bought some of these things. They weren't any better than the open source ones, which was to say they weren't very good. Only at that point, we were down six to seven figures at the time, which was fine, because this was not one of the disposable problems. It is a problem for us. And so what's really interesting about this, from our perspective, just in terms of learning and the failure, we thought, the reason we started this project was that we realized that between RSS, so that we didn't have spiders that didn't work, and Kalei, where we could just take text, feed it to someone, and they would come back with topics. We felt like for the first time we could take my geographic work, but even more than that, we could do the topic work, and we could give you that nutritional information very quickly. Turns out we've actually had very little progress on that. So it's actually very, very hard to go from what Kalei kicks out, which is too much. Kalei also, despite the fact that it's free, and relatively open, and there's a lot of good things about it, under the load that we've been putting it through, tends to mash goods together, and create new hormone tones, and that doesn't work either. They haven't yet released weights on terms, which will be a big help for us once they do, because that would have picked a heavily weighted term. We've actually been much more effective in doing the work backwards. And so, almost while you're speaking, it's almost like you're whisking something. How do the average bloggers talk, and as compared to the elite bloggers, as compared to the extra media, because we've been able to do more word vectors in part, that's what the library saw right, and that's what we've been able to do. What I'm going to do is to be quick. All of the tagging that we're doing has to be instructional, but that means it's the easiest kind of, tagging, the easiest kind of natural language process is getting people, places, and organizations, you don't get as ideas or subjects. So you can get that there are a lot of stories about Bernanke, you don't get that there are any stories about financial crisis, except for the reason to do stuff on the type of networks. And it turns out it's pretty, we've found it difficult to take that collection of entities, even if it's pretty good, it's like, that was a good deal. It's hard to go, it's harder than anticipated to go back from that to these other subjects. The country stuff is different, because there's a sort of obvious question in there. It's much harder to figure out what these questions and answers are for how many stories are about Bernanke. Stuart, do you know any open stories? The Vatican, I didn't do a little. So this key to the area you're talking about, which is this topic, plus three, is extremely hot air. So a lot of people are doing different things, but it's probably not at the stage where it's packaged up into a nice little library that you just run. It's a very fast-moving project. I mean, I should say once, the kind of people who do it tend to give their approval. So there are, for instance, there are, that comes to mind to do this, but I'm not sure they're packaged up into a beautiful source forage, you know. But this is sort of the conversation that we're looking for to have, because, again, since we're approaching this as a duct tape engineer is looking at something that's still in the research paper phase that made me solve them our problem six or five months down the line, is the sort of thing that we need to know about and that we're never gonna know about. So I can give you some of the things people think about. So if you treat a document as a bag of words, it's in that you can say, well, divide these bags of words into 10 subsets. For the most of these topics. And for the new race issue, where do you get the number 10? If the number's wrong, if it really ain't, then that can screw everything up. And you'll get a bunch of topics in just a moment. Or there's really 12 in a single topic. So one of the areas people are working on now is trying to do this in a non-parametric way, where you don't say divide it up, you can't. You say divide it up into topics and you know prior on the distribution of the number of comments and then figure out what's right in the top. Or you take advantage of the fact that you relax the assumption that every document is about one topic. Probably not a good assumption. So you say, well, you'll ask the question, what if you say there's a bunch of topics and documents are generated by first selecting some sort of topics and then selecting words based on the topic. And you can get much tighter topics that way than you don't have to assume that documents are exactly those kinds of topics. The things like that, you know. These are the kinds of things people are working on. You might expect to get better results, but it's not like CalA where somebody just gives an API. I mean, one of the reasons we're developing this as a platform is the hope that eventually the people working on those kinds of problems don't want to be worrying about the tax and how to shove it into their system and the results they get. Or one of our dreams for this is to be able to provide this as a platform and say, here's a place where you can just plug in your living code. They can do a much better job than what they're doing, which is valuable to be able to read it and allow them to turn to patients. And sometimes it feels kind of strange. Last question. My form of request. The dark matter of primary registered justice is what we're seeing here and things like video and audio. And what will look different in the way things get reported to the people who are in business. The news cases are going to look like something different from the usual, versus the audio. Absolutely. Once we start getting into video and audio, we're dealing with whole other classes of problems that we're calling a problem. Basically, This is not this before. Well, that's true. We start with the assumption that not only do we have taxed, but we've got digitized taxed. So we're already excluding things from this that don't have RSSB, it's a spider in itself, it's really painful. We're sort of betting on this idea that as more and more people are transcribing, we can get better data to come out of that. The truth is what we're mostly doing is sort of comparison. So, Elisa Miller, who is both the head of PRI and actually one of the better scholars in this field, is doing work on television attention. Because her sort of bugaboo is how little international media attention there is within television is kept. So she does handphone, when she hires some folks, they sit in the room, they go for a week's worth of moving television. She produces the data set, they put up a card ground and they're really ugly and scary. So it's still gonna be sort of point sampling on that until someone gets a really good algorithm that can listen to the radio or listen to the audio and the video and the transcript on that. As far as I understand, we're still pretty far from that discussion. We're just announcing what we're doing. Yeah, yeah. Well, that'll get us closer as we start getting automatic transcription coming out of it. Now, the trick with automatic transcription is it's gonna be one quality one. And it's gonna be pretty low to start with. That's certainly everything that's gonna come out of it. Okay, I think the point of the hour is here. Everybody wants to shine over a little bit. Welcome to the front and I believe they're looking for people to give their ideas, right? Help a lot of your questions, code these questions, code anybody wants to give them code. Doesn't matter what they're coming with. Okay, thanks so much, it was great fun.