 We're lucky to have Yuri here today. He is at the beginning of what we predict will be a long and brilliant career teaching computer science at Stanford. Prior to coming to Stanford he was working as a postdoc at Cornell where he did work on a really interesting tool called meme tracker which is a large part of what we're going to talk about today but his work along with his colleagues John Klineberg and Dan Hudenlacher is some of the most interesting work being done right now on sort of the frontier of quantitative analysis of media which is to say now that a great deal of journalistic and other media is available online what can we do with statistical tools to analyze how the news cycle takes place. Prior to that he did his work at Carnegie Mellon and also at the University of Lijana in his native Slovenia. So we are thrilled to have Yuri with us and he's gonna give us a talk which will start general and probably get technical until some of us wave our arms and tell him to stop getting technical and then we'll have an open conversation about it. Okay, thanks a lot. So the plan for today are sort of three things. First I want to show you so the motivating question for everything will be okay what can you do if you can go on the web and practically get all the news that are there on the web sort of what kind of things you can do and I want to show you let's say three things the first thing I'll show you is how can you identify short textual phrases basically memes if you like that then we will be able to trace how they propagate through the web and we'll say okay can we see how this media ecosystem how does it work sort of that will be the first let's say half of the talk. Then I'll show two more let's say more computer science questions. First question will be imagine that so all that we see is how people mention information but we don't see who influences whom or in a sense of if you think of how disease propagates the question is you see how people get sick but you don't see who infected whom so the question is can you infer the underlying network over which you know influence information viruses propagate and then once you have the network and you know how stuff propagates then the last question will be can we go and identify who is really influential in this network so or for example on in the media case you could say which news sites which should I follow so that I'm most up to date right so that I hear about big news before everyone else hears about it right so that's basically the plan for the talk and this was done sort of as when I was still a PhD student at Carnegie Mellon then part of it at Cornell and the last part and some part of it now at Stanford with a bunch of co-authors okay so let's start a bit general right so what I want to study is basically information in the media right and this is some somehow the intersection between news media technology and let's say the political process right and another thing that that that why this is interesting is that there is basically this tension between between let's say global effects of the mass media right that is sort of pushing information on the you know between the people and then then I have all these local effects from the social structure basically how people talk to one another right and what I want to ask is basically how does the information get transmitted between media basically how me between media and this personal networks that arise from let's say social contacts right so that's that those that's basically the dichotomy between mass media who's sort of pushing things out and all this like social interpersonal networks that also play a role in how information spreads and transmits okay so and of course right with the emergence of internet blogging social media and things like that this this difference between global and local influence is somewhat evaporating right because for example blogs are now feeling the the the part between purely personal networks and purely you know media type pushing of information right because you know I can now follow on Twitter whoever I like I can read whichever blog I like and things like that right and another thing that is going on is the speed of how media is reporting and you know how we are discussing stories is intensified in a sense that it's not that we are having this you know sort of 24 hour publishing cycle where in the morning you go and buy a new newspaper and you go knew about new things right it's like you can constantly get a feed of what is going on okay and all these sort of notions get captured into something that that we intuitively call like a 24 hour news cycle right so basically this is something that is some somewhat hard to define right but it's it's a some kind of natural period or some kind of cycle by which you know the news comes appears and disappears right and the one the question so to show you two examples of how people understand you cycle here are two quotes from the New York Times during the US 2008 presidential election campaign right and you know they discuss McCain and Obama and then they say you know how with every new cycle something is different here and you know you know they also say how from every new cycle to every other new cycle the you know the political text of the campaign tends to change okay so this is how we intuitively think what a new cycle is so what I want to ask is basically the following question right is this new cycle right is this something that is a metaphorical constant basically something that you know we humans came up with and we understand it as a some some kind of it gives us an intuitive understanding of how things might might work or is it actually something that is visible in the data right so is it something that I can quantify and measure or is it something that it's you know is it a more metaphorical constant and the other question is then if I can if I can measure it what are its basic properties okay so what are the basic properties of this new cycle and this is this is what I want to do okay so I want to quantify what the new cycle is then the first question that I need to ask is basically what are the basic units of this new cycle that I would like to track okay so here are a few candidates sort of this is more for computer science right so one way to track you know news would be to say okay I will be tracking how people link to one another right so I would be tracking hyperlinks so right if I write a blog post and point to some other blog post that probably there must have been a reason why I linked to that other blog post right so I could say you know I could make some inference about that maybe we've wrote about related things or that I borrowed some information from that blog post that I'm linking from right so one way to do this would be to trace hyperlinks the problem is this is to fine-grained right there are very few hyperlinks news media don't link among themselves it's mostly bloggers that link to one another and most often bloggers don't even create these links right so for example tracing hyperlinks this won't work right another thing would be to do some kind of topic models like LDA type topic models this also won't work because these topics they are too they are too bulky they are too big right it's they are good if you want to model something that changes you know over tens of years or something but not on on news that sort of changes at a dialy or hourly scale so this is another thing that wouldn't work here's another idea that you could say you could say okay I will go and I will extract named entities right you know I would go and extract things like Obama McCain Microsoft Apple Paris France and things like that right the problem with this is that you know Obama appears in use every day right maybe with a bit different intensity but the particular name entity appears every day so it would be very hard for me to like associate a particular named entity with a particular piece of news right so that also wouldn't work so well and then the next thing that I could do is say okay let me extract common sequences of sequences of words right and if I go do that then these are the common sequences of words that appear on the web right made in China I love you web 2.0 and things like that right and this is again not something that I would associate with the news cycle okay so these are sort of for for failed attempts so what I would like is to do the following right so basically I would like to find some textual units that correspond to some kind of aggregates of articles right so that they can summarize big articles I want something that varies in an order of days right something that is very dynamic in some sense and the last sort of thing that I would like to do is I would like to be these things to be simple so that I can handle them you know on terabytes of data okay so that's sort of my my wish list and the plan now is basically can I identify such textual phrase fragments we can call them phrases or wins that basically travel relatively intact through the through many articles right that sort of I want to find short textual pieces that that are sort of characteristic for particular news and that you know travel intact or they appear pretty much unchanged through a series of articles and the idea is very simple let's use quoted phrases right so let's use stuff that appears in a quotation mark some text and of quotation mark right so basically the regular expression expression to extract my names right so I'll be just following quotes okay and why is this a good idea right the first thing is that quotes are sort of integral parts of journalistic practice right this is the first first so they appear quite a lot the second thing is they follow iterations of the story right if there is a particular strong quote somebody made or a statement then this will appear in many different versions of the article or of the same news written by different journalists this sort of this little signature will appear in all of these different articles right and what's another good thing about quotes I know exactly who said it when they said it and where they said it right so I can really attribute every little this piece of information to a particular point in space to a particular person right so I can really identify what what was it about okay so that's what I'll be doing right so the data that I'll show you here is sort of a small data set right so I'll be showing you a data that we got from spinner it's basically just three months of data leading up to the US presidential election we have one million news articles blog posts per day and basically what we have is everything that Google news has so that's about 20,000 news sites and 1.6 million blogs okay so basically what I'm working got working with is about 100 million documents that are coming from this you know 1.6 million different websites okay and indeed my my time period is between August 1st and October 31st right and when I go and do this quote extraction I get out 112 million different quotes okay out of these 90 million articles okay so now now I have the quotes and what is what is the challenge then that I want to solve is the following thing right these phrases or these quotes they they change a lot so what I'm showing you here is some kind of a graph where every node is a quote and the arrows mean how one quote is or how one phrase is included in another right so this is how a phrase actually changes coming out of Pamela's mouth right I mean this is actually an accurate model almost okay so here is what came out of her mouth right it's about failing around with terrorists right so here is our opponent is someone who sees America it seems as being so imperfect imperfect enough enough so that he's failing around with terrorists who target their own country right so this was the whole the whole thing right and now here you see all different sub parts of it and how they are to relate to one another right so here is just fell around with terrorists who target their own country terrorists who target their own country failing around with terrorists who target their own country and so on right so the first challenge that I have now is that I get this bunch of these quotes and I would like to somehow group them together into into groups that are mutational variants of one another yes it just means it's a this means approximate inclusion so this is sort of included in this this is included in that this is included in this and so on okay so sort of as you go down things are bigger right and here you see you can see that you know here is the second part of the quote right about America being imperfect and here is you know sort of the first part of failing around with terrorists right and then these things will come together right so here is sort of quotes that that focus on the second part of this big quote and here are the here's the sort of things that focus on the first big first part of the okay is that flow related to syntax or to time that change there is no time there's no time here so here is so what this is is just means in some sense this is short-edged distance okay so but I'm going to how I create this graph in the next slide okay so here is what I want to do right so if I have all this so imagine that every letter now is a word right and these are my phrases right so what I would like to do now is somehow create an edge where every edge means me so a phrase CF could have evolved from this particular phrase this particular phrase or that particular phrase right so basically what I want to say is okay who where could I a phrase evolve from okay so and what do what does an arrow mean is just means okay I'm approximately included in you right I'm sort of your subset right so CF is a subset of VC yeah right and you know another it's also a subset of CEFP right and so on okay so I create such graph right where now these edges are exactly the edges that I have in the previous slide and what I'd like to do now is I will first put some weights on these edges that will sort of say okay how likely am I you know to evolve from this quote versus from that quote versus from that quote and what I would like to do now is I would like to partition this graph such that at the end every note has a unique parent right sort of I want to have you know the Adam and Eve parent here and then you know there must be some kind of line each how I came to be right so what basically I want to do is I want to delete some of these edges here such that the total volume of the edges that remain in the graph is as high as possible with the constraint that every note has a unique parent right if I follow the links I have to end up with a unique parent right so in this case this would be a solution right so I would delete here because this this this one now has two sort of two possible parents right this could be a sort of a super apparent or this could be the the original parent right while in you know here everything is fine because at the end everything evolved from this person I don't care whether it evolved this way or that way at the end if these are the edges I delete everything is okay where did the where did the weights come from weight so weight the way I'm doing weights is just think of it you know it's a string at a distance okay you know how similar are we one is could have I evolved from you and the other one is how much effort you know how big chain exactly so it's somehow a precompute this way it's based on the domain knowledge you know how the language works and things like that okay there was another question I imagine that sort of many people could arrive at the same short quote from having seen different versions of a longer quote so I would imagine that but you know it independently at the end is something that Sarah Palin said right and then you know here can be one part of panning around the terrorists and here's another part of panning around the terrorists and when I you come to this short quote I sort of don't really care whether you came this way or this way but this is this particular quote okay that's the motivation why yes you need to have a unique right there is a unique thing from each you are I but this implicitly assumes that the full quote is available somewhere oh sure yeah yes it assumes that there is that this is available yes which tends to be the case yeah if you get one million news documents a day then usually the thing is there but it is that a requirement for the algorithm then can you simplify the graph without having a source of quotes or do you you always find ABCDFGH so I mean the reason why we formulated the problem this way is because these quotes are so short that if you do you really want to keep this lying gene formation otherwise you can't do anything right things become very hard right because you can have for example imagine Joe the plumber right there is you know Joe the plumber most of us will think about a particular context but actually if you go into this one million documents you know Joe the plumber the first time you see it it's way before the real Joe the plumber appeared right just because somebody you know really had a plumber whose name was Joe and he came do something right and then you are like okay what should I do right and that's the reason why you want to approach this and say you know I just think Joe the plumber is this thing and not you know your favorite plumber who came to your okay that's a great question so it doesn't like for us spam is like we don't really handles like we didn't even have to handle spam sort of the only thing we have is like a stop list of around hundred quotes that are movie titles and city sort of music titles right you know Indiana Jones blah blah right so we filter out these things and that's all right and sort of what we are left with at the end are these very clean political political quotes so the holes of I mean of so in some sense we didn't really need to go into into spam too much so it was not a problem so we had like few few so one thing we said was like you have to appear at you know your wall you your your total number of mentions has to be let's say at most five times the number of different websites you come from right so you sort of have to appear at some particular diversity of websites this is the spam already and then the other thing is we had this short stop list that were basically movie titles because we were not interested in them so surprisingly we didn't have any trouble with spam but it's a good question okay so this would be now my my my quote clusters and let me just show you an example right so this is a fundamentals of our economy are strong right and these are all all different ways some of them and this is the volume of basically how many times this thing was said of fundamentals of our economy are strong right and you know some are good and some are you know not that good but things belong to this particular the fundamentals of our economy are strong right and you know here is something bigger where the person cited the it's sort of a quote of a quote right but only know now what I get is things like that right these chunks where all these phrases I say this is all the variant of you know fundamentals of our economy or stuff okay so now what I did is now I have a way to take these documents extract these short phrases and group them together so that I have all the mutational variants put together okay so that's what I have so what I want to look now is show you some results if I start looking at this data and the first question is can you do any anything interesting because what I'm showing you here is this is time and this is just none let's say number of articles per hour I see or the number of phrases I get per hour okay and what I want to make a point here is that it's pretty much constant right again this weekly weekly periodicity right this is sort of weekend this is during the week but overall there are no trends in this and no let's say particular global trends in the data right so the amount of stuff that I see over time is constant right so basically what this is staring means that somehow the bandwidth of the online media is constant right sort of the bandwidth how much how much stuff is produced is about constant over time sure I have this periodicity that you know naturally corresponds to the set five-day working week and then you know two days off and so on but all in all there is you know it seems to be pretty much constant is that could it be that the number of blogs and new sources is also constant or is it that what were new blogs that appeared you know a month into the data set were they excluded no we also get new stuff that is coming and of course like the number of blogs is you know slowly growing but the blogs are also dying so all in all there's no no you can for that okay so basically the question is are there any interesting temporal variations and the answer is yes they are right so what I'm showing you here is this is just time from August 1 to end of October and these are 50 largest volume so 50 most mentioned phrases or phrase clusters right and you know you can nicely see right so this was you know the George the aggression the the conflict in Georgia between Russia and Georgia right then here is the presidential campaign you know we started here with Republican sorry with the Democrat convention and then this was a Republican convention and then you know lipstick on a peak comes and you know here is fundamental of our economy and again economy and right here are all sort of attacks on Obama that didn't really work like you know who's this one this one can I call you Joe things like that right and this is the last presidential debate we sort of you know the two one was spreading the wealth around the other one was I'm not president Bush if you remember and then this one about Sarah failing and then it sort of ends right but what's interesting now is that even though sort of overall but you know I see some somewhat constant number of these things coming to me but I think it's huge spikes when I look at the most popular things right so actually I have a I have a zoom in right so here is the most interesting part how things came and how they how they how they went away right and what is nice about this picture is that this is completely automatically generated right I didn't hand I didn't do anything by hand right like you know these are exactly the sort of the phrases from from my clusters you know it's a press of a button and this comes out right so it's somewhat interesting that you know you get in what 90 million not in the end you say look this is what was going on okay yes the slight changes in the wording so for example instead of I am not president Bush if it was I was not or oh that goes accounted in the clustering right so the way I create these edges in the clustering I'm basically having a string string at a distance metric right so you I say you know you can remove a few words you can swap a few words doesn't really matter right so yes exactly so here you know there will be many different ways of how people said failing around the terrorists or whatever and you know this is not like here can be those big long clothes short clothes doesn't matter right as long as sort of they they are subsets okay about negations like if someone said not something versus yeah I'm that's I don't account for that so it's sort of I'm because I have so much data I can sort of be very stupid because I also like I need to I need to be able to work with such big amount of data so I cannot really spend too much time on every individual thing so there are trade off yes I mean in print I could be much smarter how how sort of how these things are done and I think they can be greatly improved in terms of you know how do we do these clustering and so on all right so that's the first thing so what I'll show you now is sort of a set of plots where I'm somewhat interested in how in the temporal dynamics or you know for example what I'll be showing you is I'll have on the X axis I'll have time and on the Y axis I'll have some kind of notion of volume or popularity and zero will always mean this is the peak time right this is the time when the thing reaches the the peak attention right and what I'm showing you here is this is now average popularity over I think top thousand frames clusters and the only thing to notice is right along before the thing was said I see some sort of background loss and then just around the time when it was said I see almost this like delta function behavior right so the blue line here is a is a fit of an exponential function right and the only part I want to make here is exponential function doesn't increase fast enough to be able to model the peak right so I have to model this with something that that goes to infinity at zero right so it really like shoots up here and then you know again it's super quickly the case and you know then it's the noise okay so this is now on a very long time scale so what I look now is just this period okay and one thing that I will do is another is that I will label websites into new speedier and blogs right and I will use that sort of a very simple categorization I will say everything that appears on Google news I will call news okay and everything else I will call blogs okay so I know I had 20,000 news sites here and 1.6 news sites here but then if I measure let's say the number of articles that comes from here versus here I get 40% of articles coming from this 20,000 sites and you know 56 from this 1.6 million sites right so at the end sort of I have a quite balance quite a good balance between news media sites and blog sites right and I'm just using Google news classification so again I'm just sort of using whatever they say index that's news everything else I call blog okay and what I'm showing you here now is again I'm showing you time now enough hours and this is again for a fraction of volume or popularity right and this is the popularity of mainstream media phrases or when sort of the phrase become popular on in the mainstream media this is when the phrase become popular on the on the blogosphere right and what this is basically telling me is that mainstream media tends to be ahead of the blogosphere for about 2.5 hours right so the blogosphere tends to follow whatever sort of what the mainstream media says with a leg of around 2.5 hours okay so that's the point here right so the difference between these two peaks is 2.5 hours okay so what this would so what this now tells me sort of that media is produced news is produced by media and then bloggers you know feed off and you know they they chew for two hours and then they say something right excuse me so yeah yeah next good question okay so actually I can go into more details right so I can actually take a look at every website and say okay when do you mention things you know relative to the peak time right and here is now so for this experiment I took top hundred most popular phrases and what I'm showing you this is the right this is the lag lag means negative lag means yes you are ahead of the peak popularity so zero is peak popularity right so if you are negative you are before and if you are positive you are after this means how many of these top hundred they reported on they mentioned and this is the website right so before I was saying blocks trade news media for 2.5 hours but if you look at what is going on here is that the the the sides that really are well ahead of the of everyone else are basically professional bloggers right you know hot in tempos talking points memo hot air talk lab dialy cost right there you know 40 15 hours ahead of the peak popularity right and then you know the mainstream things come and they're around 10 hours before the popularity right so what this basically says is yes you have these professional bloggers then you have the mainstream media sites and then you have you know millions of these other casual bloggers that the blog about things right so this is what this is saying yes there's a distinction at daily cost between the front page bloggers were part of the editorial staff and the individuals who have their own postings there did you make a distinction between now for this case I did and the thing is I was considering the first mention on that particular site right so I'm only I'm interested in when the thing was first mentioned on the side so probably what is the case here is that all these bloggers that are you know that are not on the front page they sort of do things late but I don't because something something else already mentioned is on the value costs I just okay okay and the last of these types of plots is the following right so again time this is peak popularity and what is this the y-axis now is sort of mangers how much at every time set what fraction of mentions come from the blogosphere okay so if it's high here it means mostly blogosphere is mentioning the thing and if it's sort of low it means new media are mentioning the thing okay and this is sort of what you see is this hard bit like pattern right so before long before everything happens it's sort of this background lies at 56 percent right I told you 56 percent of the stuff comes from blogosphere and 44 from news media in our data right so I'm here then then then this at this time bloggers take majority of the mentions right so this could be the professional bloggers right then it goes down this is before the peak this is now when me main mainstream media takes over right and then I get another bound back which is higher towards the blogosphere right and this is again right it's sort of two hours later after the peak right and these are these are now the sort of the normal bloggers and then it somehow goes back to the to the normal but the point is here it was a 56 here it sort of a 57 right so blogosphere tends to mention things longer yes this is the real data so this was just some splines okay but this is the real data okay in the relationship between mainstream bloggers and mainstream news media between quote phrases you looked at where there are certain types of stories or memes that seem to show a more variation from this than others let me actually show you yeah let me show you my next slide because it's sort of in this direction okay so what I can also do then is last sort of this x y t which which what I mean by that is basically I could I can say okay give me give me news phrases or give me phrases that have a particular temporal signature right so for example if I have these three numbers then I can say basically give me something that has between x and y fraction of total phrase volume occurred on blogs at least few days before the overall peak right so what I can do now is I can say okay I can I can say give me give me phrases that appeared on the blogosphere long be long before they appeared in the news media right and we did a very simple experiment where we where we said this to be between 30 and 70 and this was seven days and here are here are the news phrase sort of the phrases that that that this simple query took out right and this is the Santa Fe and the global warming thing that sort of bloggers discovered and then later the mainstream media reported about and this one was another phrase about you know the thing is above someone's pay grade right this is also from the political so using such a simple simple it turns out that around let's say three three to four percent of all these popular phrases started in the blogosphere and went into the mainstream media right so if you look at how much evidence is there that you know bloggers come up with something that then becomes very popular on the main in the mainstream media and there's around four percent of such thing okay so that sort of the last the last thing I can say about this and what I haven't showed you we have a simple model of how that models this temporal dynamics of these phrases where we have like three different three different ingredients that you can ask somebody can ask okay what would be the three basic ingredients that every phrase would have that would depend how popular it is right and there are sort of the three thing is one is what we call attractiveness right how interesting is this particular piece of news right the other one is how old it is right sort of the older you are the less likely are people to talk about you and the third ingredient is popularity right the more people talk about you the more likely more more likely other people are to talk about you right and there actually turns out that we can quite reliably model this temporal signatures without really needing the attractiveness ingredient right so we sort of need popularity and age but attractiveness doesn't seem to play much role in sort of a very simple model that we were playing with okay so this now concludes what I wanted to say about the new cycle and how to track it so now I sort of want to show two more computer science things if that is fine okay okay sure put it up and give you even more questions so this I may rush a bit but try to okay so here is what what so if you go back and say okay what are we really doing right here is you know my my world my set of blogs and all I see is how these blogs mentioned things right I see that you know a particular blog mention mention the phrase and then somebody else mentioned it and somebody else mentioned it and you know there's a new phrase that then some other people were mentioning over time right but what I don't see is how how this phrase really propagated right so I don't see these things that I would say okay it started here it went you know to this blog then blog then it went here and then it went here and that way right so I don't see the links I only see when people mentioned the thing right so what I'd be asking now is the following right so I will try to infer this diffusion or influence network right so I will assume that you know there are some nodes in my network and there are some edges in this network but all I see are the nodes and I don't see the edges but what I see is then the times when nodes get infected so what I mean by that is you know imagine that a says post a particular piece of news and then you know a bit later C C talks about it and he talks about it and he talks about it right and all I see I say okay there was I call this now a cascade and I see that a no they said this at time one C said it at time two D said it at time three and you said it at time four right and now I can see a different cascade you know that start at C and somehow spreads through my network right and again all I see is the temporal signature of how this cascade spread through the net right so what I would like to do now is to say okay can I infer sort of from such data can I infer the address of this network okay so what I would basically like to do is if you think of a disease is I see times when people get sick but I don't see who really who who you know coughed at whom or who infected whom and I would like to infer who what were the address over which the infection propagated okay so now I'll give very quickly how you can formulate this problem and how you can solve it okay so I'm given a cascade a cascade for me is just you know this is a cascade right it tells me who got infected and when they got infected okay but I don't know how okay so then I can say okay if I take two nodes in the net in my network how likely is that I infected J and I'll make a very simple model here I'll just say you know the probability of infection depends on the time how what was the time difference between the times when the two nodes got infected like the longer the time the less likely you are to infect the person okay so I'm making I can make this arbitrarily complicated so it doesn't really matter it's sort of the simple thing I can do okay but you know here I could have a very see very complicated model of how likely is one person to infect another person based on time characteristics things like that okay but for now let's just assume it's something very simple okay so imagine that I can compute this thing what I want to now compute is how likely is this cascade to happen if it propagates in a particular pattern T so what I mean by that is if this is my my network and you know this is really propagated from A to C and then from A to B and from B to E and not maybe you know this way or something or or this way then it's easy for me to compute how likely is this cascade to occur right I just say how likely is this cascade to occur in such a three pattern I just go over the edges of these three and compute the probability that you know A infected C and infected B and B infected E okay of course conditions that I know how how the infections which edges were sort of guilty for this infection right but because I don't know these edges I need to consider basically all possible three so all possible patterns how these four nodes could get infected okay so what in principle I would like to compute I would say okay how likely is that the state to occur in my graph I need to go now over all possible propagation 3g and compute the probability of a state under that three times the probability of that three okay and you know I just I'll assume that all trees are equally likely so I can ignore that but if the problem is that I have to go over all possible propagation things okay so even now if I can compute how likely is a particular cascade to occur in a graph then I can define the following problem I say okay I want to select a graph on K edges such that all my the set of my cascade is the most likely to happen right I want to say okay this is a set of cascades I observed what is the most likely graph over which over which cascades could have propagated and this graph has to have at least K edges the sorry at most K edges why do I have this constraint is because if I have a complete graph that's always the best explanation right if anyone can impact anyone else then you just order the nodes in the time of how they got infected and that's the best way right that's sort of the easiest way for you to do it so but if you want a graph that is not you know a complete graph then you need to have some constraint here so it's sort of for technical reasons right and there are two problems here first is that computing this thing right is intractable right the reason is because you have to consider all possible propagation things it turns out that there is this beautiful matrix matrix 3 theorem that will do this super exponential thing in cubic time okay so that's nice thing here and the other thing is even if you're able to compute this the question is how do you maximize over it okay and there is another magic that happens that you can prove that this function is sub modular which basically means it has this diminishing returns property which in turn means that I can find this graph that is near optimal okay and I'm sorry that I left lots of details but sort of here are two magical moments that you feel very happy about when they happen and everything works and here's a small example okay so I'll show you now a small example so the second is the following there is a true network with some edges right and it says this node can infect that node okay and now I will use let's say some independent cascade model or something to simulate a few cascade over this graph right and then I want to go and the end sort of reconstruct the graph from the times when nodes got infected and the simple baseline that I will compare against here is that is just that for every edge I will compute the strength of that edge basically I'll say okay overall the cascade what was the probability that you infected me and this will be the weight of that UV edge okay and if I go do that this is what I get right so these are the edges that I correctly infer and then the red are the ones that I miss and for example you can you focus here right what happens so when a cascade starts somewhere here it comes to this node and then lots of these other nodes get infected over time right but then what is this what this is confusing is that if this gets infected at time two and this one gets infected at time three and this got infected at time one then the edge here gets created right so you get these types of edges because it's more more likely that that this happened than than this because the time Delta here is longer than time Delta there but if you use our thing you basically do almost perfectly okay so this is how well we can do using all those tricks that I showed you before so okay with the comment that a lot of the graph theory here is like President Obama's quote well above our pay grade what does this mean for actually modeling real world okay so what I can't get in our is a very small part of a network where I basically see when media says particular things I also see times and now I'm trying to say okay if if you know the thing is spreading or if one media is following another media how are they following one another right and this is a very small part I sort of labeled here bluer blogs and rather mainstream media and yes happy to post and salon dot com are mainstream media because they are indexed by Google news okay but you can see sort of nice things right so here is a bit political cluster and you know you that are more expert on this thing the time would probably find more interesting patterns here but that's sort of the politics this is gossip and and this is a technology right it's like this model and gadget CNET and things like that right so you nicely get this topical clusters based on who who is following whom and you also find these sites here you know that they talk about everything a bit right that they sort of act as bridges between this different area okay so that's the first thing I want to show and the second thing I want to show is the following thing if again this is my blogosphere and now that I have the network I can say you know information appeared here then it propagated this way this way and that way okay and then you know there's another piece of information that propagated in a particular way and here's the third piece of information that propagated in a particular way so now I can ask which things do I want to read to be most up to date okay and here's one way so one way would be okay I want to read this blog okay so what's good about this the good thing is that I get to know about all three different things that happen right I get to know about red blue and yellow okay what is bad is that I get to hear that very late right they started you know long time before I get to see them on right so I sort of know everything but very late or I could say let me read this guy right so what's good about this one is that yes I get to hear things exactly when they appear but you know I don't get to hear about the red thing okay so I get to I get to know about the blue I get to know about the yellow but I don't get to know about the red right so now what I can ask is the question if I have a blogosphere and I want to let's say follow three blogs which three blogs should I follow such that I somehow cover this blogosphere as well as best as I can in a sense that you know I want to select follow this blog because it covers the topics that appear in this part of the blogosphere right and you know then I can say I want to follow this one and so on and so forth right and again this is hard to do but you can do it and so the way the algorithm works is the following so here's another real example so every dot here is a blog and the way the algorithm works is that we will see sort of some colors appear and what I'll be doing is I'll be showing you what happens as you are reading more blogs right so the first thing that sort of the garden says it says read this block here right and what he says now is you will sort of detect everything but when information starts here it takes quite long time to come there right so the second thing you will select is for example you say okay this is the second block today right because this is covering this part of the blogosphere and you know this is the third blog and fourth and fifth and so on right and now every color sort of tells me which is the block that is covering this particular you know this particular block here okay and you know I can do this and I can quickly show you how well it works so this for example the number of blogs that I'm reading and this is the fraction of stories that I'm detecting so higher is better right if I would be reading random blogs this is what I would do if I would be reading blogs that have lots of posts so I just take I will take the 100 most block 100 blocks with the most volume okay I could also say I will read blogs based on the number of hauntings right the more the more outlings you have the more likely I am to read you right another thing is to say okay I will read blogs that receive the most in me right that works better but if you actually do the thing right this is how much you can do or how good is our solution in terms of what do you need to read and how much stories you cover okay and you know this is uh who is the most influential back in 2006 doesn't matter um so let me the last slide basically uh what I want to basically what I showed you is you know some kind of framework or idea how to track that how to track memes and use uh as they as they you know propagate over the web how you can quantify what's going on and what kind of sort of nice algorithmic consequences we had this website called meme tracker.com where um there is some demos there is all the data that I showed so you can go download the data and play with it this meme tracker data and so on and there are many further questions right so one very important question is what are we missing right we are extracting these quotes it gives us something but we are missing lots of other things so what kind of biases are we introducing and what are we missing that's also the first thing the second thing is um you know how can this help me to for example identify dynamics of polarization right are there sort of political camps or you know if there is a particular quote does it sort of get split up and you know one part of the network likes the first part and the other part of the network likes the second part um and uh one thing that we sort of with these networks network diffusion network influence we try to address is you know how are these memes actually spreading between the people yes okay so uh I'm done thank you I'm somehow guessing there are more questions in this room so uh who wants to jump in first go ahead what would happen if you I didn't restrict yourself to quotes because I mean the memes could be found in so you use the sentence as the as the baseline it's and run the same algorithm would take a lot longer because you have a lot more data um but but you would end up with clusters of sentences where you know this sentence is talking about the troubled asset relief program for example like this is because it's using a lot of the same words um I just I mean clearly it's a different it's a different dynamic because because with the quotes you're dealing with people who are essentially referring to the same event uh where something fixed was said but it seems like I just wonder how general this would be the the clustering approach that you use as a as a topic modeling kind of approach when you apply not just to fix quotes which is just like uh you're just throwing a bunch of sentence segments whatever yeah I mean in principle yes you could do that right that would um the question is would you you know how would you get retired before the computation would be done right everything else yeah in a sense yes you could do it I think like what we noticed is that things so these things work really well if you work with about half a year of data when you try to do this on the full year of data um there was just sort of too much of this background noise and things started to break to break down the clustering itself so this very simple version so I think you really need to worry about then a time in a sense that you know you have to spike or your temporal signature needs to be about similar or you have to appear at about the same time as your parent or something right and the other thing is you'd want to be much smarter about how you define connections in this graph but in us in a principle yes you can you can you can do this over anything so one thing that we are trying to do now is we are trying to do it over tweets right because tweets are these 140 character long things so just teach one of them as a sentence and see if there's any you know copying or mutation things so yeah you could you could easily do that any other question yes I was wondering if you ever came maybe I missed it if you ever came to a final definition of the phrase news cycle that's one of the things you started out talking about you know I'm interested in this because I worked for 10 years for the associated press in the pre-internet era when the news cycle was something very real and concrete and we toiled under it all the time uh-huh very specifically so I think through this through this so one thing that I haven't showed you is you can then take these temporal signatures of the phrases and like you can say okay what are the typical classes of these signatures right and it turns out that there are six of them right you sort of have you have one that you have there are sort of three that tell you how wide how how quickly rises and how quickly decays one is sort of very narrow one is symmetric but very wide and one is sort of very quick and slower and then you have one that you know there's a small spike on the first day and a big spike on the second day the next one is sort of this is flipped and the last one is a spike and then a very very slow decay like three four days and then what we also did is then you can say okay we labeled websites into like seven different categories like we said okay professional blogs normal blogs news agencies televisions newspapers I think that was it and then like you find very nice characteristics of you know if a news agency pushes something out then it's very very peak and then um sometimes sort of then you have two behaviors one is very slow decay and one is very quick decay and things like that so you can find these nice correlations between how things will spike given who mentioned them and when so this was this is one thing that we were looking at and yeah you see this very sort of very strong 24-hour signatures another thing that I should say and I haven't really talked about it is together with the the Pew Center for excellence in journalism right what we did with them was we were working on coverage of the this economic crisis right the great depression and they were and it was very interesting when we were working together to they were really interested in you know who were whose phrases or whose quotes got mentioned about with regard to this particular topic right and then you can compare you know was it sort of president obama who was pushing out or who got cited the most with regard to this topic or you know was it was it some economists and so on and we got some very interesting results there were sort of you know domain experts got very interested in the methodology itself and they probably quite useful which i think was very nice yes this is i am fascinated about how this shows the spread of information in terms of where you can go get it and i don't understand your work to have to be so inclusive as to understand when people actually engage with it but do you know of anyone who's working on that that there that there may be a news cycle by class of human being who consumes news that is not all people who are sitting in front of desks all day with always on internet things that have that have a natural news cycle to their own lives do you know of any work that goes there i mean i think what you are asking is a very good point right because we we just see everyone who's connected right we don't see people who are not connected or who are you know not blogging in some sense right um i what should i say so i think that's sort of a many a very very point yeah that we are biased towards no no no criticism antenna just wondering if there's an other work i think there's some good work out there that's looking at this more from ethnography right because this is really an analysis of yes when the publication takes place right i would point you towards something like ap study uh where they looked at young news consumers and did a sort of detailed ethnographic yes and what they were able to do was sort of come out with the idea that people are actually looking for information continually through the day there's sort of a morning feeding whether that's news or a newspaper but then there's sort of people going and reloading throughout the day one thing that actually would be really interesting is to try to graph uh essentially news reporting or publication frequency which is essentially what your state is giving you versus news consumption sort of intuitive from these ethnographies and try to get a sense for whether the two actually parallel that that's that's one or another very curious about or actually just something which is probably a little more simple which is trying to add to the data um traffic or audience data from the sources that you're actually analyzing so for example if you have peak through data from the media sites and you can just say what's the number of peaks on that on that article so yeah unfortunately as we know from it's the holy grail for most of these it's impossible to get but yeah that would be a very nice way to say okay you know okay it got mentioned but how did you get popular or something got a question over here so how unusually is the period that you're studying in terms of dominance in quotes right so election periods they're all about quote people making speeches i see i that's that's an excellent question so um we we looked at also some other time periods that were sort of three months you know left and right and we still found um there were still spikes they were a bit lower and they were much more diverse in a sense that you know the iphone release you know the gets quoted gets quoted a lot or we were you know looking afterwards it was like the obama inauguration speech and so on so there are these examples where sort of also later that that happened so yes i agree that this pre-election period is a bit is a bit specific but even later it's not that it would be flat you find these nice things that uh take over maybe not in such a proportion as lipstick when i picked it right but there are yes um so along with the with the popularity of means i've been have you been looking at the sentiment of them so for example if people discuss them with a positive tone or a negative tone um i haven't but i think that would be a great idea like to say okay now here's the here's the meme you know what's the sentiment around i think that because i think you have the data and i would love to see that i can yeah we can do that yes here's you had the graph that showed sort of the number of logs and the different methodologies of the coverage have you looked at how an aggregator like say google news fits in terms of coverage so what i can do is this right i know sort of i know what what uh this is what technology is doing right and you know it's number of endings right but see it doesn't much worse and the reason why it doesn't much worse is because when when i optimize i also consider overlaps right if there are two very popular logs that cover the same topics i will just pick one right so that's sort of these these the difference here google does that in some sense is google does it says there's 576 others with the story so i just didn't know how good their coverage would be i i assume they're using a purely you know algorithmic um yeah they're using they're using they're using some very clever uh clustering approach um the question there is i think it's a good question like what what you're asking is a good question i don't have a too good answer because it's also not here how would you measure what you know what what i'm just it probably just be you know google and a data point i'm just sort of curious where they'd end up on that part of what's tricky about this is what's going on in google news there's a really nice algorithm there which is collapsing thousands of stories down to story which is somewhat similar open question about how you do that this is one approach to do this one approach in whatever their black box is sloppy approach to a media cloud there's a bunch of different approaches to it but you know figuring out how you do that collapse in fact tells you first of all how you score something like this and uh it's hard because to a certain extent we're competing against the black box yeah no i wasn't i mean just sort of looking for a data point not you know uh a horizontal vertical comparison i would think that google news would be very close to here because what it gives you at the end is sort of you know links to new york times bloomberg war three journal and so on so it's and yes it sort of does things based on the based on the stories but it does it on on the individual stories not on sort of the clouds of stories right which what you'd like to do is not just say i want to know about a particular story but about all these different stories that are there right so i think it will be here right because what are the things that that get most linked at this mostly media sites yes um so on this work here um how how how resilient is this like as a when you're trying to select the most influential blogs how resilient is this as things change over time like because there's a particular snapshot of time right so is any good at will be the same 10 blogs six months from now or someone else because it's a nice this is talking about the iphone this is an excellent question so basically the question is right i use some historic data to decide which blogs to read but then there are all these news that happen tomorrow and i'm really interested in those not you know the old ones and the question is how do i generalize the future um so um we have we have ways how to do that so if you do it naively then it's problematic the other thing is also what's problematic is what do you say here do you say the number of blogs i read or the number of posts i need to read right so for example if you say the number of blogs then you are biased towards blogs that have lots of posts right if you say i want to read the smaller number of posts then what will happen is you will be selecting you will like heavily overfitting in a sense that you'll be selecting all these small blogs with you know three posts that just by chance were early in one year right so it turns out that um you can very nicely interpolate between sort of two these two extremes and if you find um and that gives you so if you do that that gives you a good generalization another sort of heuristic that gives you good generalization is uh just exclude small blogs right you say i want to read something that you know at least has some diely diely volume and that and then we generalize quite well but we went into quite a lot of detail about the generalization and actually there's another formulation of this problem that gives you more robust solutions that i didn't mention so yeah um we can do we can do it but if you are not careful you have the over any other yeah i think this is a dumb question i probably missed it but what is our solution uh-huh our solution is this if i only have three blogs to read and i want to know as much as possible so this is the idea is that you are sort of selecting these blogs greedy and say okay this is what this block covers and then you are asking okay which new blog should i add to cover the most new area right or here's the idea right i would so what blogs specifically are those or what news sites what's my data set well should i be reading talking points memo new york times and huffington post if i want i have three blogs i can read and i want the most information then this would be the 2008 election so for the 2006 uh where's my this would be the right so if you want three pick this three this was at least that you know the data set from 2006 and if you want six pick the top six right because the way the algorithm works it sort of picks first picks the blog that covers the most and then it picks the next block that covers the most new area subject to whatever you have already covered right so it sort of always picks something that covers the most new stuff so we can if you say i want to pick top it's sort of the best something you just pick the top here and it takes into account when they post it as well exactly yeah so so what i'm actually internet like there are again three different formulations of how do you think about what do we really want like so one thing is to say i want to minimize the time between when the when the stuff was mentioned and when i get to hear about it right that's one one objective the other objective is to say i don't really care when i when i detect it i just want to know that it happened and then the last objective is to say i don't care how late i detect but what i care is that after myself lots of people get infected right and the one that we were experimenting with the most is i when i detect i want sort of i want to have a lot of mentions afterwards so that's that's the one that makes so how is this different from the 2008 that the election we haven't run this on the 2008 data one thing that i should say is this is so the way we started looking at this problem is the following application you have it's completely different you have a city water distribution network right so like pipes and houses right and you have people drinking water and then you assume that you know what some of these junctions are contamination can happen and the poisonous water is spreading through your network so the question is where should you put sensors like monitoring stations to detect these contaminations okay and what i'm doing here is the same thing i say here's my blog network and i have this information contaminations where should i place my sensors to detect these contaminations right and now i can say i want to detect so sort of the penalty or the reward i get depends on the time how long it was between the introduction contamination and i detected or how many people i saved right if this would basically say after i detect how many people how many other people would get infected so the more people i saved the more people sort of would get infected afterwards the better so it turns out it's the same problem um but uh so this is this is how it came to look at this yes um have you tracked the which variant of the meaning so if you have a some sort of quote um i'm guessing that it doesn't spread in the same proportion as it initially appears so have you checked to see what variant is the one that becomes the most popular i think that's um so we looked at this a bit but not enough and i think that's another very interesting thing that one could do is actually try to somehow understand what are you know what are the catchy parts so i have a bunch of future work things here and one thing is so here's a graph right this is lipstick on a pig but it's still a pig and each line here so the width of the line tells you the number of mentions of the thing and uh where it appears so this means the most popular part of this quote was you know lipstick on a pig right and this was then the second most popular one and third most popular one and so on right so this sort of gives you some idea of which subparts get pensioned more than others but i think it would be very interesting to try to model and understand such things and we haven't done much with this so i think that's another interesting idea uh okay that's very sweet so yeah correct all right just a technical question if there is no quotation is that your methods do you stand kind of no quotation news for example in chinese newspapers maybe a lot of news and no quotations they just mentioned someone mentioned something no quotation marks um so the way i'm going to use to that kind of context um so the way we do it right now is we only take quotations so we are sort of super dumb of course you could start with a small c set and then expand and search for you know similar things outside quotation marks and so on so there are ways how you could go around it but what i showed you really just takes starts with things that are that appearing quotes and given that we have such a high volume of of of news we can afford to do that so that's what i would say so there was a question here and then okay so you uh you've inferred this this distribution tree which the infection tree wherever you want to show call it by this nice optimization stuff and all the math worked out beautifully for you and and and so you've been able to find these uh sites that are that are good sources you don't have any empirical data to uh by which you can judge whether that is in fact these are in fact the paths by which the stuff followed would flow from one one new source to another that's a great question so one thing that we did was the following um i know the hypermix right so i know when a particular blog leads to another particular blog so what i did is was the following i also see what this what phrases this box mentioned so the experiment we did was that we throw away hypermix you only see what people mentioned and then we say can we infer who linked at whom and we can do that quite well so that was one way to yeah that's a validation yeah to the validation on the real data because usually so another thing would be um when you do that on twitter right on twitter you know because you know who follows whom you know what are the links over which information can drop on it right so the question again would be if you see what we can talk about can you infer the twitter social net so that would be another way to validate um what we also did is we did lots of experiments on synthetic data and there uh regardless of the network like our break even point is in 90s so it's like it works amazingly well of course i mean on synthetic data you know there is sort of there's variants but you know the stuff for the thing follows your models or everything is good but when we tried also on this the real data with these hyperlinks sort of um comparison it did quite well we're time for one last question so bring it ahead okay yep my best um so i mean i guess bouncing off of your comment i guess it's impossible to find the real uh empirical or objective links between pages so you're using maximum likelihood to find like the hidden markup model as it worked but behind like the uh the abstract connections between these pages uh you're using like probabilistic links sure yeah what's the what's the most likely path between this yeah yeah i mean there is like i think um yes there is sort of no ground truth and it's also it's also the question is what's the interpretation of this network right in a sense that all i'm saying is my links mean this particular site tends to repeat after that particular site that's shown it does very well that's yeah exactly but you know what are sort of there's no ground truth or no objective truth so yeah i agree with that so things that where you could get such ground truth would be for example twitter and but even there what for example people found in this network is that information tends to jump a lot right it's not that you know things would start somewhere and then nicely spread no it's just sort of pops up at different places because there is everything else around the world that you know also places a bit all of sort of does diffusion on its own right and then you just see pop up so that's another so well irie thank you very very much and thanks for taking all these questions