 So good afternoon. Thank you for all of you who came here. And I'd like to welcome Professor Jia Weihan on behalf of all our CSE faculty and students. And it's my great honor to introduce him as our CSE Distinguished Lecture Series speaker. Professor Jia Weihan is Bliss Abel Professor of Computer Science from the University of Illinois at Urbana-Champaign. And he researches in data mining, information, network analysis, database systems, and data warehousing, and with over 600 journal and conference publications. He has served on several program committees for most major international conferences in the fields of data mining and database systems as a member and function co-chair. He also served as the founding editor and chief of ACM Transactions on Knowledge Discovery from Data and is serving as the director of Information Network Academic Research Center supported by US Army Research Lab. He's a fellow of ACM and fellow of IEEE and received 2004 ACM CKDD Innovations Award, 2005 IEEE Computer Society Technical Achievement Award, 2009 IEEE Computer Society Wallace McDwell Award, and 2011 Daniel Drucker Eminent Faculty Award at UIUC. And he's author of a book, Data Mining, Concepts and Techniques, and it has been used popularly as a textbook worldwide. Today he's going to talk about mining heterogeneous information networks. Thank you. OK, yeah, thank you. Thanks for a nice introduction. I actually drove down here, but not directly from UIUC because it's a long drive. I actually drove down starting this morning from Notre Dame because when Comprehensive Science at Michigan invited me, I was thinking, first we do not have direct flight anymore to Detroit. The second thing is if you transfer from Chicago, it took about seven hours at least for all the trips. So instead of seven hours on the plane or transfer, they just drive. It just took six hours maybe. Plus I actually on the way, I can with some research friend in Notre Dame, they actually have asked me to give a talk for a long time. I never promised I would go on a certain day. So that's the reason we arranged this Friday. The other one was Thursday. So that's the second talk almost in two days. But for this building, it's a very nice building. I actually came here, I think it's 2006 or so, with some professor, but mainly in bioinformatics program. But I came to this building, give a one hour talk and I only stay here one and a half hours. That's the reason this building is still very fresh to me. It's very nice. Thank you. So for me, my discussion on this is mining heterogeneous information networks. It's very exciting and it's also a relatively new field. The contributor, I should say mainly from my PhD students. Actually the first one, E. Joe Sanchez, graduating this year. Actually this talk, probably 80 or 90%, is from her work. Okay. And there are several other students contributing, like Ming Ji contributing on the classification part. Xu Wang contributing on the road discovery part. Xiaoxin contributed in the past quite a few because he graduated, joined Microsoft Research four years ago, three and a half. So he got actually ACM6KDD dissertation award for the initial work on information networks, but this talk, I did not include any of his work. So I just got an acknowledgement because he did the initial one. Okay. So let's discuss about this. The first thing I want to discuss actually is why we want to explore the power of structured heterogeneous network. What's the difference from the typical, say social network analysis? All of these I'm going to give you a discussion. Then I will go over the several very interesting masters we developed. The first thing is everybody knows social network. I do not need the introduction, I just give you lots of logos. Those are Facebooks and all these, everybody knows. So if you look at this, besides Facebook, Foursquare, all these things, you get a Twitter, they all are different. They have a very different flavor, but I'm not going to introduce any of this, but I may use some of this in some examples. But overall, I should say social network opened a new forum for many, many people to jumping in, not only from computer science, information sciences, and also from even social parts. Actually, many, many people jumping into this part. But we are mainly going to discuss is we call heterogeneous information network, which is different from the pure social network, like friends link with friends if you got into Facebook. You say, oh, every entity is a person. Then you link with them because you like them. Of course, you hate them, you won't link with them. So that's what we say is friends links. But they are all person. But if you think about a W, like web or Google, they have the pages link with pages. This is also homogeneous because those are all web pages. The link is just some kind of endorsement or something. But we are discussing more, we call multiple typed structured heterogeneous network. Just give you one example, medical network. You've got patients, you've got doctors, you've got treatment, disease, or even DNAs, all those things, they link together. But they are different. Doctor, even you say doctor link with patients, they play very different roles. Doctor link with doctors or patient link with patients, they are different from doctors link with patients, semantically. So to that extent, it's interesting. Then if you look at, for example, bibliographic network, like we use DBRP all the time, they collect papers, but they contain authors, and also they contain venues like conference and joiners, and they have titles that may have keywords, you have citations, all these things. You think about this, this actually is also multi-typed network, okay. Which some people say, I'm also working on heterogeneous network. Give you example, wordnet, okay. All the words are linked, they are different. But wordnets, in most cases, the words are not typed. You think about this. They have so many different words. You just take it, everyone is somewhat different, they link them. But we are now studying this, we are studying typed heterogeneous network. But once you type them, okay, things become structured or semi-structured, then it becomes quite powerful. Let's look at this, you probably will get this. That means you sort of, even they are huge, gigantic network, they are somewhat organized by typing, then the power immediately will show up. Let's just give you a simple one, okay. This one actually is a co-author network, which is people linking with people because they are all co-authors. But this co-author, where you get it? Actually, most people say, I got it from DVRP. Remember, if you want to study those co-authors, just look at those links. You may say who are the popular collaborators because it's very dense or many people, you have many, many fans out or something. But that doesn't make too much sense if you think about, you won't study who are the prominent researchers. Some researchers may not be collaborating with many people, but they are still prominent. You cannot just, just based on number of links counts, you say who are the real hero, right? So that's the case. You think about this network, it's come from your publications. You got papers, papers have a title, you publish it in some, you know, rapid conferences, then you are the hero. That's the reason this one actually got rid of lots of information, but they are important information, right? So, if you want to preserve this information, you're going to get heterogeneous network, not homogeneous, like author link with another author. Your author link with conference, with papers, with keywords, with all those things, right? So, this is the simplest one I can think. This is just authors, like Bob or Mary, and these are just the conferences. You see, I publish one paper in this conference, be it one link. If I publish three papers, I probably say my link to that, the weight is three, or I have three links. So, that's heterogeneous network. Even this one is the simplest one, it's only bipartite. What questions, or what interesting things you can mine from this network? For example, we just take DBRP as the example. DBRP actually is a computer science bibliographic database. It contains, actually now it's like a 1.8 million records. And with, you know, over half a million authors and 10,000 conferences look like this, okay? It contains still lots of data, but the data actually linked, because you publish five papers in this conference, they are linked with all these. Some questions you think you cannot answer because you just do retrieval, you cannot. But if you do data mining, okay? For example, how the computer science research field is structured, how many subfields and sub-subfields can you give the whole hierarchical structure of the computer science just based on publications? And who are the leading authors on web search or probably graphic models or all these things? And, you know, how those authors collaborate, how they evolve, or whether they break up, or whether they form a new group, okay? And another thing interesting, actually this is Xiaoxin's study, is how many way ones in DBRP just carry the exact same name by the different people and who is who, you know, those collaborator and the published which papers? He actually did a very interesting thing, almost you can't identify them, I mean, read them accurately, okay? And who is Serge Brands supervisor and when? You probably know Serge Brands, Google's co-founder, right? But he was a PhD student in Stanford, okay? He did have publication interest, not only page rank, but also something else, okay? So who is his advisor and at what time? Can you automatically just based on this find it, okay? And can you predict what topics or what papers Chris's fellow students are going to write in the next several years? That's a Chris's fellow student he asked me by himself, okay? And we actually show something can be done, okay? It's very interesting. And that means we are viewing this DBRP, not as a database, but as a heterogeneous information network, then everything becomes alive, okay? Let me just show you a few very interesting algorithms you probably will find is really exciting, okay? I actually first thing, we have two methodologies developed all by Ejo San. I will show you this two methodology with a few very interesting algorithms. One is indicating ranking with clustering and classification, okay? In heterogeneous networks. So essentially she develops one algorithm called rank class, clustering, okay? And then actually got quite a few algorithms along with it. There's another girl called Ming Ji, she just joined us two years ago. She actually following this philosophy doing rank-based classification. Got a KDD algorithm, KDD paper called rank class. We intentionally make this two pronunciation so close, but the writing is different, okay? And another methodology called meta pass-based exploration. That means you are thinking this type thing actually is a meta type, perhaps this type. They form a pass. You follow different paths, you carry different semantics. You can do a lot of interesting things. We are going to study these as well, okay? Let's first get into this rank class clustering algorithm. The general idea is everybody likes ranking. Everybody likes clustering. But actually very few people are thinking these two things can work together. Why do you like ranking? I can give my own personal experience, I say, I must like ranking. You know, before page rank came. I go to a conference, I went to a conference where I ask, I ask people's UIL, okay? Why I get a name card, you know, carefully take those name cards, the reason is this, okay? For example, if I met Chris's Fellows, I say, give me UIL. Of course, I know his from CMU. Why I ask this, okay? Remember, typical IR is a inverted index. If now I search on the web, you'll say, 70,000, actually it's more than that if you really search Chris's Fellows. 70,000 entries found, okay? Where's his homepage? If his homepage is 70,000 right in the middle, it take me days to find his homepage. Remember this, right? But page rank have some magic thing finally can rank, you know, his homepage on the very top. I don't need name card anymore, right? Not only do not need name card, I don't need anything. I remember, I just want to find anything and go to Google, that's it. That's a powerful ranking, okay? It's not because Google can find 70,000 entries. It's because Google can rank them, okay? Now the second thing is classroom. Everybody likes classroom because if you do not group the similar things together, you will not be able to get right entry. You think about it, you search something, the similar things are not there, you actually have much more, you know, frustration to find it. So, but whether these two things can get together, can work together, okay? Actually, the general philosophy is this. If most people, probably I don't have to introduce anything about clustering, you know you find clusters. But people doing clustering, the general methodology is they treat every object as identical. It's like weight as one or something like that. Then they try to partition things to find different clusters. But in real life, should you treat every object identical? They have the same weight? Just thinking about it, we do the clustering of people, authors. The Chrisos Fellows, suppose published 500 papers, famous professor, very influential, okay? And suppose I got one person's first year graduate student publish one paper in the KDD workshop, okay? So both are in the KDD field. But if I put Chrisos Fellows in the wrong cluster, what would happen? Okay, many things may change. But if I put the student in the wrong cluster, what would happen? Nothing would happen. Okay, of course for him it's not fair, right? But you probably can see is clustering, objects actually should be weighted differently. Why they should not, right? And this is common sense. But why all the clustering algorithms? They don't treat this way. But the problem people are missing is it's hard for me to find the right weight. That's right, okay? That's the trick, and she was playing. She basically saying, you know, if we can do based on network, somehow we can figure out the ranking. Even if we figure out in a very rough way, it may help my clustering. And once I get a little better clustering, I get a little better ranking, then it goes back and back. You know, that's like a page rank. You think about this. Why I just get a bunch of pointers, the page rank can finally find all sort of pages. It's just incremental. You find a tiny little difference, then you get a loop. Then everything becomes order. That's the trick, okay? So let's think about this. If you see this is bipartite, a simplest network, so authors and conferences, okay? At the very beginning, we say we don't know how to cluster authors or cluster conferences. But there's a general philosophy. If somebody published more papers in this conference, likely we'll be in this field, right? And especially if you've published papers in both conferences quite a lot, likely these two conferences are in the same field, or something like that. So that's the way you do clustering. And with this way to do clustering, you also have some way to figure out the rough ranking, because if you publish more papers in this field, likely you rank at somewhat higher, okay? So with this idea, you can go into this initial ranking, then you go back to clustering, and clustering becomes better, you can get better ranking, you get better ranking, you get better clustering. So you go back and forth, done. Okay, you do really good. So then she worked out this one, essentially it's like an EEM algorithm, okay? But it carries a real intelligence, is initialization, you can do EEM or K-means, you just randomly partition. But you have a sort of a ranking rules, like you publish more papers in this field, likely you rank a little higher in this field, okay? So you do the ranking, then you measure the space, you do the partitioning, you can adjust your cluster, you go back and forth, you repeat this one, two, three, okay? To repeat until you get stable, it's done, okay? So that's pretty simple three-forward algorithm, okay? So if you convert this one into a network, essentially this becomes matrix, okay? This one is a conference suppose is type X, author type Y, you actually get X, X means conference, link with conference, initially it was empty, no conference link with the other, but they link with the conference via authors, okay? And then author, after many rounds, conference where link with conference more, and author where link with author more, even you do not have co-author relationship, you still will be linked because you are closer, okay? So with this idea, and she actually worked out a very nice algorithm, but the interesting thing is this, you know the ranking, you will disagree with me at the very beginning, you say people publish more papers, you'll be ranked higher, what about I publish more in a very junkie place, okay? Should I rank number Y in the world, okay? So actually you need some simple rule to guard against it because you probably know even MIT has a group, use random number generator, generate some paper, you sum into a conference, it's accepted, okay? Of course that must be a junkie conference, but why you can distinguish them? Actually it's very easy to distinguish, you say your paper got rejected, you said I set my own workshop, all my paper will be accepted, fine. Do you think other people will join you? You think you'll get this workshop, you accept all of my own papers? If I were a researcher, I would feel ashamed of it, I would not send paper there, right? So it's very easy to get a rule like this, okay? We say highly ranked conferences, attract many papers from many highly ranked authors, simply says if you are reputed, you probably feel ashamed, you go down there to a junkie conference, automated number generator, you are in the same category, right? So that's actually, we just code these four rules, which actually is very easy to code. And philosophically you probably know, we do have one, one called simple ranking is just you publish more papers to get a higher rank. This one actually saying we look at the conference venues, whether it's reputed or not, and authors whether reputed or not, we based on these three rules, including collaboration, which to some extent you can tune them. For example, if you publish with say, Albert Einstein, likely you probably will be quite good on physics, right? Otherwise our Einstein, we're not want to even co-author with you. So we actually got this coded into matrix form, including the last one, actually it's elastic. Elastic means this alpha can be adjusted. If you say collaboration is very important, put more weights on collaboration, if it's not so important, you can reduce the weight. But anyway, we take this, we run through our test on the DBRP datasets. Very interesting, you can see the convergence. You can see here at the very beginning, these two colors all mixed up, that means ranking mixed up, the clustering also mixed up. But using this algorithm after a few iterations, the ranking become very well separated, clustering also become very well separated. Of course, this test is we use two fears. One is we call it database and data mining. The other fear is computer architecture and hardware. We assume people actually usually do not publish both in both fears, okay? Of course, there are people who may publish in both fears, but in general, they don't publish in both fears. That's why you can see it's where separated clusters. But if you got it like KDD and machine learning, maybe it's hard to split them. So we use this, we took over 2,000 conference, 20,000 authors for those many years. We do this running test just like say, whether we can cluster conferences well. We set up like K is 15, we got this result. You probably can see overall, for example, if you are in information retrieval, you pretty can see the CIG IR, AC multi-media. Of course, you said K is bigger, the multi-media will be separate with CIG IR. But here, actually, you can see these are the cluster for information retrieval. These will be the cluster for theory, these cluster for AI, this cluster for network. So actually, we did not put any knowledge inside. That means those computer knows nothing, but still can do very good cluster and not too bad ranking. You look at the CIG IR rank number one, this is like VRDB or Soda or even you say, Soda, Stock, Fox, these are probably the top three conference in theory. But anyway, so this shows it does have some power. Of course, if you want to go deeper, you say not only are one of the partitioned database, actually within database want no query processing or data model or data mining or data warehousing, but they all submit to the same conference. So it becomes hard. How do you do it? You need more information. That's why the bipartite one is only powerful enough if they are very well separated by the conferences. But you need more, for example, you need a term, like keywords, title information. Actually, we only use title, even do not use abstract. Take title, we look at their terms. So we got this star network. Star network means research paper center, but you get author, venue, and term as a star network. The interesting thing is you do the classroom and ranking, they will nicely separate this one into multiple stars. But every star has exactly the same structure. This paper, venue, author, title. It's more like a DNA, you chop them, you partition this into many small parts, they still have the same DNA structure. But the real contents are different. This database, hardware, theory, you can where separate them. And I just show you these numbers, you may not like to read it, but I just show you where the red parts becomes interesting. For example, when we run through this, what you'll see is like chip or AI, everybody knows this is the AI conference. You can see the third column is red, it's distribution wise, this is the largest distribution. Then you see whose second column is red. You'll see CVPR, ECML, then ICML, HCHI, those are all AI conferences. You probably can see why we can easily separate them, because the distribution, after a few rounds, the distribution there were separated. And not only separate the conferences, you can see this separation goes down to all entities. This one is go to, this is the terms, this is the conferences, these are the authors. Okay, see, if you do not know which field this is, you look at the title, you look at the keywords, database systems, data queries, system management, object relational, this is database field, everybody knows. You look at those are the conferences of their top database conferences. These are the authors, of course, I cannot say these ranking is most authoritative, but at least give you something pretty good, those are very well known authors. Like David Wait, Mike Harry, Jargarish, all here, okay. Then let's look at, with this, we do not have any training, you think about this, okay. Then I got another student, Mingji, she came, she says, we do not have any training. Why we don't put some training? It's not hard to put training, right? So she put training and tested, okay. Got another algorithm called rank class, classification. The general philosophy is this. You may labor, for example, say data mining, you may put some, you know, like blue labors there. The database, you may put a green labor there. They do knowledge propagation, because they link them more, they tend to propagate some distribution down to the others, okay. Based on this competition, some you do not know, you also will get labors, okay. Actually, not only that, she's actually using this workout, quite a few formulas and using graph regularization technique, I will not get into detail, though I'm publishing like PKDD 2010. Then she actually used each of the other student idea, saying why every object should be trade identical, okay. The thing, if you think about, even for classification, I'll give you a training set. If this training set is very highly regarded conference, or highly regarded author, I do not say they have the same weight as first year grad, okay. If you put Jagadish there, he's a big one in database, you put a first year grad, even you're misclassified, so what? But you don't want to misclassify Jack, right. So that's the reason you probably can see this rank class, ranking and classification also can work together. And she did it. And the interesting thing is this, she did it, give very small number of training set. You don't, actually the interesting thing is this, if you give the best training set to the top conferences, you almost for sure get a very good result. But she was not doing that because she was thinking this one is a little cheating, okay. She gives something a little ambiguous, it's more like the authors and the papers. Remember, author and paper, even for me, you see, whether I'm a data mining person or a database person, it depends on who used it, right. You see, people in database committees say, this guy is a database person. People in data mining say, oh, he's the data mining person. It depends, right. It depends on also where you publish your papers. So we take this, we actually compete with a few very good network based or some like a, you know, like a regularization, you know, these algorithms. We actually found with a network and with a rank based classification, you get very high classification accuracy, okay. If you look at this, you look at number of training sets you really need, okay. We got 14,000 papers, 14,000 authors. When we train them, the lowest one is 0.1% means you just need 15 papers and 15 authors. You take those as a training set, okay. Then you give them labor, then everything sort of become trained, okay. Because you use network, actually you can see we can reach just these, so few examples. We can reach like 80.2 or 82 or 83% accuracy comparing to the other algorithms. Remember, this 25% essentially random gas because you have four fears. We basically use database, data mining, information retriever, and AI. These four fears, if you got 25% you are doing random gas, okay. But here you can see the accuracy is pretty high. But if you look at the concrete case, you will be really convinced because not only we give a good classification accuracy but also we give good ranking, okay. Look at a conference, these four fears, okay. For example, if you know like AI fear, you probably say this actually is pretty good ranking for the top AI conferences, okay. Or IR, you pretty look at this, you say CIGIR, this is very, you may argue with me, say CIKM and ChipperW, I will rank ChipperW higher. But ChipperW is more spread, not only on IR, many other things, okay. And CIKM is a little more concentrated but not as concentrated as ECIR. That's why ECIR actually rank higher than CIKM. Of course, CIGIR is always no question, it's number one. But if you look at the keywords, that's automatic, no human instruction. You look at the retriever, information, web, search, text, can you get even better ranking or better five keywords to replace these five? I thought about that, I just cannot find any other five words can beat this five in IR, right. So that's simply says this one give you so small number training example, then finally give you very nice results, right. So it's elegant. So this one is we call integrated ranking with clustering and classification, okay. Then we go a little further to look at this, we call metapass-based exploration, okay. Why metapass? The interesting thing is this, we look at this, we essentially for the papers or authors, we take the publications, those are the key parts. But remember actually there are some other things we did not put in, for example, publication city or publication year, we did not put in, okay. The reason is we think they are unimportant, right. The city cannot decide the quality of author, right. You're publishing California or publishing Hawaii, actually maybe the same, okay, same way. So we don't care that, we drop it. But in principle, how do you know it? We play with this because we know it, right. We did not put it in. But if we put the wrong information you get the wrong thing, right. So actually we are thinking about what about we explore this metapass? But to explore metapass, the first important thing is you follow this metapass, you get similarity, you get a semantics. How do you check the two things that are similar, okay. For example, who is most similar to Chris's fellows? I actually asked this question to Chris himself. He said that's hard to say, it depends on how you judge it. If you judge closeness means I call author papers, those are my students. If you judge closeness means because we publish in the same conferences those are the same level researchers, right. So that's a very good answer. Simply says, you need to go through the metapass, okay. One is only go through the co-author pass. The other is go through the co-conference pass. They're different paths. Then they carry different semantics, okay. In that sense, we actually work out one called pass-based similarity, okay. This similarity actually is very, very easy to compute. You can see this. This similarity, ij is two objects, okay. How do you calculate this similarity? You're based on i goes to j, follow this metapass, how many paths you will go, okay. Then you look at i goes back to i, follow this metapass. Remember, author like a Chris is author, okay. Me is author, okay. So you look at this, they all go to the same paths but one, you think Chris goes to Chris is also another author but just the same name, right. Chris goes to Chris follow the paths, this metapass and this metapass can go to the other authors as well, right. So that's the reason you use this metapass, this go to serve this metapass, okay. This is the average of this one. You know, this is go to the other ones, you know, just divide by their average. So it's very easy measure. But this measure becomes very powerful, very interesting, okay. Actually people in the past study, you know, the similarity measure. There are the paths counter random walk, pairwise random walk, there are many, many things. With this, actually people work out quite a few measures. I do not try to compare all the measures. I give you two very popular ones. One called personalized page rank. Another called sim rank. Actually both were developed by Stanford group. Sim rank was done by Jennifer Wiedem's group, okay. But these two, we actually say if you want to find who is most similar, which measure you want to use, just thinking same metapass. We actually work on new one called path sim, path-based similarity. For these three measures, we actually took one real person, Anheiduan. Why I took Anheiduan? The first thing is he was my colleague in UIC. We were almost in the same group, no, next door almost. So I know him very well. But he actually left UIC. Wisconsin grabbed him, you know, he joined Wisconsin. Actually this, you probably also know, Jignesh Patel also grabbed from Michigan. Those are two, like young stars, all grabbed by Wisconsin group, they joined Wisconsin. But I want to see who, Anheiduan is most similar to who, okay. And it's interesting that you use personalized page rank. What you'll see is most similar, is Philippi Ui-jawe, Hector Gasser, Malina, Johan Wycombe, do you agree? Probably don't, because Anheiduan was a assistant professor. He just joined us, of course, grabbed by Wisconsin, okay. But he actually cannot compete with Philippi, because Philippi has like 700 publications already. He probably got the highest publication at DBRP. So how could you, these two are the most similar ones, okay. But person page rank, you know what page rank is, who is most authoritative, he just pointed to who. And you grabbed probably Albert Einstein, so most similar, I dare not say that, right. So that's the reason this may not be good for comparison similarity, okay. And if page rank actually grabbed somebody, I even do not know, because look at diversity, random walk, okay, you walk down there, you grab somebody very far away, okay, they have some similar patterns, but that's too far away, okay. But if you look at the past page, you probably get astonished. Anheiduan actually is most similar to Jignesh Patel. They both grabbed by Wisconsin guys, okay. Of course, these several other guys is also star, like Jinyang is a star in Duke. René Miller actually is a young star in Toronto. Actually they were all at the same level. Remember, we have over half a million authors. This pantheon immediately grabbed these several guys. You think it's magic? Actually, that's the power of this past page similarity, right, think about this. So, with this, we really go down to Chris's Follow-Sys. You see, we say, who is most similar to Chris's Follow-Sys? Actually, that's exactly like a Chris Vazour himself said. It depends on which past, which semantic past you want to use. Actually, we use APAPas, it's author, paper, author. What's the closest one? It's co-authors, because it's linked by the same paper. So, with this past, you can see Chris's Follow-Sys, who is most similar? Spiros Papadimichu, who is from Chris's point of view, this is the star student, because they co-authored most papers. And Jinyang Sen, including Yuri Laskovic, now is in Stanford University. But Yuri, of course, is younger, is published less papers than Spiros. But anyway, you look at this, but on the other hand, these are pretty low, simply says, they are not really comparable to Chris's Follow-Sys. But they, compared to others, they are still more similar, based on APAPas. But if you look at the APCPA, this means author climb up to paper, then climb up to conference, they share conference, they climb up to paper, they co-authors, this is co-conference past. But based on co-conference past, actually, Chris is most similar to me. Okay, actually, this is nothing magic. I thought about it, I think it's right, because Chris is published mostly in both database conference and data mining conference. I publish this also in both database and data mining conferences. You can hardly find another guy actually doing both. So that's the reason that they got this, and Rakesh Algor actually is also, because he's a database guy, it's also a data mining guy. So it's, including Jampai, my own student is both database and data mining. So that's, actually, this is pretty accurate, because Jagadish is also doing something both on database and data mining. So you probably see, it's very, very good similarity, you think about this. It's also grabbed from half a million author, right? So it's just magic. And we took this one, go to Flickr. What we wanna find is, which picture is most similar to this picture? We just take this picture. You actually can see the sensitivity of the past. You look at this sensitivity, this ITI is image tag image. That means you share a common tag, you're more similar. You think this is reasonable? Sometimes it is, but people tagging things sometimes a little random, they have bias. So what you can see is those flowers, of course you can say similar to this lotus flower, but is a bird similar to this lotus flower? You probably will argue with them, right? Of course you say this is wrong tag, but how could you judge it? Actually we go a little further, not only using tag, but also using group tag. Remember, they not only tag some particular image, say tag, they also tag which group it is. And then you can see using both tag and group tag. You look at those pictures. Those are all lotus flowers. You look at this, you can see this guy is number one similar, it's not even lotus flower, it's another flower, right? But as you see those are all lotus flower, even this guy from picture wise is nothing red or nothing blossom, but it's still very similar because it's lotus flower, right? So you can see the magic thing can happen from humans muddy semantics. You use different metapas, you use collective knowledge, you actually can get it. So this metapas is so magic, then the student, Ejo, and then she got excited. She said this metapas could be the key to solve the semantics problem. So then she was doing one interesting thing is past predict. Past predict, actually this question was more like raised by Chris, those photos of himself. Because we give a talk, I give a keynote speech in PKDD 2010, and he give the same keynote speech on different day, he was there, I was there, he listened to me, then he immediately raised question. He said, your network is magic, can you predict which paper I'm going to write next year? Okay, I say probably not, okay, I don't, yeah, he says, even I myself cannot predict what paper I'm gonna write, how could you predict? So that's true, that's interesting. And I brought back to Ejo, I said, Chris was giving a tough question, I said, can I predict what paper he's going to write next year? I said, I cannot, okay. And she says, probably we should think about this twice. Maybe we can do it, okay. So actually she saw it back and forth, she says, probably it's too easy to predict roughly what paper he's going to write because it's just based on fears or something. Of course the exact title cannot work out. But predict who he's going to collaborate is more challenging. I said, how could you predict? I said, I cannot even predict the next five years which student will follow me. How can I predict the next five years who I'm going to co-author? He says, maybe if you're highly ranked, the most popular authors, those existing groups, you can predict who you're gonna collaborate. That's reasonable. So we start working on this. And while we work, we use this magic of metapas, okay. So that means we call this one as a relation prediction, not a link prediction. You probably know link prediction is which link, like which Facebook guy you wanna link together. It's called link prediction. But that's a link, it's homogeneous. It's friends with another friend. This one is you actually, which paper you're gonna write, which conference you're gonna send, or which co-author you're gonna, it's heterogeneous. This is what we call relationship prediction. We draw this metapas. This metapas, you can see paper. You have paper citation. You write, this is author. You have the topic, which are keywords. You have venue, those are conference and journals. Then you got this, this is the magic graph we got. This we call a metagraph for DBRP, okay. Then we take this metagraph. Essentially, if you wanna predict the co-authors, okay. The co-authors is here. This APA means co-author, because you author same paper. But we can use all the other metapas. For example, author work on this paper. This paper site, this arrow means site. Site the other paper, the other paper author by this author. These two maybe later become co-authors, right? So we'll based on this, we'll work out for lens, this is the lens two, this lens three, this lens four. We say, up to lens four, we look at it to see which metapas will play more role in this prediction. Think about this. Of course, you can say, what about even longer one? You know, once the lens become very long, sometimes semantically become very, very weak. Remember, we have the sixth degree of separation. Somebody in Africa, I even do not know, I link with them within six, diameter. So that sometimes is too much. So we took actually lens four, but it does make something meaningful. For example, even you look at this lens four, is author publish paper in certain venue. But there's another author publish paper in the other venue. They could later become co-authors, right? Or publish in the same ICML, or maybe later you become co-authors. So we took this. Then we do, based on the training, we work out their P values. You probably know if you do the regression, you do all these analysis, you can work out these P values. And also you know the P value is lower, actually the link is stronger. It's very high, it's kind of a noise, right? So we actually look at this using the training set, we found this e to the power minus 174, that's pretty low, right? So we took those several low ones, we say those are the four star predictors. Those are the four star predictors. Those like this, it's basically noise. You better not use it. You probably will agree with me. For example, you look at this guy, you say why is noise? You can see this author work in this paper, this paper side, this paper, and author by this guy. This guy probably already passed away, right? So how could you co-author with it? So, of course it could, may not, but you know just too far away. So that's why it's not a very good predictor, okay? So we based on this, and she actually worked out this regression, based on this to do prediction. Some magic things happens, okay? We try, originally I told her, how could you predict co-authors? Because I even do not know who are my students, okay? Who will be my student in the next several years. Actually, what she actually used is training data is earlier data, okay? Then the prediction data, actually later data, but we already know it, you just hide it. You say, DBRP, you cannot see anything beyond 2002. That means 2003 to 2009 is something you wanna predict. But you can see anything in history, longer one, okay? Then I pick a one, I say, who is the best one? It was my previous student, Jim Pei, because he got PhD in 2002, okay? From me. That's the best time he became free agent. He can co-author with anybody anymore, right? Because of course, even when he was student, I never restrict him to co-author with others, but he naturally will co-author with me more than the other known researchers. So once he become free agent, we predict based on our say, we predict top five candidates as a co-author in the next several years. Interestingly, actually four of them really become true, okay? The only one guy not true actually happened to be his lab mate, okay? That means they actually both got PhD from me. But they probably psychologically, I don't know what's the reason socially of psychology. They may not quite like each other. They both are professors, but they never collaborate. In the next five or eight years, so I can never call those social or psychological things insight. But anyway, it's pretty interesting prediction. Actually, she actually also predicted me, like from the year 2002 data, predicted me top 10. Of course, I know every one of them. For example, like Jagarich, but I never co-author with Jack, okay? So her prediction is not quite right. Of course, there are many social or some other reason. It's not a psychological reason. I actually not really have anything, you know, I'm happy with Jack. But anyway, what you can see here is, hence Peter Gricker, like Sharagor, we actually collaborate. Actually, being new, I did collaborate with him, but beyond his, her time, because I collaborate with him 2010. And she said that I can't because the finish is 2009. So then, but the Chris's Fellows us actually, I collaborate with him to co-edit a book, but that won't count because it's done, it's book is not in DBRK, so it won't count. So the qualification is only two, but it's still pretty good ones, right? So it's interesting because there are so many, you remember there's so many candidates. It's like 11,000 candidates. Okay. So actually there are one, I skip one, it's very interesting is predict a web. Not only you are going to do this, when you are going to do it. That one actually because of time, I will not have time to introduce it. It's publishing just got accepted by Wisdom 2012. So if you like it, you can go down to my webpage to download it. Then the last one I'm going to show on this prediction thing is a role discovery. The role discovery actually was done by another student called Chi-Wang, okay? What he did is this, in, we got a big army research grants for the DBRP data or any network data, we want predicted hidden roles. For example, the army want to do this, okay? You got a very messy communication networks. You want to find who is the chief, like Bin Laden, okay? And who is the insertion? Who is the third lead? You want to predict this. You think he can be predicted? Okay. Of course Bin Laden, it happened finally. He never really on the email or anything comical with others. So he just tried to completely hide himself from outside. But you think about this, okay? If you just based on communication patterns, you work out a structure like this, then the insertion likely will be the utmost one. The third lead likely will be something like here. The rear chief could be here because you think the insertion were directly comical with Bin Laden? No way, because you can capture Bin Laden's second day, right? So that's the reason this structure actually is very rare. Even the army were agree because no full soldier directly dared to comical with their general, okay? You think about this. They come in with a squalid, then squalid report to, you know, there are lots of hierarchies. So we based on this. What you want us to want to get is from DBRP data, nobody else, nothing else. You cannot see the web. You predict who is whose advisor and when, okay? It simply says you need to predict. Remember, DBRP has no association. Never say this paper. I mean, from DBRP data itself, you're from University of Wisconsin or Michigan. You have no such information. Just author, paper, conference, that's it, right? So, but based on this data, he actually worked out this core consumer based graph, okay? The graph contained author, paper, starting time, ending time, and then worked out the ranking score. Actually the interesting thing is he mainly based on two heuristics, everybody will buy. One said the advisor usually will have longer publication history, has more publications than the advisor at the advising time, okay? That means when you do the advising, that time usually professor has more publication than student and longer publication history. Do you believe it? I believe it because if I have much less publication than my student, they will not follow me, right? So that's probably the right. Another one, as you say, once advisor become advisor, and he or she will not become advisor again. I deeply believe so because if I become professor, I don't want to be a student anymore, right? So that's probably the quite right. So since these two rules are agreeable by anyone or any country, no matter you got PhD in England or Japan or US, probably the same, okay? Then we use these two rule, then we plus some training set. And we got pretty high accuracy, actually on the advisor-advice relationship. What I can show you, a few people you probably will be convinced is very interesting, okay? One guy we take is David Blake, okay? If people working in AI or in information retriever, they all know David Blake. David Blake got, now he's a professor in Princeton, okay? And we predict he has two advisors. One is Michael Jordan, another is John Lafferty, okay? Actually, it happens to be true is Michael Jordan, who's a Berkeley professor, okay? Is working on information retriever, AI, or whatever you want to say. And we predict the year is O-1 to O-3, the other one is O-5 to O-6. Of course, O-1 to O-3 may be a little conservative because you take longer to get PhD. But actually, because it did not publish together yet. So that's why you can do dare not to say it's even earlier. But you can see the graduation date, actually it's not too bad. The other advisor actually is a postdoc advisor. The time is quite right, okay? Then actually, the student, she want to use my own student, Hong Chen, put it there, say, she just got PhD in year 2008, say. Where's her advisor, okay? Then found one advisor, Chiang Yang, another is me. Actually, Chiang Yang was her master advisor. The time is right. Actually, the PhD advisor is me, that's the time is also right. Then he said, I predict somebody never got PhD, okay? He took a search brand, okay? You pretty no search anyway because he is a Google co-founder. And he found actually, his advisor is Rajeev Modawani in Stanford, and that's the time. Probably is too short to get a PhD, which is true. He did not get PhD, but he got a Google, okay? So that's probably more, more than a PhD. But anyway, that's really also the truth, except that we cannot say this. This is an advisor-advice, but never say guaranteed graduation, right? But that's actually, it's very interesting. People actually got it. This is in last year KDP conference. Finally, I wanna show you, once people got excited, people were thinking, this information network can do something magic if you import more data, okay? I got another group of students actually led by Tim Weninger, who is my PhD student. And doing something like directly get mining web structures and then integrate with this information network, DPRP. What he was doing is this. He says, whether you can grab anybody, like from University of Michigan, you get a comprehensive professor, grab one. You can grab all of the CS professor out. No more than no less of their homepage. Can you do that? And actually he found you can do that in a very nice, easy way is this. If you assume people, no matter in which institution, once they set up a webpage, they're sort of somewhat structured. For example, the computer science, the information management, the director, they give you the space, they say, all these computer science space is here, then the web structure is here, and actually they are at the same identical level. You can grab them. So based on this, he actually worked out this kind of structure is, you start from HTML based on their structure, you go layer, and you can cross another page, you cross another page, you go back and then force, you actually can accurately, based on the structure, to grab their pages, and you see those as siblings. That's very interesting. And that result, I probably can show you is most interesting thing is this. He actually tried for all the U.S. representative and senators, like Congress and Senate. The interesting thing is you probably know when the Democrats was on power, the webpage only show all the things on the Democrats. I mean, some part of the committee, because the Republican were not formed committee, but actually their webpage, everything are there. He could find both and all the structures on the committee. It simply says, one night if the election changes, the thing, the committee all become Republican, that happens, the webpage actually did not really change. They just changed the links. Something become visible. Actually, he found all of them. Okay, it's very interesting. And another interesting thing is the worst performance actually is the Computer Science faculty. Why? You see, those are all 100% recall and precision. The Computer Science, he could not do it. And I say, why cannot do it? He said the Computer Science faculty is just too tricky. They put their homepage on Facebook or LinkedIn. It's not in their original space. How could you find it? Based on this structure, almost all the other people, like a football once, Amazon, you can find it, all these things, like a football. He'd pick a Illinois, which is not a very famous, I mean, Illinois football, never so famous. He'd pick a Illinois, put a one football player of Illinois. He actually found over 10,000 football player at the same level. And the footballer, he put an Illinois and formed all the football teams because they all structure in the same structure. It's very interesting. But the Computer Science, one, you can never reach 100% recall or precision. The reason is the Computer Science people are just too tricky. They play with the web, and the other people, they don't. But anyway, with this, he found more structures and doing the integration with the Information Network. And finally, we'll do something magic. That's ongoing work. That means he is still digging this part, and we are integrating the Information Network, DBRP data, with the students, with the professors. And finally, what we want is some people want to apply any place. They say, I'm interested in games or toys. You go down there, click the toys, and probably can find out who are doing research on toys. We want to do that. Of course, now we still haven't achieved it yet. But it will be very interesting for people applying PhD program or something. But anyway, what you probably can say is the heterogeneous Information Network simply says, you think the very messy network, you put some typing information in, which usually people will do. Then things become somewhat semi-structured. Then the power of links can be thoroughly explored. That makes things very interesting, because the more data you put in, the more information you can take out. Instead of thinking more messier, you know, your handle. Actually, things become more structured. It's kind of magic, but it's true. So with this one, we actually are the lab with this Ijo's work. Actually, our lab got more than 10 students now working on Information Network, because people got excited. Say, if we can do that, why we just get a first wave, everybody claim some fruit results? So that's the reason we are working on. You probably see those are the paper. Actually, most are this student, Ijo-san. You probably see most are her paper. And some are the conference tutorials or something. We actually did a conference tutorial in SIGMOD, KDD, ICDE, and I actually got a keynote speech in quite a few conferences. I have the credit of her, because majority of these contents are actually from her research. So it's really, really great to work with some really good students. So this is one I gave a talk, actually in essence. On top of the Acropolis, you go down, you take pictures. This must be like two or three thousand years ago, those relics. So it's really nice. Okay, I finished my talk. So I would love to get more questions or discussions. Thank you. Yes. Actually, first one is when I see you rank class, I don't remember the name. Yes, it's rank class. You probably use sort of recursive integration. I mean, here you have a class cluster. Yes. You're ranking. Yes. And based on this ranking, you improve your cluster. Yes. And you said that if your ranking is better in something, you can improve your clustering. Yes. So my question is, since the clustering algorithm itself is unsupervised. Yes, yes. So how do you judge whether or not you all have a better result? Okay, yeah, that's a very good question. Actually, remember clustering is unsupervised. The rule you give provide a little supervision. You think about this. You say, if people publish more papers in the highly ranked conferences, you yourself get more reputed or something. That's a reasonable rule. So you get these rules like a back and forth. You can enhance it. So we cannot say completely unsupervised. You've got a little knowledge feed into this one. But the magic actually is the network itself actually contain the inherent structures. You think about this? Okay, if you, suppose you're working on data mining. You say, I actually published paper in KDD, in ICDM, in SDM, those conferences. So if you publish all in these three conferences, not only you, but probably many other researchers publish this, naturally these several conferences will be clustered together into a data mining field. You agree? Yeah. So even I say it's completely unsupervised, but your network structure itself will automatically take some people glued together into one cluster. That means inherently there are some knowledge structures. We just did not know. But when you publish, it's not random. That's the same as page rank. You think of why page rank can work. It's just people endorse the other ones not random. Maybe somebody random, but collectively it's not. That's why inherently this structure can tell a lot of things. But on the other hand, it's a very good question in the sense whether you trust every link or whether you wait every link is the same. Actually, if you look at this meta pass, if you look at this, you'll probably know that's exactly what we currently are working on is this. If you look at, just I give you this one. You look at this, we partition this into two parts. One is ranking, working with clustering and classification. Another is meta pass. We actually did not really link this to yet. That's the linking between these two parts into why it's currently, Yizhou, she's working. It simply says she wants to put a meta pass as a supervision to guard this. For example, if you say I want a cluster, not only those researchers, I want a cluster to say writing stars. Think about this, current writing star. Not 20 years ago writing star. Then a year becomes very important, right? Think about this. If I want to cluster writing stars, what I do? I would use this meta pass because this meta pass will say year will play important role. Actually, I can even train them. That's exactly what we are doing. We do some training. Say, what is writing star? I give you a bunch of writing stars. Based on this writing star, they found actually this year pass, weights very high. Then we will take a year as important thing to do clustering. Then we'll give you the writing stars instead of fading stars, right? You'll probably see then probably, some very big name probably got retired so you will not credit inside. That's the magic of meta pass. We haven't used it in this rank-based clustering. We are integrating the both. Yes? Oh, sure. So how many past schemes you have? Okay, yeah. That's exactly if you look at the prediction. What we are doing is there are many, many potential past schemes. The first important thing is very long past may not be that interesting. Very long past may not be interesting. Then the remaining usually will be determined by training. That means you'll give me hints. I will try to find out which meta pass is more important. That's exactly this process. You can see with so many meta pass, we only pick up a few. Generally, this is all the possible things you have in your schema. I mean up to lens four, right? Yes, yes? You showed these great results with very little trained agency. Yes. What happens if you get a lot of trained data that doesn't go up to high 90s? Yeah, that's a very good question. Actually, you were saturated to some point. Okay, the reason is this. The data itself is somewhat noisy. For example, you classify me as a data mining person. Some people classify me as a database person. So, I mean, no matter a paper or a person could be ambiguous. So, finally, once you, even for example, my ranking could be high because you get a lot of papers. But this high guy is also split into two. So then, things become a little muddy so you can never clean it up perfectly. There's a saturation point. An idiot or? Not quite. It's reasonably high if you look at this. But if you were saturated at some point, we haven't found a very smart way to go even higher, okay? If you look at this, you'll probably see actually, this one is accuracy on authors. This is accuracy on papers. This is accuracy on conferences. Actually, accuracy on conferences, very easy you can get pretty high for almost anybody because the conference is not so ambiguous. But if you got papers, you'll probably see your saturation at a certain point. Because a paper, how do you classify this paper? Sometimes it's very hard. Okay, it relates to many things, right? So, that's a reason, that's a very good question. Some places, you probably will stop there or if you want to do more, you have to play new tricks. Yeah. Yes, yes. You mentioned author name is an edition of your introduction. Yes, yes. What are a few authors for the same author when the name appears differently, like C-Pulitzus and Pursus Pulitzus? Yeah, that's a very good question. We actually had some previous study by the other student Xiaoxing Ying. He did this way one problem. That means, actually if on a 14, we want to share the same entry, he want to split this 14 into each one, say who author which papers. Yeah, I cannot say it's perfect because it works for DBRP. It doesn't work for, say, PubMed. Okay, why? Because the PubMed, you probably know, computer science probably gets three or four authors. PubMed, one paper can have, I don't know how many, but once I got my buffer, like 50, I could not hold all the authors. I have to extend the buffer to more than 50. Okay, so with that, actually, sometimes another thing for PubMed is many PubMed papers, they only have first initial. They don't have the full name. Then you get more collision. It's very hard to distinguish them. So that's the reason, for example, even I went to NSF, they don't ask me to put a whole name, and say J. Han. I got lots of very strange papers, say that's J. Han's paper. I look at it and say, definitely that's not my paper. So there are lots of collisions and you need new algorithms. But this one definitely helped. Yes, that's a very good question. Yes? Yes. An iterative process. Yes. It's possible that it does it. Or does it always converge at the same one? Yeah. Actually, for any clustering algorithm, you probably know even if you run k-means, you finally may converge into a different point. You can say suboptimal. For this one, that's the same thing. You have the same problem. That means nothing is perfect. What you need actually is that you run it multiple times, to evaluate those distances. You say which one is somewhat better. But I could imagine that it might be such that... Yeah. That's a very good point. Actually, the key, for example, you look at the conferences, the top conference is very dedicated. You always do it right. For example, you say sigma is database, or sig-ir is ir. You will never get confused. But once you go down to lower, lower one, you run different seeds or different things. Those marginal things become a little muddy, because you, based on... That's exactly the power of this rank clustering. The top ranked one, actually, will not go wrong. But the lower ranked one depends on your initiation. Sometimes you... So that you'll argue... Yes. That's a very good point. Yeah. That's a very good summary. Okay. Thank you. Thank you.