 The next speaker is Hannah Wallach. She is a professor at University of Massachusetts Amherst, like the computer science department, and she's one of the founders of Debian Women, women in Debian, and she's going to be talking about statistical machine learning analysis of Debian male lists. Thank you. So I'm going to use this microphone because I don't want to lie to the other one, but I'm going to be talking about statistical machine learning analysis of Debian male lists. I should probably give you a little disclaimer to start with, which is that when I submitted the idea to this talk, I thought this was going to be awesome. I was going to get a ton of analysis done for the conference. It was all going to be fantastic. I was going to have amazing results to present. Yeah, anyway. So I have some preliminary results to present and a lot of exciting things to tell you about and a bunch of ideas for what I'm going to do next. And that's about it. So, who am I? Well, alright, so I'm Hannah, and I've done various things to do with Debian over the years. For a while, I was one of the leaders of Debian Women until I got busy and sort of dropped off the face of the earth. I founded Cone Women's Summer Outreach Program. There's a workshop organised for the Floss Pulse Gender Study. And most excitingly, I started in September as an assistant professor at UMass Analyst. And this talk is really actually... So this talk is the first talk that I'm going to be giving that's sort of unifying my machine learning research with my free software interests. Okay, so what am I going to talk about? So I'm going to walk through several things today. I'm going to start off by telling you what my research is on. So what my research goal is and how I actually try to achieve that goal, what my methodology is. And then I'm going to talk a little more specifically about document analysis and statistical topic modelling. Then I'm going to talk about analysing Debian mailing lists using these techniques. And I'm going to tell you a little bit about what datasets I'm using and some sort of preliminary findings and results and the kinds of things that you should see more of in the next year or so. And then finally, I'm just going to talk about future research directions, specifically other kinds of models that I plan on applying to these kinds of data. All right, so what's my research goal and methodology? Well, this is it. So I'm going to read it to you. To develop new statistical models and computational tools for representing and analysing large quantities of complex communication and collaboration data in order to better enable social scientists and technologists to advance the study of scientific and technological innovation as an experience and then that's right about that. So this is really what I do. I develop statistical models. I develop computational tools. And I use these models and tools to analyse really large quantities of data. And for the most part, I work with text data, but I also work with sort of social network data, graph-based data, that kind of stuff. But my primary research focus is on text data. And specifically, I'm interested in text data in terms of the role it plays in communication and collaboration medium. And ultimately, the end goal and what I'm doing is to work with social scientists and other technologists to actually study the process of science and technological innovation. So how does innovation happen? How do we actually make new technology, make scientific discoveries and stuff like that? So why free software development communities? Well, there's been a ton of interest in the past 10 years or so in free software development communities. And there's all kinds of people interested in free software communities, from people in, you know, commercial places, non-commercial, academia, of course. And there are many reasons why people are interested. Primarily, it's because free software development has these really complex technological, legal, and social structures. And this is really interesting. So you see a lot of kinds of, you know, I don't know, maybe companies or whatever that have on a much smaller scale technological, legal, and social structures, but really the free software community is kind of where this all really comes together on a massive scale. And for this reason, people are very interested in studying free software development communities. Furthermore, a lot of the collaboration that takes place is geographically distributed. The developers aren't working in the same place. They're not sitting across from each other. They're actually working, you know, across the world. And as a result, they're communicating with each other via the internet and stuff like that. So despite this considerable interest, the organizational and social processes underlying collaborative free software development are still kind of unknown. I mean, there's been a lot of research in recent years studying this kind of stuff, but there's still a long war to be found out. And this is a really great area of study for social and computer scientists. Obviously, social scientists are interested because of, you know, more of the legal and social structures, and computer scientists are interested because of the technical structures. So when I'm talking about studying collaboration, what am I talking about? Well, this little quote here, this is actually from an NSF brochure that I found lying around in the library being given away for free, and I picked it up thinking, oh, this is kind of entertaining. Maybe this will come in useful, and sure enough, it has. This little quote here just explaining what I mean by kind of products of collaboration and the kinds of data that I usually work with. So I'm just going to read it to you. Scientific information is both the basic raw material for and one of the principal products of scientific research. Scientists find out what other scientists are accomplishing through journals, books, abstracts and indexes, bibliographies and reviews. And that's great. There's been huge amounts of work in academia for the past many years studying the kinds of, these kinds of things that scientists produce when they're collaborating. And there's been lots of research on this people analysing pattern data, people analysing research papers. It's kind of stuff that I've been doing for years. But the thing is, this is actually not what happens within the free software community. We don't write these technical articles that are subject to peer review that take a couple years to repair, the people in site and other articles. We're not dealing with these kinds of collaboration structures. And so for this reason, studying free software is actually really interesting because although it is technological community and there are products of collaboration, the products of collaboration are kind of different to what you find in the academic sciences. So why is Moss Collaboration Days interesting? One of the most interesting things to me is that most of the data are publicly available. We've got mailing lists, we've got IRC channels, we've got commit messages, we've got bug reports, comments and source code documentation, even two-citing records, when one developer met another developer in person. There's huge amounts of data publicly available. And for the most part, particularly with the text-based data, these data aren't being used at all. People aren't actually analyzing the content of these actual collaboration data. People aren't looking at the content of mailing lists, they aren't looking at the content of IRC channels or commit messages and bug reports, you know, this kind of stuff. They're for the most part looking at how the people interact with each other via social networks and stuff like that. But there's much less emphasis on actually saying, okay, well, when these people are communicating with each other on a mailing list, what are they talking about and why are they talking about it? And so my goal is to use these data, these publicly available primarily text data to study the organizational and social processes underlying free software development. What are the challenges? Well, there are actually some really big challenges in working with these kinds of data. So as I said before, people do a lot of studying, you know, academic papers and patents and stuff like that. And that's fine, that works very well. Those data are actually really structured. You know, when you have an academic article, you know it's going to start with an abstract, then you know it's going to have an introduction, then you know it's going to have a background section, then you know it's going to have methods, then results, then conclusions, blah blah blah and so on. You know, you basically know what's going to be in there. Similarly with patents, you know that there are going to be claims, you know there's going to be an abstract, you know, all of this kind of stuff. We basically know the structure of these kinds of documents. And for that reason it's pretty easy to kind of segment to do entity resolution and work out who the authors are. You know, people typically only publish under one name. It's pretty easy to work with these kinds of data. But that's not really what we have in the free software world. Instead we've got much more informal communications and we've got much messier data and often highly unstructured data. So for instance, developers might use an IRC-NIC on an IRC channel and then their email address on a mailing list. And yet in order to work out who's talking to who is collaborating on some project, we have to actually perform some kind of automated resolution to map these things to each other. You know, there's no point in saying, okay, well, you know, Stargirl's talking about this and Hannah Wallach's talking about that when in fact Stargirl and Hannah Wallach are the same person. So there's challenges like that, things like entity resolution. Another challenge is the fact that IRC channels typically have multiple interleave conversations. So I do a lot of work on analysis of text documents and analysis of conversations and stuff like that and that's what I'm going to be talking about later on today. But it's actually not possible to perform that kind of analysis if you don't know what constitutes a conversation. So if I see a bunch of interleave comments, which of these comments are actually people talking to each other and which of these comments are actually two conversations taking place at the same time but aren't actually related. And so therefore in order to analyze IRC data you have to solve the problem of conversational thread disengagement which has a terrible name, it's the most ridiculously long name but conversational thread disengagement which is a complicated natural language processing problem. The other thing is that there could be a mixture of highly technical and off-topic discussion. You know, you have some people talking about, you know, packaging, packaging software, stuff like that and some other people talking about what they had for dinner tonight. Indeed, they're both perfectly valid things to take place on an IRC channel but when we're actually analyzing the way that free software developers communicate we might want to separate out which things are highly technical discussion and actually relating to the, you know, whatever piece of software they're building and which things are off-topic. And then finally, conversation style is often very casual. So, you know, these data, they're not formal papers. They're not things that people have spent weeks and weeks writing and, you know, really looking into the language of. But these are just random comments of people that are making on IRC or on mailing lists. They're not, you know, written in that same formal style as in academic paper. And so, in fact, before we can even develop models for answering social science questions about free software communities, we actually need to perform significant text analysis if we want to make use of these text data. So, how do I take this kind of stuff? Well, as I said before, the real modeling challenge is in aggregating and representing really large, messy data sets. We're not dealing with clear data. We're not dealing with small quantities of data. We're dealing with large amounts of really messy data. Also, handling data from sources with disparate emphasis. So, for instance, you might have a bug report that's talking about the same thing as some post on a mailing list, but these are actually serving very different purposes and therefore may use very different language. Yet, we still want to know that these are talking about the same thing and are the same kinds of conversation that are going on. And then finally, we want to be able to actually reason about these data and work out what's going on when people communicate and collaborate under uncertain information. We don't know for sure what people are talking about. All we know is literally the words that appear in these mailing lists posts. We don't know for certain, you know, the underlying topics or anything like that. And yet, we need to be able to efficiently reason even under uncertain information. So, the framework that I use to do this is Bayesian latent variable models or Bayesian hidden variable models. This is a really powerful and flexible modelling kind of framework. Very general. The key thing underlying it is that of Bayesian statistics. It's a lot of fun, but I'm not going to talk about the intricate details on it if you want to find out more about Bayesian statistics you can talk to me afterwards. This talk, I'm going to talk about one particular type of Bayesian latent variable model. And that's statistical topic models. So, I told you a bit about what I'm sort of trying to do and the kinds of data that I'm dealing with and the kinds of methodology that I'm using may be Bayesian latent variable models. And now I'm going to tell you a little bit about document analysis and actually statistical topic modelling. The particular modelling framework that I use on a daily basis. So, statistical topic modelling is, well it's actually a set of models as many different models fall into this framework but statistical topic models all share three fundamental assumptions and the three fundamental assumptions are as follows. Documents are assumed to have some kind of latent semantic structure. So what I mean by business as follows. When I see a document, this document is about some topic or some topics, plural. And yet what we actually see when we look at a document is simply a collection of words. We simply see, okay, this word was used, this word was used, this word was used. But as a human, when I read that document I know, okay, you know, this document is about entertainment and it's about sport or okay, this document is about, I don't know, bug trackers, that kind of thing. But all we actually see when we actually look at documents are collections of words. Nonetheless, we can assume that documents have some kind of semantic structure. It's about some kind of topics. Furthermore, the second assumption of statistical topic modeling is that we can infer topics from word document co-occurrences. Okay, well, what do I mean by that? What I mean by that is that as a human when I look at a document and I see this collection of words, I know what the document's about. I can say, okay, all right, well, because the word, I don't know, Jennifer Anderson and Brad Pitt were used in this document, similarly, you know, if I'm saying words like subversion and SVN and CVS, I know it's probably about version control systems. So if I'm looking at word document co-occurrences, if I'm looking at which words are used in which documents, I can work out which topics are being used. And finally, the idea is that words are related to topics and topics are related to documents. And again, what do I mean by that? Okay, well, I mean that document is about some particular topics. You know, a document might be about, I don't know, Jennifer Anderson and Brad Pitt getting back together and that means the fact that this document is about these topics means the words like relationship back together, break up, Angelina Jolie are all going to be used in such a document. So given these fundamental assumptions and given some collection of documents, the goal of statistical topic modeling is to actually learn what topics are used in that data set. So which topics are represented in this particular data set in question? And then also to learn which topics are used in which document. So given any particular document in that data set, which subset of the topics that characterize a whole set of documents are used in this one particular document. And why topic models? Why don't we just literally count the occurrences of words in each document? Literally just sort of, you know, look at the number of times each word is used in each document. This is why. So when I was a young PhD student I was working on this particular thing to do with this area of machine learning research called Gaussian Processes and I had this fantastic idea of something that I was going to do. I thought, oh, this is great. I know exactly what this is going to solve. It's going to be awesome. I did some Googling and I couldn't find anything about it and I thought, well, this is great. No, please tell us before. So I spent three weeks working on this and then I told a colleague of mine about this and he said, oh, you know, you should totally check out the geostatistics literature because they actually use Gaussian Processes as well but they call it Kreeging and they have this, like, largely different vocabulary. So you might want to actually look for some of those terms. Okay, great. Turns out what I was working on had been done before in the geostatistics community but I simply didn't know because the overlap in vocabulary was relatively small. And if, so what I did here actually is I took a document about Kreeging on the left and I took a document about Gaussian Processes on the right, both of these introductory documents and I counted the number of times each word was used in each of these documents and we can see that the overlap, if we actually, these are ordered from most probable to least probable. So Kreeging is used with the highest probability in this document on the left, then covariance, then mean, then estimate. On the right, we've got Gaussian, then regression, then covariance. There's not actually much overlap and yet these documents are talking about the same thing. There's a small number of overlaps, namely covariance and matrix, but there's not that much overlap and the advantage of statistical topping models is that they actually get around these problems. So long as there is some overlap in vocabulary, even if different terms are being used, these documents, we will be able to infer that these documents are actually talking about the same thing and therefore, you know, a vibrator, I don't know, search for stuff to do with Gaussian Processes, some kind of search engine or something based on statistical modeling could also pull up a document about Kreeging even though the actual direct overlap is pretty small. Okay, so I said the topics were related to words. What do I mean by that? Basically, what I mean is that a topic is a probability distribution over words. So here on this slide, I've got four different topics represented in four different colors and each of these, so four columns, four columns and each of these topics is a probability distribution over all words in some vocabulary and I've ordered the words within each topic from most probable to least probable. So what we've got in this topic here on the left, we've got human, genome, DNA, genetic genes. Okay, when we look at that collection of words, we can say, alright, well this is probably a topic about genetics. That seems reasonable. Similarly, if we look at the next topic, evolution, evolutionary species, alright, well this is probably about evolution. And so, you know, computer models, information data, computers, this is probably about something to do with computing. So in some sense, the fact that the words human, genome, DNA, and genetics have high probability of this column here on the left, this pink topic, this means that because these words have high probability, this topic is probably about genetics. So we can infer that from looking at these words. And so that's what I mean by a topic. I mean a specialized probability distribution of the words. Okay, so how are topics and documents related? Well, the basic idea is that when I have some, so I said this before and I guess, you know, I'll repeat it to hopefully make the rule clear. When you have a document, you can basically assume that that document consists of some topics. So in this case, there's an article here. I'll just read you a little bit before I prompt. How many genes does an organism need to survive? Last week at the genome meeting here, two genome researchers, radically different approaches presented complementary views of the basic genes needed for life. One research team using computer analyses to compare known genomes concluded that blah, blah, blah, blah, blah, blah. So when we read this, we say, oh well, it sounds like this is about genetics and evolution and computer analysis. And we can tell this from looking at the words in this document. And so one thing that what I've done over here is only on the right-hand side of the screen, I've represented the degree to which each of those topics that were on the previous slide. So if we look at the previous slide, I have four topics, human genome, DNA genetics, then there's one about evolution, something about disease and something about computers. For this particular document, I've got this little graph over there that shows how much each of those topics is represented in this document. And so the red topic that wants to do with genetics, that occurs a lot. This document is really about genetics. There's a lot of stuff to do with genetics. The yellow topic which is about evolution and stuff like that is represented much less. And then we've also got the blue topic which is about computational analysis. But this article isn't talking about diseases and stuff like that. So that green topic just doesn't occur at all. And what I've also done is I've also coloured various different words to indicate what topic I believe they're from. So we've got words like genes, genomes and genes and sequences and stuff like that, coloured in red because they're probably from that genetics topic. And words like computer predictions, numbers, computational, coloured in blue because they're from the computational analysis topic. And so in this way I'm sort of relating words to topics and topics to documents by saying, you know, a document has some document specific distribution over topics and that that distribution over topics is then going to determine which words are used in this particular document. So how does this all fit together? Well the way these kinds of, so the way people actually sort of look at these kinds of things in the world of Bayesian-Lake invariable models and statistical modeling and stuff like that is that they assume that some statistical process generated the observed data that we're dealing with. So in the case of you know, Debate mailing lists the observed data would be particular mailing list posts. In other words we have a bunch of documents, they contain particular words, we want to analyse them and find out what topics are represented in the collection as a whole but also in each particular document. And so in order to do this we're going to assume that these documents were not generated by us sitting at our computers writing messages, we're going to assume that these documents were generated according to some statistical process and then given the statistical process we can use the principles of statistical inference to invert that process and actually learn the things that we don't know namely what topics are used in the collection as a whole and in particular documents. So I'm going to talk you through the particular generative process that corresponds to probably the most well known statistical topic model. And this generative process basically explains how if you have a set of documents these documents could have been generated according to statistics. Okay, so in order to generate a collection of documents we first generate a set of topics and these set of topics there might be a hundred of them, two hundred of them you know a thousand of them, whatever I've got four over here, the same four I was looking at before we're going to generate a set of these specialized probability distributions over words. And these topics basically say what is this whole collection of documents about? So in general you know what are all of these documents about? So these, this set of topics in some way describes the collection of documents. And so in this case we've got about genetics, one talking about evolution, one talking about disease one talking about computers. So we're going to generate these topics and it doesn't matter how we're going to generate those, in fact you would generate them by drawing from a particular distribution, I'm not going to get into the math behind that if you want to know the exact math and mathematical details you can ask me afterwards or not, whatever. But we're going to assume that we're going to generate this set of topics. Having then generated this set of topics we can then go and generate individual documents. And of course you know just to reiterate, real documents aren't generated like this but we're going to assume this generated process so that we can then invert the whole thing using statistical inference and infer the things we want to know. Okay, so to generate a particular document we start by generating a distribution over topics for that document. So here we're going to generate a document, we've generated a distribution over topics for it this distribution says okay use the red topic with fairly high probability little bit of the yellow topic and you know kind of a medium amount of the blue topic. Having done that we can then run through and generate each word in the document as follows in order to generate a word we start by picking a topic from our distribution over topics. So here we've picked the red topic which makes sense because you know this document assigns a higher probability to red topics than to the other blue topics. Then having done that we can actually go and take a look at the red topic and say now let's generate a word from that topic let's pick a word and in this way we can run through and generate given a document specific distribution over topics we can then run through and generate all of the words in this article. And what we end up with at the end is something like this you know so we've generated a set of documents in this particular way I've colored in some of the words according to what topic they were generated from and this is basically this underlying generative process that's assumed in order to then invert this using statistical inference. So at inference time what do we want to do? Well as I said before for real data given a bunch of posts to Debbie and Mayla for something we don't know what the topics are that best characterize this data set we haven't got a clue we don't know. Furthermore for each particular post we don't know which topics we used in that post and so what we have is something that basically looks like this we have a collection of documents we can see what words are in them and we're going to assume that they were generated by some topics and some document specific topic distributions but we don't know what the contents of those topics are or what the contents of those documents specific topic distributions are and so the goal then with statistical inference is to say okay given this collection of documents let's learn all of these things so the way this works in practice when we're doing statistical mock topic modeling is as follows I'm just going to run through so this is like a really simple algorithm that actually although it's simple it seems kind of hacky and heuristic it's actually not at all hacky and heuristic and there's some very nice mathematics behind it that show that if you do this you will converge to the right you know the right point of the distribution and all that kind of stuff and it's basically a technique to say okay given a collection of documents what are the topics that characterize the collection and what are the topics used in each document so what we do is we start by we say for every single word in our collection of documents let's just randomly guess a topic so let's start by saying and one of the things about this algorithm is that it assumes a fixed number of topics so you would tell it at the start okay use 200 topics that's the only thing you're going to tell it use 150 or use 100 whatever and there's various different ways to choose appropriate amount of topics not going to get into that if you want to know but let's say we assume that we're going to use 100 topics then what we're going to do is we're going to run through every single word in the collection of documents and for every single word we're just going to randomly guess which topic generated which word in other words we're going to assign a number between 1 and 100 to every single word we just say guess who cares doesn't matter we have no information we have no idea what's going on so we're going to randomly guess which topic generates each word then given this set of guesses once we've actually got a guess for every single word we can estimate how many times each word is used in each topic simply by counting the number of times each word of each type is assigned to each topic and similarly we can guess the document specific topic distributions by counting the number of times each topic is used in each document so of course we've randomly initialized this thing by just guessing what topics each word comes from at random our initial guesses about these distributions that we're interested in are going to look like junk they're going to have a bunch of random words that are unrelated you know thrown into each topic each document is going to be about some complete junk number of you know some complete junk topics you know it's going to look ridiculous but this is where it gets good so we're then going to repeatedly refine the guess for each word to set some number of times and look at each word in turn and every time we look at the word we're going to say okay let's pretend we don't know what topic generated this word let's get it that's what we're trying to work out let's say we don't know what topic generated this word let's pick a topic that's responsible for this word and we're going to do that by choosing a topic with probability that's proportional to two things the two things that that probability is proportional to are the number of times it's been used in that particular document so we're going to look at all the other words in that document and we're going to say okay well how often has each of the other topics been used and a topic that's been used many times in that document is going to have a higher probability than a topic that's hardly been used at all and we're also going to trade off with the number of times that word has been assigned to that topic so in all documents how often has a word of that particular type we're going to look at all other occurrences of that word and we're going to say how many times has another the other occurrences of that word been assigned to this topic and so topics that have been used to describe the same word in other documents many different times also have higher probability and if we keep doing this we keep repeatedly saying okay pretend we don't know what the topic is that generated this word guess a topic according to this particular probabilistic framework that I've described if we do this enough times we actually converge to a point where the distributions of words for each topic and the distribution over topics for each document are no longer random they actually start to sort of describe something that makes sense and if we look at the topics they will look like this kind of thing where we have human, genome, DNA, genetics words like that all together in a single topic words like computer models and information in a single topic similarly we'll see that things like some topic that's about disease isn't used in this article because it's not using words in it to do with diseases and so if we repeatedly run through this very simple algorithm what we end up, what we go from is something that looks like this where we haven't got a clue to something that looks like this where the topics, these specialized probability distributions over words actually group related words together so that's really, you know, documents use particular topics and they're talking about similar things so that's how topic modeling works people have been yeah, okay, I guess I didn't need to get back up in the slides because I had this one, but whatever and so people have been doing this stuff for years for various different types of data primarily however, newspaper articles these things have been used to analyze New York Times articles to analyze, you know, articles from Science, the magazine articles from the Neural Information Processing System of Conference they're typically used to look at really structured, academic kind of data and so what I'm interested in doing is actually using these models to look at Debbie and Mailer lists and so now what I'm going to do is talk you through just some initial experiments that I ran I'm going to tell you what I'm using in terms of data and the kinds of topics that I'm finding when I actually run these kinds of things okay so the initial data sets I took several different mailing lists in the end I decided only to use Debbie and Project and Debbie and Women just as my starting point and the reason why I chose these two I chose Debbie and Women because it's probably the mailing list that I'm most familiar with I chose Debbie and Project because it's got a huge number of messages on it and it covers a really wide range of topics so I stripped all quoted text which was painful I had sort of some horrible regular expression for me Max it took me an entire evening there was much swearing it was not not at all a pleasant experience but nonetheless I stripped almost all quoted text and signatures because I don't want that to be sort of biasing the analysis then what I did was I looked at the Debbie and Project mailing list which consists of 19,000 messages I guess that's I don't have any commas in that number but it's over a million words so it's a lot of data over a million words, maximum roughly 8,000 words per message so the longest post and this is with quoted text and signature strip is 8,000 words okay yeah so there's a lot of data here in contrast Debbie and Women 4,000 messages total and 1,500 words maximum per message and I'd probably be willing to vet that either myself or I regret that anyway so I thought I'd use these two mailing lists just to kind of see what's going on so here's some topics so I ran with 100 topics on both of these data sets separately these are just four topics I just picked out for pretty much randomly one thing I didn't do is I didn't remove spam from these lists and as a result therefore many of my topics were just spam so that's one thing that I'm sort of dealing with at the moment is actually just running it through spam before I even start so ignoring spam I just picked some topics randomly from each of these two lists and these are the kinds of things we see so in Debbie and Project we see things like package, packages, install, app, get, apt these words are all kind of related to each other and they're all kind of about packaging and app then we see Ubuntu, Debian, patches, derivatives, long-term support alright, those are kind of related yep, okay I can believe that people are talking about that then we see this which is to do with a new retainer process so we've got Enem process, applicant, Dan, FD then at the end we've got FTPmasterQ packages, upload, team you know, it kind of makes sense that these are the kinds of topics that we're seeing on Debian Project really, if we look at Debian Women here are four randomly chosen topics from Debian Women and we see women, men, female, male, man okay, not surprising website, page, site, work, Debian Women well this is kind of interesting because when we started Debian Women one of the things we put a lot of time into when we first started out was actually getting the website up and running what did we actually want on there for people to actually look at when they searched for Debian Women we see this represented in one of the topics that this algorithm is discovering we're also seeing this one over here post culture response posts behaviour, again, not surprising there was a lot of discussion in the early days about you know, people's the way people responded to posts the way people you know, posted and behaved on mailing lists and stuff like that and again, we're seeing that represented in the topics and being discovered and then also we're seeing stuff to do with the new retainer process as well now there are different words that are appearing in the top words in this topic but nonetheless because the emphasis of the mailing list is different so although it's talking about although people are talking about the new maintainer process on both Debian Women and Debian Project they're talking about it with different emphases and so we're seeing slightly different words appearing in the top words for each topic but this is really nice it's telling us that we can get a sense of what these two lists are about by looking at these topics we've done this, we can do some other really cool stuff so I took various different topics and I plotted them over time so here we've got packages, package, Debian maintainers, maintainer, upstream you see this big peak around 2000 then it sort of dies off and then there's another big sort of rising peak towards 2010 that's kind of interesting, alright here, this is to do with Debian developers, upload rights people, DDs, voting rights rights, these are the top words or phrases that appear in this topic and we see that this topic is pretty constant and not really used at all until this massive peak summer around 2008 okay that's kind of cool too that's telling us something about what people were discussing on Debian Project in 2008-2009 that kind of thing so here we've got discussion about the new maintainer process, again there was a big peak around 2008-2000 sort of a peak summer around 2007 and then trailing off after that Debian women so this is a topic that was found actually not in the Debian women mailing list but this topic was found in Debian Project and it's got a massive peak around 2005 which is when we were doing a lot of Debian women stuff and the words in this topic are women, men, Debian, Debian women and so on but we can just simply look at this topic look at this graph and yes it really is highlighting behaviour of what actually happened in Debian at that point so that's mainly what I wanted to show you with that I've got a few points about future research directions so one thing I'm kind of interested in is cross language analysis and again a little picture and quote from this NSF brochure I'm not going to bother reading the quote to you but I kind of liked it so cross language analysis suppose we actually want to do this kind of thing across multiple languages suppose we want to say that rather than a topic being a specialised distribution over words, just in one language suppose we want a topic to be a collection of distributions over words in multiple languages so here we have like space, mission, launch, satellite, NASA and spacecraft and I don't actually know any of these other languages but I'm sure some of you do you can see that the words of the other languages are also related to space and spacecraft and stuff like that so even though we've got multiple different languages here, multiple different specialised distributions over words they're still all talking about the same thing but in different languages here's another example with poetry, literature literature and poems these are taken from Wikipedia by the way as in like we ran one of these models on Wikipedia so I wanted to get these kinds of polylingual topics you kind of need aligned documents and there's two ways that you can do this you can either have documents that have direct translations of each other and that's great, that works very well that's known as a fully parallel corpus but the problem with that is that these direct translations are really expensive to produce and they're relatively rare but we actually did some really interesting work where we showed that you could as well as defining a model that works on these kinds of data, we showed that you actually didn't need that many of these kind of aligned documents, you just need a few parallel glue documents and less than 25% of the entire set is sufficient to obtain these kind of nicely aligned topics across multiple languages so one thing I've been thinking about is can we use documentation things where we do have direct translations where work has been done to actually translate stuff in multiple different languages as glue documents for then simultaneously analysing the content of mailing lists that are in different languages so we don't have aligned posts on mailing lists but we do have documentation which is aligned can we use this kind of thing in order to do this analysis ok, another thing I'm interested in is simultaneously finding groups of people in topics so without any prior information can we look at Debbie in project and say this group of people is working on something together and these topics characterize what they're working on and can we do this by simultaneously discovering groups of people and groups of topics and the answer is yes that thing can be done and that was something that was done by my research group a few years ago actually not by other people in the research group so one of the things I'm interested in doing is applying those techniques to Debbie in mailing lists to actually say without any prior knowledge can we work out who's doing what within Debbie and who they're communicating they're collaborating with this is a little picture that actually this is all from the Enron email we ran this on the Enron email which is publicly available you see really interesting groups of people popping out and particular things that you can't see at all anyway that's it I'm done I'm interested in the practical aggregations of this could this be used for spam filter? what? could this be used for spam filter? um maybe Viagra enlargement sorry maybe use the microphone I get it okay could it be useful for spam the question was what are the practical applications and could this for instance be used for spam filtering maybe for spam filtering but to be honest really simple naive techniques work very well for spam filtering so yeah maybe more other applications that could be used for this is creating some kind of online browsing system for Debbie and Mailer lists so suppose I'm interested in browsing Debbie and Mailer lists and I put in some search keyword or something and it pops up a bunch of topics that are relevant to this then what I could do is click on each of those topics and it would show you which documents across all Debbie and Mailer lists are related to that particular topic so that kind of stuff that would be very easy to do and when I say very easy I mean well tell somebody else how to do it okay I had seen a similar talk like a couple of years ago at the Python conference and the guy who presented it was actually analysing the dynamics of threads in the mailing lists and it came out with quite interesting results actually for example that when the Python dictator in the Python community intervenes in a thread and it was stopping as an authority and it was able to recognise the most authoritative persons in the community I wonder if you could do something absolutely and one thing you can look at is do people so one thing I'm really interested in as well is do people's tone is the language they use differ depending on who they're speaking to so and that's something that has been well studied yes you know people use different language if they're talking with somebody who they perceive to be more authoritative and then as opposed to somebody who they perceive as being less authoritative and so yes it's absolutely something you could study and particularly on WMA lists where there is blaming and other such stuff I think it would be great yeah so yeah good point so I had a question but first I'm going to say about the browsing or practical applications and I thought well maybe your mailer could send most of the mails in a list somewhere in a folder where you won't look at them and then just put in your inbox the topics that you're interested in that's not a question my question was you had some topics there and then you had this one men, something and I noticed that in your word collection there was men, women, women, men so what I'm doing is you know I said that every word the algorithm works by running through and every word we guess a topic for it and then we move on to the next one we guess and work that topic for that word and we repeatedly iterate through this so as a post-processing step what I was generating these what I'm doing is I'm saying if two adjacent words have the same topic assigned to them then treat that as a phrase and so what somebody's done is I'm removing punctuation so it'll be like men, women, men slash women, yeah, that kind of thing that's all, this is just post-processing hello we want to make two questions is this an open project or a free summer project we can look the search for somewhere so a couple of points about that I'm going to be here all week I'm perfectly happy to teach people how to put the stuff up, I have a really simple technique the whole thing is available as part of a package which I guess I will be packaging for Debbie in some time this week and what is the math behind all of that of this? so we saw the Dirichlet prior yeah, no, I mean it's basing on statistics specifically using Dirichlet-prize, Dirichlet-multinomial distributions give sampling Markov chain multicollet and that kind of stuff and the question is about pre-processing you do or if you shouldn't do more pre-processing you do to remove the collets but shouldn't you remove plural and singular of the same word sometimes you have plural and singular and here's why, so I have a really good example why you don't want to remove plural and singular of the same word so if I have an article that's about apples and if I have another article that's about apple one of these is probably about fruit and one of these is probably about computers so the plural and singular there can have very different semantics and so as a result you don't want to do word stemming because otherwise you're not going to see any of these patterns appearing it's going to be lumping articles about computers together with articles about fruit so that's why you don't want to do word stemming in terms of other pre-processing yes, you want to remove stop loads of words like the and and off in fact with a particular variant of this stuff that I work with you don't need to do that because they did some cool maths showing that if you do the right thing with Bayesian prize then it does everything automatically but yeah, you can remove stop loads you can remove punctuation, all that kind of stuff I mean the amount of pre-processing you do is kind of up to you depends on how lazy you are so I've got a question about I've got a question about how you come up with the set of topics and I'm just sort of assuming that there may be sort of a pre-distribution of topics that are interesting and I'm wondering initially with one of the corpus how sensitive the number of topics is in terms of clustering and in particular when you're looking at multiple corpora especially IRC versus email and so on is there a way that you can sort of get to an automatic clustering of topics that are interesting yes obviously in this particular variant of the model you specify the number of topics a priori so if you specify too many topics something just won't really be used so if you specify like 500 and the model only actually needs 100 to account for what's going on in this data set it's only going to really use 100 so that's one point another point is that in terms of which topics are interesting so I have another variant of this model called cluster based topic model that simultaneously clusters topics as well as learning them so as well as learning which topics are used in which document, which words belong to which topics, it also clusters the topics into a number of clusters that it automatically determines and so this you can see that you know topics about like women men and that kind of thing and women websites and stuff like that would probably be clustered together in the same cluster Hello? You seem to learn a fixed number of topics how can your methods have a new topic which appears Right, so there you would I mean, okay, so there's two variants of this kind of model, you can use this version which has a fixed number of topics which uses a Dirichlet distribution you could also use the non-parametric version which uses a Dirichlet process prior and would automatically learn the number of topics it's just it's slow because we're on this path so I'm just using a fixed number of topics but yeah you could use a DP version and just automatically generate new topics I think we maybe only have time for one more question Could you use this to build a non-parametric flamethrower detector or a chemical idiot detector and then and then when the alarm goes off you said So here's a pretty small number of topics you know it's enough topics that I can go through and say this topic is really about something of content this topic is not and so what I could do then is I can go through because there's only like 100 topics or whatever that's amazing I can say that all of these topics are about complete junk you know these are stuff that is just absolutely ridiculous then I can say which documents use those particular junk topics with a really high probability then I can look at the authors of those so yes we could do something What I was thinking was that you could say you've set off our problem in a private email form a machine so that you couldn't really take offense you could probably do at the end of the year you could send you could send a wooden spoon to the person because they might love to do something so this is actually I mean I know we're all kind of joking but this is actually a reasonable problem that I'm really interested in automatically detecting flame wars taking place doing this in a completely automated fashion this is like super extremist and it's a really tough problem but yeah cool thank you very much