 I'm going to talk about cooperation and social status in free and open source software project. And I did an empirical study with data gathered from the Debian project. So this talk will be about what I did for my master thesis in sociology. As I said, it's based on empirical data from the Debian project. And I will try to focus on things that I think they might be interesting to you as Debian developers. So the title of the talk is the English translation of the title of my thesis. But probably a better title for this presentation would have been something like size and evolution of Debian. So I will mostly talk about how Debian grew in number of persons, number of people contributing, and what size it's now. And because this beamer, as we call it in German, is probably going to break down soon, I put up the presentation also on people.debian.org. I also posted the link to IRC just before the talk. So if you want, you can follow on your screen if it's hard to read. So first, the first part is titled Data Mining Debian. I will give a short overview how I gathered all this data and how I put it into a format that it's possible to do analysis really. Then I will shortly present you a categorization of Debian contributors. I made, exposed from the data, four categories of contributors. And then I will present some graphs about the size and evolution of Debian and about Debian, about developer activity when most developers are active and when activity is going down during the day, during freezes, during what happened after Ubuntu, for example, was created. And in the last part, in the thesis, this is the biggest part, I will try to talk about cooperation in free software. I will try to present a theoretical framework very short and some data that backs up this framework. So my data sources were mailing list archives, package uploads. I gathered package uploads from the Debian. There were changes and related mailing lists. There, at least in the beginning of Debian, there were some uploads that are only recorded on portal-specific changes mailing list. I gathered all bug logs from bugs.debian.org, but I quite sure that on bugs.debian.org, there's quite some old bugs are missing. Probably when they migrated to the current system, they only migrated bugs still open and the bugs that were fixed by then, they are not in bugs.debian.org. I tried to use popcorn statistics, but in the end, I didn't use them for any analysis because all the results were not, I tried to do were not significant. And all in all, it's about 38 gigabyte of data that I used. And the data set contains data from April 1995 until November 2007. And in November 2007, I cut and did not update anymore because it's hard to do work on a constantly moving data set. So I downloaded the raw data. Then I wrote specific Python scripts which put extracted the things I was interested in into a big re-national database. So what was collected? From mails, I collected the person who wrote it. What time was it? And in which time zone was it? If the information is available in the mail headers. Which list or to which bug was the mail sent? How many words are in there and references to other mails? And then additionally for uploads and bugs, there was some more data collected for uploads, the uploader and the signer, the other contributors. If you have a change log which lists more than one contributor, the number of change log entries per person and the package name. And for bugs, I collected who submitted it and who fixed it. Which package was it? What was the severity in the end of the bug? How was it closed? Was it closed by a new upload or by sending a mail? And how many mails are in this bug log? There were of course some problems doing this. The biggest problem was that there were duplicated entries for the same person. So I had a table in my database listing all the persons. And I had to find out which entries really belong to the same persons because they were using different email addresses but they were the same person. Then there were some contributions done on role address, which I had to skip because there's no real person behind the role address or it can change over time the person. And there were some unparsable mails, but those were mostly spam. So I tried to deduplicate the entries as much as possible. I used GPGUIDs, so if the UID is on several keys, I said, OK, this, oh no, the other way around. If two email addresses belong to UIDs on the same key, then I said, OK, this is apparently the same person. I think that's quite a safe way, but it only reduces the duplicates by a very small amount, so it was not that useful. Then I did use some software written for deduplication, which is open source. It's called February. And it helped me to normalize the data, so to split first names, second names, and nicknames, and then do, for example, see deduplicate based on common real names and common nicknames and then I also used parts of the message ID. So if you have the same host in the message ID, it's a hint that it could belong to the same person and domain names, and if they wrote to the same list and or they contributed to the same packages. But I reviewed most of them manually, so it was a lot of work. And that's one important thing. I only did this for those that submitted at least one box or contributed to at least one package. So I did not deduplicate a huge amount of persons only writing mails to mailing lists. And also, all the further analysis I do, yeah? Yes? Actually, the developers were not really the problem, because they are mostly using one email address or easier to recognize the scheme. So you have this plus in the email address where you can easily say, OK, this plus, that's a special thing. I can just cut it off and things. And the bigger problem is box submitters who are not so used to all this technology stuff who use nicknames one time and things like that. Yeah, but that's true. I did not know about, I didn't include the carnivores stuff because I did not know how it's accessible. OK, and then I did aggregate the data. So I summarized it by month and persons. So you have a table which lists the month and the person and what they contributed to Debian in this month and the total amount of what they contributed. And this was processed with the Python scripts. And because of the first data gathering process was put into a relational database where I had all these relations between persons and box and packages, there were some quite complex SQL queries to be done over the whole set. And it took running the whole script, the whole aggregation of all people. I also did it for the mailing list people. Took about 20 hours on 40 notes of computers of a cluster at the University of Bern. I could use for it. And then I also created a second data set which is about the bug reports, which has one entry per bug report and contains information about the submitter and the fixer, what they contributed to Debian, and some additional information about the bug and about the interaction between submitter and fixer, so if they interacted before or not and on what. So that's just a short overview. The final data set structure for the data set about project activity. We have the person, then we have the period. And when it was last updated, then you have the bug submitted, the bug fixed, the bug activity, count of mails, count of words, then how many mails, how many words in mails, and so on, so on. So I won't go into too much detail about that. So that was the data mining part. The next is about just one slide about the categorization of Debian contributors. So when I'm speaking of Debian contributors, in this context, I mean everyone who has at least submitted, contributed to one bug report or to one package. So you have to draw a line somewhere, so because it was too much to also include all people writing on mailing lists and because there are, could also be, you could say you could also include all the mailing lists or forums or whatever. IRC, I did it that way. And then I made four categories, and each person belongs at one point only to one category. So it's like they build up on another. Each is a bit more involvement than the lower one. So the lowest category or status is you're simply a contributor. So that means you have at least one entry on boxdebian.org. Mostly those are bug submitters who submit one or two or perhaps three bugs. And those are mostly unknown to other project members. Then the next step is you become, I call it, simple developer. That means you at least contributed to one package. You have a record in one package changelog. And probably those contributors are known to some people in the project. So they are known to their sponsors, for example. Or if they're working on a team, they are known to other team members. But most members of the project don't know these persons. Then there are official developers. That's what we call DDs. They can sign and upload packages. And I think most interaction partners will recognize if they are interacting with an official developer. That's not true for all, so it's a bit, you don't know if they are using the Debian.org address. It's easier to recognize them if they are known. And some are more known and some are not well known. Then there is one fourth category I call core developers. And those are those 20% of developers that have contributed most to Debian in the sense of package work. And together, all these core developers do about 80% of the packaging work. Yes? So that says that 20% of our developers are core developers. No, that 20% are core developers. That's the definition of what a core developer is. And it turns out that the classic 20-80 relationship is actually true for Debian. OK, and you just defined core developers are 20% of all developers. You didn't go through and say, you're a core developer, you aren't, you are a core developer, you aren't. And then in the end, you divided it and you came up with something close to 20. But you just said 20% is approximately right. Thank you. And I think most of those core developers are actual DDs. But there are some examples of core developers that are not DDs. And most, they are well-known among other project contributors. I think that's quite true for most of them. Sorry, this distinction. Sorry, second. I'm wondering whether this 28th distinction is based on which time span. Because I guess that from release to release, it might change with a core developer and then did 80% of the work. Yeah. So you could do that to have this dimension in it. But I actually didn't do this. So it's just for categorization. And I tried to make it also as simple as possible. So I didn't do it over time. I just, for those analysis where it matters, I just say, if they already reached this stage, I check, only check them. But if you're a core developer, so you somehow stay a core developer until the end. And from the sociological point of view, this also makes somehow sense. Because if you gained the reputation of being a core developer, it will not decrease that much if you don't actually do that much work anymore. So the next part is some graphs about the actual data or tables. So that's how big the data set actually is. The whole data set, including mailing list posters, is 236,000 persons. Or we don't know, because it isn't really de-duplicated how many there are. But I think if you would de-duplicate it, and if the ratio shrinks is about the same as it was in the smaller data set, you can subtract at maximum 10%, I think. So the de-duplicated data set actually is 32,000 persons. And of those, in November 2007, 7,200 were still active. So activity here is defined. They did not have a period of inactivity, I think, longer than three months. So they were active in the last three months. And of those, 30,000, which is 92%, are contributors. And 2,500, which is 8%, are developers. And those developers are split again into 1,000 simple developers, 827 official developers, and 500 core developers. So this is this 20% number we talked about. Which is by definition like this. And what you see here, and what you will see on the next slide in a graphical way, is that the distribution of work is quite skewed. There's a huge amount of person contributing very little, contributing bug reports, and adding things to bug reports. And there's a relatively small amount. It's less than 2% of the de-applicated data set, who actually does a huge amount of the work. Yeah, Peter? I'm a little confused looking at this and trying to think back to what your definitions of the different developer categories were. Because it looks like simple official and core developers are disjoint sets here. I can't read the projector completely. Unfortunately, are they actually disjoint? Or is it just work? No, they are disjoint. So if you're an official developer, you're not counted in the simple developer category. Are core developers a subset, or is that another disjoint set? This is another distinct set. So if you want to have the approximate number of actual DDs, in the sense we define it in the project, you have to add up these two numbers. And you would have to subtract those few that are core developers that are not actual DDs. OK, thank you very much. So those are so-called Lorentz curves. And I apologize the graphs I took from my thesis. And so the legend is still in German, but I will try to explain it. But you can see for different kinds of work how it's distributed. So the black line is development work, package work. And you can see that here is 80% of the persons are contributing those 20% of work. And those last 20% of persons are contributing the whole rest. And it's quite the skew of the distributions is a bit different for different kinds of work. But you see that all work is quite skewed. And that's another interesting graph. That's the temporal evolution. And first to the one on the left, from the beginning of 2008, you see for each month how many people were active in a certain kind of activity. So the black is all persons that were active, either bug reporting or contributing to packages or whatever else. It goes up until here. And also the others go up. And this line here is the first release of Ubuntu. And what you can see is that the amount of people active, it's quite constantly, it's not a flat line. It goes up and down and up and down. But you can see that there's a trend of going down since then. But what you can also see, and that's the line down here, is the amount of people actually contributing to packages. And it has become quite stable, but there's no downward trend visible here. Or at least it's not as strong as it is up there. But there is a downward trend here. Those are the bug reporters. So the assumption could be, or out of that, that we're mostly losing bug reporters, which means we're probably losing users because of Ubuntu. But we are probably not or not much losing developers because of Ubuntu. And the second graph here is, again, people contributing in this month. And that's only packaging work. And those lines are releases of Debian. So you can see that before each release, the amount of people that can contribute to Debian is going down. So we have a bottleneck before releases where not so many people can contribute anymore. Now it's the count of people that did some kind of work on a package in this month. So it's not the count of uploads, but it's the count of different people doing uploads. Is it clear what I mean? No, if one does 10 uploads, it would increase by 10. But as it's only one person, it only increases by one. Yes. For instance, I did two uploads this morning, so that would count me as one. Yes. It would just set the flag. Anibal is active this month. That also means that if there are a given amount of people doing NMUs just before a release, you count them only as one, right? Every one doing an NMU, I count as one. And if a single person does 10 NMUs, he's only counted as one. Because I was interested in how many people are there and not how much work is it. And then as I had this time zone information, I tried to figure out the regional distribution of developers. So we have some data about this from surveys where people say, OK, I live in the US or I'm in Switzerland or whatever. But I tried to see if I can reproduce this only by process data. And you can see that there are some peaks. There is a huge peak here around. This is zero, so UTC and two. So this is mostly Europe. And this is mostly America. And this here might be Japan and Australia and New Zealand. So this is for mailing lists. And this is for packaging work. And if you summarize it, you see that on lists, about 34 people are probably from America. About 55% are from Western Europe. And about 5% are from Australia and Japan. And for packages, it's even more Europeans and more Australians, a bit less Americans. And yes, the polls and US, these are the two big FOSS surveys, free and open source surveys that were done. Polls is the one done in Europe by a European research group. And US is the one that has been done by a US research group. So you can see that the US one was more people from the US filled it out. And the other one was more popular in Western Europe. So the problem with those is that people just could go to the website and fill it in. So you had no control over the distribution of people doing the surveys. So you did not randomly select them among the developers. Might be quite biased. But also, don't give too much weight to this because there are problems like summertime and daylight siphoning time and things like that. And so one could also argue that some of those here actually live in South Africa or whatever. So now to developer activity, there's another thing you can do with this time zone information. You can plot over the week on which weekdays and which days of hours of the day, what amount of work is done. And you see it's from, I think this one is Monday. So it's from midnight, Sunday midnight to Monday midnight. And here is the weekend. And you see it's going up during daytime, it's during working hours. And it's going down in the evening. And it goes quite a lot down at night, of course. And it's going up again. And you also see that, I think, the little bump, the peak is the morning, and then the little bump is where people go for lunch. And there's a lot less activity on weekends. It's quite equal, or sometimes even less, than work done in the evening. You also see that Friday evening is not so popular. And you see that for the dotted one is for package work. So you see that for package work, it's a bit different. There are more peaks in the evenings, and there's much less decrease on weekends. So it's more evenly distributed. But you also see that there's quite a lot of package work done during working hours. So there must be quite a lot of developers that are either students who don't care about working hours, or that can do paid work for them. They're somehow paid because they are able to do it during working hours. And then what I tried to plot some kind of developer, I call this developer careers, for each of those four categories. And you can see how long they stay in the project. And you see how they reach the category or the status they have in the end. So most contributors stay for one or two months. So they only have one or two months activity, and they disappear afterwards. Then simple developers, they have quite a long time that they stay about, what is it, 30 months. So more than two years. So the average or the median of simple developers stays for a bit more than about two and a half years. So when someone becomes an official developer, it increases a lot. They stay for more than 60 months in the median. And core developers stay even longer. And you can also see that the point until they reach their final status is after between one year and about two years. What this graph does not account for is, and I will show this in the next table, that most of the people are not constantly active. So they have periods of activity, and then follow periods of inactivity, and they have periods of activity again. So it's quite hard to say when a career is actually finished. There are even contributors that have periods of inactivity, of say two, three, four, five years. And then they appear again and submit another bank. Yeah, Joy? Am I reading this right, that the steps show, for example, that a simple developer starts out as a contributor and stays a contributor for quite a long time before they actually develop something. Whereas most core developers are a contributor for a very short period of time before they actually start contributing. Am I reading that right? Yes, I think you're reading that right. Yeah, I think this little step here that is not existing for the simple developers is the time they only contribute on mailing lists, because, but this is actually, yeah, I hope it's somewhat correct. But if you have a lot of developers and you build a mean or a median over it, so it becomes somewhat, every career is somehow individual and it's hard to fit into a general model. So probably this table is more informative. You can see how many periods they have in the mean. So for all persons, the mean is they have three, in their career, they have three periods of activity. And the median length of the period is one month and the median length of the inactivity between the periods is two months. And for the amount of the number of periods increases to the official developers, but core developers have less periods and they are the only ones where actually the length of the active periods is longer than the median length of the periods they are inactive. And the percentage here is, yeah, if it's 86, so it means in 86 persons of the month, they were actually on the project, they were actually active. And this one, if you only count development activity. OK, what follows now is I will try to give a short overview about the theoretical framework I used for open source development. It's somewhat inspired by economic theory and what I have to say this is, yeah, if you don't know anything about economics, it's probably quite hard to understand and I don't have the time to explain it in very big detail now. And there's only one framework you can use. There are other approaches that may be quite equally valid for describing open source development. So some of you may probably know this game called Prisoner's Dilemma, so this is economic game theory. And what you see here is it's a game where people can either choose to cooperate with another person or to defect, so to not cooperate. So in the sense of open source or free software development, if you are together on a project, you can either decide, oh, I won't do anything and let the others do the work, or I can choose to contribute to the project and develop something, help something, and yeah, this cooperates with the project. So if both choose to cooperate, more software gets developed, so both of them have a payoff of three. So they actually have to invest something, so it's not they have to subtract their time they invest. And if both don't cooperate, they have actually both a payoff of one. So they don't invest any time, so they have some time free to do something else and have fun. And if one cooperates and one does not cooperate, then the one that does not cooperate has a higher payoff because he does not have to invest any time, but he profits from the software that gets developed. And the one that does not, it does develop, does cooperate, and he has a lower payoff because he did his investment, but he cannot profit from the work others would do. And if you assume rational actions and they know this payoff structure, then you can show that they won't cooperate. So there's no free software. This is, you can see that it doesn't matter what the others choose. It's always better to not cooperate. So if you compare this to this and the payoffs, is when you cooperate, it's three. If the other cooperates, it's three. And if you don't, it's five. So it goes up. And if the other does not cooperate, it also goes up if you don't cooperate. So you will end up in the red square. So what can we do to solve this? And one way to solve it is that you say, OK, we played again. And people know that this game is constantly played again. Yeah, free software is not a one shot thing, so you stay on the project for a longer time. And this can lead to cooperation. There's one strategy. It's called tit for tat. The players start with cooperation. And in each next round, they do what the other one did before. So if one tries to gain the additional payoff from not contributing, he will be punished by the other one that there won't be any cooperation in the next round. And if you sum this up, you can show that there's a higher payoff in the end if you constantly cooperate. So this strategy does instant punishment, and it can also forgive. So if the other one starts to cooperate again, you will start to cooperate again as well. But this does not scale to free and open source development as well, because it only works if you interact with the exact same person again. But on most projects, there are more than two persons, and so you will interact with others. There are extensions of this game to more than two players, but the more players there are, the more fragile it becomes to establish cooperation, because it only needs a small group of non-corporators so that can freeride and gain a higher payoff. And if you stay at this assumption of rational actors, everyone will try to freeride and nobody contributes anymore. So how can we stabilize this cooperation? So the main point is developers need additional information about their interaction partners. And that's one kind of such information is reputation. So if you know that this other person, you did not interact with it before, but you know that this other person was cooperating before with many other developers, so you assume that this is a cooperating person, and you will cooperate with them as well. But reputation is not directly observable. You would have to ask every developer about the reputation he thinks others have. So my assumption was that developers build up reputation by contributing to the project, and that the work done for the project is a proxy for reputation. There are some quotes I try to show that reputation really plays an important role in open free software development. The first quote is from Eric Raymond. The utility function of Linux hackers are maximizing. It's not classically economic, but it's the intangible of their own ego satisfaction and reputation among other hackers. So the first part, the own ego satisfaction, I think that's a thing that also is very important, that people doing free software work are quite self-motivated, and they enjoy simply contributing. That's one factor, but that was one factor I was not interested in, or I think there is already a lot of work done on this part, so I focused on the other part. Or another quote, statements from people who do things always carry more weight with me than people whose primary activity is making statements. So this is directly what I want to show said. The thing is that you have to actually do things and contribute and not just do nothing and be there. And that's the economist's version of the whole thing. It's a bit more theoretical. I think it's a bit running out of time, and it's not that important. I'd rather show some results. So the question was, does more precisely the question applied to the Debian project and my data set was, does reputation have an influence on bug fixing? And the hypothesis is the higher the status of the bug submitter, the faster a bug will get fixed. So that's one of the hypothesis in my thesis. That's the one I will show results about now. So this table shows bug fix time depending on developer reputation. It's quite simple. There are no control variables and whatever. But you see here is there's a complete data set and there's a reduced data set. The reduced data set only contains what I call simple bugs. So I excluded all bugs that changed severities somewhere in between or they were reassigned to different packages because the assumption is that if a bug has a very complicated history, influences of the actual difficulty to fix the bug are much higher and the social component will not show up as much. And you can actually see that the effect I wanted to show is present in both data sets. So what you have, the first column is the number of bugs in this category. The second column is the number of events. Events are bugs that are actually fixed. So what I did is a statistical method which accounts for the fact that some bugs are not yet fixed. It's just some sort of survival analysis it's called. It has many different names depending on which field you're from. So medicine calls it something different than we call it in sociology. And yes, event data models is another name. And the third column is the median time until the bug gets fixed, so in days. And the last column is the 95% confidence interval. So with a probability of 95%, the true value will be somewhere in between those two numbers. So the median over all bugs is 63 days. But if the submitter is a contributor, it goes up to 88 days. And it constantly comes down if you have higher reputation. So for bugs submitted by core developers, they are in the medium fixed after 35 days. And if you only look at the reduced data set, the effect becomes even stronger. Does that account for the fact that the core contributors are more likely to submit better bug reports as well? Better bug reports? Yeah, with patches and things that will actually enable them to be fixed more quickly? This one does not. But I did other analysis which tries to account for this. So I used, if the bug actually has a patch as a control variable, I tried to figure out if there's a good way to actually recognize better bug reports by way of how many words are in there and things like that. But that was quite difficult to do. I also tried to account for the fact that probably people staying longer in the project have more experience and will submit better bug reports and things like that. Did you look at the tags at all, like won't fix or patch? Yes. So but first, this one, that are so-called survival cores. So you see after how many days, which percentage of the bugs is still unfixed. So it goes down and slows a bit. And then the end, 12% of the bugs submitted by contributors will probably stay unfixed. And for bugs submitted by core developers, it's 8%. And if you only look at the simple bug reports, it's even more. And there's a higher difference between the different categories of contributors. And because it's quite a huge table, I just summarized the results from the multivariate survival analysis. That's what you asked about. So I included some control variables, like bug severity. Who is the bug fixer? If it's fixed, the experience of the submitter measured in the time he's contributing to the project. If it's a pseudo package which is involved and the way it was fixed. So it was fixed by email or with a new package. And there are significant results for most reputation variables. And one interesting fact is that prior interaction between the bug submitter and the bug fixer has quite a strong influence. So if you interact it before, it's more likely that your bug will be fixed faster. And as well, there are more significant and higher effects in the reduced data set. And but I have to acknowledge that the overall variance in the time bugs get fixed that I can explain with my data is quite low. So there are, for sure, and that's not really surprising, other effects that influence the time until the bug fix a lot more. But if we assume that this is quite randomly distributed over all the bug reports, and it's not correlated to the status of the developer, which is quite a strong assumption, then that doesn't matter. And that's another thing. If you only look at pseudo packages, you can see that this up here is the pseudo package general. And you can see that about 50% of the bugs against the general won't get fixed at all. Until November 2007, Holger has fixed a lot of them since then. So it would look quite different now. And also, pseudo packages for teams are, so pseudo packages like installation report where there's not general where it's just sent to Debian Devil. There's a team responsible in theory for it. There's still some percentage of bugs that don't get fixed. But if you look only at the first 15 days, you can actually see that bugs against general in the first period get fixed quite fast, faster than bugs against normal packages. So you can see that there are several possible explanations for them. One explanation is that those are just not bugs and someone just closed them. Another explanation which is more taking into account the social aspects is that there are a lot of people on Debian Devil. And they are, if one chooses to take care of this bug now, the chances are high. And they will fix it quite fast. But if it just gets ignored, it gets ignored forever. I think we have to wait a bit for the next slide because it's quite a complex PDF. And my power book is a bit slow. And the evidence is not the fastest. No, some other problem. OK, I will, at the end of my presentation, an outlook with the data I gathered, much more could be done. You could, for example, do network analysis. So draw networks of people cooperating and working together and analyze that. Look at if there are central players or who is on the border of the whole network. And the data is available online. You can use it. And if you want the raw data because it's quite big, you just have to ask me and I will send it to you or upload something. It's just that I don't have the bandwidth so that people can download files of 500 mechs just for random fun. But if you're really interested in it, just ask. So thanks for listening. The full text of the thesis in German is available. And all scripts I used, the Python scripts, are available as well on this URL. It's also linked in Penta. And I have two copies of the printed version of the thesis here if you want to take a look at it. And are there any questions, comments? What? German. But, yeah, it's German. There are images and tables, but.