 Okay, so I think we're going to go ahead and start. This is about evaluating team performance within Debian It's Andreas Tiele and Tsukbar Singh So I'm gonna turn it over to them. If you guys have questions raise your hand. I can get the mic to you Here you go Thank you. Hello everybody. Nice to see you here. It's a specific pleasure for me to present this Google Summer of Code project because it's the first project I'm mentoring as mentor and I have a really great student and I like this and so I think this talk will be just fun What was the motivation for this? I think it was three years ago when I wondered who are the member of our team? it was actually the Debian mid team and I Did not find a really good answer and so I was thinking about who is regularly posting on the mailing list as a first means and Well, I Did some graphing and I was astonished that I can see really who is inside the team and Also who left the team, which is interesting. You are wondering why are these people leaving the team and so and are there any problems and It's not the members in our team and all these questions somehow showed up and I Thought well, this is an interesting thing for instance This was my first graph updated now and you see in the first time I was quite alone for three years and then came some members and It's actually not so important with this high peak Which is just me but important is there are other peoples which are also there and so I think the run over by bus Factor is quite good in this team. So we have a solid base We actually lost somebody who was not Active in our team anymore, but he is still in Debian Yes, it's the Debian mid team This is the initial team. I started and once I presented a graphics like this And in Argentina people said told me hey cool. Can I see this graph for my team and So I thought well, why not? But as it is if you are doing a presentation, it was just a quick hack and I did this just I was browsing the the web archive for for names and so and it showed up it is it's not good code and nothing and Finally, we also said well on the mailing list. You are just chatting Who are the people are committing codes? Who are the people who upload packages and all this stuff should be done And we should also try to do a fair evaluation There might be people who are just quoting text and say yes, I agree. And so this is not really work and so and so Superhead some interesting ideas how we can do it better. We will Prepare later and so we now again for more flexibility in the in the evaluation and Did some technical enhancement over this quick hack So what did we have you done in addition to this mailing analysis? We also are analyzing VCS commits currently in SVN and Git and What's packages are who in the team has uploaded on behalf of the team? If you have any ideas, we can discuss it later. Maybe some teams are more or less Communicated in your I'll see or so which is not the case in those teams, but there are probably chances and ideas We could discuss later or now But the good thing and the some of course you have somebody who implements the stuff and I can sit and wait now And I'm waiting for the introduction of So have fun Hi, so Andrea said already done most of the work But we decided that we'll start it from scratch because it was just a series of hacks So we needed something that would be easier to maintain and so we don't everything right from scratch so It's very fast as compared to the previous code because for most operations we SSH into alias and then perform them locally So like for the Debian med repository It took us like 12 to 14 hours initially, but when we started this approach It just takes us to two to three hours. So it's fast like you don't have to do anything This is what and you started and every project has a mailing list We measure who are the most active contributors. So Co-indity is not the only metric because quality also of communication does matter because If you're talking too much and it's not substantial, it doesn't matter much and we handle spam So the we you don't only need to worry about it. So in the final result We automatically remove the spam. So we just get a list of the top contributors. So Since I'm a summer of code students. So we have this sock coordination mailing list and We this is from the data we graph and as you can see Obey or third is at the top and but you can see how it varies like in 2010 We have Stefano who is contributing the most and in 2011 we have Anna who contributing the most So this is what we are aiming at. This is a very simple data. We are measuring many more things in it Number first thing is frequency of posting which is just who the number of postings and Then we have this is something which we added. This is new so we've taken into account the raw length of the message body and The length of the body excluding blank lines blank lines and quotes and Finally, this is I think the most concrete metric because we remove every clutter possible and we remove the blank lines We remove the quotes and we do for the signature So we have something which tells us concretely who actually talks better It's not over just quantity. It's quality also Yeah, so Anything which you would like to contribute to this because this is all we have been able to come on. I Just want to there are a couple of different Readability metrics that you could use if you're interested in seeing who's posting things that are comprehensible as opposed to lengthy I feel like this would give people more sort of points if they post very long You know flame war type things or Impenetrable technical jargon and maybe it's I mean, I don't know I don't know what the metrics would be but it might be interesting to try to apply some of those to the filtered without blank lines without quotes Without signatures. Yeah, but that is the only problem because then how do we actually measure in a meeting list like who talks the most? Substantial thing because that's impossible Anyone else would like to contribute anything to this because we need more yeah For every email that is sent what the number of emails I've sent what percentage of those have actually been followed up by someone else So if I keep sending emails that are not relevant, they'll be ignored. Yeah, that's a good idea Yeah, how about attachments because people often send patches to the mailing list where because they are not not a real member of the team yet and Yeah, post one two patches and then someone says yeah, you're a good guy join us Do you count the attachments in in or not? We measure commit stats, so maybe Somehow that I think should be taken care of that but right no nothing specific about patches. No Can you elaborate some more like what are you trying to get? Well Not really if you commit now if you send in a git page And someone applies it to the repository and he is a committer But when you use SVN or something for your team, then the committer is the person who committed it really Yeah, so the contribution isn't attributed to the person who sends the page and he won't show up in your metrics at all Or just when he really really joins the team, so he missed the history I think that this might happen But I think if you want to have the global image, it will not happen so often that somebody is Just sending his commits 50 commits via email. He then becomes a member and really commits So this might happen, but it's probably hard to measure Idea is fine, but I think it's more effort than it's worth it. We should do about it. Okay So now we come to this we have started including VCS commits also because I think We don't have many metrics or we just have to you know make use of the ones which we already have so Again quantity is not equal to quality. So We also measure the number of lines added and number of lines deleted so It is but We don't because we have very limited number of metrics So we have to make use of the ones which we already have no matter how poor or bad. They are like lines of code is hardly a good metric like Suppose if in a gate repository you commit a Binary file, you will see that it has like if you commit an image file, it will have 500 lines committed So it's it's a poor metric. We know but there's nothing we can do about it because these are the only Matrix which we can measure Okay, now we have the challenges like the problems which we faced so this is the first problem because We have no solution to this other than Yeah, you do have an inbox archive of list they burn or con master Okay, you can what yeah, sure I Can elaborate on this as I said I started with a web archive because the inboxes more than a public available Okay, and this is was really crap and this is to have it some Python module which can pass nntp and actually there are inboxes on master and I Can pass them but not he and not the public and We want to publish this data and as long as we are be elaborated with this master I asked kept on asking them Can you please publish these inboxes because it's branded stupid on aliyat? We have inboxes. He has a code for inboxes, but you can't use the inboxes on this Or if says anybody who knows this master and could help us that they publish it some way it would be really really happy Because right now we it was extra code, but we fetched them through an nntp Then we create an inbox and then we pass it so it was the refuse because they said it was not There were privacy privacy issues with this so So I wonder if you should talk to Enrico about measuring contributions Because he has this nifty script, which he says is very fast And apparently also works well So of mining change logs, and I think the notion So you say lines of code is not a good metric, but probably team member contributions to bug logs is quite interesting and team members being Mentioned in change logs is also quite interesting right So for the way we measure kind of we kind of measure performance in DNM process, and first we cannot be done automatically Besides there's a new directive saying that you cannot automatically evaluate the performance of somebody that you want to hire and We abide to that but no, I mean jokes apart. It's not just Grapping the change logs that gives you results. It's not numbers because then you go and read and maybe you only find Things that did not really require work In the change logs that rarely happens, but you'd like to see that there are substantial Change log entries and even the way we grabbed for change looks cannot be automated I cannot make an automatic thing to click for an applicant and that fishes up a change log because Names can be ambiguous that it needs intelligence to build up the query for something that kind of only gives you the change log of that person and Unfortunately, we have two people called Luca Bruno in Debian And well that they they are We can't use mind change looks on them unfortunately Because in that case it's ambiguous, but and even again as it's been mentioned before for mailing list When if we look at mailing list activity, we just Call up via the deep portfolio Like all messages of some person and we go through them and if we want to look for commit activity or BTS activity We kind of do the same If you can't commit the lines of code and all somebody does is reformatting source files Then they'll have loads of lines other than removed and I wanted to suggest to count lines removed Because often they're the most useful But yeah, again, you know when you sort of reformat pieces of code and right, okay That's the same so I I can be asked about manually sort of putting intelligence into Evaluating people performance and doing it efficiently, but if you want to compute like a karma number or something I am not the expert in the room for that Actually do not want to measure the karma we want to know if there are 10 or 20 people working on the same project We are not proud on being high on top on this list, but it's important to have Several people and not only one In that case, I think it's enough to say there is activity over time And you can already just not you don't even need to measure the number of lines committed you just measure the number of commits and Say and and see that, you know There are commits in a day and How many days in a month there are commits and you set a threshold and you say that person's active in the team in that one Or messages and so on there is activity in the month from that person in that team That would probably be enough to say to say that person's active So as I said, it's not a competition what we want to do It's just to see is there a problem in the team. Are we losing people or are we winning people? This is the point and it's really helped a lot for me to a little bit understand There are problems in the teams when I made my my analysis. I've seen teams consisting of one person and so This team has abilities in problem and what are the reasons and what kind of learn from from this? This is something like this. I really like the idea to to check for bugs which people might have fixed in the team we could I think I have a good idea how to do it with UDD and And yeah, and about names you will be astonished. We have a Single day we are new developer who is using Six or seven different strings for spelling its names name and then this is a maximum And then there are several was only one and I found some some way to to derive the table from the UDD which has Identical names for photos I have one name and one person and probably I have two two persons mixed and two or whatever So the details to to make sure that you get really the person are quite hard Some people are changing their email there and earlier that are name minus guest And then they become name or even different name and so this this makes The stuff under the hood a little bit hard, but I think finally you see an Image about the team you it's not a question to discuss is this person just Reformatting or not? I can't believe this that one single person has fun in doing 1003 formats of code and is top committer in this team. I can't imagine this one So any further questions comments? Just to comment on the reformatting if there is a team that decides that their code base needs to be changed to a new Set of coding conventions the person who does the grunt work of actually doing the transition is a valuable member of the team Even if it's not, you know significant intellectual labor And to second your point I think we the thousands have many Commits and huge load of data it that's it's will be just noise these things And if it's real work, then it will be there and I'm a little bit Wondering why nobody writes the proper of privacy which is we should talk about it If you can show one of these graphs, please or either the Last one because well when I did the first Graphs I had full names there and people said they are not allowed to do this and okay I read and now we be cut off the last name which Somehow work, but you know the busy people everybody in the room who knows who this Is there anybody who sees some problem in this? No, I'd like to I interject because this information is public. I anyway So if you are contributing to a public mailing list, you're specifically told it's going to be public You are you it's not you know, you're not being forced to participate if it's private You're told it's private so I Mean I would say yes There are problems with it being public but the problems are not with the work You're doing the problems are with the mailing list and if if there are people if we're discouraging contributors because they feel that they can't Participate because of the publicity then that is potentially a problem However, the openness is the trade-off there and I don't know that we can make any other and Well, if there's pretty issue all of us should close Because all of is doing exactly the same thing. I don't know if you know about it. It's this website that basically does that and and Well as far as I'm concerned Although I asked the problem with all of it's probably not that they do that But I emailed them saying that they were Misrepresenting me in my work and I would like them to take take down my name from the site I just do not want to appear because I do not want to spend time Fixing my information in there and adding to their value. They're not first-of-the-people. So and You know, I I do not want to put work in it and I do not like the way you represent me So please take my name down and they basically told me to fuck off And I think it's at that last point that you have a privacy issue in such thing Yes, it's tough. That's part publicly accessible and you mine it. Yeah, fine But yeah, and then as long as the moment somebody tells you, you know, please take me off Yeah, so we will do that if someone asks. Yeah, that's no problem. I think that's no problem. Oh, okay, okay So you just recommend that we have some sort of form if you want to take your name out of this You could contact us then okay, I perfectly agree with this. It was also my argument, but I When we're discussing this I learned something that if you Aggregate public data It becomes somehow new data because this aggregation process is Something not everybody can do. So this is actually new data and I have no problem. I would not be here if I think it's wrong, but I'm just asking and For instance, what could happen if the employer of our tool said what have you done in 2009? Oh, you are chatting all the time Yeah, this could be a problem so I think for for our stuff, it's not a problem, but What like I just rise this is spying. I'm not I'm not thinking that it's true But I just want to hear your opinion Given that the goal of this particular product is the evaluation of teams It seems like you could present the information with no names attached at all to simply say is the team healthy Look at the number of contributors. Look at how the contributors change over time So you have a consistent color per person and maybe we don't need to know who the name is Unless it's a team that only has one color showing up, right? If everything you see there is a red bar and then nothing nothing nothing nothing nothing Then maybe you want to know who that person is or you can help them out But if that if that's already the case and you know the name of the team you go to the team's mailing list And you could probably find that person This is the second problem which we have been having so because It doesn't seem to work We need someone who's good with gates. So you know how to do it Because upstream we want to remove the upstream contributors So then in some mailing lists in some repositories, they are lots of them so much so that they overshadowed other members of the team for the actual cases we have Formally, I just investigate in the the commits we are mailing list and then you see the commit Which was send it by the demon after commit to mailing list and now super is Investigating the real commits and on the git logs and SVN logs And then you see we have some upstream developers who are really really active in upstream and are committing to our git repository and doing Two or three change your kinds and the debut and directory and this we have a very specific case This rises This is a higher high scum it on our team even if he has committed nearly nothing to the debut and packaging and Yeah, the solution comes up and I just wanted to explain why I was on the slide And so we it's a little bit for technical from the git point of view So I guess I just wanted to know if my script. I sent you doesn't work. Please tell me and I'll fix it This is thanks to David it is Basically, so if he just got a code in in pearl and which should be helpful to to size it These these slides are a little bit older. He wanted to show the problems he had Same answer Yeah, but you can try because anything new is welcome here Yeah, I wanted to suggest you usually have the upstream branch and the Debian branch and you can measure the delta of the Branch because the upstream will have only the upstream commits usually. Yeah, but anyways, I think I tried but it didn't work out So maybe there's something so I'm missing So anybody who? For better or worse followed my advice Probably wouldn't have us deviant only commits on a separate branch But unfortunately people don't follow your advice Your problem would be because of them following my advice, so I Think that in general, it's not useful or at least I find it not useful. I see how it would be useful for you, but yeah But I think this problem is basically solved as code is there and should be okay we need them because I Have been requesting for this from the start because we are not fully convinced because and we have time left So it's not about time also any new way of measuring performance is very much welcome because we need them This is the maximum which we could come up with because we tried everything and stay next to you So I think one very interesting metric would be Commits From team members upstream and then you have to sort you have your problem again, but I think that This is a huge contribution if People make fixes which and on your team make fixes which end up and they do the work of Getting it upstream. I think that's something that's really important for for debut Well, I think it was only work for people who are using it in other words, you can't see it and We have teams which are using mixed repositories and you have a wrong tendencies to use this to have if you observe to get people and Others don't use it Maybe I've missed it. That's something like new DD's coming from a team within a certain time frame would also be interesting information At work and at locations and names of DMS and yeah So you mean that if We learn that some some DMS or DD's come to this team because they are members of the team This would be a good sign in this aspect or Just to make sure I understand How do I put this Members of the teams who get advocated by other team managers to enter the NM process I Think we have a lot of them in in Debian made perhaps is in different teams. I think they have eight Even ten people gathered for Debian. So perhaps we can find this somehow I guess you can measure wiki page edit, but there will be a problem with mapping of Wiki out or with email or something. Yeah, it's possible. Yeah, I Actually read the change logs of those which are tech they were made, but it's hard to find a way to To tell what wiki page is for what team it's as a clear mapping team wiki page, but Yeah, we can think about it in bite on teams or wiki pages related to our teams start with Python slash something Okay, so you're suggesting to start with a team name and then yeah Okay. Yeah, we might consider. Yeah, just checking anybody else So this is what we have done till now. It's been I guess two months. I don't know. Yeah We have mailing lists for alias and which works very well because by per mail has all the archive is already So we just fetch them and pass them for lists dot Debian We had lots of discussions and then finally we fetch them over any NTP then we create emboxes and then we Throw them over to the alias parser. So that's how it works So gate and SP deposits are complete and what is incomplete is we we almost I think this this won't take much time like we'll be fetching package upload data from the ultimate Debbie database and the thoughts which you presented and finally we'll have something like It'll be easier for you to fetch the information Like popcorn for example does that so you can easily check like in the month of August in 2011 who was the most active contributor something like this Right. I think that's it. This is our website and we have a public mailing list So if you have any thoughts you can get in touch with us I have a suggestion or problem I face in the Debian games team we have over 170 members and Yeah, well I've had some list on alliot. I did not members as in Lock-ins in the group so not group together with the old alias and the new DD lock-in and from my poor Python scripts about 80 people did not commit or and did not mail since the January of 2010 it would be really good to have a Possibility to ping all those people Most probably over you interface because it has all the data already. Yeah, so tell hey are you still interested in being in the team or? should we just remove you now and you can rejoins later when you have the time because When we do meetings We are 20 people in the channel. We don't know are we everyone who is interested to work at the moment or should we wait more? Do we need more? opinions or can we just decide on our own because we are the majority currently online So you actually want to use the data and ping the members to ask them whether they want to still contribute or not? Yes. Yes, that's a good idea. Yeah, because that would be actually making use of this the data which we have gathered Yeah, yeah, it was similar for me. We had in an alias. They are 90 or 100 people and I Think 30% of them never did a single commit So just subscribe because I like the idea of the project and solve this something good So I think another Way to sort of publicize this That would be nice for the project is to just send an email that maybe goes into Debin develop announce or something just highlighting the teams that you found that do seem to have Active and healthy community in them just so people can acknowledge that one of our values in the project is to actually have active and healthy communities So not to like highlight the teams that aren't actually teams Don't don't highlight the teams that are one person But how the teams that are actually teams that are functional and say, you know, we want to acknowledge You know based on our rough heuristics These folks seem to be doing a good job. Yeah, that's a really good idea. Yeah Um It's probably no reliable way of doing this, but it might be useful for certain teams to be able to distinguish members between active and passive So for instance, I have a few packages in the Python team So I'm not really taking I don't consider myself an active member of the team But I just version control my staff there so that I make it easier for other people to do changes So if you can say there are X members active members in the Python team I think it would be useful to know how many of those are actually you know contributing to the whole Packet set or just have their own So how how do you I mean what do you suggest on how we do that because So I guess we agree that that would be useful thing to know Yeah, but then if you want others vouching for you or you will you vouch for yourself? That is what I want to know. How do you want this to happen because I guess you could see if if Certain people tend to consistently commit only in packages in which they are either maintainer or an uploader Okay Yeah, I guess yeah, you could look at commit messages or change logs. I don't know I think you you can't tell if people consider themselves a member of the team if they upload on behalf of the team They are for us bad definition member of the team This is what we call it member of the team and you can't Should not count the number. It's just what what you have done counts even if you say well I'm not so close to this team. What you have done some work out if you Post it to the mailing list if you don't show up frequently you are not on this graph And so yeah, because we just take the top. I think we'll just take the top x contributors or something not everyone. So yeah Okay, so I guess that's it if you have any questions or suggestions, please we also tried it if you Take everybody you get also the last spammer who was get fetched by our Spam protection stuff. So it's it makes sense to have the the most active 10 or 20 or in large teams 30 but not more and if you are committing only very few and You are not visible in the statistics This was actually also the send to This is the idea of mine to to get rid of everything which is not so important for the team So it's a specific also spam and so Hi, could this be used to identify possibly orphaned or packages in within the team as in Packages that are under the team, but no one is actually actively working on them Maybe yeah, cool. Yeah, because we have not touched that part yet. We I'll do that once I get back So maybe yeah anybody else Do you have any Overall statistics to show about the number of about of you know the teams in Debian or do are there some results We can see yeah, I think We have actually like for testing we have a lot get lots of data. So I think we have that right now There are at this address at blends.debian.net list stats you find text files, which is important for the for Accessibility because I got a hint Even accessibility team want to see their committers and they can't pass this and the PNG files And this is update updated once a month And if you want to see some results My chance I have this graph Shown to the after the DPL talk because you see Stefan and Sachioli. It was the most Active poster and he dropped his activity when he was becoming DPL. So you actually see something in these graphs yeah, and There are other examples This is accessibility the list you see some multiple is very active Mario Lung and then the activity Drops a bit, but I think this is actually a good team because there are quite a few people They are don't lose many people and they're doing something MD64 I think this team is just by it because it's not used anymore It's there's no use MD64 is completely normal and There's not much to discuss for arm it happened a little bit later something and Well, I think it's also quite okay Blends commit. This is because I have all the work with the plans. You see I would love to have some more here So this is not so good working team The plans mailing list there was a lot of discussion this cousin in the beginning This was when we renamed CDD to blends. There was a lot of discussing Damian bootlifts Yeah, this is a little bit sad when if I look on this list and see that we lost the famous one and Technical committee Coriosa who is chatting the most and there you can't Jesus But it's not so you use anymore by the active people Devian blends commit Yeah, I I'm just the I'm the most frequent chatter, but I'm not the most frequent upload of code. So You can read this from from these graphs And I actually have we have five minutes left. I've also some graphs of the uploads activity You see I'm these graphs are all on this uploaders Debian live team Yes, Debian upload. This is about the uploads of packages in the Debian this team here for example You see there are not so many people I really love to make this a stronger team because they could join if if all the Open-street map guys would join this team and then they make a former real strong team They could be real strike force to make they've been a city solution for geographic information system But they don't believe me. They don't do it They were live team. This is basically Daniel Bowman He doesn't got good job, but somebody could support him Debian made team Yeah, okay. I'm not this here. You can see it because I was looting in the number of commits obviously Have larger commits and the Charles had more tiny commits because I'm winning again if you count the number of uploads Whatever this means as I said, we do not make a competition Important is that we got here this these people. This is the most important thing that not this high numbers. It's okay And actually we have also some I think 15 to 20 people are active here. That's just okay For I must say for a quiet niche under this 20 up The Debian made this equation niche Topic and and there's not so much use of it and we have but we even though Got a nice a larger team here. We try to measure the top 20 uploaders to see how it What it makes from the use and we here you get the committers with only one commit and this is Basically an QA or so. So it's not so interesting anymore. It's might be interesting up to 15 Or so it depending from the size of the team for Debian science. There are more people. It's more interesting Yeah, it's Debian science Yeah, it's it's looks also quite good here Yeah, we have the top 35 uploaders. It's also doesn't make so much sense And they wish him teams. Well, that's this are examples Okay, any more questions? That's all. Thank you