 Thank you. So welcome everyone for this talk though, that's a bit of an update on the community analytics. We've been working on the past few years, but first I will let Christelle introduce ourselves. So my name is Christelle, I'm a computer engineering student and I'm majoring in data science. I'm currently doing data science internship at Inyoka Consulting and this is my second internship at Inyoka. My first one was in 2019 and I was working on Cumden with Kevin and that's how I got to know KDE and Cumden as well. So as for me, so I'm one of the old-timers in KDE, that's what they say. And so I started using it in the 90s and only started to contribute a few years later. Then I fell in love with the community and I've been helping around with KDE lives and then KDE frameworks and also community stuff which includes the setup, helping the setup of the KDE manifesto. And so this effort about community data analytics which we are going to talk about today and as Christelle mentioned, I'm part of the Inyoka AutoCulture family where we are doing services around software and I live in Toulouse still. I'm going to move from there. All right, so let's start with that particular talk and so what happened previously, right? So you're used to that, always a bit of history with me. And so let's start with that guy and his ridiculous dog and he's been taking over an idea from Adrienne Degout which was named the green blobs back then and he brought something named Gidwiz and made the blobs blue. So he did a few interesting talks for us, his dear polydems in our community. He kind of retired in 2017 and now he prefers to have fun with his dog apparently. So that's when I came in the picture. I basically thought Gidwiz, why I asked him first if he was fine with that and so on and gave me the code and I expanded it quite a bit. So I reused mostly the Gidwiz in code but then most of the dependencies I swapped with something else. So I changed for pandas, for the data processing, network X for the graph analysis and bokeh for the actual output which gives us, as we see in this talk, better visualization. I also added some ways to clean up the data with rules which is important for identifying people for instance when people are not necessarily consistent in the way they commit, the name they used, then a bunch more visualizations and the big advantage of bokeh is that we have more interactivity in the visualizations. So unfortunately that's a PDF so we won't be able to see that today but otherwise I mean you're able to zoom in and out and plan and everything which makes it much easier to actually explore the results. So a few examples of what we do so you might have seen that on my blog for instance. So that one is one of the simplest ones to understand right that's basically week by week the commit count and the number of unique committers which gives you an approximation of the team size each week right in the community and so you can see that evolve over time and then get a trend. Then we got the activity plot which is the one truly inherited from Paul and now it's basically yeah colorful because we have more thresholds for the for the activity so each line is a contributor and then each column is a week and then you get a color depending on how active you've been and the last thing that all the contributors are sorted by the first commit date which is interesting because this way you see people coming in the picture right so you get that envelope there which gives you an idea of how much you're recruiting and then so that the envelope I'm talking about that red dot and then inside of this area the denser it is the more attention you get right because that means the more people who came in actually contribute for a long while otherwise you get like here like a one shot and then the person disappear right and then we also have the contributor network so for that one we basically plot a network to get an idea of who collaborates with whom and that's based on when we're looking at the commits that's based on the assumption that if they touch the same fire there's a chance of some communication happening at some point so that they don't step on each other toes and so we basically draw a line between those two persons and that one is the hardest one to read so that the centrality for a particular person because as soon as you can do plots like this you can tell if a person is very central in the graph or not right so we color coded the notes so that one is fairly central for instance but that's a picture at a given point in time now if you rinse and repeat that for I don't know every week or every month you can see for a particular person that value evolve and that's what we are plotting there so the blue line here that's basically the centrality of that particular person which happens to be Volker Kraus who is in the other room right now so that the centrality of that particular person over time and so you can see that he can be fairly central and then a bit less right and correlated to this we show also the activity so that is the number of commits everything is normalized though so that's why it goes all the way to one and zero and one thing we have realized that if the team is small and you have a lot of activity then you get fairly central quickly right so we also show the team size because that you can draw conclusion of that blue curve only at periods of times where the team size is somewhat constant so around here we can draw a conclusion but in there right it's kind of shrinking so we cannot draw any real conclusion about that spike right or or that one but there we can start to do some conclusions so that's a quick recap of things I've been talking about in the past and then there's kind of an open question is that just a fun puzzle Kevin right where you do this well it's definitely a fun puzzle I won't lie but not only right because that that's a way for us to show also if a community is healthy because of the activity level right as I mentioned we see if we recruit people if we if people stay around or not right if we have a large or small burst factor for particular teams we can also see the team structure we can also see team splitting which I didn't quite explain here but I wrote a blog about that and there are also actually professional uses I've been using that for instance for framing technical projects at customers when they wonder yeah what should I pick as a dependency right some of the reasons for picking something are technical some of them are more because of the community and also interestingly you can audit the actual customer code with this kind of tools and you find some funny facts sometimes so that's yeah what I said pretty much right so especially for the code editing it makes it easier to explore the project history also to evaluate the developer's turnover right inside of a team at a customer and that also helps you find out how the team is structured right and who owns which part of the code for instance and who works with whom this kind of answers you can get and that can be yeah fairly important for customers and you can identify the key people on a particular project I mean if you start looking at those graphs after a few hours or a few days you basically know everyone on the project all right and with that so we're going to expand the view and I'm going to let Christelle continue with this okay so what's new what's new is that we've added data sources so it's not only about the code commits anymore we find that the code commits are not capturing like everything about projects so we thought about adding new sources like like merge requests issues and mailing lists and that's where that's pretty much what I did during my internship so come then was a bunch of python scripts it was a couple of scripts for like centrality and activity and all that and well my work on it was essentially to make it more user friendly and easier to use so we turned it into a pipeline which makes it easier to handle and to reproduce results and all of that so it works like this you have come down which is the library and you can use it to do all sorts of things you can parse repositories and then you get a pandas data frame that you can use with certain files and with pretty much options so we don't have to just use it with activity you can use it with network and then parse it once and then you can display it with like with the different with different characteristics and all of that so what are the new data sources we have mailing lists so mailing lists are pretty pretty common way to to to discuss things and bugs and reports and all of that in teams and was it was pretty widely used in developer communities a couple of years ago now it's it's more now there are different ways that to to discuss things but it was it was pretty common so we thought about adding that it's pretty much it's not very common anymore so we might not use it in practice but but it's pretty interesting to look at the history we also have github githlab discussions which are issues and merge requests and we thought that'd be interesting uh because this is where the discussion around code happens and this this this allows us to look at things that we didn't look at before at how people how central people are to like to contributing differently and not just code and this leads to a new metric which is the responsiveness like how much how responsive a team is so eventually we potentially this will allow for aggregate aggregated views so for example we can we can combine commits graphs and merge requests graphs to have a fuller idea and things like that to have to have a better view of the community a better idea image uh I also forgot to say I'm sorry that it's not it's not it's not easy to map the commit authors to githlab users so so that's something we're pretty we're still we hope to work on but we eventually want to to get to a point where we can combine all sorts of views for different data sources so uh what about our data set here where we're talking we're going to talk about like what's happening with the community so we have a data set that we're going to analyze and what is it so it's it's constituted of of all the kde repositories so yeah all 950 plus of them and even the ones that are un-maintained because we think they offer pretty interesting insight and well we're using the we're using the the feature income then that allows us to to to program some rules and then apply them to to to our data while processing them so that's the rule set and this allows us to handle big offenders that have names that commit from different names like Laurent Montel for example commits from Laurent Montel and Montel Laurent and we with rule sets we can we can we can we can we can we can make sure that this is going to be the same person that's committing and this this helps us reduce biases in our statistics so when it comes to KDE as a whole uh well this is an update this is the activity the all-time activity of commits commits only so a couple of points are pretty much stand out like there's Laurent Montel that's been active ever since 1999 he's right here so we zoomed in previously and we saw his name right here and he's been pretty active ever since 1999 he's got orange colors right here so that's that's more active than than the average and then we have inflection points like one right here and one right here and these inflection points point out to faster recruiting after 2010 right here and faster recruiting in the last couple of years but we we have less retention like yellow here is less dense so people are are not staying they're not always contributing but it's getting better and here it's more dense so the people that joined in the last couple of years are still here and as from 2017 uh well this is the zoomed in graph this starts from 2017 and and we thought there there are some people that have been active contributors and like we'd like to mention them i'm sorry if i'm going to butcher their names i i don't intend to but here we go there is Agatha Caco there's Ahmad Samir there's Alexander Lano there is Alexander Stepic or Stepitch Camilo Higuito, Carl Sean, David Redondo, Devin Lin, Han Young, Jan Blackwell, Jonah Bruckert, Mevan Carr, Nade Graham, Nicholas Fala, Noah Davis, Sharaf Zaman, Vlad Zahorenti and Bukar Ahmed and well you guys are pretty like are pretty active contributors and thank you for your work so what about merge requests this is the activity for merge requests and it's obviously obviously more recent we have less history here we start at around 2018 so the KTE moves on to GitLab pretty recently and we don't have GitLab merge requests from before that so that's it's pretty obvious but there's some things we can we can we can learn like uh here there's an inflection point so so people have been more active on merge requests we see people appear like Xavier Huggle and Michael Johnson they're pretty active on merge requests so uh why didn't we there there's an interesting question here we didn't spot them on commit activity but we're spotting them here so so that could be related to the fact that they're more they're more active discussing code but not necessarily contributing it and that's for example another another position in the open source community where you don't necessarily contribute a lot of code but like guide the development of features and things like that and when it comes to the team size excuse me here we have the the the entry count in blue and then you have the team size in orange and this is the peak that was the Nokia peak that's in 2010 so I I think this is when Nokia bought Qt so that's that and then we have a bit of stabilization right here and then it picks up again so in 2019 2020 we have the entry count that clearly picks up then we have a trend that's pretty positive for for the number of developers that are active the number of contributors and well for for merge requests we clearly have um we we clearly have like a growing trend but the thing is we don't have enough data to to to have a more precise or intelligent insight so so what we really get from this graph is that we need more data and well as as for the new metric for merge requests this is the response of where in orange we have the response time in hours so here is a lot of hours for for response time and converge not much uh so responsible are becoming more reactive to merge requests which is a pretty good sign uh it took some time but but we're there and then you have we have the blue shape right here this is bars that that pretty much that represented the stock of unanswered merge requests so we can see that it's growing and that could be due to the fact that it's the recent in 2021 uh but that's that this is the the stock of unanswered merge requests now it's important to know that the history of the history for merge requests goes a bit before the instance existed which is counterintuitive because before the instance existed how can we have things like that well this is because Caden Caden got important with all of its history and well it's been excluded to to produce this this plot okay and um this is the network for commits so uh this shows the centrality of users as Kevin previously explained so we have some users that that are very central so the top be Alexander no Nicholas Fallon Carl Sean and Heiko Becker and they're pretty central their their node is like in black or very dark purple and well uh for merge requests we also see there are people that are very central and you see the same people that come back like Nicholas Fallon we also see Adam Alexander Ahmed Samir and Alex Paul something that's interesting is that Laurent is nowhere to be seen we see him we see that he's very very central for the net for the commits network but he's he's not very central when it comes to get get lab merge requests which pretty much shows a couple of things uh first is that he he's he's contributing code but he's not necessarily discussing features that he's contributing with a lot of people so he could be uh developing things in his corner and well the the for example uh changing changing the search for coding or changing certain loops writing them another way uh but the other thing that we can see is that uh commits on their own aren't enough to really tell if someone is central to the community but then let's zoom out a bit uh this is the the the subnetwork for a credit this is where credit enters the scene and we have right here a person that's very central and this person is hella ramped the who answers a lot of merge requests and and pretty much is very active within credit and onto Kevin all right so let's continue with this um so obviously what we've been seeing with crystal uh that's the community as a world right but then sometimes you also need to look at sub parts right to make sure that you're not mistaking you know the forest for the tree um and so we looked at two other examples you try to see there were other phenomenons we could spot in there uh so one of them is Katie frameworks so basically we did everything then focus on frameworks and then we'll have a focus on Krita because we've seen that interesting fact with the subnetwork there um so for the all-time activity so I'm sorry because this one with the colors unfortunately it's slightly harder to read than the previous one because there's a bit less density in yellow right but if you look closely you see that the profile is very similar to the world kd right there are two inflection inflection points though which don't match the ones we've seen for the world community because there's one which is before um 2010 and which is likely the kd 3 kd 4 transition where a lot of people uh it's actually fairly sharp right uh well a lot of people actually jumped in to help with that transition in kd lives and we'll so see another one around 2014 that's actually uh the year that we released the first kd frameworks five for the activity and we could yep for the activity we can also zoom in like we did and so I did a zoom there where we are basically focusing on the main recruits around that period which I actually managed to get on the same graph but one of the top one at the bottom no meaning to that but that's pretty much so i'm at the same year getting very active and alexander luno as well so from 2018 and 2020 right and then we have the old-time activity but for the match requests and we see to what the top you might see that there's gives that feeling of a thicker line but that's not a thicker line it's actually two lines which are very active around the same time and if you look at who that is that's pretty much david for and ham at samir right so you can see that they're talking to each other quite a lot right because they seem to send messages around the same time um then for the old-time team size uh we see very clear ramp up uh for frameworks five right both in activity and team size um and we also see an exhaustion phase right uh around here and that's pretty much why we worked a lot kd frameworks five then it's released and then there's a deep in activity right everyone tries to recover and then it ramps up again um and clearly it's be picking up quite a some some pace since a couple of years um open questions there but it could be actually the development model of kd frameworks five right before that was kd leaves then we changed the development model kind of uh paying off uh that ramp up that could also be uh the preparations for kf6 which are ongoing right now so again that uh increase and that could be actually both right so then you get a compound effect uh right for that we cannot say much right clearly for this kind of stuff we need more data at that point uh for the responsiveness similar profile uh then the whole community uh we see it's very slow at the beginning and then it picks up um it kind of converges but but there's slower responsiveness though uh on average uh it's three days to a week on average and uh on average on the community it's yeah more around two or three days for the network so we get the strontality top five again uh for the commits so Hamasami fredrich pospo alexander lono laura montel nicola feila right and no david four right here i'm surprised there uh well because you see david four more in the top five for the merge request right uh we have the same effect around uh laura montel again that's because laura commits in plenty of flies so with the assumption of touching the same file you collaborate right then he shows up in the previous one but he's not very active on the merge request actually discussing stuff uh he's doing i mean a lot of cleanups which are very important but then it's not really part of the decision making uh which happens in the merge request so it doesn't show up but then we see david four showing up again so he's all very much managerial now right all right and now quickly because we don't have that much time left uh zoom on clita itself right so david community right support and now just a team for particular software um so that's interesting again if you look at that one so the profile kind of similar a bit but it seems to recruit faster on average than the world of kd right so it recruits faster that being said right if it's out to read that because here we don't have that much density so it seems to have um faster recruiting on average to the world of kd but at the same time the price of that maybe is to have a lower retention right so there's a lower retention on average than the rest of kd if we go for the focus on the recruits from the past a couple of years that's mostly ioin agata kako chara for the man and they flew for the merge requests uh clearly dimitri kazakov and a la ramp they're like top of the game right so that clearly the decision maker uh among uh among that team uh again we see the not no kia peak because that was very much a big thing for crita back then uh when it was part of caligra um and since then though so they are the peak their stabilization phase but clearly it's growing uh and it started to grow a bit earlier um than kd as a world after the no kia peak but then that team growth is uh is a bit slower there there's more uh increase on the on the count of commits right yeah more data please right so we'll do that again in a couple of more years and that might be more interesting uh they're interesting with crita though is the all-time uh responsiveness on the merge request so it's similar to the kd community for the for the responsiveness we see it's slow in the beginning and then it converges which is normal right adaptation to a new tool being introduced um one thing they do super well though is they're very tidy on the unanswered stock right because they are virtually no stock what's registered they are that just the last weeks right of the story so that's normal that you have stuff which is unanswered uh of their recent uh one thing we spotted though which is surprising to us and we have no good explanation for this is that one uh we see that it starts to converge right but then there's this yeah peak uh in 2020 it's unclear to us why uh and if someone has an insight about that it's very welcome i would like to know actually uh why they could have been this phenomenon for the network so centrality top five dintrick as a cough i've been wrong at a ramped shaft that man and you're in on him uh names that we heard before uh on the uh match request though uh we see volterra appearing uh who is not on the on the commits so apparently a bit more in the conversations and a bit less in the volume of commit status all right so we're almost uh at the end i i had to go a bit quickly on that part to give some time for uh this and then uh the conclusion uh so there's a name i mean there's a name that we heard quite a bit except in the part of on critter uh in the community so with hamad samir right who's been fairly active uh we see him i mean registers at the world community level registers on kd frameworks um and so we thought that could be a good case even though the plot is not super interesting in a way but that was a good case to um try with the uh with the centrality uh metric which i was talking about earlier so as a reminder you should basically trust so that you should basically trust only one these plateaus okay so the data before that because of the team size is still growing you cannot really draw any conclusion on that part that plot has been produced uh with the kd frameworks merge request dataset okay and only that part and then we can see with the uh orange line which is the centrality level we can see that basically hamad is like just shooting right through that ramp up of the team and whatever happens uh with the team and the activity it's just growing and growing in the centrality uh so i think that's actually a name we should count um we should count with uh if he keeps at it uh like that uh is el bent on being very very central to kd frameworks in particular all right and with that i leave you with christelle again for the conclusion so when it comes to the metrics we've noticed that it's um we actually think that it's interesting to have the conversation during code reviews not just the contributors of the code but also the merge requests and all of that because it allows us to capture a bit how people behave in general what's the role and especially for senior developers and how they evolve we see david for that it has assumed a more managerial position whereas uh uh loren motel is contributing code and then making cleanups and all of that so it's very it's two different it's two different approaches it's a very senior developers and and it's it's it's something that we've only been able to see we've combined two views for metrics well as i said commits are on their own probably aren't enough to really evaluate someone's centrality and well a single metric well there's too much bias single so so it's important to have to look at generally how the community behaves and well as for the community it has a complex history with a couple of decompression phases they're visible but it's healthy overall recruiting is going well and well retention picks up in the last couple of years first the projects um they have their own trends but they're pretty close to the whole well at least where we've looked but but as a whole the community is very healthy and finally git lab seems to be very big for kde mrs are picking up and well we look forward to to a couple of more years later we'd have more data and then maybe look at more insights and finally what's what's in the card for us for come then uh we believe there is some work that still needs to be done on on the responsiveness plot and well we'd like to find ways that to match git lab accounts to commit authors to be able to combine different views for commits and merge requests that would be interesting as i said before and finally maybe even have a way to customize weights and activities and network edges so maybe contributing a whole new feature would be something that has a bigger weight than contributing something that's very minor so that's it for us thank you for listening and um any questions hi thank you very much uh for this talk we have five minutes for questions and we have questions coming in right on time i can i can i can read you uh those so david redondo asks in the chat we thought that heiko becker was maybe in the top because he does the version increases shouldn't such commits be excluded since they somewhat distort the picture want to give a shot at it instead of oh i go with that uh were you did you say i should give it a shot yeah do you want to answer that or i go for it oh you can go for it that's okay all right uh so yes uh one of the reasons uh so definitely uh that's definitely something where you want to actually exclude those it was easier i mean because we had stuff like this right but that used to be under different accounts or that used to be uh via the account of a boat doing this right i remember for instance david for doing this kind of release work and you don't see the commits coming really from him so then they have a very well known name which is boat name and for those boats we actually have exclude uh rules in his case that's actually harder because you don't want to remove him completely from the picture as well so you cannot just say well that's him so we just shoot his commits out and we don't display them uh so definitely yes that's used with the picture that makes things a bit harder um because you cannot survive between these two activities fairly easily right okay thank you uh we have another question again from david redondo for the git lab stats i think it would make sense to keep in mind that only some projects moved to it in the beginning and then the bulk did only start using it later yes uh that's what we tried to show right when you see that big break in the trend uh that's actually what we were trying to point out right you would see that in the beginning it was just ramping up and then suddenly you see like everyone that is coming in right um so yeah that definitely shows okay and last one do you think also from david redondo by the way do you think it's possible to extract certain roles from the data for example querying for developers who do more of a managerial role like uh david for i'm not sure i got the question correctly so i i think just looking at the data can you i don't know have a query that segregates uh people into managerial roles and another or different roles that you even didn't think about yet um i think that's doable to make scripts which actually detects right based on some profile um in that particular case that's in a way easy right um because there's a strong disconnect between the centrality when you look at the commit and the centrality uh when you look at the match request right so if you compare those over time and you see that diverge in some direction then you can say clearly the role of that person is changing right as a level trend stays parallel you cannot say much