 Thank you, so welcome everyone. So as mentioned, we'll talk about the community data analysis and bringing it back to KDE because it turns out that's something we used to do and then stopped and I picked up the torch. So my name is Kevin Othens. I'm working for KDAB to pay the bills and so I'm doing cute consulting, trainings, development as a trade and on my spare time, I'm working for the KDE community. I've been doing that for a long while. Lately, I'm not turning much code in KDE products but turns out that through that topic, I'm turning a bit of codes about the community itself. So let's get back in time first. Do you, maybe some of you know or remember that guy, right? Look at that guy, he's dashing, right? What a teddy bear, right? You want to hug him and hug him with him, right? We've been hugging him for a long time and you probably know him not for the code he produced because he didn't write any code for the KDE community really but he had a couple of scripts which he didn't publish until fairly recently as far as I know and he was more communicating about the results of his scripts, which were mostly this. First, what he called the blue blobs that you might remember. So we had one line pair contributor in the funds and then the scale from left to right was time and you would get a small blue blob in front of contributor if that contributor was active that particular week, right? And the more active the contributor, the darker the blue blob would be. Turned out when I looked at the code and we get back to that later on that there were basically four levels of activity you committed once, two times, three times or four times and then everything else is equal after that which is fairly low resolution. But that was an interesting one still, right? Because from that we could see if new people were joining in a particular team or the whole community because if you have a graph which gives you diagonal like this with the blue blob starting that means new people are coming each lines, right? So that gives us an indication that also allowed us to see when someone was stopping, right? Or if someone was doing a pose and then coming back. So we could be able to see this kind of things pretty. The other type of the other type of diagrams he did were those, right? So he did also contributor networks. And so we would have one node per contributor and an edge between two contributors if those contributors collaborated, right? And we determined that through commits, right? That's what he did in script. If in a commit someone touched a file and in another commit another person touched the same file then you would have an edge between those two. Between those two, all right? So that's a bit of an approximation of collaboration but if you're looking at the data produced by the community which in the case of software product is many commits that the best thing you can come up with, right? At least at the time. And then 2014 came, right? 2014 is basically the last time we've seen Paul at an academy and he gave a talk where he shared something which looked like this. That was a diagram where he was putting over time the measure of cohesion of the community, okay? So the higher, the dot, the more cohesion of that. So there was some noise in there but there was a clear trend of climbing up all the way up to 2010 and then it would drop again, okay? What's interesting about cohesion is that it's not related to the amount of commits you would have a particular week, right? It was related to that. How well connected that network was, how much collaboration was happening between the contributors. So that was interesting to see that it was climbing up until it was declining. The reasons why we're still debating but what I found sad is that he basically started the debate and then nothing happened, right? We didn't really start reflecting on what happened and then it kind of went the way of we do those and we didn't look at it. Recently I thought, yeah, let's see how we did it and maybe we can pick it up and we could try to make those metrics again and maybe we would get new insight because since 2014 we moved on, right? And new thing happened. And so I looked at what he did and that was basically Python code using git commands from within the Python scripts and splitting to graph this, all right? That was nice. I mean it's not all handmade code, right? With love and the amount of skills that Potadams has and if you ask him he will tell you he's a crappy programmer. I can't say I love all the code I've seen there. There were good things. Some of the stuff I'm using now is still coming from there. There were a couple of problems on his scripts. So they were very slow and how to maintain an extent because that was all really handmade. And the fact that he was using graph this as the output made things fairly static. So if you remember he ended up with a PNG which was 10,000 pixels wide, right? Which is not necessarily easy to comprehend and navigate in. All right, so that's pretty much the situation all the way to 2014. And then 2018 comes, and here I am, right? Here I am and thinking, okay, let's revive this and see what I can do. In 2018 you might have heard that data science where everybody knows about data science and what it is and what data science is, right? And it was of every problem in the world with data science. Well, yeah. If I was you and sitting in that room I would probably stand up and shout bingo at that point. Yeah, it's kind of a budget bingo in the data science field in my opinion but they developed some interesting tools, all right? In Python, which is nice because that's what I was starting with. And I used three of them mainly which are pandas, network X and bokeh, all right? So pandas is to be able to have pictorized view on your data and then you can actually process it in punch instead of having your own annual loop and you would have to do everything by hand. Network X is nice for dealing with computer graphs, right? And bokeh is nice for splitting out the visualization and that's actually dynamic HTML so you can actually navigate in there. And there is also nice tool which is named Jupiter which allows you to turn your code faster because you can prototype and run some parts of it again and see how it works. So that's what I've been looking at. And so we have a very cool demo that, sorry, that's that. And so Jupiter looks something like this. That's kind of an ID inside of your border. Then you get views on your code. I don't spend too much time commenting on the code. Because otherwise I won't fit in the 20 minutes, I have. But the idea is that you can split your code in snippets, right? And then run each snippet individually. So I run the code snippet. That one is longer. That's actually the part which processes the Git history. So I'm looking at the world Git history of PIM on the year 2017 on that snippet. And then we can start to fiddle with data. What's nice with Jupiter as well is that if you try to put one variable, one random variable, it tries best effort to actually display something meaningful, which, when you have plenty of data like this, it's nice to actually check that you have a table which looks like what you expected because you will get things wrong. Those, some more brand newing, produce the actual network, all right? So some more tables. And then at the end, so we are finally the bokeh rendering, so we get these kind of graphs, right? And so unlike what we have to use fee, that's actually interactive, so we can actually look in there, right, pan, and figure out that guy that's actually wrong on there, right? And he's basically connected to almost everyone right out there. So his centrality value is fairly high because of that. Or we could have that and write it, which has very high centrality and so on, right? So we can do these kind of things fairly easily. And if we look at the world script, I passed it quickly, now it's fairly linear, right? There's nothing, I mean, you have data in, data out, and you just do operations on that so that fairly maintainable in the end. And that's fairly easy to change things, right? If I would want you for some reason, that makes sense, but if I would want you to flip my centrality on the edge, I could just change that part and say minus, one minus nodes, right? Just run that bit, now I have my centrality flipped, and we run just the rendering, right? Then I have my color scale, which is backward, right? So fairly nicely, I mean, change the line, experiment, see how it looks, right? So very good tool for getting there. And from that tool, then you can make scripts, and you can start to produce stuff like this, right? So that one, that's commit count versus team size. My team size will mean the number of people active a particular week. So that's actually lower than the rate team size, normally. So each dot corresponds to one sample for a particular week, and then the line plots are actually same data, but where we filtered out the high frequency features, right? So to remove all the noise and get an idea of the trend. And interestingly, when I run that one on, all the history we have for KD, which is nice because we did the work in the KD repositories throughout the history, all the way back to the beginning, we didn't lose anything, as far as I know, so we have this story going back to 1996. We can see that we have both the commit count and the team size growing, right? Very consistently in 2010 that it drops, right? Interestingly, that's the same time when Paul found the difference in equation. I'm currently trying to reproduce that particular graph and failing so far, so I'm not sure what's going on there, but that was interesting to find the same tipping point and then we see it's declining, right? And so we can see very interesting things happening there. So for instance, here, we have phenomenons where the team size is kind of constant there, right? We have a plateau on the team size, and during that same period, we actually see the activity climbing up crazily, and if you look back at this story, that's basically the time of preparing KD4, right? And so everyone was busy just breaking everything in KDLibs, and so obviously that was generating a lot of commits and so on. We were the same team, right? So you can have a great variation on that while you're in a fixed team size. And so the point which is inconclusive in the great question for us is why is it still declining or not? And that's kind of a tough question, and that's also, I mean, when I'm exploring those graphs, that's where most of the risk is, because look at that, right? I mean, if you look at it to me, it's still declining if I look at it that way, all right? Now the thing is, the way you present data actually matters, right? Because now if we don't have that big drop that we had before, right, which is queuing our view, then it's not looking like something which is declining that much anymore, right? So the thing is, when you get something like this, and I, because Lydia was bashing me on the head about that particular one, I actually sampled some of the data at different data points you see, and that's more that view which would be relevant, where it look, it's not completely stabilized yet, but very clearly it's almost stabilized at that point, right? So it looks like, yeah, we had that big decline, but now we're almost there, which is plateau, and seeing the amount of new faces I see this year, maybe we can hope things to actually climb up some more from there. So that's, in my opinion, actually a good news. Interestingly, that graph is, when I published it on my blog, it's the one which got me most comments, because everyone was outraged about the, oh, 2010 deep and so on again. But that's not the most interesting stuff you can come up with, in my opinion. So when I do stuff like that, no one's comments on it. Lazy bastards. But the thing is, I mean, we can do that, right? So I've revisited the blue blobs, and remember I complained about the fact that the, I mean, resolution was fairly low in the amount of colors we had, but we can have more colors more easily than before, and so we can see that Laura Motel truly is a bird, right? Because, I mean, it's like outnumbering every one, every single week, right, of the year. It's just crazy. But then you can spot things like, something happened to Volcker on that particular week. I mean, it has been lazily churning a couple of comments here and there, and then suddenly like 130 something. That's actually the Randa effect, right? That week, he was at Randa, and at Randa, you chose code for cheese, right? So he was there, right, doing stuff. We can also see things like that, right? We can also look, so that's KDP again, the contributor network from last year, which is one I showed with the Jupiter stuff. So we see long, right, we find David, okay, it's everywhere, that guy, Daniel, okay, I mean, for a comedy, Volcker, okay, still at him. And then there's this guy, oh, Alan Winter, right? Alan Winter, if you look at him, I mean, most of you haven't seen him, right? Most of you never hear about him, or think he left the community. No, no, he's still doing stuff in KDP, right? And he's still very central, right? He's among the top six out of five central people. And then we have, what's interesting about this kind of use that you can find, not like this, like Christian Modikov, Modikov, sorry, who is basically acting as a bridge, right? We've got three persons who are connected to no one else, right? And so that gives you, so if you want details about that, go to my blog, right? But that is kind of things which gives you an insight, okay, something interesting might be happening there, so let's look at that. Which prompted me to do this one, which is even harder to read, that three normalized plots, where you get the centrality for one contributor, you get is centrality, is activity over time, and the green one is the normalized team size. Because what I normalize everything there, so it's all between zero and one, because what I'm interested in is trends, not the absolute numbers, all right? And the reason why we have the normalized size is that we cannot really compare the centrality versus activity if the team size changes. So in that area, right? If we consider the team size somewhat stable, I don't know if there's no, right? Somewhat stable, then I can look at the variations of activity compared to centrality, that's fine, but if I try to compare centrality there with centrality there, it doesn't mean anything because the team shrunk, right, at that point, so it doesn't make sense. And so the graph from Booker is actually interesting because to get his centrality up at that time, he had to be fairly, fairly active. I mean, that's like crazy level of activity from him, right? But then you get a very huge spike in centrality here with a very small spike of activities. That's because the team size changed, all right? So that these kind of things you can look at. And more interesting, so Christian, again, you look at his activities climbing while his centrality is dropping, right? For an area where the team size kind of constant. And that, ladies and gentlemen, is a sign of fork in the community, right? We have someone who is walking on something that no one else is walking on, right? So he started to walk in his own corner. I'm not judging the reasons or whatever, right? I'm just saying that's a pattern, right? And so there might be more of those. That was a quick tour of the demo part and I'm running out of time, so quickly. What's coming next, right? We have a few metrics and so on that we've been looking at, well, obviously we can look at more data sources, I mean, that's one of the requests like, yeah, look at the reviews, yeah. Love to look at the reviews. I mean, plenty of information there. But the Fabricator API is just a bitch to get to to actually have the information. So it would require me to have a very big chunk of time to be able to get there, and I never managed to get there, but I would love to have that. So there's more sources we would want to look at in the same thing, but that's not that easy, right? You have a complete history and so on. But that's the obvious ones, right? More metrics, more data sources, and so on. But if we do that, we're always looking at history. That's pretty much what I do there. But when I look at history, I might find patterns like the one I showed you before, right? And maybe we can start to do stuff with those patterns. One of the things would be, can we try to predict when someone is slowly on the way out? Or can we try to predict when we have patterns in our situation with Christian, which then would make us, or at least the community working group, proactive in figuring out that maybe something bad is happening inside of this team, right? Because right now, the community working group find the facts when it's kind of too late, right? Sometimes odd harsh words have been uttered already. If we could get early warnings about that and they could be proactive, and before we break the relationship, then we could try to salvage it. So that's one of the things which would be interesting to do. I would love to do at some point, but again, quite some times, right? Like everything. The other one I have is more of a funny idea, is like having our own recommendation system. You contributed to Kmail.Git for the past four weeks. Maybe you would be interested in checking out Ms. H. List as well, right? So for the onboarding, that would be actually something interesting to do, right? Because often contributors, they get in because, well, Kmail, there's a small thing I don't like, I will fix it. And then, know what, right? We have a hard time then to have proper retention. We might easily push them toward other tasks we have prepared. But maybe at the end of the day, Kmail is not necessarily easy, software product you would like most, and maybe you would like to look at something else, right? So if we have a way to provide some interesting recommendations for them, that could be interesting. But that is a joke at all. That's pretty much what I'm looking at for the future, for now, and thanks for your attention. And if you've got any questions, if we still have time for that, because I'm two minutes off, so we're kind of fine. So if you've got questions. You're gonna get some questions and you're ready for them. Is there a break after that? Oh, we have a break after that. Yeah, bring questions. The first question goes on. Oh yeah, before the first question, one thing I forgot to mention, there will be a buff on Monday morning, if I remember correctly, there will be a buff on Monday morning about that topic. So if you want to get in, if you want to give an ant, you can come to that buff. If you have ideas of things to measure, then we can see how that would be, and feasible or not. So we can have discussions, more discussions about that. But questions? So one of your conclusions that there is your thought was that there's a lot of future research when we started to get it. And we, you know, are here for the research to get to earlier or for the later, but we haven't been able to get any of those future research that has become normal, hard to tell, or low, even if you do not know the source. Should be. So is there a shortage of updates about the outcome of future research? Do you think there will be a bit of a concern about the case area in the past? Could be that the odd part. I mean, I can, that's why I actually did that one in two blog posts separately, right? Because obviously I failed because I'm still human and so on. So sometimes I sleep something somewhere where I shouldn't. But I try to post where I'm actually looking at data and trying to figure out, okay, what's going on. Kind of a cold-ed way of looking at things. And then having my own opinion and interpretation at distance, right? And that's what I try to do there and fade slightly. So from the time where it happened, right? So the second blog post is not only about the data, right? But the data tells me that the time where it happened and we have a hard time to recover from that, right? So the time where it happened is actually with the grid switch and we have a hard time to recover from that. So that means there's two periods in there, right? The switch to get itself. And the only reason why we would have dropped like this with the switch to get is because it was hard to use at the time. I mean, back, if you remember 2010, when you get what there, it was really pitta to use for almost everyone, right? Even seasoned developers. And then the second period where get improved, get appeared, everyone knows get. And still we still have our time to recover. And then that particular area, which is one we're in, well, obviously we have to speculate a bit. So I have an opinion about that, which is not necessarily true because I have nothing to really back it up apart from my own personal experience when I get to the university and when basically, I mean, when I put students in front of fabricator, they're like, oh, I have to create an account. I can identify myself with Facebook, right? I mean, that's kind of crap. I mean, I hate that. That's the kind of crap I have to hear, right? And yeah, if those people think that way, obviously you put them in front of something. So in that case, I couldn't even be fabricator really. That's identity, right? They have to create an identity. And for them, they're like, why would I do this? Right? I don't want to do this. I want to identify with GitHub or with whatever. And we don't have that option, right? So maybe staying with fabricator and just fixing that would be enough. We know this, right? Then my personal opinion is there's probably more from the interactions I got with them because I can fold them to create an identity and then see how it goes. And they generally have more problems. Can you tell us to be able to go between the site and what it means? We would have to decide towards, against other projects that will still get a different place and then in terms of how we do the course and why and how we analyze the characters. Right. So that's why I started to also look at other communities because at one point, I mean, the problem I figure at some point is that I'm too close from the data. In some way, that's normal, right? I'm doing this because I want that community to improve. But then if I want to have patterns and so on, I have to try to find them at other places, which forces me to try to know other things and so on. So one of the things I think I would try to look at is NextCloud because they got a fork, right? So the fact that they got a fork means it's interesting for me for that. So I will probably pull more of the data from NextCloud, more of the data on the cloud, try to find back the period where things were happening, try to put things and see how it goes. So that's one of the things I have on my mind. But yeah, definitely I need to look at more. That's why the last one I did. Q, DLC, Rust, because I've been playing with Rust lately to have different things to look at, right? So in so far, my question is what about young people if I see a little torture 16 years old now, PC is some music, PC is some elegant technology that don't really get bored. They can't handle that keyboard, they must do it in the school. But from the inner of the dish to do something, it's incredible, right? So it's a question of mine, have you noticed that we lose quite a generation of people who need to do something but not come into the platform to use today? Is there any objection about this, what we can do about it? Not necessarily, I mean I don't necessarily have an objection about that because yeah, clearly it's more hip to do stuff on Android, right? I mean, because they use that daily or so, and so they want to play with this. And so an option is also for us to go more aggressively on this kind of platforms. I mean, we have Kirigami and so on, so we are having actions toward that already, and that's a good thing to push them, right? So I have no particular objection about that, but clearly, I mean there are other factors as well, right? Because if you take a team playing with their phone, yeah, obviously, I mean, be it Android or not, she probably won't contribute to it. If you take a student which is in software engineering, yeah, then yes, getting into these kind of platforms could be interesting, but Linux is still interesting in some universities as well, so for those ones, there are small factors, and that's more of those factors that I'm looking at with the story that we have. Did you consider the side effects of bringing this integration in more popular mode or something in that way that maybe kind of they could reach out to someone? So by, you mean the impact that could have on the community when I published that? Yes and no, in the way that it never got beyond the point of maybe that we kicked their ass in doing something both onboarding, for instance. Which is great because you're there, so that's a good thing in that regard. If the graph about the team size and commit was really, really, really bad, I can't tell if I would have published it or not. The thing is, it gets to a territory in our way, it's kind of inconclusive, so that's good in that way, because if that's inconclusive, that gives a message to people that it's in your hands now. At the time when Paul did that in 2014, that was still very much inside of decline, so when people are confronted with that, and I think that's why we shied away from the debate at the time, that's, oh my God, we're gonna die. And that's basically people being caught in the light of a car coming quickly. And so they basically just freeze and don't do anything about it. But we're out of that, so the other thing, it's inconclusive now, it's kind of stabilized, so we can go in any direction we want, right? It's not like we have some force of nature forcing us to shrink or anything. I think both computers have modified the same file and give over that in the other file. Right, so one thing, the same file on the same time period I'm looking at, right, so if I'm looking at the time period of a week, that's if two contributors touch the same file or set of files that week. One thing I didn't mention, that the link has a weight as well, so if the overlap in the files I've been touching that week is very large, then the weight of that particular edge is higher. And so then that plays in the centrality computation that plays in the layouting of the graph. Right, we have both on Monday morning, 10.30 for remember 14, something like that. So that's on the wiki. I'm going to be having a break, so we can go to the e-mail desk and we can go to the drinks if you want, but we have a half hour break and then we'll be here.