 Okay. Good. Excellent. So, hello and welcome here to the next session in the main hall of the DapCon Taiwan 2018. I'm now here with Professor Dr. Wolfgang Maurer from Germany, Regensburg, which is the city in the south of Germany. And you're working with the University of Regensburg, but also with the Siemens Corporation. And I just learned that Siemens actually has already an open source competence center for more than 10 years. So this is pretty cool. And of course, you have a lot of products like with Linux and embedded Linux. And that's your personal background. And today you will also like, like give us more information here about your the research that you are doing and especially analyze Debian. So the title of the talk is, are any big brothers watching you? And if yes, what can they tell about Debian? Of course, we want to know that. So big round of applause for Professor Dr. Wolfgang Maurer. Yeah, thank you very much for the introduction. Actually, that's I think the longest title I've ever given to a talk that I submitted somewhere. And are any big brothers watching you? The answer is of course, yes, because if not, then the second second question wouldn't have made any sense. As you already said, I'm kind of a split personality. So I do work in academia with some fraction of my time. But I'm also involved with Siemens corporate research. And yes, we're doing lots of lots of projects based on Linux. We've had an open source competence center for more than 10 years. It goes even further back than I've been working for Siemens and many quite, quite many Siemens products actually use Linux in its various forms. So for instance, we own that's bad angle for the laser points, I realize. For instance, the left most device is a magnetic resonance imaging tomograph that runs a special combination of Linux plus a real time operating systems. And on my, if that brings a bell for you, then we have a device that does communication. Sorry, the next device doesn't do communication. It's a it's a CMARTIC. It's an industrial controller. Then comes a mobile x-ray device again running in Linux. Then comes one of our communication platform. And the last image is supposed to represent the research, the more fundamental research that we're doing at Siemens corporate technology. I've also done a fair share of documenting and writing about the Linux kernel. Some even in languages that I cannot even read. So I've learned that it does seem to make sense what's on the last two images in the bottom row, but personally I cannot judge that. That's me on the cover. 10 years, oh, sorry, younger, taken by a professional photographer and Photoshopped. But it's really me. So I was I was involved in producing this picture. Now, why am I speaking here at DevCon? That's actually my first, my first time addressing you guys at DevCon. So typically more found on the Linux Foundation events, embedded Linux conference, and so on. But part of that, or part of the reason is that as Siemens, we are sponsoring this year's DevCon because of our involvement in the civil infrastructure platform and the civil infrastructure platform is also part of the reason why I'm telling you about the stuff that I'm about to tell. Coming to Devian, so coming here, I was wondering when actually did I start using Devian first? I've never been a Debian developer. I've only been involved with Debian as a user, but that goes back for quite a while. And I do remember one of the questions that I thought about, should I be using Debian or Ultrasill? Back then in time, Ultrasill won because it came on a CD-ROM. While I had to install Debian from three and a half inch floppies, which meant carrying such a stack of floppies at home, but actually I've run a Debian on quite a few architectures, X86, ARM, Spark, PowerPC, you name it, Alpha, Eitanium, PRisk. When I first compiled the list, I got a bit depressed because I realized how many architectures that I've run Debian on are not even available on the market anymore, and that I guess just means one is getting old. C'est la vie. Anyway, before I start discussing about what I wanted to discuss, I'd like to know a little bit more about user. Here's three target audience check questions. Who of you has done academic software engineering research lately or is an academic? Okay, that's kind of a minority. Who of you has read about software engineering research lately? Okay, that's a few more, but also not too many. And who of you has attended a software engineering conference, and you have named lately? An academic software engineering conference, I would say. Okay, also, which is pretty good because I feel when I talk to people, when I talk to researchers, when I talk to developers, there's a certain disconnect there, of course, between research and between the people who do the actual software. There are, of course, certain exceptions to these rules. Another one is sitting right here, doing research, presenting research on Debian conferences. But as I've seen from my quick survey, that's obviously not the case too often. So it may make sense to discuss this topic. And what I wanted to discuss today is how research, what kind of research is done that either targets Debian that uses Debian as a foundation for research or that can have bring positive influence to Debian. So in the first part of the talk, I'm trying to sum up a little of the research by other people that has been done in the last 10 years or so that somehow relates to Debian. And then in the second part of the talk, I'm going to delve a little more into details of one field of research that's currently becoming very productive, namely the socio-technical analysis of large-scale development undertakings. Number three is then, I'll see if I have time for that, this moment as an interactive part of the session, because I also have some questions for you, some more questions and would like to know your opinion on certain things. So what's going on in software engineering research currently? So as a big fat disclaimer on front, of course the software engineering community is having as easy to find a common opinion as all the Linux developers have it to find a distribution that suits everyone. So what I'm saying is, of course, my personal subjective opinion, while I'm not going to prefix every sentence with that, in my personal subjective opinion, some current trends in software engineering research are about one migrating this field to empirical, quantitative and evidence based methodologies. So we used to have a lot of opinions, a lot of guesswork in earlier years of software engineering, but people are transitioning to make that quantitative and really turn it back into an engineering discipline again. And there's also lots of interest lately in automated software engineering and construction because we've come to realize that doing software is maybe too hard for humans, so we should get all the help from computers we need. We can get. Why does it make sense as scientists to investigate Debian to do research on Debian? Of course, I don't need to tell you that. It's one of the largest collective engineering undertakings of mankind, which makes it interesting in itself, but the main advantage from the scientific point of view is that there's a lot of publicly accessible data behind that. So you have mailing lists, you have bug trackers, you have all kinds of freely accessible data sources that simply do not exist in commercial projects of a similar magnitude, and that can be analyzed with various methods. And that can then help science to understand the really important questions of software engineering quantitatively. How should we do development? How should integration happen? How do processes work optimally in software development and so on? As for the civil infrastructure platform, I've already mentioned that, and as for Siemens, for other companies, there's one solid reason why research about software engineering and about these more software aspects of software engineering like processes and methodologies is becoming more important, and that's because I've said in the beginning, Linux is used in very many non-traditional products, so your average nuclear magnetic resonance machine is quite different from your laptop. Trust me, if things go wrong in this machine, it's also quite different from your mobile phone rebooting and so on. These are systems that need to satisfy very strict safety requirements. On the other hand, they need to provide more and more functionality, and that means people are really getting into using Linux in these very demanding safety critical fields, which has its dangers and which requires a certifying software that comes from the Linux domain that has not been written with safety in mind. Actually, there's three different ways to get software safety certified. One is start from scratch, start development from zero to standard compliant development. That's obviously out of the question for Linux, and no one wants to start the Linux kernel again. But you can also argue with proven in use arguments. You could say, I've been using Debian on my Spark station for the last 25 years and no single fault ever happened, so let's do it, let's put it on a magnetic resonance tomography machine, of course simplifying things, and you could also do compliant so-called compliant non-compliant development, which necessitates you to prove certain aspects of the development processes, and of course that's only possible if you analyze processes, if you can make quantitative statements about the processes, and that's the general goal of analyzing the software elements of software engineering. I've already mentioned the civil infrastructure platform, that's one of the reasons why I'm interested in this kind of research. The civil infrastructure platform, you may have seen our booth, you know that we are sponsoring the conference, the civil infrastructure platform is dedicated at bringing Linux into such areas, not so much from the safety point of view as of now, but from the super long-term maintainability point of view, so you may replace your mobile phone every year, every two years, maybe every three years, but you certainly don't want to replace your nuclear power station every two years. You don't want to replace your magnetic resonance tomography machine every two years, you don't want to replace your industrial control in your big ass industrial plant every two years, but these devices are supposed to last for 10 years, for 20 years, even longer, and that's the interest of the civil infrastructure platform initiative of the Linux Foundation to provide such system, which of course creates strong interest in many things, for instance in long-term support of the distribution itself, so we're supporting Debian LTS, strong interest in reproducible builds for various reasons, but also interest in these research topics I mentioned, namely automated software engineering and especially quantifying processes identifying when processes work well and when they don't. So coming back to analyzing Debian, of course the process analysis and related things is just one, a particular subfield that people have been working on by analyzing Debian and I promised in the submission of this talk to outline the research that has been done about Debian or using Debian and so on. Turns out that's pretty hard to find some objectively representative examples, I already mentioned this problem, it's always hard to talk about other people's work and I'm trying in this first part to focus on other people's work and that's hard for two reasons first because a lot of software engineering research is published each year, hundreds of papers is maybe totally underestimating the amounts. People often use Debian but don't mention it explicitly when they analyze Debian as typically mentioned, but still any choice I would make is subjective and unfair, so I decided to go with a more perhaps more objective and fair methods, go to web of science, that's a science database, search for the keyword Debian then select the papers I get and then maybe select work that sites me or work that builds on this, of course that still gives a various objective and unfair choice but at least superficially I can claim that I have made efforts to do this objectively. As a result I'm going to first talk about a number of papers, about 30 of them, of course apologies for not being able to covering them in detail that's quite impossible in a one-hour session but I'll try to cover them as good as possible. For the next slide I already apologized in advance so I thought, sorry no for that one I don't apologize yet, I'm going to apologize for the next one. These 33 papers I've classified into broadly five categories where people are interested in dealing with Debian that's of course improving software quality that's the foremost goal of software engineering in analyzing communities and cooperation so here we are getting into a more non-standard topic, testing and analyzing code at large scale which is again a fairly obvious thing to do, understanding licensing and code sharing of course that relates very much to open source software that's also very much are very close to the heart of many Debian developers and then finally as the last category research that focuses at improving Debian itself so now that's the slide I'm apologizing for and also for the four other ones I really tried to make to present the papers that I was mentioning in a more graphical or entertaining way but turns out it's simply not possible so I'm just listing the authors approximately the time when the research appeared and what it was about and will comment a little bit so that you get an impression what people are actually trying to do with the help of Debian and one of the first one of the first papers that was of the earlier papers that was dealing with improving software quality in general with the help of Debian was by Chen and Wagner in 2007 a large-scale analysis of format format string vulnerabilities in Debian Linux that's of course a very interesting goal for every software developer especially if you're dealing with C and C++ they came up with this they could analyze a very good fraction of all C packages in Debian so they used Debian as a data source to get the hands on as much C code as they could they analyzed about 66% or roughly 70% of all the C files in the Debian distribution back then that was Debian 3.1 and found a whopping 1500 string vulnerabilities that could be possibly exploited okay they estimated that about only 85% of them are true positives but still that's that's a quite scary number and they also managed to get rid of quite a few of them in the course of this of this research so that is quite an impressive work moving forward a little more to the present Adams and friends especially the name German Daniel German we will hear quite often on this list was research published about an empirical study of integration activities and distribution of open source of them that's distribution of open source software one of them is of course Debian so they were able to find some some patterns that package of they were able to identify some some integration patterns that people actively using everyday development and the nice thing about that is they really did not just do a theoretical analysis but talk to the actual maintainers talk to some actual maintenance in Debian to confirm or refute their findings and then yeah basically documented these for everyone to benefit from the wisdom that comes from Debian the next paper source file set search for clone and reuse analysis that's a typical example of research that builds on the large data sources provided by Debian of course developers to like to copy and paste on occasion but the question is how often does that happen how does that influence source code quality or does that influence a software quality and they came up with an academic method to evaluate these questions or to find a source code duplicates in the Debian ecosystems and then at least roughly quantify the magnitude how often that happens good and finally the last paper the depth sources data set that should surely ring a bell with many of you because one of the office used to be a Debian project lead that turns out to be a quite recurring pattern so many Debian projects lead tends to publish academic work so in that respect there is already quite some connection between between Debian and the scientific world at least when it comes to the leaders and this data set I guess you've all heard of maybe you're all using it it comes from academic research but builds up a more a faithful representation of all the data that they can get their hands on and by sharing it they're enabling a huge amount of data for other researchers that can be analyzed that that can be used to benefit from Debian so communities and cooperation that's a quite different topic that's not really all that has traditionally not been so much in the focus of the core computer science research of core software engineering and again the first papers I'd like to mention come from one of the previous Debian project leaders the first one I think no I spelled his name right I think so he lacks some vowels and that was one of the early studies one of the early studies on how people actually cooperate in software projects in large distributed software projects back in 2003 doing something like Debian a volunteer based building a volunteer based large system was too much of industry a bit of a shock because they were only used to these traditional top-down hierarchically organized development undertakings and what these authors did was not not yet formalizing or making quantitative statements but really capturing the essence of the Debian development process and the Debian approaches that they that have been found useful so far led to a number of subsequent papers that I'd not like to discuss in detail what's interesting is that a couple of years later the knowledge that there are actually alternatives to traditional management style even spreads to the outside the engineering domains like the second paper from the information and economics policy journal is quite not related to computer science but they also used insights from the Debian distribution to teach different management styles than the traditional ones to the world yeah the paper by Wang that's an example of a long of a large amount of papers that started to look at the problem of how people cooperate how communities are formed and so on more quantitatively the subject that Wang's considering is a very special one so they're looking at email archives from Debian and then try to infer if someone writes an email to any of the mailing lists how likely is it to get responses how likely is this or that developer to respond and so on and it turns out you can do that pretty reliably with machine learning and statistical techniques okay question of course is what's this good for but that's not really the essence of research it's already amazing that you can tackle such problems such very social problems in a quantitative manner and the last couple of couple of papers the last three papers are a very good example of the Debian community being made the guinea pig of actual research maybe with knowing maybe without knowing so many of you perhaps have participated in any of the key signing events at Debian conferences and then you surely have heard the name Wolf if you didn't know that you've contributed to actual academic research by doing proper key management in your function with in Debian then these are the papers that you may want to look at testing and analyzing code that's again at large scale I should mention because testing testing small code is of course not so much of an issue testing 50,000 packages produced by thousands and thousands of developers is a completely different matter that's the subject of the next category and so the first paper I'd like to mention it seems like a very straight forward one so on the distribution of source code file sizes done in 2011 by Harais and friends but that actually and that of course also was this study of course was performed on the Debian distribution because it contains very many files so it seems like a good place to go for if you want to measure file size distribution and what they surprisingly came up with is that many of the previous assumptions on how file sizes are distributed in large projects are essentially wrong so they need to be described by difference these different distributions but if this is wrong if people have been using wrong estimates for that previously that means that many of the models that we use to compute the value of software the economic value of software make predictions and build time make predictions on bugs and so on are essentially flawed good going on mining security vulnerabilities from Linux distribution metadata that's a work that goes into the more into the testing testing portion of this category and what the authors did is they were they basically they were interesting when they were interested in how do how do security vulnerabilities within Debian evolve they are of course getting fixed as far as that's possible but how do they track across releases how long does it take are we getting better at that or not and they only could do that because there's all this historical data available in the back trackers in the open forums that's again an example for research that's simply not possible if you don't have open data sets if you only rely on proprietary and company endemic code the next paper by König and Townsend the research done in 2015 is in my opinion particular very very interesting because it raised one problem that's known to many developers in the real world but that's not known that's that's not known to many to many researchers in the academic world because what they did in this paper is actually very simple they took some existing they took some existing research tools and tried to apply them to the Debian universe of course the result is that the research tool broke in most of the cases because it's typically optimized for some very special cases it's not optimized for all the bells and whistles you find in real life source code it's not optimized for the volume of data you're dealing that but this paper turned out to be quite a win-win situation for both research and Debian because on the one hand by applying the tools to Debian they managed to improve their tools managed to fix bugs in their tools fixed corner cases and so on but they also contributed lots of bug reports back to Debian 700 in this case and in case if any one of you would have been bored before 2015 then you should have had lots to do after this paper and this work was done good in the last paper let me just briefly skip over that that's a classical example of applying testing approaches to large large collections of software so in this paper the authors could prove very specific properties that lock freedom for C programs in an impressive amount of cases 292 so the community I guess doesn't learn too much from that immediately but it's at least reassuring to hear that it's impossible to detect many common bugs in your software repositories good moving on to licensing and code sharing a topic that's obviously very very specific to open source software and I guess I'm progressing further in time than I would like anyways I'm going to just quickly summarize what people did here without getting into the details but it's basically it's two types of approaches that people are doing one is using using source code analysis to detect any incompatibilities in the licenses that are combined in Debian packages or in the distribution at large so finding basically spots where changes where legal changes are needed to make the sharing of to make sharing of source code legally okay again to comply with the open source licenses and another another aspect of this that I'd like to mention is that Debian still can set so can can bring surprising input to other fields of science if you look at the third paper from top the 2013 paper that's actually from physicists so I'm I'm a physicist by by education so I can fully understand the problem of course in physics and that's a a quote from a physics conference that I'm giving now when it comes to software everyone thinks they're Moses so a physicist of course would never listen to any computer scientist on how to write their software how to package the software and so on and that's why it was possible in 2013 a 2014 to come up with a paper that basically said yeah so we're packaging software astonishingly there have been people who have done that before and maybe we shouldn't do this our own way but use the same tools, formats and so on that Debian uses so if you're in physics you can invent a Linux distribution and still get that published but that's again one of the nice effects that distributions like Debian has and finally coming to my to the last category research about improving Debian as such here I'd like to highlight especially yeah two approaches so one again the first one is again from one of the former Debian projects leaders so we've had three or four of them already and what he did was actually superficially a very simple task that usually wouldn't be regarded by science as interesting if you don't care for the details if you start caring for the details you of course realize that this task is very hard and the task was to rebuild Debian as quickly as possible rebuild the whole Debian distribution as quickly as possible in 2009 if I recall correctly there was about 10, 12 or so thousand packages already quite a lot and it took about a week to build the whole system which is naturally bad if you wanted to automate the testing if you want to test reproducible builds if you want to roll out releases and so on and the goal of this paper was a really practical one namely use one of the largest French distributed computers the grid 5,000 machine to rebuild Debian the paper itself is very interesting because there's so many so many problems that you wouldn't expect that range from packages not being able to build in parallel to vastly different differing boots different builds times and so on that a lot can be learned from that work and that's also a very nice example of research that directly benefits Debian because he could by using these large resources large shared resources that are now these days standard if you think of cloud computing he could really bring down the build time of Debian from weeks to roughly I think it was eight hours so it's a very very small amount of time the other papers basically are are doing applying research and Debian and then in some way or another contributing it back so to not spend too much time on that anymore I just invite you to to look at these papers the links links are all in the PDF set I will be distributing but it's they're all very good examples for doing doing interesting research based on Debian and immediately feeding back improvements to Debian itself so as I said I did I did already apologize for listing five pages of references but I really couldn't find any better way to introduce at least some some representative sample of research about Debian so let me let me come to the second part that will be more pictorial which is the sociotechnical analysis part which is due to the fact that I'm going now in a shameless plug to focus a little bit on my own research that I did so it's not so much about the actual papers of course lots of people have contributed to that research field and I don't want to to outline my my own research too much but I'd rather like to discuss the fact that actually many of the problems or software development to a large extent can be reduced to two problems mostly and I guess many of you know these two problems problem number one is it's about technologies or softwares about technology and technology is hard this is a pictorial representation of the Tower of Babel you may be more likely to be to be aware of that if you come from a European background sort of those with a non-European background that was a an attempt by people in the early ages to build a tower that reaches up to the heaven that reaches up to God very massive engineering undertaking and of course that failed tower broke down people spread all across the world spoke different languages and so on people in Taiwan if I think of the Taipei 101 of course to better these days but still it's quite a way to go up to the sky and to heaven so problem number one softwares about technology problem number two software is about people and this image I guess even less of you are supposed to know it's a very famous black and white movie in Germany basically about to people one Catholic priest and the leader of a socialist party in post-war Italy that want exactly the same thing for the people to be happy to be nice to each other but totally cannot agree on how to achieve these goals I'm pretty sure you've never heard of this problem in Debian you want the same thing but there's multiple ways to reach the thing and then you discuss an infinitum to reach it so second problem it's about people and that's where the where the the research field of so here we are where the research field of socio-technical analysis comes from the idea is to combine knowledge about both aspects the social aspects and the technical aspects of software development to arrive at better software development methodology and of course nothing is better suited than open source software to get information on the social and the technical aspects of software development because it always it very often comes already in combined forms so social information is contained in many of the artifacts that we are creating in software development as an example I'm taking a a commit to a software project but you could find similar things in many other artifacts that appear when you do when you create distributions when you do infrastructure work and so on the commit as you see in the bottom part you all you all are very familiar with commits so you don't need to spend much time on that the commit does not just contain the actual patch or the actual diff to the project that does the technical change but it also contains lots of explicit and implicit social information for instance in Git you have an author of a commit you have a commit of a commit that creates a social relation between two persons in some projects you have these developer certificates of origin for instance in the Linux kernel that's tell who reviewed the patch who acknowledged the patch maybe even who was against the patch that create an effective social relation between two persons and the social relations directly come with the technical change the change brings so and that's of course the ideal data source to come up with to consider socio-technical aspects of development of software development question is how do we leverage these data to to construct collaboration networks so which which of the social connections are mentioned do we use best to come up with appropriate representations of the social structure of projects how do we determine which developers are influential which are central to the network and which are not and that's not meant in the sense that this should be used for finger pointing or giving giving batches to people but just in the sense of how stable our development networks are well structured are there and so on and how do we identify communities so can we arrive at conclusions if these if the community structures we find are good and bad for software development so constructing these networks is a topic of its own I'm going to I'm not going to discuss any of these methods in detail just observe that there's a lot of different possibilities to construct these networks both for relations between people as well as relations between software development and artifacts so of course not only people but also artifacts are in some kind of relation with one another detecting communities analyzing network properties and so on that's also quite the standard problem of science and actually and when when I did this first most surprisingly to me is that sociology there is quite some mathematical aspects to sociology and they really came up with many approaches to find communities to quantify these networks to assign properties to these networks and so on that can be readily used in the sociotechnical analysis of software development yeah so here's here's an example if you just take the data that you have and try to find a collaboration network that for instance is for the Linux kernel that of course is maybe not the most easiest one to interpret so you need to bring that down to you need to bring that down to more digestible form to better to form that can be better comprehended and that of course involves finding communities finding sub-communities in this large network of people unfortunately I don't have a picture ready for the Debian community could be easily created but essentially it would look very much the same as well as that's no information you need to boil it down to easier to digest units and that's of course sub-graphs, clusters, sub-communities in the developer network that you can find with many, many existing tools okay just to give you to give you an example just to give you a little rationale that these approaches really work skipping from no actually yeah QMU many of you may know QMU the virtualization or and simulation software system level simulation software that's a very nice example that you can really see how the mode of cooperation how interaction between developers changes over time so QMU 0.11 that's a very ancient version that was mostly running as a hobbyist project back then created by Fabrice Bellar in the way he maybe usually does things create them just out of boredom and to show that things can be done like putting a Linux kernel in a javascript simulator in a browser or writing a universal system simulation software that can provide support for any hardware that you like and so on and you actually see that so people got got interested in that project it's not the early phase of the project but you see in this graph nodes represent developers the size of node represents the relative influence of a developer to that project that does one developer that has a really large node that basically makes up the project with some some smaller very small nodes attached to him so these are the contributors that give him patches but in the end of the day it's the the main developer who does the actual work that of course is not a good structure if you want to rely on a project because consider the truck factor what happens if the largest node is hit by a bus then you very likely will run into coordination into sustainability problems so moving on with KM was a version history at some point it became or KM will become one of the core components of virtual virtualization systems like KVM or commercial solutions and then people really started caring about what will happen to the project if things go wrong and so the development structure changed QM is 0.13 I had yeah a more app structure already and with QM 1.5.0 there's still an old version from today's standard but that's at the height of the whole cloud and virtualization time just a second you see that this QM that QM really evolved into a a social collaboration structure that every textbook manager would really like so you have people who are responsible in each sub-community it's a manageable number of sub-communities the communities are large but still not in finite and so on so what we do actually actually does make sense and can be can be objectively can be objectified so you have a question I'm trying to make sense of this graph visualization so is the connections and arrows representing something of the node relation of the nodes yes so the nodes represent developers and the arrows represent cooperation relationships for now just forget about the direction of the arrow if there's a an edge between two nodes that means these two nodes these two developers do interact in one form or another how do you define the interaction so I've said there's there's multiple ways to define interactions for this for this case we're defining interaction as people who have contributed to a common software development artifact and the artifact in this case is either a function or a class but there's also there's other ways like this commit author relationship or reviewing ones patches and so on that essentially leads to the same results no would you would you so the question was did you use this and that theory that I don't know and then the simple answer was no could you repeat the the name of the theory that you mentioned that I don't know is Gantrundwile they have published in springer a formal concept theory you have a number of concepts and you have a number of properties and then you can connect these together you can filter them out according to a certain algorithm so that you have a source and you have a sink and then you can split up in possible good choices the best choices there are according to a certain filter and it's it's used for files and for for many things so you can find it in in springer very large okay that's that's something we should look at but the the short answer is we haven't used it for that work having read so many of these papers what is your sense about how often the data collected and analyzed in these projects is being used to support the Debian project that is a question that brings me that brings me to the end of the talks I wanted to skip a few slides anyway so that's a good opportunity you can tell lots about the the networks that you find about the the properties of different types of networks what these properties mean for for the communities for progressing with the communities there's also lots of work being done to ensure that these networks are accurate so but given all that the question is how could the if I understood your question correctly how could the Debian community or other communities benefit from this kind of work and that's that's precisely one of the points why I'm why I'm presenting this here so I said we will need more of this information we will need to be able to quantify the social aspects of a project when it comes to when it comes to safety certifying or certifying other aspects of software projects I would like to disagree that you do not need more data to support Debian what you need is to support Debian so if you have anything that can support Debian you can use it to support Debian you do not need more data to do that so if these papers are providing information that helps us understand how this project should improve it would be nice if there was an emphasis among academics that these are applied research papers they're not just basic and that yes there is something to learn about Debian in terms of applying it elsewhere but Debian needs that help okay so I see so basically what you're asking is why why don't we give back the data to Debian to support Debian how to encourage it I am encouraging you to encourage others okay so actually that's that's things we're trying very actively so firstly let me point out that of course all the analysis that I've presented here I had to skip over a lot of them of course but that's all based on open source software so we're publishing all our data we're publishing all our methodologies it's on github there's no Debian package for that because this whole thing is basically unpackageable but that's just a detail problem the the issue is so and of course yeah that that brings me that brings me to this point to the both thing what I would what I would like to ask you as a as an at the end of the talk as an outcome of that as I said I had to skip a lot of methodology but that is to learn what the actual questions are that you think make sense to be analyzed in that way so are there any spots where it's okay so we're unsure how to proceed best there's different approaches how we could do that how could we get could we get answers by by doing this type of analysis but there's also there's also a thing that we would need from Debian for that and I said you can you can you can argue well that the the relations you obtain from these results are correct in a certain sense but this certain sense is only statistical so without going into the detail there's some statistical techniques that you can use to validate the networks that you infer but in the end of the day you need to really check it against the knowledge of real people you need to you need to validate that against the knowledge of real people to ensure that you're not just analyzing anything because the problem we're dealing with here is you can from any data you can come up with arbitrary connection graphs so you can say and collaboration graphs you can say committed to offer is the social relation to go for you can say analyze cooperation at the file level that's the way to go for you can say I analyze evolutionary dependencies that's the way to go for that'll give you a graph and then on this graph you will run some some clustering algorithm that'll give you communities but problem is a clustering algorithm will always give you a community regardless if this represents reality or not and that's the point where we need the input of actual developers of actual people knowledgeable of the system relating to the previous topic I have one question that might be useful to us I see you analyze the several or your community has analyzed several projects like QMU LLVM and and I I bet you analyze some big projects small projects successful and failure ones yes so is it possible to predict like 10 years from now how Debian will look like from the present clustering or structure will it fail or will it being a big success or will it split or something like that well so these are of course very detailed questions so it's only possible to predict the future in a limited way but that's that's a field that we've been considering in there there are actually I didn't cover that slide there are actually some patterns that you find in successful open source projects how they grow how the network properties change over time and that is to a certain extent an indicator how a project will do in the future or if there are any corrections needed to the way a project is run so for instance LLVM as you already mentioned would be an example of a healthy and successful project that follows a typical pattern other projects like Node.js that may have slightly more trouble than projects like the Linux kernel or like LLVM don't follow this pattern yes and that if you want if you want to predict such things then looking at these kinds of data would be the appropriate thing to do of course I feel free to come to me afterwards and tell me more about the specific aspects that you find worthy of analysis hey thanks so following on from the theme of sort of the previous questions about prediction and getting more data so as you said yourself given any constant static set of data it's possible to analyze it to whatever degree you want and come up with whatever conclusions you want like arbitrary pick specific specific properties to look at so I think like asking or asking for more data would sort of just basically repeat that same mistake of coming up with whatever conclusions you want based on more and more data so for me it's sort of the scientific method really is to be able to is to aim for being able to make predictions or prescriptions to test the future or to try to adapt or to do something different in the future and that's really what actually tests whether a theory is true or not you can't sort of test a theory you can't be sure whether a theory is true or not even if it's a very sophisticated or convincing sounding theory unless you make predictions or unless you make prescriptions of how we can improve ourselves and have those predictions or prescriptions be justified in the future using unknown future data so yeah I sort of that's I guess echoing also the previous opinions that more data isn't necessary at this stage it's to actually convert the existing data that we have to actions or predictions so actually I don't think I was really asking for even more data in the sense of that we need more projects that we need more source files that we need more communication so what we have right now in terms of communication data projects and so on is already very much and quite sufficient for our purpose maybe if I said that then I said it wrong but what we what we actually need is verification verification by experts because the the arbitrariness I was referring to comes from say clustering clustering will always produce your clusters and you can use statistics to verify that basically the clustering you found is either highly improbable so it's likely not right or highly probable or what not again without going into details but the the final the real proof point is if you analyze people and there are social interactions is the actual people and if they say okay that's about how I perceive it then your method at least is likely to be right but then this is sort of subject to the biases that experts have as well and I guess by experts you mean people that are sort of well embedded in the communities yes um you know I I'm somewhat well embedded in the communities so maybe I fit your definition of expert but I I wouldn't be confident in my own self-analysis of what reality is like it would only be if I could make predictions or if other people could give predictions to me that I would then be confident that these things of course so I I totally agree just taking time into account I'd like to cut this off now but rest assured that we have a whole lot of psychologists that take take take care of exactly these questions that you raised with a very good questions and in the end of the day let me just close with so I hope I've sufficiently raised your interest or raised you to to um to disagree with me that we will come tomorrow on Sunday after this talk whenever to some discussions about the topics I've shown you but with that thank you very much and I don't want to keep you from rockets any longer