 The group of the university, I'm also a member of, but I'm not in the picture, it was taken before, does research on pre-software engineering approaches and how they are trying to study what pre-software does and how can it apply to other things that are not pre-software. I'm just looking at how volunteer PD in open source project works and how do they manage to get things done, what are the results, how is fighting that. I think I probably have to get an afterthought. And the European founded projects in order to do this kind of party analysis of open source stuff and one of the things they've done, they were known a few years ago was starting a study on counting devian releases. So getting studies about devian releases. So this talk is about that. So basically, this is one of the things I'm using to talk about was not initially in the presentation. So it's about some lies and some physics. There's also some nice graphs there on the presentation. Actually, people that went to DevCon 5, I don't know how many of here went to this talk. I think the people at university gave the same talk about Sarch, I believe. What is Sarch? Yes, I think it was Sarch. So it's kind of the same talk, updated for us with some trends, conversions between Sarch and it, but fresh data from it. So it tries to give an overview of how devian has grown into it and maybe drop some questions in there if that is sustainable or not. So I'm going to introduce myself for those that don't know me. I have a friend named Sanjino, so it's a very long name. It's right here next to the other long name there, this one. I'm working at the universities and assistant professors, and they introduced me to give this talk, but I think they've been there for over eight years now. So what summary for this talk? I'm going to show you some information that they've gotten about devian edge, the latest release with lines of code, different... I'm actually going to introduce how they do this stuff and how they produce these numbers, because there are some things that have to be taken into account with the methodology they use. It has some caveats, so you have to know them before you start telling me the numbers are lies, even though I already said they're going to be lies. So some information about the release, lines of code in the release, the distribution of use of programming languages in the release, some things that might be new to the people that have been working with Devian for quite a long time. Our packages, which packages are the most, the biggest ones in distribution. Also, there's a very nice information based on the Pokemon model that tries to model how much time, given source lines of code, how much time somebody takes to do that project if he were to do it from scratch. So we'll see that too. We'll see some comparisons with other operating systems and previous releases of Devian, and we also will see what are the things that these group research group is going to do in order to keep researching how Devian has evolved through time. So I'm going to stick this one. This is intended for a more general audience. You all know what Devian is, so one of the main things is that even though they're analyzing the release, they're also taking into account maybe things that have not been released yet, but are available in source code, including the new kernels that might be released in many. But for the purposes of the research they do, even though we're working with 40 different distributions, stable testing and stable and experimental, they only focus, we only focus, I don't know whether it's a day or week in this presentation, on the stable release. So there's no data on how testing and stable is going forward. Actually Enrico Cini was asking me if we could get numbers for the next release so we could add that data, ask information to the packages so somebody could say, give me information about the packages that are over 800,000 lines of code and see which are those and try these things with packages between those that are big ones and maybe those that are programming Java or those that are mainly in C or C++. So we might do some analysis later for testing, not for this presentation but might be available in the future. So the methodology the group uses to do this analysis is basically take all the archive and packet and run analysis over it with some things that have to be done in addition in order to get meaningful results like trying to distinguish which packages provide the same thing but some of the packages that but maybe slightly different so maybe different versions of the same program like GCC which we distributed version 295, 3.3, 3.4, 4.1, 4.2 so maybe all these different versions of essentially the same software did not get counted four times might make somebody think that they need bigger than it actually is because all those versions are essentially the same software different versions of the same software so there's some work done on not just unloading and running tools but after the tools are run trying to distinguish which packages are provided in the same software and try to only count one of them and not all of them. It's been run through all the archive you get a result and result is the physical lines of code for all the software distribution and all the different packages in a given version it's physical and it's not logical because the tool used which is maybe with a slow count cannot get logical lines of code you can only get physical lines of code so for somebody in C has done a very large line which is actually different many ten lines of code but only in one single line physical line they only get count of one because that's the limitation of tools being used also using a slow count we will try to identify different programming language pair packets and also get information about which packages use which programming languages maybe get an overview of which of how many programming languages are given are taken on average on the different packages available and also try as I said before to identify files that might be the same across packages slow count also provides this and this is also done to get some basic information on what would Debian or the software that is used in Debian cost if you develop using traditional techniques so there's an issue with this approach basically and this is something that is due to maybe in Debian we don't package software the same way so people tend to develop their own tools to package software so you get things like Yara and CDBS which kind of hindered these automatic analysis of tools I even had issues with them when I had the security audits with some other tools that were working talk about this I think last year in Mexico if you're using automatic tools and you get into packages it's not easy to unpack them and get to the real source code or you're not going to get those packages properly analyzed how this is solved in this count and you'll see there are some issues here is by trying to find binary files within the binary source within the source package is trying to find binary files and uncompress them unconditionally so you find if you get a Debian package, Debian source package that within it has a tar dc file then uncompress that and keep analyzing the data from the known works there's the issue which has to be handled manually based on the data obtained that some packages are actually the same but different versions some packages some software is reused in many packages so maybe libraries instead of being used in an external library it has been embedded in the packages I think this happens rather commonly and I remember this one instance of the media decoders the libraries like ffm eeg is that right also included in many different video players instead of having a library for that so that source code is replicated as duplicated actually in many different pieces and if you don't strap that out it's going to be counted like 4 or 5 different times even if it's the same code and also there's some easiest mounting stuff and you'll see why because of the way we distribute packages some of us don't want to blame because I actually have one package in each data which I believe is one that is a skill to be healed for the next release but CDBS might remain but when you have packages within packages it's difficult to determine how to get the real lines of code and we do not do it in a single way so they have to develop different types of code for all this stuff so you'll see this is the size of the data so that's the those are the numbers the final numbers based on all the packages so that's 260 million source lines of code only taking into account the upstream packages so only taking the upstream source code if you take into account the data source code that goes 23 million source in terms of code more to the upstream that's not actually all packages introduced by Debian that's also taking into account which is only distributed by Debian so that is the source code of the packages introduced by Debian and the source code of the of the software that is only available in Debian maybe DTKG, EI or ABT so that is included the directory or the counter for source lines of code is that 17 million source code source lines of code that might be it has not been investigated yet why is this increased if this is due to patches maybe like some source code has been heavily patched and that includes a lot of source lines of code remember Chrome which I maintain from the upstream is there is a hundred revisions made by the Debian package so that includes a lot of codes that's a lot of lines of code or maybe that's based on maybe that's because of the difference packages which are only Debian specific that has not been get so one question for the audience, what do you believe in Debian X because this has changed from search to search are the five most used programming languages so we get the C that's easy C++ Perl that's not one of the five most used shell you have one big front what big front no Java so those are counting the different programming languages per package that's the distribution of languages in Debian that's a very interesting thing which has changed a lot here from search and that's Java which was way downline it has dedicated the number of bytes of code from search to edge and it's way over Perl, over Python which is one that has been it has been a lot of the programming in Python recently many people have Deans Perl and started writing just code in Python but the one that has Jumbo has been Java are there conclusions you would like this to draw about the efficiency of writing in the language there's one interesting thing that is measurement of source of the mean of source lines of code per package to determine if that's higher in Java programs than in C or C++ programs but that could be done later but that's not the end so I don't know if that's only three or four packages actually one package that contributes a lot of that I think is open office which has a lot of Java stuff how much does it cost? I don't know we have all the data there if you want to what can you find there's some I don't know I can give you the list of packages that are packages are specific because they don't build anymore I don't know at least John, John the Reaper the Crocodile Password assembly coding site there's a lot of libraries that have some more optimized routines for popular architecture IP IP I don't remember so like PHP for example is not that high up in the list and that one and TCL has been dropping for a while one of the meaningful things here is that if you take this vision that's the distribution of programming languages for edge you can pair it with that source okay that starts C++ well C++ is going up C is going down Java has jumped way up and has displayed Python and Perl and list has gone way down actually it is not any longer than the five most used programming languages in packages if you want more data all this is available so you can play with all of that so list not only dropped out of the top five it actually dropped it was above program four it dropped below it's on number six actually so it dropped from four to six so that dropped from two percent three percent to seventy percent actually that's the distribution the number of lines of code has grown so that doesn't mean that the lines of code is less code than it was in search that's why it might be the same point but there's much more code introducing in edge with Java and C++ in this there's a bias in some of these statistics that you will see now so as I said and C++ is being used less C++ and Java are being used much more Java specifically has increased a lot so I believe the security team has started learning Java because that's if that hasn't reached a lot it probably will increase as soon because I believe we didn't release edge with GDT I've known this since official GDT that would be things solved for them and that would make us get even more Java stuff into the archive good so what do you believe the largest packages in Debian and the second one KDWaste KD4 that's too many that's too many no I thought it was the kernel yeah that's everyone I think you would say the kernel first oh it takes longer to build for the next couple of months yeah so yeah you see I choose an ISO over there the ISO is over there and that one is one of the things with the representation I've known is one of the things that have bias probably this recently is that this has been counted and Yale you're responsible for this because of course not anymore you're still in the uproar so that package actually I was really surprised when I saw this when reviewing the presentation that's one of the things that it kind of biases some of the data so we probably have to review it without it and it's when Juan José who developed his data finalized it even taken into account that this packaging includes a lot of libraries the buggers which are actually GCC includes all the libraries so the tools took that package unpack it, unpack all the sources and then count it so that 230 megabyte package was counted as unique when it's actually the rehashing of many different packages so that probably has to be spit out and removed and taken in account taken out at least some of that in the least GCC different versions are actually the same size but they have been removed to only try to count GCC once but actually it's being counted twice because it's inside number three actually it's being counted five to four because I think that number three includes GCC three to three GCC four to one GCC four to two I swear but the thing is GCC came because it was too buggy to include it so actually I added here the top ten packages are basically if you take least or even take top 100 are basically either development tools or end user software so that's Mozilla iSweezel, iStove and Evolution are on the top 100 packages so those are the biggest software we have K3BSP was on the least but iI32 leaves put it down it's number 11 so it's actually number 10 so it should be over the least of two yes which one I really don't know but it's really big it was in the top ten it was in the top ten but it's the X sources does it? I have a question why we're an old member KD desktop on such environment on the list because they also have a package on it well actually one of the things that it's not on the least that was on the least charge is X and now it has been split a lot so it's not on the least it was always on the list this is because of the packaging no one knows it's a number of discrete units distributed as charge upstream so none of those individually if you probably sum up all the environment it might get in the least but some of the individual options are that they don't so products is not listed as products, it's listed as packages upstream absolutely terrible if you look at some of the things on this list VNC4 is on the list because it includes copy of the X server if it were broken out so it didn't need to take that source it would not be on the list Eclipse includes a full copy of SWT GTK that it distributes as part of the Eclipse source so if that weren't needed to be distributed as part of Eclipse so one of the things I have to about that it's on further work here it's actually trying to get more precise what the quality of usage is so that actually could provide you maybe some of these packages are just yeah actually but what is interesting here is that if you take the top 100 packages with that on the list that's a bigger list it used to be in 3.0 that used to be about 65% of the distribution of the source lines and even with these packages that are being counted twice, three times the top 100 list of packages is only 34% is always going down that's because most of the archive is composed of very small packages so they we get more packages in the release and they're very small that increased the overall lines of code top 10 remains the same but they're not contributing that much to the overall count we'll see that later on this is the logarithmic scale for all the package size we have few packages on this on this side, not with over and a thousand lines of code and you get a lot of packages which are basis logarithms so that's the sum and the average is over here that's the average size of our packages in the archive 28,000 source lines of code this has been done already for all the releases it has not changed much between releases the average in the release is about 20,000 to 30,000 lines of code from potato to edge so there are very big packages but there is also a lot of small packages so this is my next question and you cannot get the answer to write for this one something about monkeys in here sorry? monkeys in top packages monkey is writing code or copying it's basically binding standard version of knowledge that's the standard so if you use a common model that's a standard used for classical engineering and that's a user's reference and that's what the papers we take these data all these papers saying this is what Debian is worth so what we tell the media is that Debian 4.0 is worth $6.7 billion USD or 5,000 million euros based on the source lines of code okay so if you use it will take 9 years to develop if we, not Debian obviously that does not include all the things we've done in Debian these equals all options if we try to develop all the operating system it will take 9 years it will cost you that much so that's like one and a half times the current market capitalization right now for embarrassing I like the schedule of the time it's nearly 9 years how long has that been around so we're about 3 years ahead of schedule how many people would have to be involved in the operating system well you have to be by actually 9 years by the person you used to get the number of people you would need to develop so that's very nice for you actually it's actually feeling really good we've never had some before how many people would be hired for 9 years to you divide that's a few thousand that's the number of people you need to develop that's a very nice figure that might sound pleasant that's also a way to show that sales models are not a good way to determine the cost of sales but these are the class we use so is this really a lot I mean I think Google gives me that Microsoft equipped with this cache recreate the whole thing in Debian 6 times just because they have cache it would take 9 years extremely expensive to sell a crappy product many people to give you a sense of scale that's 7% of HP's annual revenue last year so again do you have it coming in do you have it coming in marketing, the sales the deployment all the things that also will add on top you can imagine that but usually you won't come in year 1 to 5 it depends the size of what you do in a company also this is interesting this is a comparison of Debian with other operating systems actually the media source lines of code for obviously non-free operations that are not really known at their estimates somebody just says 30 million or maybe somebody within Microsoft says this is that but the good comparison is with other free operating systems like Fedora you see over there or Open Solaris also they have done the analysis of that you can see that well obviously compared to non-free software Debian is much bigger but it's also much bigger than other free software projects also and these are the adjusted numbers that are excluded from the code? no if you divide by two you get higher numbers you can divide if you say this is a gross estimate divided by two you still two times Fedora core and Fedora core has been counted using the same methodologies Fedora core only ships one copy of GCC as far as I know I don't think that's a feature of Debian that we ship four copies of GCC those copies have been removed of the numbers but not when they're within I-32 but I mean those numbers are trying to remove packages that have been actually go to the list of websites you will not find GCC in 3.3 on the list those have been removed trying not to add up to the total amount so actually it's more of lines of code than there that has outside of value that you have retracted from the whole yeah that's the same thing this is the number I want to see because it gives a better idea of useful lines of code that are supposed to be raw lines I think that's very subjective so you can see here one thing you can see here one thing and I can show you here it compares with other Debian releases even if you're taking into account this gross estimate and there's things that have been counted twice you haven't counted twice in all releases you see that the growth of Edge versus Ham is like 11 times versus Woody it's almost 3 times and charge is only 1.2 times so it's not as big the very big change and that was in position 3.5 to talk was from Woody to charge which actually duplicated the source lines of code and there's a question here because the number of Debian developers grows as fast as the number of source lines of code and that's so silly one of the nice graphs I had we we growth of source lines of code packages you'll see the packages source packages and source lines of code actually follow the same trend and that's because most of the packages are very simple packages that have the 20,000 30,000 source lines of code so they grow the same there's a big change probably in from there's actually a big change from Woody and charge which is open office I believe was it increasing charge right or was it increasing Woody already open office was first in charge first in charge I believe that was when charge was first in Woody you can see that but you can see that even if some of the top 10 elements weren't used at that point there's no growth there but you can see the number of developers have since the release is not growing that fast so we have a lot of a lot of code not many developers that's based on the code that is based on the code of the Debian developer elections Debian leader elections is actually db.debian.org these maintenance which are not are exactly right so there's two things that cause the shape of that line to change that I think are interesting one is it was after Woody that we first started culling an active maintainers and so what's interesting is that horizontal line means that we have maintained or slightly increased the number of reasonably active developers whereas prior to that the slope of the line was just how many people have accrued it the second interesting thing is it's after Woody that the concept of sponsored uploads really got started and so there's more work being done by people that actually gets into the archive without somebody actually keeping it one of the things that they're working on paper for is doing an analysis of volunteer work on Debian based on Debian developer changes maintenance which might be useful to determine not only who is uploading but who is contributing who is actually the maintainer for the package so numbers for maintainers are different and one of the nice things there is actually they can analyze the time it takes for somebody to enter Debian and to leave Debian which is on average like six and a half years before they get burned out and they go does that measure when they enter the N queue yeah I want to see those numbers adjusted for how long the person spent I want to argue for shortening the N queue time based on if they're making uploads while they're in an N queue that will include them but at that time it's based on data from Gable to I believe SARTs I don't think it has been included then and it's still being developed but one of the things they're going to do is more analysis on the maintainer side the other thing I'd like to see on this graph is you talk about Debian there's another thing that can be mined out of that is the rate of frequency of uploads per package which I would like to see how that compares with this increasing trend of number of raw packages in the archive does where do we fall between these two lines in terms of how many uploads we see per package as a metric of how well individual packages are being maintained are they is it keeping pace or do we have lots of packages and they aren't good because we're not cleaning up those packages aren't maintained one of the things that was done one of the ones you don't have to deal with yeah you have that's not it actually one of the things there's a paper in the reference you will see the reference to where the papers are there's some interesting paper there was an analysis done between Potato and Sarge of the archive of versions of the packages and versions which had not changed in the video from Sarge which were exactly the same version same package and version and you can see which ones had changed and which ones had not and which ones had to be removed from the archive but that's something that can be done in any case just to summarize one of the things that that was done for Sarge and we would like to do for H2 is an analysis of the authorship and licenses of the archive that's something that there's too low rate down for that so you can write on the archive and get how many files are GPL license GPL, BSD license and how many and who is the copyright owner for the files so who is it? is it the Fischler Foundation? is it SAM? is it HP? if it's an individual developer and that was done for Sarge and it's going to be updated for H and we're probably going to do an analysis more in depth analysis it was done for data to Sarge it will probably be updated for H to analyze how the distribution has evolved and how is it really growing and couple that information from volunteer activity because one of the things that there is a research group has been doing is taking information from many these packets of loads the CEVSSDN and get that information in order for example to determine who is the project manager for a project based on mail security and types of loads and who is working on what specific parts of code and whether a release, for some of that this was the study for genome whether a release time release model is appropriate or not and if there is like when you have time release all the time on the archive and then, you know, make a release or whether you do all the work at the end of the release so that has been done for other projects VSD, Genome, and we'll probably try to be done for Debian with the caveat that in Debian we don't have a CVSSDN we can point people to analyze because it's all distributed and all we have is the packets of load information which is going to be used but that's not a full information that doesn't give you information maybe like you could get it maybe off if there's developer activity in between releases or maybe if somebody goes makes a lot of changes in one day doesn't upload and forgets about the packet for a week or a week for a month and then goes next month does a lot of changes, upload one or whether he is working constantly on the packets and then making a given load at some point so that is going to be done also but it's not being done already so go to this website www.debiancounting.com which will be very likely somewhere else which has all the data of all the releases up to X I think the only thing missing there is the graphs which are not in generative right now for some of the releases but that has all the information in this presentation and previous presentations of all the releases packet information, version information actually I can show it to you guys here who is here all the information for X so this is that summary I gave you all the information and you have all the statistics packages so you can go to packet information and see open office how many lines of code it has of which languages you see that Java is 7% of open office over there maybe other teams has a lot of Java there which is basically Java so you got all the information all for X and all the other releases so if you want to play with this data it's all there okay okay I'm not trying to so to end up that's the number of source language code in Debian that's a host you will see that those are the most used languages and still as compared to other distributions like the Dora Core or DSD even with this cross counting methods you use the biggest free software companies on how to do that okay so if you want more information those are the tools that have been used loghound basically with David Wheeler that's the one that's going to be used with copyright analysis I think the DPL mentioned that this morning that he wanted to do some analysis of what copyrights being used oh no that wasn't the main list but TPL versus version 3 version 2 or newer there's a tool already tried to get that information so that could be used and there's some other tools over at that site and all the papers are available also at that site so you can get there all the papers on the charts and back and when new papers will publish with more information they will be also available there so that's all from my side if you have any questions for me I have a question I want to know from the beginning if it's been covered or not are you trying to also to analyze the number of the users or it's just the no that just just software in the archive there's other studies being done in order to try to get used to get some info information from the packages that are uploaded more quickly on that but that even that doesn't represent all the users actually there's no information that you can get from the users right now so this is basically software at a level on the archive and how much of it is and how distributed it is okay what's the thing according to sources like this there's no new machine including security so every machine out there is by default checking security.wm.org now all of those machines are downloading a full copy of the release file each time they're caching that locally so with a little bit of magic to try and dispel intermittent squid proxies between end machines and our security.wm.org machines which there are free all we need to do is count the number of full downloads of that file between updates of that file of the number of installed machines so AJ and I had a look at this last night and for the first two weeks of Edge being out there was no updates to the release file from one machine that was in the pool of three machines we estimated the total installs of Edge during the first two weeks was around 100,000 systems the longer that window between the release file being updated more accurately can come also we don't currently serve the magic headers you need to be able to dispel squid from or any other proxies or NAT or whatever well NAT you can get around I'm not looking at source IP address I'm looking at full downloads of that file so multiple machines behind a NAT will each download a full copy of the release file but squid proxies will do that also but squid proxies are officially deflate the figure over they do currently so it's a minimum threshold but by adding certain headers which I'm trying to see if I can convince them to do we should be able to tell squid not to cache but to force clients to re-fetch that even through the squid given that file is only 200 bytes or 100 bytes that's not really going to be a big over there to us but we should be able to improve the accuracy of that number I haven't seen Edge today yet it was only last night which was fine I don't have access to security logs do you kind of do the same for order information so we were just looking at Edge at that stage but the main thing is that you've got to have a file which is being downloaded and not update it forcing those clients to re-download that file otherwise you start counting it from zero that was just one machine I had tripled it for that for 100,000 first two weeks rough heuristic so that's the only place I can think of why you can't install it and that also doesn't any machines are installed without access to it of course this is kind of what we see that we have 100,000 downloads and then they distribute Qt in 24.7 versions of KDE installed in Brazil the other thing we do in counting at home is nothing everything is wrong this is the same thing with DVM2 that you have other ways of having centralized repository as schools have in large installations everything goes there as probably Extremadura is one of that examples we have your own pool and maybe or guys I don't know school buildings we have it's a low number it's true about bringing in their pools but everybody is told not to mirror security so they will of course okay any more questions there were a lot of questions next topic