 Good morning and thank you for being here. My name is Francisco Rivas. I work for a research group. We did a special academic work over the release of Debian. We have been doing that study since ham. So let's get started. The schedule is, I'm going to do an introduction, a little review of the project Debian. And especially of the release, I'm going to talk about the study. The specific goals, some issues that we had to embrace with some packages and some kind of things that got hard to study in some way. I'm going to describe the methodology, of course the results. Some conclusions are a review of the presentation and a few to work. I'm going to describe the tools and of course the section of questions. As we are a group of researchers, academic researchers, we are very interested to understand the evolution of the free-liberal open source software. We have been doing some empirical and quantitative studies over free software, specifically using data mining techniques. We are involving some projects to study a larger scale of software. We want to understand how the communities of free software are evolving and are evolving. We think that Debian is a good reference. It's a good base to start our studies to use it as a large compilation of free software, of course. This is a study, as I said before, this is an update of a study that we have been doing since ham. We are also interested in understanding how Debian is growing and of course in its community. One important thing is to understand how the project is embracing the increase of the complexity and the large scale of software. You can find some information about the group and some other research that we have done. As a little overview of Debian supports several architectures, this number is under discussion. Some people said some of them are architectures, some others not, so it's a reference number. Debian is independent of any company, about one and five hundred dedicated and very good developers and maintainers. That number is an approximation because depending on the study, the attributes of each developer that you get from the database of Debian, that number can increase or decrease if you take into account the account of the developer is active or disabled, for example. We think Debian is the only project that has a commitment, a real commitment, using the Debian social contract and the Debian free software guidelines to develop. It's a kind of commitment with its users. Debian policy of course, especially interest to including major applications, that's one of the reasons that Debian is late to release some of its software. It uses a Linux kernel and supports other ones. It's important that BSD kernel because later we are going to see how the community is working hardly in the key free BSD kernel. Of course, the target of Debian is to build, to construct a complete release of free software according to the Debian free software guidelines. Those are the releases that we have been studying. About Lenny, it was released on February, has two thousand, twelve thousand thousand of source packages and 23,000 binary packages. It was released after 22 months of hard work and development and improvement. Includes OpenGDK 6 has several images, one of them is the Blu-ray image. Include most of the software, most of the stable and major software available. It's HSS 2.3 compliant and support LSB software. The specific goals of our study is to estimate the size of Debian speaking about, from the point of view of source lines of code. It's a good measure to understand how Debian is growing since the first releases. As I said, we measure the size of the distribution starting from the point of view of physical source lines of code. For us, physical lines of code is a single line, no matter if it's several separate for commas or something like that. We identify the largest packages of the distribution. We identify the most used languages used in those packages. We measure the modification to the upstream packages. Actually, we measure the size of the source code, the size of the package with a patch, with a developer patch. We measure the size without that directory. We are going to see that later. We do a comparison with previous releases with another studies. Finally, we do a cost effort estimation, applying a classical model called Kokomo. I'm going to explain that later. Some issues that we had to embrace to do the study, because there are a lot of sources of imprecisions and mistakes, is the duplicated code across the packages, different releases of the same packages, forked versions as GCC, for example. Patchwork, we took in account the fact that, for example, GINAT, it's a lot of patches over GCC, but finally it's a totally different software than GCC. So we do a list with target or purged software that we cannot take in account for the study, because it duplicated code or some stuff like that. We choose the best representative software for each of those duplicated packages. About the methodology, in summary, we don't load the package in packet. We do the count. I'm going to explain later what count means, and we do the sums and get the results of the study. The results are the physical source lines of code. This is the methodology, more explained. We don't load the sources that GCC select the packages to be analyzed. That it's the list of purged packages that they're not going to be taking account for the study. For each package, we don't load it. We extract the source directory, apply its log count. I'm going to explain its log count. It's a tool to analyze the source code of a package, getting information such as what languages was used to build the package, numbers of lines of code, and the estimation of cost afford done. We apply the package, the patch after that first application of a log count to understand the influence of the developer over the source code. We apply the log count again. After that, we delete the Debian source to understand to get the measure of the final package. After that, we apply the log count. We got three measures in each step. We start the information in some temporary files. Finally, we get the results statistic and that stuff. Speaking about the results, we got the upstream packages. It's 280 million source lines of code about the Debian source packages, including the directory is the results of the applied, the patch is 323 million source lines of code. It's the real measure of the release. The Debian source packages is 300 million source lines of code. Compare it with the previous versions. We got Lenin's 12.78 times the second release. It had 25 million source lines of code. It's interesting to see how Debian has been growing through the years. The difference between Edge released two years ago and Lenin, it's near from 40 million lines of code. It's the graph that shows the most used languages in the whole release. We can see, as you say, it's still the most used language to develop the software in Debian. In the second place, we got C++ and it's still growing. In cases such as, I don't know, Evans, they change from C to C++. In the third place, we got Shell. One of the things that surprised us was the constant growing of Java. This table shows for each release of Debian, so I got the other ones. We can see how C is in single release of Debian in the first place. This number is the rank of C in each release. The amount of source lines of code and its portion respect to the whole release. As you can see, in spite of the source lines of code is increasing, its use is decreasing constantly. In potato, in woody, it's still less. In Lenin, it's increasing its lines of code, but decreasing its use because other languages are increasing. As you can see, Java starts in the fifth place and it's increasing each release. As you can see, it's the fourth place. It's interesting. And finally, with 15 million source lines of code in Lenin. Actually, one thing that surprised us was it's over Python and Per. Actually, over Lisp. If you see the previous releases of... You can see Lisp in the fourth place. It's very interesting. This is another graph that shows the increase of the use of C and the increasing of C++. Some comments about that. And C++ and Lisp uses going down. It's interesting because in spite of it's always in the first place, it's decreasing its use maybe because other languages are most used or something like that. Java, Python and Per, their use is increasing constantly. And almost six million source lines of code are right in Python and Python and Per. About the packages, we can see... This is very interesting because if you see the previous studies of the packages speaking about Edge, the OpenOffice package had more source lines of code, about four millions over the Linux kernel. But between the 2.6.80 and this version of the kernel, the kernel includes ten millions of lines of code. Now, that version of Linux kernel has 59 source lines of code. And the difference between the kernel and the OpenOffice is about three millions. As I said before, the K FreeBSD kernel is increasing in the rank. The community is working hardly in it. This graph shows in a logarithmic scale the distribution of the packages in the X axis. We ordered the packages by its size. The first one is the package with the large amount of lines of code. It's similar to the previous studies, but with less packages here. Some comments about the packages, K FreeBSD is growing up. As I said, in Edge, it was at 11 position. As you can see, it's in the 9 position. The source lines of code for packages is decreased in Lenny from Edge that had about 28,000 lines of code for package. One interesting thing is, if you see this table, most of the software in the top ten is any user software or development software. It's interesting because it's the community effort. Actually, this position of OpenGDK can explain in some way the increase of Java in the rank of the languages most used. Another very interesting thing is the top 100 packages share is low and the contribution of the top 100 packages, the contribution to the release is decreasing from 64 in Ham to 32 in Lenny. That means the community is working in other packages, not only in the same as previous releases, so the contribution is homogeneous, I guess. Speaking about the cost-effort estimations, we apply a COCOMO model. It's used in classical software engineering to estimate costs and efforts to develop private software, or proprietary software. It's hard to find a model to estimate effort and costs in free software. From this point of view, we can understand if a team or enterprise or a software factory decide to create a Linux release, Debian 5 release today, they are going to need an effort of near 84,000 persons per year, and they need nine years to develop a release from the scratch, every single package. Actually, this number is interesting because it's more or less than Debian has since the first releases, and it costs 6,000 millions of euros approximately. Comparing Debian with another operating system, we can see this release compared with the other has a large number of source lines of code, which is very interesting because the difference is 4 millions. Some conclusions, it's a review of the slides. Lenny was released on February. It has 323 million source lines of code, an estimate cost of 6,000 millions of euros, and it estimate effort to develop from the scratch of nine years and 84,000 persons per year. And she says C++, Shell, Java, and Python are the most used languages. Actually, Debian represents the largest compilation of free software until now. Speaking about the future work, we want to study in detail the evolution of the Debian project. We want to study the ownership and the licenses. It is a very interesting field. We want to study and analyze the volunteer activity to understand why the number of developers is not growing constantly or as we expect with that big community around the project. We want to study the source code management systems, the mailing lists, the back tracker systems. Actually, we have three tools for each one of them. You can find more information and detailed information of the study in that address. You can find statistics for each package and cost estimation for each package and the previous studies. Speaking about the tools that we used to do the study, we collected the data using a set of Python scripts that download the source.gz, extract it, select the packages, unpack it and apply the slow counting in each step. You can get the slow count. The copyright analysis will be done, we hope soon, with a tool that we are developing now is called Pyternity. We got tools to analyze and other reachful sources of information about the community. Actually, we have been doing some other studies in genome or KDA using those tools. This is our research group or at least part of us. That's it. Please, questions. Hi there. I think it's valuable that you're counting stuff up. For me, slow count is indicative of bloat. I was wondering, have you done stats of the Debian minimal system and see how it's grown over time? You know what I mean? One thing I participate in is another community called suckless.org. The aim of this community is to make the code better by making the lines of code less. I hope you concentrate on that sort of slot count angle that more code isn't necessarily a good thing. It's usually a bad thing, actually. Yes. It depends on the point of view. You can see more people use development software for the distribution, but one of the things that we are very interested in is understanding if that is good or is bad or is it different? I can tell you lines of code is a bad thing. More lines of code is a bad thing. So be really good if you track minimal Debian and just to see over time how things have bloated and target those sort of things to make them better by hopefully getting developers interested in cleaning up bloated software. Yes. That's one of the things that I forgot to say that are preliminary results and that is one of the reasons that we want to analyze in detail, deeply that information. Actually, one of the most important things is the source of mistakes or imprecision of errors that we have to embrace. So it's totally correct. I think it's just merely saying that bloat is bad is slightly tricky with Debian because we're adding more bits of software and you're always going to add lines of code for that. The number of lines of code in the minimal system is relevant. What I'm actually wondering might be quite interesting to know is I saw you were taking the lines of code in the package just with upstream and then with Debian and then with Debian, with Debian to remove which gives you the amount of patch, extra lines of code we're providing. What might be interesting to see is how over time that changes particularly as a ratio of the total amount of code that's in there. So obviously as we get more software we'll get more local patches but does that as a percentage of the total amount of software grow or gets more? I think that's quite interesting. Yes. We are taking that in account. I'm interested in the social background and I've done some estimation how big the community is and it would be nice to see the numbers again. How many developers? Well, let's say 1,500. How many packages? And then it's interesting to see what is the average package number a Debian developer maintains and then I looked a little bit around and I see every package is estimated a group around, not Debian developer, one Debian developer, but 10 people around working on the software and upstream and then to estimate how many people are involved in this complete process that would be nice to have new numbers. Perhaps you can estimate it from your numbers. So it's interesting to compare it with what a company has to do or to find all these testers, all these persons. This has also something to say about quality what you can deliver. Yes, that's very interesting. So, excuse me, when you say the people involved in the whole process, what process are you talking about? I would say this triangle from the complete person involved and then chopping down to packages to maintainer. So how many people are involved because this has something to do with well, I'm sociologist and I'm interested to get free software in school. This is my main purpose. During the project leading in Germany. It's important to see what political impact this means for the freedom of knowledge, freedom of access to knowledge and all this was behind sometimes not spoken open what motivates to do this job and do this work. This is interest for me for political, ideological reasons. One thing that could be said about the communities is that the licenses are not only a legal document but also a declaration of which community you are part of. You have the GPL community, the BSD community, the Perl community. Do you have any numbers on the relative importance of the different licenses in Debian? No. If they are growing or decreasing? No, precisely that is part of the future work. Actually, we are writing a tool to do that thing. To extract the license and get some statistics what kind of license is most used and that stuff. Yes, it's one of the most important things that we are doing now. Any other question? Okay, then it's time to speak again. Thank you.