 Oh, and they have no mic. They have to go up there. Who the heck moves goddamn heads? The search for, like, swear words, I'm sure he's laughing at the instances that people would search for businesses. Good, shall I start? So hi everyone. Don't worry, there would be no free software politics in this talk. With that, I'm done for the opening talk. This will be a pretty standard technical talk on a tool and a piece of the Debian infrastructure I've been working on, which is called DevSources for the software that implements it and which has its main instance at sources.debian.net. So if you've never used it or if you want to check the stuff I'm saying live during the talk, I encourage you to just go to sources.debian.net and see the stuff I'm talking about. So in this talk, I will just give you an overview of why I've created DevSources and why it exists, a little bit of feature, so kind of cheat sheet of how to use it, some technical details for the people that might be interested in hacking and helping me maintain the tool, and the roadmap of the stuff that is upcoming with it. So with the overview, so in a nutshell, DevSources is essentially a web app that allows you to browse all Debian source code via the web. So essentially it's a source code highlighter in which you can pinpoint every single file contained in a Debian package and see what's in it. So the main instance is here. The idea is pretty simple, but it seems to be very useful. I think it's useful for us Debian people to just like when it's shot about specific bugs and specific lines of source code file in a package, it's nice to just have a URL you can point to when discussing it, and it's easier than saying, yeah, you need to just download this package, open this file, and so on and so forth. And I think it's also useful for the powder free software ecosystem to be able to check what's actually being built in Debian. So what's in there? What are the Debian people doing? It poses some system level challenge to get it right because it's a fairly big amount of data. I'll enter into that in a minute. And in my opinion is possibly the highest obstruction layer at which we can show the source code contained in Debian. Because if you use something as that we have a developer used to develop packages, for instance, if you look into version control system, it's fairly difficult to find out a common workflow or even a common brand structure, right? It's something that we've been discussing the very same days on Debian develop. So it's fairly difficult to offer a common view over different source packages. Okay, so I think this is the best we can do right now to expose the Debian source code to people around the world. Before going further, just a few acknowledgments, I need to thank my employer, Iriel, that essentially has sponsored the initial development of the web UI that has been done as an internship by Mathieu Kanai, and is also sponsoring the hardware and the hosting for the current infrastructure, and is also quite happy in having me working on this in my famous spare time. There are other contributors that you will find the contributor file, and I hope that by the end of this talk you will be interested in adding your name to this list of contributors. So research motivation, that was the initial reason why I developed web sources. Essentially what we wanted to do is try to do some static analysis on all the packages contained in the Debian archive. So we wanted to see what happens, what are the trends in bugs that can be found by static analysis tool, okay, and now we can analyze those. So there are some tools that we are developing ourselves, like Oxynel, that you might have heard of, it's essentially a sort of grep and said that understand the semantics of the C language, and can be used to establish bugs pattern, and actually even generate automatically patches for specific bugs. It's being used by the Linux kernel and it's very cool stuff. There are other tools I can build by the LLVM tool chain, and there are a lot of static analysis tool you can run on a huge load of available source code. It should be something that keeps up with Debian uploads, so as soon as there is a new upload you want to rerun all the checks. It should be integrated well in all the Debian infrastructure, and ideally it should have some layer of community review. So you know when you run static analysis tool you will find plenty of false positives, false negatives, and before submitting bugs automatically to people that maintain their software, you shouldn't really do that because they won't be about the false positive, and essentially what you want to do is essentially add a community of people that goes through the issues and says, this one looks like a real issue, let's submit it, and no, this is not a real issue, so let's ignore it and do something to avoid keeps on popping up in the future. So this was our initial motivation, and to do that what we wanted to do is use a not to implement a thing that would do all of this, but actually have some sort of Unix-like architecture with small parts that do a single thing and do it well. So what we wanted to have is a network, essentially a rebuilt or static analyzed network that will keep up with Debian uploads, a web apps to browse the results, and a web apps to browse specific source code and pointing to the lines which are affected by the issues. So what we ended up having is actually that source is just the last part of all this, so it's essentially only the code source browser with the ability to add specific messages to specific line, and then there are other stuff, this one developed by the Mathieu as an internship, which essentially a web app to show the result of static analysis tools based on a format called Fire Rose, and then there is the build which has been developed by Paul and other people to actually do the real network that keep up with the builds. I guess I'm at fault for all the three names, so if you don't like them feel free to blame me, but it is a kind of architecture that we have ended up adding. So the source is just the last part. So what you can do with the source is, so this is the main instance you might have used already, it's essentially offering you various ways to navigate through all the source code which is there. So feel free to play with it during this talk. The main thing you have is essentially package browsing, so you have the package browsing by letter prefix to have the usual stuff. When you have ended up selecting a source package, so the name by default are all source packages. So if you're looking for a binary package and you don't find it, that's normal, you need to use the source package name or alternatively submit a patch for doing automatically the redirection. When you have found the source package you're interested in, you choose a version among all the versions available and look what's in. When you arrive at a specific version of a package, what you find is essentially the content of the source package as obtained with dpkg source-x. So that means there is some lack of uniformity there, for instance patches may be applied or not depending on the patch system you're using and with recent packages that uses the 3.0 format, so dpkg source format, what we'll end up having is a set of source packages already applied. If you're looking at kind of old packages or other packages that apply patches at build time, you will find patches which are not applied. So this is a kind of uniformity that is difficult to get around, so it's exposed to you as it is. What you have is some HTML syntax highlighting, so depending on the source file you're looking at it will recognize the language and will try to do some syntax highlighting in your browser. So it's client-side, everything is delivered to your browser and there is a JavaScript toolkit which will do the syntax highlighting in your browser. There is some file type detection based on both the extension of the file and the shebang lines if it exists, meaning that we will recognize paired files even if the name of the file is not .pl. And this is thanks to the Gini people, Gini is an IDE with which they shared with us their roles, we have adopted it and they just work for the most part very well. What you can do other than that, we have various kind of searches, you have package name searches with some substring matching, which is pretty cool if you don't remember the exact package name. We have all the SHA256 of all the source code files we indexed, so if you know the checksum of some file you want to look it up you can do that. And this is also used to talk about duplicate detection. So whenever you are on a single file within the source it will tell you the number of duplicates that exist for that file in the code base of the source itself. And finally we run C tags on every file we have in the database, so essentially you can do searches like tell me which file contains the function printf, that would be a very bad query because you will end up having a lot of file contained, well no printf, not that many but print you will find many many files defining a function called print, so we'll end up having a lot of packages. So this is the kind of searches we have integrated in-house and then there is a very cool integration which is a service which is not provided by us, which is called code search Debian net, which is maintained by Michael Stapleberg. And essentially what he's doing is full text searching of source code, so he does some full text indexing of huge amounts of source code and it allows you to search for it. It's a tool we have been using to find how many packages in Debian contain you must do no evil and the violent number of RC bugs. So there is a kind of a query here because Debian code search itself is not capable of indexing all the source code we have in DevSources, so it's only indexing unstable essentially and it's updated I think once per week. So there is no guarantee that everything which is in DevSources will be fine using Debian code search, but it's kind of cool. And the integration is nice as well, so essentially you have a search form for source code on sources Debian net which will use code search Debian net to do the search and the the result you will get from con search Debian net will point back to DevSources to show you the results in the usual interface. Given that DevSources was meant to be used in collaboration with other tools there is some sort of API to work with others. So a first kind of API is a URL mechanism which is granted to be predictable. Essentially the URL here will we always produce it this way so you can point to a specific file in a specific version of a specific package and this is the URL scheme that we will that we were using now. You can point a specific line with the sharp L37. You can highlight line ranges so you can add stuff here and it will point to a specific line and add some kind of highlighting to individual lines if you want to talk about them and you can add pop-up messages. So essentially you can add you can pass a parameter here saying message 22 call save blah blah blah and it will add a pop-up messages in your browser at the specific line. Okay this is what we meant to you we mean to use for mess for errors. So there is static analysis tool that will return some specific errors associated to a line and the idea is that to point into the source if they will use something like this. Okay and there is also a specific URL for iframe embedding so if you are developing another web application in which you want to have some specific source code highlighting of some code which is in Damian well there is an URL which I'm not showing here not sure why but in which you can just know that using that URL will give you the content of an iframe you can embed in your application. Everything is documented here so I won't go into more detail. There is a JSON API so essentially everything you can do as a user while browsing you can do also with a JSON API so you can check which versions of a package are available and retrieve the source code of the file and so on and so forth so if you need to integrate that in something which is not a web thingy you can use the JSON API to do the same. This too is documented at the URL which is on this slide. In terms of coverage essentially we are we have two parts we have a live archive so a part which is keeping up which what is live in Damian meaning stuff which is shipped by the usual mirror network and what we cover right now is essentially everything on the official network mirror network okay that means you have all everything from an old stable to experimental given that in the in the last part in the last few years we have migrated some some stuff in and out the main archive there are some weird glitches for instance with the backwards is in because it's the first version of the backpots archive which is shipped by the main mirror network but squeeze backwards is not there because it wasn't a separate archive so still we have a single mirror network for now so you will not find security in there and you will not find any derivative there's something on the roadmap to be fixed there is garbage collection so we cannot afford for now but might change soon to keep all versions of all packages that ever existed in Damian so we are not as complete as snapshot Debian R could be and what happens is that when a package expire meaning when a package disappears from the Debian mirror we wait for 14 days to avoid creating state URLs immediately and then we remove it from Suicide Debian Net okay so the URL which I've shown before are not granted to be stable or not granted to exist forever and when some packages appear from the archive at some point it will disappear from Suicide Debian Net as well hopefully this delay will allow you to catch up with the kind of indexing you're doing yourself of the Debian archive. The updates are pushed so they are coming from a tier one Debian mirror that means that the lag is minimized so as soon as there is a push to an update in Debian archive it will inform the sources. The usual update runs takes about 30 minutes okay these are the good ones the bad ones when we have to index the Linux kernel Chromium, LibreOffice and blah blah blah can take up up to a few hours okay so usually we are well able to complete before the next push run. So that was for the live archive then there is the historical archive so what I've done I think last summer or maybe was the yeah I think it was last summer was to actually inject work on injecting all the historical releases of the Debian archive so I went to archive.debian.org that in case you are not familiar with is a kind of mirror that keeps all the old Debian releases which are no longer in a Debian mirror and injected all of them in the in depth sources so that's a kind of toy for the statistic geek so if people want to do stuff like monitoring what is the most popular programming language in Debian since the end of the 90s up to today you can do those kind of those kind of investigations so for instance I'm not sure what I'm showing here so this is the the number of slots of C, C++, Java, XML, Shell and Python which are the top languages over time and here you should see how they evolve more interestingly is the relative distribution of languages in Debian over time so here you see that in the beginning we had C which was 70% of the archive then it declined and then it might be going up again now the language which is here is actually C++ which kept increasing and you don't see much here because maybe I should have looked some should I should have used some log scale but all these data is available you can be check for it in the on the website under stats and it's kind of funny to play with adoption so given the original motivation for me to do that was some kind of research purpose I'm happy to say that it's been starting to be adopted in the research world so with Mathieu we've published a paper in a quite important conference about software engineering and measurement and the idea is that we are selling both the sources the software and the dataset which we obtained with that as a kind of tool kit for people interesting in looking at the software evolution of free software over a long period of time and to extract and do everything they would want with the extracted data we've also been able to essentially replicate and redo one of the major studies in the area using this dataset essentially confirming most of my results and finding some some perks in what they did at the time the paper is available on my own page in case you're interested in looking at it adoption in Debian well I've already talked about the integration with code search and I was really happy about that it's been already integrated in the pts thanks to Paul Wise and essentially you might have noticed that in associated to any single package in the pts you now have a browse a browser source code link which will bring you to browsing the source code of that package and also a search source source form in which you can search within that package so if you want to find a specific snippet of code in a specific package from the pts you have a way to directly do that of course if you want to integrate other Debian services with it you are more than welcome to come and talk to me I'd be happy to help you do it adoption in general so the reception seemed to have been quite good so LWN talk about that there's been some blog post on the official Debian blog and was really happy about that it's kind of increasing so we are at about 3000 requests per day it's slowly increasing and my feeling is that people were generally quite annoying of having to do APT get source or maybe depth check out before being able to look at a single line of code in a specific Debian package so even the idea is essentially straightforward this service essentially feeling something we were missing in our infrastructure for since quite a while so technical needs for people who might be interested in contributing so this is the architecture it's I had to do that for the paper it was submitted so that's why I spent some time making such a big picture so essentially as a backend here we use the Debian mirror network and archive Debian Arc we mirror it locally in different ways because you can mirror the main archive using that mirror but you cannot mirror the historical archive using that mirror given the archive format developed over time so this is plain old or sync so locally we have a big local mirror essentially we have two kind of triggers that will trigger an update run so it's either cron if you have not connected your depth sources instance to us to the Debian mirror or if you have done that while you will essentially have the mirror itself which triggers an update okay the depth sources updater then goes on essentially what it does first you will update the mirrors it will extract the package and metadata from them filling a database okay and then what you will do it will run several plugins that are in charge of doing all sort of indexing on the source code which has just been extracted and on top of this we have several interfaces the main one we have seen is the depth sources web application which you can per use via html or javascript and in addition to that you have a very various kind of API of course we have the JSON API I've talked to you about and in theory we can also open up SQL queries right now they are not open up to the public because it's too easy with SQL to actually dose the system but if you are interested in doing specific queries over time I can give you access to you and we can work something out we have various plugins so this is the an excerpt of the the database schema here you can see some numbers if you are a geek for those stuff for that stuff we have 16 Debian releases over time we have about 30 000 package names 83 000 source packages which means 83 000 versions of source packages over time and these are the plugins that we run so on every single package that is extracted we run this usage which is essentially a hello world correct plugin to show how to create a plugin we run the slot count to count the amount of lines of code for every single language in in the package we run checksums this is the plugin that checks the shot 256 of every single source code file and we run c tags this is the number of rows we have so the biggest table we have the c tags one with about 360 millions of rows so for database geeks that might not be that big so I've spoken with the postgres maintainer in Debian this is the kind of database they use for testing but for me that was a pretty substantial database this usage in case you wonder how much it takes to use this to all this kind of infrastructure right now we have six gigs of unpacked source there is no deduplication on this I will come to that in a minute the postgres db is 100 gigs I think more than half of it is actually indexes to make queries feasible so I think the real data here is about 40 40 gigs the source mirror in the end is the smallest thing we have it's 70 gigs and in total we are well under a terabyte of data so that's not that big for today requirements for hosting a significant service this is the evolution of disk usage over time and the peak here is when I have injected all the historical releases this is not zero this started 3.5 so it's not like we went 10 times up when I injected all the historical releases in case you want to have a look at the code so the kind of technology we use is pretty straightforward so we use that mirror for the mirroring part the db is postgres 9.1 or plus is acquired and I remember why but it works on a stable machine python is the implementation language of choice and the the main technology used on top of python for the infrastructure part is SQL alchemy so they have nice models you can use to play the database and it's working pretty well for us at every update run what will happen is that well I've already went to that and essentially after updating the source mirror we unpack all the new packages we do garbage collection for all the packages that have disappeared and we do all we updating the stats which essentially means running all the plugins uh the possibly the most tricky part is that we run we run fairly large and nested SQL alchemy transaction which surprisingly works pretty well so this is the kind of the most difficult logic code you will find are those kind of nested transactions which are hidden for you so there is no explicit begin transaction but the SQL alchemy code could be to become a bit tricky the web app is straightforward as well it's python flask as a toolkit and the other big component we use is ilite.js to do the syntax highlighting on the browser of the user so we do not use automatic language detection by ilite.js which is supposed to be one of the major feature of the tool because essentially ilite.js was meant to actually include code snippets in blog posts so essentially the parts in which you have to do the the automatic detection is fairly small and it was working pretty well but it turns out that when you show off entire source code files on the web the automatic language detection will will not work that well so what we do we use the guinea convention which i've mentioned before if you lack source code highlighting for your favorite language the first thing to do is to go and add support for it to ilite.js so i've been asking the past to do that for specific languages like the sylab language or i don't remember what what else it was but that's the place where you need to add your support in order to to have source the source is the guinea using it what else roadmap there is a bugs file i still have to migrate all the bugs to the qa.dev.org website i plan to do that during that course so there are some low end infruits which are the the parts you might want to start looking at in case you want to contribute so one is that we've made all sort of fancy stats for the paper we've published using matplotlib and but essentially they are not live so we have produced it for the paper but they are not shown on the web interface so it would be nice to essentially have all the same kind of stats we did being live and updated every time we do something every time there is an update would be nice to have some file name search that's not implemented yet but it's straightforward because we already have the table with all the file names and there is some interesting posgar thing to be doing to be done here there are some specific indexes which work well on file names so if you are a posgar's geek might be something interesting for you to to play with as i mentioned before we do not have yet any kind of redirection from binary package to source packages that would be nice to have because many other services in that again have that so essentially we still lack the injection of binary package names in database but the database structure already supports that so again this should be a fairly easy hack we do not support tarbol in tarbol so you know those horrible debian packages in which you do deep package source extraction and then everything you find is a debian directory and attach z okay that's fairly annoying to deal with uh in to some extent i really don't want to add support for that because if i can add another incentive to make tarbol in tarbol packages die i'm happy to do that but to some other extent given we have all the historical releases it would be nice to look to you know to be able to point to a source code file in bash for instance that would be kind of nice so that's not too difficult to add but require some work uh we have a test suite which is a quite interesting thing to do in this case because you know wrapping all the real work which is happening in some virtual environment to do the testing is could be challenging and we do not have 100 percent test suite so if you are if you are a testing geek which likes doing this kind of stuff in python well you're welcome to give hand something more substantial we do need multi archive support mainly for security because it's really annoying not to be able to really see the kind of software we are shipping to users in case we have released some security updates so that would be interesting to have and file level deduplication so essentially i was curious about how much disk space we could save if we actually do deduplication at the file level right so right now there is no deduplication at all so different versions of the same package will occupy essentially twice the space but in most cases a subsequent version of a package will share a lot of files with the previous one so given we have the checksum we already have the information about this this is through the old historical archive so i think you might expect it to be higher i guess but this is what we get so if we do the deduplication right now we will essentially have the space required to host beware only the uh the unpacked source parts okay you can index snapshot yes i think so i think so and even more interestingly so i think pubs is not here but pubs has proposed to actually inject derivatives in sources the net so taking ubuntu and putting it into sources the net taking all the other derivatives we have in the derivative sensors and injecting into that i think they would be feasible because the you know the amount of overlap that derivatives have is very very high i mean ubuntu is probably the one that has drifted the most and i think it does 10% or 15% of packages which are really different the other one are pretty much based on the data ones so tom yes so this is a feature that we have on the to the list but the point is that what we can easily do is different demands right so having an interface in which you choose a version of a package a version of and you ask show me the depth diff among these two versions that is feasible what i think is not feasible is pre-computing all you know all the diffs because you have all the pairs and that explodes fairly strong yes that's something that we want to do so that's absolutely feasible but regarding derivatives they are already doing diffs with respect to debian so i think if you look into the derivative sensors i think they are already doing periodically diffs and keeping an index of all diffs between debian and derivatives so that exists already but it's a feature that would be nice to have here as well and another kind of crazy idea which i like quite a lot i don't know if you have ever seen the linux cross-reference is essentially a website that have lets you browse through the kernels source code and essentially hyperlink each function definition each function usage to the function definition so imagine being able to do this kind of thing across all of the debian archive so that when you have a package using a library you browse the source code of the package it's calling into another library you click on the function name and you end up on the other package where you have the definition of the function that would be very cool it's not that easy because as long as it's a single project that linux kernel even if it speaks you have not that much ambiguity well if you do this here you will have a lot of symbols that are nothing to that have nothing to do with your package that might be its own definition so there are some strategy you can use to do deduplication for instance looking at the dependency of the package or ensuring it the same language and so on and so forth but it would be a fun exercise to do so if you are interested and if i managed to get your attention on this project the we try to follow the best practices in debian to advertise development information so every single service in debian which has a web interface in my opinion should have a footer pointing you to the source code of the service pointing you to where to report bugs and pointing you to how to contact the maintainer of this service so just look at the footer of source.debian.net and you will find all development deformation you will find the pointer to the git repo containing the code which by the way is afaro gpl licensed so the source code should always be available no matter who is deploying the system you will find a pointer to the list of bugs you will find then there is an about page with more information and the place to discuss this in terms of development is the debian qa list so just show up there or the debian qa ioc channel feel free to highlight me i'm that the ack on the channel if you have specific question about depth sources so to summarize depth sources is a seems to be a very simple idea yet very useful and there are quite a bit of fun development tasks in case you want to participate thanks so we we have a derivative and one of the things that we have to do is pull all the sources do analysis and find out what all the licenses are associated with the pieces so that we know what the the league ramifications of that so is there a vehicle or mechanism in your in your analysis and filter to protect perhaps all the licenses are created index for licenses okay so in the service offer on the web not yet what i'm working on because it's a topic i'm very much interesting in is using the same set of data to expose on the web and in a without some kind of api the content of the debian copyright file as long as those files are machine possible so you might be aware that we have two different formats for debian copyright the historical one which is not machine possible so you can do some sort of heuristic on top of it but not much more than that or we have a new format which is the machine possible machine debian copyright format and they've been monitoring the usage of that format using this corpus and it's now about 60 percent so if you look at unstable about 60 percent of the licensing information that debian provides are encoded in that machine possible format so what i wanted to have on top of this is a service in which you will tell the name of the package the version of the package and essentially we'll get back to you the information contained in debian copyright okay so if you want to do the analysis yourself maybe because you have a derivative with other information there are essentially two possibilities either we make it true we make it real the idea of injecting derivatives here and then you will get the service for free or you can deploy it yourself that sources and use the machine I'm gonna write that was the second part my question was is this something we could set up and do a copy of yourself or is it fairly complex obviously no no it's so the the source code of the platform itself is available you can deploy it wherever you want I'm not aware yet of other deployments but I've documented how to do that and I've been in touch with a couple of people interesting in deploying it for their own internal distro whatever so it's absolutely feasible and in the paper I've also documented the full process you will need in case you want to re-inject all the data that they have here probably you are not interested in the historical data of debian but maybe you have some historical data for your own actually there is quite a bit of interest in the historical and we would probably utilize your instance in that space because code volatility bug volatility all of those things are important and understanding you know the stability of our package that comes from the sources okay great so it seems like there is some margin for collaboration yeah thanks okay Michael yeah so you notice or you mentioned that there's a problem with the source and binary packages so I guess that information is in DAC right and Don also mentioned it in the box talk so I wondered I mean newer versions of Postgres can query remote tables as if they were local did you talk to people that you could just use the DAC table in in that sense I don't know whether it's a performance problem but maybe that might be a solution and not everybody renewing the whole thing all the time so I agree so as maintainer of the main instance I would be skeptical in adding you know runtime dependency to another service but as maintainer of the code base yes I would love to not having to redo all the injection myself so what I think would be very cool is to agree on some sort of model for SQL alchemy or whatever for a subset of the package information and have code that injects that information from the packages and injecting to the database at least we will not be everyone will not be you know redoing the code to do the injection maybe Paul have some comment about that okay we'll talk we'll come Paul's comment is we will talk one just behind you um a few years ago I was working on a legacy a few years ago I was working on a legacy project and I needed to be able to build something that required specific versions of libraries and I ended up installing woody uh oh my god I was like it was it was already like 10 years old at that point and so I can see some some usefulness in um you know historical stuff so that I could find I was like okay which version of things can I find where can I find this version of things because otherwise it this stuff wouldn't even compile yeah so I agree so that kind of use case long-term preservation of resorter is very dear to me I don't think that the source should become the platform you get the code from but if but it's absolutely reasonable to expect you use this to actually find out which version you need and then you go directly to the source right so archive the banner it's the the proof that they've been project in general very cares very much about not losing all releases so maybe you could just use this to find a specific version and then you go to archive.debian.org and you retrieve the version of the package you want maybe I could help making it easier associating to any single version of a package the origin for the for that specific package so maybe I could provide direct pointer for from to the bsc that existed in archive or something such if that could help let me know and I could totally work on it hopefully I only had to do this once ever in my life so far Felipe. Hey Zach. So following on Mika's suggestion is there are a few things that you said you want to implement that will align with like packages and package QA stuff like finding file names finding binaries and then also package also has like an entire dump of things how are you planning on duplicating or integrating do you have anything looking out on those types of tracker packets packets QA so regarding packages.debian.org I didn't think about getting in touch with them also because it seems to me the code base might be kind of whole these days so I wanted to do something you know more maintainable and more importantly something that could be more appealing to you know new contributors to contribute to I don't know what we can share with them with tracker.debian.org that might be a different a different manner yeah anyone else okay thanks a lot for your attention