 Large infrastructure. I'm gonna share the talk with the material. Can I so the first part? Main thing you can do is to actually just navigate to an individual source code file, which is part of Debbie and or Past and you will see that source code file very important like the cow say utility Which I'm pretty sure content of that file is and do some fancy stuff like adding messages as I'm gonna show you later But at the default message, you will just see the content of the file syntax highlighted in this interesting so you just do a pretty good source and obtain on your disk devices, so Having a web app to do that is actually pretty useful because it's where we can standardize and give access to people to the source Code of Debian independently on the machine. They are using to actually go through that specific piece of source code So an alternative to that you can just go to sources Debbie and that with whatever browser you want So we care quite a bit about graceful degradation. So it also work with Text only web browser and you can do that on a wide number of machines I will briefly go through the features that you find on depth sources So the basic feature ever a dimension is a source code. I like so you might wonder which languages are supported So we use a framework to do syntax highlighting in the browser client side Which is called I light.js which support like one under the ten languages and the actually welcome support for additional languages So all the well-known languages are in there. We have managed to add support for other stuff Which was not supported like the sila source code file thanks to Compatches by Debian users and this is what you will obtain by default So which part of them and are included quite a bit. So actually you will find all Sweet that we call live sweet. So basically all Debian releases You can actually find on the Debian mirror network are included from all the old stable to experimental Okay, so everything which is on the Debian mirror will be in there and actually every time the Debian mirror is updated We receive a post notification So the content of the source is immediately updated, but we have more than that. We also have all the Archive release going back to ham. Okay The reason why we don't have the releases before ham is that there are some differences in the format of source code Back of source packages So we need to do some fitting with the PKG source to be able to extract that it isn't the working But all releases starting from ham to experimental are actually included which is quite a bit of source source code So in addition to actually just browse like file by file and version by version to the specific file You want to see you can actually also search through all this Amount of source code and you can search by various ways So you can search by package name and we do support like incomplete matches if you don't know Exactly the name of the package you are looking into you're looking for you can search by By files and you can search the content of file So how do you search the content of file where we index them in various ways? The native index indexing we do ourselves is actually C tags So every single time we add a source code file to the sources We run C tag on that on that file and that means that all symbols which are defined by developers such as variable names or constant names or Type names and all this kind of stuff can be used as keyword to actually search for content in In the sources and additionally we have a Integration with code search debut net which actually Allows you to do a regular expression of searches on the actual content of source code file And you can use code search that the debut net from the sources debut net interface It's actually search for the content of of source code just a caveat The debut net does not index all the source code. We have in depth sources. It only index I think unstable and it's not updated every time there is a mirror push So the amount of source code you search if you use code search debut net is actually a bit smaller than what you find in the Sources, but it's usually what you want to to search through for bugs or that's kind of stuff The search of course is not done on the fly. So it's done in Portugal So we have pretty big indexes and I'm going to show you some number later, but it's actually pretty fast This is the actual user interface you can use for For searching and it allows me to point out that there is another kind of search that they haven't mentioned So we do shot to fight six indexing of all the source files which are in depth sources So if you want to check whether a specific source call a specific version of a specific source file is included in The sources you can do that and just enter the the ash of the file We do support not only is Displaying of the content of the of specific source files, but we offer some additional features which serves different use cases So for instance, you might be a developer which one wants to point other developers to specific line of a source code file and Add some annotation like hey I found a bug here, or you might be a user that stumbled upon some source code related error during the You were using their debut machine or you might be a tool You might be using a tool that does some static analysis of files And you might want to associate specific lines of code with error generated by by those tools So we support that with a syntax which allows you to call into the resources and adding specific message in the URL So in this example, you say a URL in which we've added a message which is Debbie rocks, okay And we associated it with a specific file And to a specific line of that file, okay So you can do those sort of an operation and you can also ask the sources to highlight those lines for you So you can actually use that sources as a static code base of the Debbie source code and just point into that using links from other web applications You can do So these are the other examples so you can add a message here between associated to light 17 and another example here is static Analysis example in which you might have find error and you can just add it to a specific line And everything is documented. There is a specific year old URL scheme you can use and it's it's actually pretty simple So you just get parameters you add to the URL Thanks to the fact that we compute checksums for all the source code files that are in depth sources We actually can find duplicates, okay So when you go to a specific rendering of a specific file in depth sources, you will find here Maybe with the light you cannot see it very well, but here it says duplicates 4309 so this is because it is a fairly common File, which is the content of the the textual version of the gplv3 license So depth sources knows that within the content of all the files indexes There are about 4,000 different copies of this life. Okay, sorry It's 4,000 identical copies of throughout all the source code indexed by By depth sources, so it's an example of a very large number of duplicates But it might be interesting to find out that specific files are actually shipped by many different packages and we have the information to do that You can use depth sources itself in the Applications in the Debian infrastructure, but it's actually also already been integrated into a number of other services in the Debian Ecosystem so I've already mentioned code search and the way we have an integration with it is that you have a search form on Depsources where you just when you enter when you enter a search queries in there You would just be redirected to code search and Will recognize that you are coming from so depth sources. So when you Find a result on concierge.debian.net you will be redirected back to sources Debian net for Rendering the results you have encountered and this is actually pretty nice because it avoids having to Reinvent the wheel of having a lot of different application doing Display of Debian source code with syntax highlighting you can just talk to us and we can integrate the service into into depth sources It is already integrated into the package tracking system or the new package tracker Which is a tracker that Debian.org? So you will see a nice link which is browser source code and when you click in there You will be redirected to the specific version of your package or you're visiting actually sorry in the pts You will be redirected to the latest version of the source code of the package which is available in depth sources You can also just embed the specific windows of depth sources in your web application So we do have an API where you can just use an iframe and have the syntax highlighting of a specific source code file Embed in your application. This is this works, but it might be a bit clumsy So not every Debian developer likes actually not every sorry what developer likes using iframe So if you want something better, you can talk to us and we can see if we can find better form of integrations We do have some statistics, so what we have we actually do some statistics which are both Specific to individual packages individual version of packages and also aggregated through all the data of the sources so for instance when you are Within a specific version of a package on the sources you will find what we call an info box So it's something which is floating on the right while you're browsing to the source code and it contains Information about all the data we have extracted from that specific package So for instance, you will find general Debian information like links to the pts or the version control system You will find information about the area So these packages in Maine you find information about which suite this package is available from so Jesse and Sid in this case And you will find information. We have computed ourselves after integrating the package into web sources So for instance here you can see that chromium is actually two gigabytes of source code when you when you want to Extract it on your machine that he contains about two million developer defined symbols So name of function name of constants name of variables and so on and so forth and you will find the result of Zlock count So we compute How many lines of code of each languages are available in this specific package? So you will find that the the top most language in chromium is about 10 millions of C code You will find a sorry 10 millions of C++ you will find three millions of ANCC and then Python and then go on descending So these are the statistics which are specific to a given package and then you have statistics Which are aggregated to all the content of web sources So you can go to sources.debian.net slash stats, okay, and there you will find metrics how much How much does it take to actually have all the source code of Debbie and Jess on your machine this page knows So the amount of lines of code which are in current seed this page will tell you so as an example These are stats for seed as of yesterday. I think so in seed We currently have about 11 million source code files. We have 23,000 source packages It will take you 228 That's gigabytes right of this to have all the source code on your machine About 127 million of developed by the fine symbol and a bit more than a billion lines of code This is what we ship when we ship seed to our users If you're curious the most popular languages in seed so the top most is a C Plus plus and then Python and it wasn't the case a few years back So we also have nice trends showing the evolution of programming languages over time and we do have statistics Which are not aggregated through all Releases but also statistics dedicated to specific releases We do some fancy graphs where we're not that good at graphing, but we try So this is the evolution for instance of the most popular programming languages over time So you see squeeze here you see wheezy you suggest you see seed and those in between actually the smaller suites like the update suite or the LTS suites Last thing before passing the mic to material is so everything I discuss is available Essentially via user interface So it's something you use in your web browser and the user build on top of HTML and JSON basically But all the information we expose on the user interface are also exposed to developers via an API So there is a JSON based API which you can use to extract any kind of information You might want to have from depth sources and use it programmatically So for instance you can if you are interested in statistic We have in the info box of specific packages like how many so lines of source code in C are part of Chromium you can use an API to extract that information And I'm saying that passing the mic to material for the upcoming news Thank you. Zack one two and two Okay, I'm going to present to you now the new features which have been integrated into depth sources since last debcov So there have been a lot we had first an outreach G student Yung Ji Yang and we in one of the work of two GSOC students Clemont Shreiner and OST's Yohannu We had also many Lot of many new external You can see the list of their patches here So this feature enables Many pop-up messages included in the Sorry in the in the code. So this is useful for instance with compilers if they found Many errors they can point the user to many places in the code. This was not the case before Something new that needed a lot of Refactoring has been done by our outreach G student is the blueprints So we use flask internally In python to do to display the web app and now we have many blueprints which are plugged together and this enables to To have new applications working on the code base And all the code we archive in depth sources. So more of this later Okay, nice fancy fancy feature is The Detailed listing of files. So you can see the rights and the files and their sizes just like in LS dash L file edition now through a chromium or Firefox plugin you can edit directly the files in your browser with a JavaScript magic and Now you just click a button and the patch is generated which you can send the Debian maintainer of the package That means that the entire Debian archive is editable through a JavaScript based browser Thanks to Raphael geyser for this Okay about G sock now We have a new blueprint blueprint application which is plugged to all the archived codes in depth sources Which is about the copyright files. So there is a Part of the archive that uses machine readable copyright files So why not parse and display this information in the web interface compute statistics about the usage of licenses? Generate spdx files. They are License exchange formats developed by the Linux Foundation Oh, and so an API to to get this same information about licenses Through the web interface it looks like that So you can actually click on the files click on the licenses and they will they will point you to the right places a second External blueprint application is the patch tracker. Maybe you've heard of the old patch tracker, which is not online anymore So as a new blueprint made by OSC's student It does more than the same things with a couple of new features for for instance for now It only supports the quilt format It will syntax highlight the package just like the source code in depth sources And you can also query it via an API to get the list of patches and their content And it looks like that So you can see that you can view and download the patches directly and get the list of their descriptions their file it deltas And aggregating on one page our second district student Clément Schreiner worked about an asynchronous updater because until now we had a Synchronous updater, so it needed a re-rate of all the stages of the updater Adds package and compute stats or garbage garbage collection or just examples We're using the salary Python package to actually spawn independent and asynchronous tasks If you want to get more information about these features, they have presented it in the GSOC session this afternoon You can see the video and they will be really soon available on sources at debian.net. We are merging them currently Other other new features that we are not there when you are go So a bit of refactoring now that this is a top-level Python module This is intended for packaging the sources, which will happen one day a new configuration letter Now we are flake 8 compliant, which is a better Python code for sure Discovery today we have 85 percent of discovery. I don't remember the number for last year, but that's way better You can look for package names case insensitive. That's a small new feature. We have better statistic charts. Thanks to our Looking for a very ancient package or actually a new one. Sorry a better statistic charts. Thanks to our GSOC students Python 3 support. I think it's almost there modulo dependency Yeah Now the roadmap so the new features we would like to have in the incoming month so the big picture is Use depth sources as a platform or the displayed platform of something bigger, which which is called debil To run automatic runs of static analysis tools So depth sources could be used to display the results of these tools along with the message this this tool produce So we can embed this message in the pop-up Pop-up feature we saw before We could also gather statistics about the bugs evolution and how they disappear from release to release You can check the debil project for that We would like to have more life statistics stats are nice about licenses patches and evolution File name search. This is currently not possible The link between all the binary package and the source package for now We you can only look for source package. It would be nice to have the automatic link If you if you don't know the name of the source package you're looking for The table in table support, which is not trivial for instance. What do we do with table in table in tables? That's never easy. Oh Well 100% discovery this will happen one day as well Now file level deduplication, so we have a lot of similar files the gpl v3 we mentioned before was an example You can see that the Deduplicated core of the code archived by depth sources is about 45 percent This is mainly due to all the files copied to different versions of a package So if you could do this to at the file system level, we would save a lot of disk space Okay, about the technologies behind the box So everything is almost in Python The web application using the flask framework with blueprints ginger to templates and all the web related technologies like HTML CSS and JavaScript PostgreSQL is behind to store the data in the database We use Apache web server SQL alchemy as an ORM to talk to postgreSQL and that's a picture of the The shema behind depth sources So you can see as a user that you have access to the web sources web app as a miner You can have access to the web app through the API to get a decent results And you can also compute SQL directly on the the DB if you talk to us if you have a Some kind of special request for some interesting statistics for instance So we do have a local mirror and a copy of this mirror with all the extracted package This takes a lot of space as we'll see after I will all the details for updating through the mirror getting the information from archive that the beyond org That's a couple of tables from our postgreSQL. So you can see there are a lot a lot of information in there for instance, we have Almost 400 millions of different c tags Well, that's the bigger table the biggest table Okay, but this usage the impact sources currently take more than 800 gigabytes The postgreSQL almost 150 gigabytes The source mirror 135 gigabytes in total it needs 1.1 terabytes and it's increasing it was 800 gigabytes one year ago So that's the evolution of the disk usage on our machine The peak is due to the inject injection of all the suits in archive.debian.org So that's normal Okay, a last point. We also use depth sources as a research platform Especially about the statistics we gathered on all the content So depth source is a huge software collection It's homogeneous because all the software in there respect the Debian's packaging format. It is up to date So we have there 20 years of source code evolution and plugins to compute stats on all this source code So we can have nice charts for example, what are the trending programming language and how they have changed in the last 20 years This is all the plugins we have so you can see the evolution of the number of package The number of files in the archive the number of lines of code of c tags and the disk usage About the c tags there is a something different with squeeze. It's a bug. It's not supposed to be there We don't know currently why it looks like that So it's mostly increasing the surprise there That's the file size per language. So it's actually quite interesting to see the shell scripts Really big and became bigger and bigger before actually Becoming smaller in the recent releases That's the absolute evolution of lines of code per language So almost every language is increasing You can see notably C and C++ that are bigger and bigger On the other hand, we have also computed the relative evolution of this language and C is Decreasing there is more and more of C file, but not compared to the rest And notably because there is more and more C++ Java and Python And It's not getting Bigger for JC at the time we computed this JC was Testing and there were things like two times the Chromium package there are the Linux package So that explains this Small peak on the right If you're interested in this you can read the research articles Stefano and me have written first one is about the last charts Showed you the second one is about the depth sources data set and how you can use all the data that is in depth sources in a reliable way PDFs are online Finally a last point if you want to hack on depth sources, we have First you can clone the data repository which is in the QI team Hmm Everything is written in the hacking file Or you can use docker to actually set up a depth sources instance on your machine with a couple of lines to execute the Depth sources docker build and run You will get a container which contains depth sources with a taste data included which contains a Something like 10 or 15 different packages So you can begin writing code on depth sources without setting up the entire thing which takes 1.1 to 0.8 Then you can actually contribute the box listed there You can write up plugins following the plug-in examples which are in the Repository you can add new features, etc We would be glad to to see that Thank you very much. If you have any questions Yeah, hi. Thanks for the presentation Do you have any statistics beyond the copyright file about the devian rules for instance like which build system? Are used or anything which is quite specific to devian this way or devian patches or stuff like that For the benches we are currently Managing the work of our gsoc students To compute them for devian rules. No, we don't but it's quite easy to write a plugin To to get these statistics because the beginning will be executed for every new package which will come into the archive So that looks like something interesting to do Is depth sources in the Debian archive? No, it's not packaged yet It's on my to-do list But actually so the the biggest part of the work was actually refactoring in Using a proper top-level Python model, which is now done. So I guess it should be pretty easy to do that Yeah, and when they will be able to bruise the psorcese in the psorcese, that's a great contribution for something interesting right Didn't think of that. We bootstrap Hi Joachim Breitner wrote an email today to Debian devil with an idea to make Debian source packages available as kids repositories each source packages should be one tagged Commits in such a git repository and he wrote in his email that he would target Snapshots as a source and now I am asking myself. He's not here unfortunately whether Sources would be a better source to get Reliable data what was in source Torbal at a specific version in a specific suite and make tagged commit out of this and provide this as Git repositories may either pre-generated or dynamical on request So I can take that so actually no it would be better to do that with national Debian org Because our history here is much more coarse-grained than what there is on snapchat Debian org So in snapchat Debian org you have every version that have ever existed in a given Debian release at a given time Here we only have there is the Like I love to say the releases which were for specific package which were available when that release was made So we have one package version for H one for Jesse one for him While in snapchat you also had all the releases that have been developed during the development time of a given release So we have much less Releases of any individual packages than what is available on snapchat Debian org does snapshots also go that far back in time No, so that's a different problem. You're right. So no we go more back in time But that's because I think snapchat Debian org is data only since 2006 I think but what is not on snapchat Debian org is on archive Debian org So for all stuff, yes, it would be the same to actually use what we have or archive.debian org Thank you Just somewhat a silly question. Haven't haven't you accumulated this many files? Did you do stats? Like how many files collided the md5 some but they are distinct files and then how many in Shah Well, you know longer check some would be different. So we only have Shah 256 right now. So Right, you know, we haven't done that yet, but it was actually right. So we would be fun to do I don't think there will be any collision by the way, but not even with that with the md5, but Hello, would you be interested in a contribution which imported? Yes So no, I said yes before the end of because we are always interested in contributions So we there is an open bug about integrating into depth sources source code from other derivatives and That's actually something which is really cool But before tackling that we actually need to implement file level deduplication Otherwise with just explode I think would be actually pretty easy so So I mean the essentially the key we are using is like the sweet name So as long as we have a clear naming scheme which differentiates packages coming from other derivatives There shouldn't be a problem. I'm curious about any uses of your service by excellent people like Where there any publications by academics? outside of your team and also For instance for people looking for copyright litigations, whatever is the spdx stuff useful or Do you have any evidence that it is really used? So the publication we've shown you are still pretty young So the oldest one is like last year So no, I'm not aware of other publication except from other ones that have been made I know I've been contacted by researcher interesting using the data set But I don't know if they have yet published anything regarding the The licensing information so one of the use case which motivated our work on the Debian copyright tracker in a way is that we have both the license information and the file checksum So we can provide Information about license and copyright information as they are seen in Debian to other people And we have an API that allows to do that It doesn't mean that our information are correct because we might always have bugs in Debian copyright files And actually we don't have that many machine-readable Debian copyright files yet Even though in cities more than 50% of the archive these days But so that's one of the use case and actually one of very interesting use case We also have c tags so you can do like fancy stuff like have a binary Let's look with an M the symbols which are in the fight in this binary and let's see if in that sources There is a matching file which have all these symbols and let's see what's the license of that file in According to Debian. So yes, those use cases are one of the reason why we're doing this work But and we have been contacted by the spdx working group about working together, but we don't have anything Publicly available yet. Oh, maybe you'll never know Other questions. Okay. I will say that's it. So thanks a lot. Thank you very much. Thanks