 Hello. For those who don't know me, I'm Enrico from Italy. And I'm going to talk about a little kind of, sorry, well, I'm going to talk about a sort of prototype system I made to publish Debian information called Debian Data Export. So the point of this is Debian is a bit of a data hell. We have a lot, a lot, a lot of different kinds of data in Debian package files for binary packages, which are in RFC A22-like format split by distribution then by architecture or for source packages. Well, it's the same. The mapping between maintainer and source packages is available in a few places and in different formats. Depth tags information has its own format as well. The depth tags, whether we have the depth tags vocabulary of tags, extra sources of depth tags information, popcorn rankings that can be useful to have so that packages can be sorted by popularity. Bug information, you can access the bug tracking system in several ways, all with different formats. Package change log desktop files for packages not installed is another thing that people may need to use for features like other new application in this menu. Information about what's in the new queue, package screenshots, APT file information, statistics about the archive, license information, localization information, status of use can't build the logs, size and lines of code of packages, Debian weather reports. Yes, you know the Debian weather reports. They tell you how installable is unstable today. So if you can upgrade or if you are going to end up into a dependency mess, Debian information specific to Debian Pure Blend, what flavor of Debian this package is used in, UDD, which has all of those and no, not all of those, but is yet another data source of access in a different way. I have no idea because I'm not on the topic. You see the weather. OK, I have no idea because also you're not asking. You're saying that it's used there. I need to look for it. It's a data help. You need to dig for all of these things and find out where they are, what's their format, what's the protocol to download them. So you can probably you can think of more like Joy has been showing graphs of how the helper is used. And yet, you know, yet more information that we have that most of us generate that statistics about something in Debian and put them somewhere and then we are happy with it. And then no one else uses them except sometimes people mention them somewhere and people are like, I have been looking for that for years. So it's a bit of a shame because some of these things could potentially be used like by package managers to tell these packages bugs open as RC bugs open since more than a month that package has less RC bugs. But the other package is not as popular. You could make use of all this information in an interface. So and then besides, you know, the data itself, there's formats. So we got RFC A22 like files used for several things which has a sub format for package descriptions which, you know, empty line starts with a dot and to make a bullet list, it's kind of wiki like tags have a sub format as well in the packages file dependencies. They also have a sub format. You got soap interfaces, LDAP interfaces, SQL interfaces. Plenty ad hoc formats. There's HTML scraping going on. I think VAC made an HTML scraping interface to the BTS or something or to something else. And then where do you find the data? Sometimes it's on the mirror. Sometimes it's in a directory on people. That would be an org or sometimes you need to log into the right machine that will have it in the file system. Sometimes it's elsewhere outside of the debut network. Sometimes you can only access it from specific machines like if I'm setting up a package search interface on my own server. I cannot access UDD, for example. You can only download it from the server. Something you can actually compute it on the user system to relieve servers of computational things. So obviously, if you want to know what other, if you want to know popularity information, you need to get it from the server. But if you want to know if a package is installable, you need to compute it on the user's machine to be really precise because they may have a different sources list. Something can be computed only by cooperation between the user machine and the server, such as package recommendation. Because in order to compute that, you need to index all the popularity contest submission and then query that using information about what packages are installed in the user system. So we've got a bit of all of that. And so I want to start writing a replacement website for the Deptec submission, the Deptec stack submission page. And I usually, or even in the existing one, I usually have the problem that I need to collect a lot of data, such as information about all packages in every architecture on specific releases. And so, well, I had the problem of, how do I get all the data that I need kind of got bored to go and track all the places where I should take it out from? And then I realized that the information that I was collecting myself was used by other people. So, and I had a piece of code ready from other projects. And I thought it all fits together. So I made this prototype. My goal is that producing data should be easy. The major task should be computing the data. And all the rest should be kind of easy to do. Finding data should be easy as well. And downloading the data should also be easy in terms of protocols and in terms of format. That was the idea that I'm attacking here. Both to try and use it in Deptecs.debian.net, which I would like to allow people to tag all packages in Debian, Ubuntu, Pure Berlin, possibly other derivatives on all architectures. Or I would like to make also completion in web form fields. It's, you know, it's a few, well, it's some lines of JavaScript, but the problem is getting the JSON information. And I would like also to have to get to build some machine readable interface to all the data that I produce. I produce, like, Deptecs information. And, well, I wanted, at the moment, you need to know the URL and get it. I would like to make, like, a convenient interface. So the prototype, which I'm going to demo. And I will impossibly increase the font size and hope that it works. OK, I'm gone. OK, like this. Well, OK, the solution is DDE, well, the solution. The prototype I made is DDE, Debian Data Export, which blah, which is kind of like a file system. Well, it's like a tree. Well, OK. It's like a tree of where, well, you got plugins. Generate or access data. And I publish it in a file system kind of way as branches of a tree. And you can query every bit of the tree, and it will give you data. So for example, I can list the, well, I made a plugin to export the polygen grammars installed in the system. And then you just, oh my, I just say get a grammar, and it will automatically run polygen and give you the output. That's a kind of simple thing. Or you can ask it, give me the same output in JSON format, and it will do that. Or give me the same output in YAML format, and it will do that. So the plugins kind of produce, it's written in Python. The plugins produce a Python structure with the data to export, and the front end will encode it in whatever the user requires. And that kind of solves a bit the format issue, because when you query it, you just ask it, give me the data in this format, and it will do that. So you can list what's available, and you can get a value, and you can also ask for documentation about, well, not here, but the plugins allow to get documentation about, even here, allow you to get documentation about every single piece of the tree. So it's all self-documenting. You can browse the tree and know what's around and know what you're getting. Now, this is kind of command line, and it's a bit awkward to use. Well, you just run it dash-desk-server, and it blows up because I'm already running it with dash-desk-server somewhere else. When I'm OK, can I have a browser? You can run it as a server, and you get an interface that you can browse to see what data is around. It tells you about the available output types. So you can get out. Is it? At the moment, there's output formats for CSV, JSON, self-documenting HTML pages, Python, Pico, Yamla, and text-based presentation just to see what goes on. So anything that every piece of data that you see in this tree, you can query it in all these kind of formats. And when it works as a web server, it's restful. So the tree is directly mapped into URLs. So the way it works here is you go around the tree, and you're like, OK. Oh, there's nice Polygen grammars. What are these grammars? OK. The description of each grammar is generated using Polygen. So everything the plugins do is completely dynamic. And then you get the information, and you add the question mark type equal JSON, and you get the JSON to download. So basically, you go around the tree, and you say, OK, this is exactly the data I'm looking for. Go away. This is exactly the data I'm looking for. Then you take that URL, add the data format you want, and use it in whatever software you have. Now, it exports JSON, which means you can use it from a web page, from JavaScript. You can make mashups and whatnot just by getting the data. In this case, another example is app package information. So at this level, you get information about all packages, and you can go down and get information about a single package. So depending how high you are querying the tree, you get more data. The more you go down, you get specific data. You need to download the whole package list then you query at a higher level. You want information about only one package, you query at a lower level. And when you browse around, you get an example of what you get. So when you are trying to create your JavaScript thing, then you can look around. Or you have, well, completion for binary package names. I didn't put the documentation yet over here, I just add it to the plugin and it works. And I can put a prefix and it will give me all the depths that have that prefix. There are several plugins written. For example, I made a plugin for the apt-exapian index. I don't know if you heard of it. I blogged about it heavily some time ago. It's a big index of package information, tags, and things like that. That is extremely, extremely versatile. You can do a lot with it. And you can see you go here and it tells you what's around. You can put a query with keywords, say an image editor in the URL. And then you get a data structure very fast with a tag cloud related to those features, possible completions, corrections, list of packages that match, suggestion for extra keywords to add, and lots of things like that. All nicely in JSON. If you want to implement an HTML, if you want to implement a dynamic page using it, it's easy to do. There you have it. So it starts with a tag cloud. As I type, everything changes. I have possible completions and suggestions and packages. And you can see it's pretty fast because the back end is very fast. And this is a static HTML page. Or for example, you can use it to complete package names in pretty much every web form. So there you go. Pretty much on the fly. The source of this is a static HTML page. It can be put in the Debian website. The source of this is basically an HTML input. You give it a class, and you tell it from what DDE server to get information. And it works. It also includes the little bit of JavaScript. But that can already be put in every Debian web page, and you automatically get completion with proper full-back for browsers that don't have JavaScript. And whereas if a package manager wants to download popcorn information, one can make a popcorn export. And it's very easy to query. The data space that is exported does not try to export every single thing. It does not try to be a super generic, big relational database of everything. We have UDD for that. It just tries to export views corresponding to common use cases. The idea is that, well, you have a special need. Then you craft a SQL query for UDD, or a nailed up query, or whatever. And if it becomes a general need, that query can be turned into a plugin for UDD. And then it's readily accessible in whatever format to pretty much everyone in a self-documenting way. Current users of UDD, I've put online the Aptx Appian index version in Aptx.debian.net. Screenshots.debian.net uses DDE to get a package information for its database. It can be used to make completion in web forms. And it can be used to emulate apt-file without having a local file information. For example, this is a little script that just queries DDE for any package that contains bash and gives you the result right away without downloading all the package information. It's not a drop-in replacement for apt-file because it doesn't support regular expressions because the way I index it to make it super fast, no, but the apt-file author is here at Debconf and I need to catch up. And the idea was to not include it into apt-file, but if there's no local database, apt-file can tell you there's no local database. You can run apt-file update, or if you're in a hurry and you don't want regular expressions, you can run our apt-file and get the things. So obviously, this can be used to add extra features into package managers. apt can make a fetcher to get data out of it, although it may not be a good idea because its data is dynamically generated if we start operating it with lots of apps. I don't know if it copes with the load. It can be used to feed more external websites that show Debian data. For example, if someone makes a corporate distribution and has a website showing the list of packages, even in the internet, and they want to feed that package database with information coming from Debian, they can pull it out of DDE without problems with parsing it and so on. Or different flavors of Debian can put online a DDE of their own, which only shows a personalized view of the package database so people can set up DDE instances to provide different views. And if they keep the same tree layout, it will be just a matter of switching the software that uses the data to use a different URL. In terms of deployment, it's a Python WSGI application, which, in theory, you can deploy in any way you like because WSGI is an extremely thin, small and simple interface that you can put into a CGI, you can put it into a web server, you can put it into pretty much everything. In practice, if you deploy it as a CGI, it does not scale. If you deploy it as a standalone web server with Python modules that turn a WSGI application into a web server, you'll have all sort of problems because none of those really have a clue about the streaming data. There is one that has a clue about streaming data, which is cherry pie three, but it doesn't have a clue about dependencies because it conflicts with cherry pie two, which is needed by pretty much anything else because they made a complete incompatible rewrite without changing the module name. If any of them are listening today, that may explain why things like turbogears are switching away from it because it's a hell to live with. Python paste, if you ask it to reload the web server, it will kill all the current streaming, which is not nice. Fast CGI needs careful tuning because it tends to say this CGI process has been running for more than five minutes, I'll kill it. And maybe you are streaming the whole Debian package information to a slow machine. So, you know, a DD query may take 20 minutes, as far as I can tell, because it doesn't generate the data set locally, it can stream information. And in the other hand, you can have something that processes the data as it streams. And mod WSDI runs in the apache process space, which does not sound to do. So ideas about deploying anything like that, which is basically a restful dynamic interface to all sorts of data would be appreciated. In terms of scalability, it's already done only, and there is no state. So it's a wonderland for cache, which is excellent. You can put varnish in front of it and be happy. You can use aggressive cache adders like everything I give you is valid for six hours, because most of the information is generated, it changes when you have an archive run. So it can be replicated, it can use DNS round robins because there's no state. But obviously, if all web forms make a lot of small queries, and there's lots of people using that feature, it may not scale. So it all needs to be seen if app starts getting data out of DD for some nice use case, maybe all package managers will kind of kill the thing. Since it's dynamic, you can actually use the mirror network because it's generated on the fly every time. Nice things that could be done is JavaScript workshops that get data from DDE is running in multiple places or they run on a server and get data from a DD running on a different server. This is currently an issue because the security model of JavaScript does not allow you to query information from a different server than where you come from. And that's very annoying. Firefox 3.1 should have support for a new webby protocol thingy that allows you to do that. But, well, that's kind of new and it's only Firefox. DD supports JSONP, which is a standardized attack against the HTML security model to allow you to get information from elsewhere. And it's absolutely painful. They also, I can even add JSONPP, which is an even worse attack against the JavaScript security model, but that road makes me sick. And in terms of future development, it's mostly on a works for me at the moment. I will say it's a couple of inches of mine, but I won't do all the work because I don't have the time. I thought it was cool to provide completion for package naming web forms and I made it. And so it could be nice if later I can talk to someone and it kind of, oh, sorry. No, no, I'm just stupid, don't worry. It's all right. But if you want to upload, if you want to share extra stuff to it, please write a plugin. I'll be glad to show you how. Existing DD is DDdebiannet slash DD. Slash DD, this runs on Debian machines. Oh, I guess if you want it, I can paste it on, I can paste the link on IRC. DD.debian.net slash DD, as soon as they find IRC. Okay. And it currently exports apt-file information that index I showed you before to feed data to our apt-file, it exports data out of UDD. There's different views on the big, big package list and you can get package information, say all the packages on ARM for every distribution and part of the archive or you can, again, go down. And so at this level, you get basically every package in Debian or Ubuntu, in any version of them, in every architecture of them, anything. And as you go down, you can limit things to one architecture or one distribution or only main. Ah, that's a full-screen terminal that shows up. That's DD, which is low in responding, unfortunately. But you could do that. And I made a sandbox, well, not quite a sandbox. I made an experiment with publishing information, publishing static information. If you, I need to find a blog post of mine. Basically, if in your own home, in people.debian.org or on merkel.debian.org, create a .DDE directory and put stuff into it, it will be published in all possible data for months and so on. So for example, I put some in my, okay, this explains how to do it, so it's not long ago, it don't remember anymore. But again, I don't remember it anymore and it's kind of self-documenting, except the page does not exist. Why? It's now called static data. Okay. So you just put, sorry, in this directory on merkel, you put a file.jamler or .json and it will show up. So a very simple and really nice thing to do now, if you are generating any sort of daily statistics about the archive, make it in Python pickle or .jamler or .json, put it in that directory and it will automatically be exported as part of the .de.dev.net today and that works and that sorts out the problem of how to publish data and that this is a call for everyone who's generating any statistics to just do that. I'll put this on IRC as well. Yeah, it supports .jamler, .json and pickle and you just put different file name extensions and it will do the right thing for you. You can build your own hierarchies. There is a way to add documentation to whatever you put online. You just put a file .doc.jamler or whatever that describes the data and again, it will appear when you browse those three it will tell what it is and it's all documented here. There is a sandbox directory if you want to play with it and everything you put is mounted inside the DD tree and if two people put files with the same name since they are mounted in the same namespace, the first person gets it and the second person is ignored so if you have something to publish with a fantastically common name, hurry up and then we have another DD at .debtakes.dev.net which has the apt-xapian index of all Debian packages and also Wuntu packages. So if you want to do something like this, you feed it with the .debtakes.dev.net slash DD and you can already use it, generate TAC clouds and whatnot and of course the .debtakes.dev.net got the Polygen plugin which is extremely important. Again, the interface is self-documenting, it tells you how to use it by a JavaScript so I don't need to explain it, you just go and you say, okay, I will use JSON and it tells you, you know, this way. So okay, those are the two instances, the two public instances that can already be used nowadays and I think I mostly, I think I've shown all and I can take questions. I guess my question is, yeah, it's on, I'm just not talking very loud. I wonder how many things in DD actually require dynamically query data at runtime versus how many things, I mean, all the data that you're having people upload to the sandbox is just static data. So I think it's from an interesting design choice that you chose to let it pull in runtime data, do you need it for the UDD and things like that or could you do it entirely statically for most things? I'd welcome ideas on how to do it entirely statically because since if you have like a five level hierarchy you are basically generating, I mean, if you render that statically, you have a lot of stuff because, you know, different views on things, it can be kind of fixed by limiting the amount of views we provide to only those that are useful, which is a perfectly nice approach. And then you need to render them once for JSON, once for YAML, once for Pico. So if you look at the data space that it's dynamically generated, it's pretty big. Unfortunately, package name completions cannot be rendered statically, they can just be cached in memory with trees and so on. So yeah, it's a bit tricky. Also the static information that you can publish, it's read into memory because you can query, you can slice it when you query. So you can publish a hierarchy of directory or well, hierarchy of directory would be static but you can publish a YAML dictionary and then you can query only one entry of the YAML dictionary or if you publish a hierarchical JSON structure then you can query it at any point. So I don't know about one interesting thing that could be looking into new stuff like MongoDB. Maybe something can be done because they have these ways of indexing JSON documents and then querying them very, very fast. I don't know if that will help, I haven't looked too much into it yet. But yeah, the data space is huge. I'd like it to be more statical but then it's probably less useful. So I'm not in two minds over there. But certainly for example, for those very slow UDD queries, those can be terming to, at least the query can be made once and then hashed somehow. That can definitely, that can definitely be done. Or actually instead of using UDD, it could be an interesting idea to generate a MongoDB of the end package information. And then query it with their own fast stuff. Microphones, would it be possible to feed images into this? Say I'm generating some data that I've collected and I would just like to put it in too. So I don't have to store it somewhere else. You could if you can and call JSON or whatever. Well, I would like to include it as a thing. I guess the easiest way is you generate a static place online with the images and then you spit URLs to it, which is what can be done for Debian org. You have a view where you can ask where's the screen of this package and any guy URL, which you can use then from JavaScript and package managers, whatever. I have a question. In Edinburgh, we had to talk about data mining popcorn. You remember and are these data anywhere restored and can I have access with DDE onto this data? No, you can't because DDE has no access control whatsoever and the data mining popcorn thing gives you a sort of limited view over the popcorn information that we can't really package. Sorry, we can't really share because it may lead you to know something about what someone has installed. No, I mean, it's results we get. I just posted the link because not everybody here. I just want to, because this data mining was done and you got the results. Users who installed package full have also installed package bar. That can definitely be public. There can definitely be a plugin for it. Yes, absolutely. Is anybody here who knows where these data are? I just hear the talk and then nothing since two years. I know, I didn't do that. Okay. It was somewhere else. It was Alan Schröder. Yeah. I just posted the link to this talk but I haven't heard anything about this for me, quite interesting data. It's interesting but I didn't chase it and if we managed to chase him and get the data out and if the data is still computed, it's definitely something that can be posted on the IDE because you put it in some index somewhere and then you just offer a restful interface to it and done. I think it would be very interesting for if you even have a plugin in Synaptics also. And then you put it there and then you know where to find it. Yeah, okay. Okay, thank you. Anybody else? Okay, I think if there's no more questions, there's a few links posted on IRC. These slides will be available and and if you couldn't see over there, you can see it back in streaming. Fast forward only where you see this slide. Okay, thank you very much.