 Hello, sorry about the delay. If any of you has Gobi installed, they can help me with the slides with this procedure. Gobi is like IRC with a text editor. You can connect to the text editor and take notes all together, which is fun. I'm asking because this talk is divided in two. The first half is me talking about what I've done. The second half is me asking you about how to do things because I reached a point in which for a couple of things I have no clue. But it's fun. It's very interesting technical problems for a computer scientist definition of interesting. I don't see people connected. Apparently, maybe it doesn't work, maybe no one cares. You can take notes disconnected. It's just less fun. The problem I'm trying to solve here is to build smart interfaces to browse the large Debian archive. I don't buy the Debian is too large, which should stop getting packages into it. I prefer to say Debian is too large for the current tools. We have to browse the archive. Let's make new ones and put even more packages into it. So that's the long-term goal. Now, the first problem I think needs solving along this path is to have some index that can support actually browsing and scaling and searching on a massive amount of data. Now, actually, that's the second problem I had. The first one was to have categories. It's sort of okay right now, I'll do it. So we now have depth decks. We now have categories. They work decently. So the second step is, well, integrate everything into a nice fast, cool index that allows to be like an engine for something smart to browse and look for packages. APT is an index, it's a good one, but not for searching packages. APT cache does a substring search of every single package, one, all package, look at the description, string, substring match. It's not ideal, but it's not fast. That's not what it's made for. So I'm not asking for APT to have a full-text search index because APT should solve package dependencies fast and figure out what needs to be installed in a machine. I don't want to bloat APT. So I create another index. Two different kinds, another optional index if you want. So if you don't need to search in crazy way the package archive, you just don't install it. What the new index should have is fast full-text searches. APT cache is slow. You don't have results right away. Nevermind search as you type. So, okay. Faster tag searches. Well, I created DevTags, well, not just me, but we have DevTags, we want to use it. We don't want to wait ages for things to happen with it. Like, DevTags is not supposed to be used by grabbing the tag field in grab up the veil or something like that. There should be something like give me all X11 programs or something. And I would really like the index to be extensible. I've already gone through the pain of inventing a new piece of metadata for the Debian archive. And if someone else has an idea for a new piece of metadata, ideally now with this index they should have a place to put the data instead of well having to reinvent a new index or whatever. So, that is the problem, this index. The solution is APT-Exapian index. It's a really ugly name, but it's, APT is because everything having to do with packages is APT in the name. Exapian is the technology behind it. And index, it's because it's an index. So, once you explain it, it's very, that's what it is. It's an index that sits in varlib apt slash xapian slash index. Mistake in the slides. It's based on Xapian technology. If you don't know Xapian, it's a full text search engine like Lucene, that sort of things. This is Xapian because the API to search the index is the Xapian API. So, if you change a tool, you change everything. Then if you want to make it based on Lucene, then you will make APT Lucene index. Which is a different index that will have something, I don't know. Xapian has a, well, I basically decided to go on Xapian because it does everything. It can index text but also numbers and dates. And it understands numbers as numbers and dates as dates. So, you can query all packages with size less than this amount or sort the result by package size. Or if you had dates, well, you could still take dates into account. It has bindings in all sort of languages and they're decent. I've never had problem with the Xapian API. And it's fantastically abusable. I was trying to do things with it. No, I was trying to do things, sorry, without it. And then I went to the Xapian C channel and I was like, yeah, I'd like to search things. So, I'm making the Xapian query like this but then I want to post process the data and usually they were like, yeah, if you don't post process the data and just tweak the search like that, actually you get what you want. And that has happened rather often in my experience. So, I was like, well, you got me. And it's self-documented. There's a readme file in the index that will tell you what you find in it. Indexing is done by update after Xapian index because every tool to update something in Debian starts with update, something. It can be run interactively and in that case it will give a nice percentage indicator of how much time it's taking to index or you can run it on a weekly Chrome job. Interesting thing about indexing is you can, if you need to inject new data in the index, you can add a plugin in this directory and the plugin can feed data during the indexing process. So, for example, now APT is up in index by default, indexes package descriptions and depth tags. If we created a package able to download popularity contest informations from the popularity contest page into the local system, this package can install a plugin over here to add the popularity contest data to the index. Easy, done. For search index, you just need the plain Xapian API. I found out in Debian that no matter the effort I would put in creating a new fantastic library to search packages, no one would ever use it. So I decided that this time I don't. I point people at a library that already exists so no one will have to learn a new random piece of technology that I pulled out of my pockets. But you just use another existing technology if you fail, if you don't think it's interesting for indexing Debian packages, you've learned a nice piece of technology you can use for other things. So I'm not going to, yeah, this time that's, I kept that as simple as possible. And again, this readme file documents the index layout. To give you an example, where do you go? Stay with me. Okay, that's what the index looks like. How was it, the less colors? Okay, that's the index, well, almost. Now here we have the readme file. It tells you what is in the index. Now the funny thing, every plugin for indexing can add a bit to the readme file. So when the popularity contest package will install a plugin to add popularity contest information, it will also add a piece of documentation to the index telling how to use that information in the index. So it says, well, now we have these sources for indexing and that's information there. And for every source it says, well, depth tags, you can use it with this prefix and blah, blah, blah, blah, blah, blah. Which shows the problem of, I don't know what's in the index, the index is not documented and all that sort of things. The index is inside the directory index, which is a sim link, because that makes directory updates automatic. That is the Xapien things, which we don't care too much about. The interesting things is that the index updates are atomic. If you ever wrote anything that has to do with the apt index, apt index updates are not atomic. You have to lock everyone else out by opening the index for writing because it doesn't support readlocks. And so if you have a tool to search packages and someone else does apt to get upgrade, your tool will crash. Or you run it as root and they can't do it, but you get update. But since, well, in this case, it's just create a new index, flip the sim link, which is guaranteed to be atomic. So no one bothers if the index is upgraded while you are using it. Question, yeah? My IP, you guys, what's my IP at the moment? Yeah, good for the goby. Yeah, there's nothing on the right here, guys. Yeah, interesting. One and two, one, six, eight, 42 to 170. Okay, it's firewalled. Nevermind, goby. Okay, well, firewall between wired and wireless. So that is what the index looks like. How do you think, which one is the terminal? What do you think? This one. I'll show you a run of updates as an index. It may take a while, so it's not done on post-inst. You install the package, you don't get the index right away. Because otherwise someone would have been going to murder me, painfully. So I wouldn't. There's instead a corn job that drains your laptop battery once a week. But at least you don't notice it. But then I just updated once a week, since it's not fundamental information like package dependencies, even if it's a week out of date, maybe it's not a big deal. Done. It runs quiet from Chrome. And it even has command line help, is always welcome. Now, who is using Uptazapian index at the moment? We have two or many things, depending how you look at it. Tools that use it. One is GoPlay. Have you ever heard of GoPlay? GoPlay is nice. GoPlay, sorry, maybe the font isn't big enough. GoPlay is the package manager of the Debian game team. You run it. There's something wrong here, but let's run it from another place. You run it and you see games. That's the popularity. Miriam is actually the user interface. And the maintainer of everything. You then get to choose which kind of game you want to play. Which one? Not in Debian. Sex in Debian was a simple editor for X. But it's not in the distribution anymore, as far as I know. So you get, I don't know, sport games. It's not that many. Let's get something we have a lot, like board games. And then you choose the interface. Three-dimensional board games. Or demon board games. Which is a monopoly game network server. Text-based interactive, whatever. An interface. And you also have, for some, a screenshot. Package description, tags, and everything. Interestingly, if we go to something like Arcade, did I, did I, okay, I messed it up. It also supports Viadaptags, an external tag, external tag data for rating. Which are taken from outside of the archive because rating are subjective. So we don't distribute them with the package file because people will flame us to death. So everyone can make like a rating tag source and DevTags is able to pull it in and up to XAPI index will happily index it. So it supports plugging in new categories, spell, custom-debate distribution, whatever. So let's go play. Which can actually become, at your option, go learn new experimental feature and you can choose what you want to learn. Art, chemistry, geography, what user interface, World Wide Web, interactive, mathematics, server, mathematics for geography, probably makes sense somehow. Go admin, so backup, hardware support, login, and so on. It's going to become skinnable. So it can be used by other task-specific things. So it's really quite interesting. Oh, another thing of go play that I didn't show is search. So this is go admin, actually. I want to search for a configuration. I type config and it searches the type. Automatically, this list of things will decrease according to what is displayed. Now, most of admin is configuration so it doesn't really make sense. Kill user, obviously we have many. Of that as well. Kill user on the command line. There you go. Is there user management? There you go. And that's all the things matching killed user with user management. Yes, what it's doing is actually giving you the top search at the top. Indeed, CPU, congratulations to whoever gave a name to it, is a replacement for user add, user mod, user del. So it's indeed to remove users from the system, although the name really doesn't suggest it. Should have guessed it. No, okay. You get the best results first. Because Grapien does scoring by relevance. So first you get the one that matches both and then you degrade into one that matches. It's even more fun than that, but I really need to hurry up with the presentation. Code examples, what you can actually do with it. So let's get a terminal. There we go. And let's go and fetch the code examples. I blogged about all of these, but I'm sure no one has read it. So now I force you to read them. Who has read my blog post about AppTapion index, including the code attached? Three people, okay. Right, three and a half. Okay. Edited the configuration file and Vim realized it and updated itself right away. It's one of those happy moments. Okay. Simple query, two programs full screen. I don't wanna know. I just pray. What you do is access the database. That's Python. The examples have been ported to Ruby, although no, the person who's done it just told me he did, but he didn't give me the results yet because he wants to clean them a bit before publishing or something, but Ruby is going to come. You get English stemming, because AppTapion also does stemming. You can look for used or using and it will work. It doesn't matter. The ending will be fudged appropriately. So here is, if it looks like a DevTags tag, append the tag prefix. Tags are indexed with the XT prefix. Words are indexed lower case to make it case insensitive. Then, okay, make this list of term or them all together. And that's the query. You create an inquire object that will hold, will give you kind of access to the results. Oh, sorry. Let's get the simp. Okay. Oh, sorry. Let's get the simp query simple. I forgot where I should have started from, but it's actually the same. Open the database stemmer, same create the term list. Everything is ordered together, or because Xapion will give the best matches first. The best matches is the one that match everything. So it will act like an and that degrades into an or as you go. And then get the first 20 results and print them out. Every document in Xapion has data associated. Data is the package name. So I fetch the package name per every match and I get it out of the apt cache and print it out with the description, short description. And that is look for image editor and there we go. Nevermind, it doesn't show the gimp. It's not that apt would show it to you either because gimp is image manipulation. If you want to see the gimp, the gimp you do, I think, yeah, but it's very much lower. You get image, you have cut off, that was the first 20. Where was it? Well, there's a trick in one of these is once I get the results, I get their tags of the top matching results, feed the tags back into the query and then the categories of the top results will pull in other packages with similar categories but different description. That brings in the gimp nicely. Just other things you can do. Xapien can suggest the terms that could improve the search. So this tag works with raster image, works with image, the word images, the word editor, spelled, okay. Digicum itself because it's got lots of plugins. So if you look for image editors, you'll get lots of descriptions containing Digicum inside. And so on, okay, and that's tags that would improve the search. You can do a search as you type. Sorry for the black on white, whatever. Fast enough. Note that it does not do substring search. You can actually look for midnight commander by typing MC. The fact that it doesn't show is just a bug. Yeah, packaging, so now if I search for MC, you get it over here, over here. All these actually contain MC. Probably stemming comes into play and make it weirder, but well, the code is actually simple. I would have liked to force you to read it more, but I don't have time. I need to pass to the most important part, which is help. It's written up over there, tiny and small because I'm shy. But, so one of the thing is it's extensible and you can index all sort of other data. Problem is pulling them into the system. For example, popularity contest information, statistics about the backtracking system, rating provided by external websites like iterating.org. It's been at least two years. I promise them I will find a way to put their data into Debian because they kindly made like a Debian only view of their data on purpose. But it's a bit tricky. We already have APTGAT update. Then we have DevTex update and we will be going to have a popcorn update, iterating update, BTS package info update, and it all becomes a bit boring. So well, my idea would be to have one package per every data set we want to get because in that case, like one package for popcorn, one package for BTS statistics, because in that case you can choose which one you want to have and every package can have like a cron job or whatever preferred update policy they want. The package will have a copy of the data set inside. So you can install from CD and get a starting set of data to work from. A tool that can be run to fetch the data from the internet or a plugin system to fetch the data using a single update everything tool that will look at what data sources are installed and call their plugins. That is something to be decided. Possibly a cron job, data written somewhere in var so it cannot be accessed raw instead of just through the index and then an app in index plugin to index it. So that would be my sort of vision on how I can see new data coming into the system but you can understand it sort of tricky if someone has ideas now. I'm not really sure if you've got a question for me. But I think if you're saying that you have a lot of pages with all that run data which you need to get and probably be duplicating a lot of work they all have to do it the wrong way. So it's, I think a good idea to design some sort of API to say, well, this is some commands you should be able to support and maybe some codes which they can do their specific thing to actually get their data without duplicating all the things that all the others do as well. Yeah, so like a common fetching system. Yeah, and it can also do the cron job as well. Which kind of sounds like apt could support sources that don't start with them but could have other sources for apt and then apt get update will pull them into the system. I was looking for Michael Vogt who I don't see at the moment. I'll beat him up later for not be. Ah, thank you for being here. And Michael has implemented hooks after apt get update. And run your database update now if you want. After apt get update. Just write a stamp file and look for it in your cron job. So we create a cron job that is free at like every hour and just look at the stamp file that's updated. And then go on. Like this. I think the only concern with the sign of this, and I think the whole idea of having a package system to download additional data on apt get update is just the right thing to do. We may consider some stuff as really heavy weight, any like screenshots, for example, we let's data you don't want to pull for every package. Yeah. Like in one big chunk. So, but what kind of data I guess. Screenshots could be just as packages because they don't change often. Yeah, exactly. Same for icons, I think. You should just put them in a package. Right, so I guess we could even talk about, because now that there's the hook system, if apt get update downloads something else, the hook can tell some packages to have a look and process it. Which could mean that if we can make like, instead of dab something else, that downloads a file and just puts it in a directory, that could be well enough for many things. That could like simplify things a lot. We could have a special like, downloads, miners, popcorn, and then just an HTML trigger to write. And you put it like in varly-bubbed stuff. And then the hook will call a script, looks if there is popcorn in varly-bubbed stuff. And do it. Let's talk about that later. Okay, solved. Yeah. Um, names. Exactly. There's a game template. Package that's solved too. Go play depends on game thumbnail, which is okay because they don't change very often. So that can kind of go slowly, no problem. And then go learn will depend on learn software thumbnail, except it's the same package because it's a sim link. But we could sort of make a package that only contains a sim link and done with it. Depend on go play and skate shots. I don't know. It's okay. Other problem, Debian specific stemming. As I said, Upscrapian index does not do substring search. Problem is computer people like to stick words together. So when you were looking for a GTK, GTK you wouldn't match LibGTK because that is a substring search, which means we are not speaking proper English, technically we are speaking Debian English which has different stemming rules. Now, if different stemming rules are easy to implement on indexing because it's just another plugin, looks at the package description data, creates new term, feeds them. It's tricky because the same thing has to be done when you search. When you search you need to do stemming like you did it when indexing. And I can make a library with the Debian specific stemming, but then that would be a library written by myself that no one else will want to use. So I'm a bit, I don't know what to do here. I tried to ask the Xapian people a bit about suggestions, but they don't have, I mean, you can't implement pluggable stemmers into Xapian, so that wouldn't work. And I'm a bit at a loss, but some interesting stemming problems are libful becomes both library and full. Debful becomes Debian and full, so it's not just split it, but also complete some parts. CVS Delta, CVS Graph, GNOME catalog and so on. GNU something, usually you index something, but then GNU step, you don't look for it when you look for a step, so there are exceptions. But it's actually a problem, a general problem with composite words like rin, fleische, tiketirung, subber, wakung, sauv, which is apparently the longest German word. It's a name of a law for cracking beef meat in some part of Germany, but it's a nickname of the law. Another way to fix this problem is to actually write these in descriptions, whereas PostgreSQL could put in the description that is the PostgreSQL database. For example, or LibGTK can mention GTK in a description. So that is a call for package maintainers to put in the description all the keywords they would expect to be used when looking for the package. I would solve actually a lot, that could possibly the first four points are pointless as long as the fifth point is done. Yeah. Is there a key word list here in Berlin? Well, hopefully is, if that is the keyword you would use to look for that package that is a very meaningful word to describe that package. So hopefully, yeah, there's that tags which helps a lot, because the trick of using the tags of the top match to pull in other packages, they don't match exactly, but have the same tags, also solves a big deal of the problem. So again, GIMP is not an image editor, but through tags we can actually recognize it as such. In fact, forgot to show this, where are you, ah, you get mouse scroll wheel, you bastard. Let's get the browser. Depthags.debian.net is a new thingy that I've started to play with like a couple of weeks ago. It's a kindly, the server and bandwidth are kindly sponsored by one of the companies I work for. And if I look for image editor here, actually, I didn't do any graphical layout or anything. I just stitched together the thing. It uses up XAPION index as the backend. That's another thing that uses the same backend. It's just an index built with a custom package database that merges package information from all architectures. So what I'm doing here is I look for image editor. It's, this is a search on package descriptions, but I only take the frequent tags that are in the package that are resolved from the descriptions. So it basically uses the package description to search tags. It's very clever. So with image editors, what I get is the list of tags that are usually associated with image editors. Let's see if this works. It does. And then I just click on tags. Yes, I want it for X11. Yeah, that's another way of seeing it. Yes, I want to do editing and I'm down to 28 results. Yes, it's a program in which there is Gimp. No, there is not. Yes. Okay, it's over here. There's another way to look at package descriptions. Adjust as a tool to look for tags. It works incredibly well. It's very smart. If you go on the online tag editing, one of the toys with which you can look for tags is you just type what the package roughly is and it will tell you lots of tags you can actually use and they make a lot of sense to tag that one. What else to index? That's my three ones on the list. There could be more interesting data to fetch. It's a simple question, so if you have an idea later, pop it up. I was just thinking about indexing the file sources so that we can find files with some patterns but that wouldn't make it due to searches, searches, so maybe nothing. So index in source code is interesting, but you're looking at several gigabyte index. I was thinking about the file list. Oh, okay, file list, yes. Index like apt-file. Yes, that's another thing to index. Can someone send me an email about it? It's a firewall because I'm wired and the two, you have, what? Okay, there's another gobby session on what IP? I found automatically with fancy things. Okay, if you see a gobby session, join it and write that down. 192.168. Okay, shame it's five minutes before the end. Now, this one is, that's this slide. What languages are, right? Another one, right. What languages this package is translated in? Good one. Another one that kind of popped up in that mailing list is licensing information. People asked me to do tags for licensing information, and I tell them, well, give me a list, not too long list of tags that describe 90% of the packages in the archive and I'll have no problem doing it. No one managed it yet. Well, yes, it's a very good way to solve it, but as keywords, it may work better. I know there's a new format for Debian control, sorry, for Debian copyright, which may have a summary inside, which then could sort of be searched. That would be another idea, but it's a bit yet to come because this new format isn't actually very finalized yet. Doesn't work. We have way weird licenses that anyone could possibly think like, yes, this is GPL, but if you link it with that specific software, it's LGPL, but I have a friend who likes to use it differently and I only allow him to do it when he's kind to me. Now, we had that sort of licenses in the archive. It's really, most of that is the same side. GBAB is the X and then you have 40% mess, which is already too much. But then it's sort of a pointless search result. Right. Yeah, well, could be all, I mean, could be tried, but... But don't turn camera. Someone, and then, okay, I'm done. It's the last question. Index update. Can it be improved? You notice it takes on a Core 2 dual, whatever, 1.6 gigahertz with a gig of RAM, SATA disk takes about, what was it, one minute to upgrade it, to update the index, which could be a lot. So one idea is to do it incremental, especially now we can catch after update, but then we need to see what are the descriptions and other data that actually changed. Could be possible unless in looking at this, you figure that it takes more time to see what changed than actually to see the rest. I don't know, can APT tell me only these packages have had any changes? No. Okay. And also incremental updates increase the size of the index. The reason I recreate the index every time, it's because it's compact. It suffers from fragmentation a bit if you do updates. It's got B3s that they have some leaf left around and whatnot. So the index is about 30 megabytes and if we start doing incremental updates, it may get 60 or something, which worries me a bit. So that is another tricky one. Updated packages they are, so I think there is some sort of thing that would be able to get you that difference. Right, but then it's just marked Upgradable. It's not updated. Yes, but if it's Upgradable, it is updated. Right. Because the package doesn't change. So you only index the ones for which you have new versions available. The interesting information is that you have new descriptions, right? I mean, that's what you're interested in. But you won't get new description without a new package. Okay, we're just about out of time. Thank you, Enrico. You're welcome.