 Welcome everyone. It's my pleasure to introduce Ollie Betz to you. He is the long-term maintainer for the search engine Zapien. He's also, he's both upstream developer and Debian Packager and apparently he's still searching. Hello everyone, thanks for coming along. So yes, I've been a Debian developer for almost six years now. It'll be six years next month. This is my first DEG Conf because about seven and a half years ago I moved to New Zealand so it's quite a long way to come. I'm self-employed. I work as a consultant doing work on Zapien. So I've bespoke developments and helping people set it up and I've been involved in the project since the very beginning. So a bit of an audience survey to start us off. So who has heard of Zapien before? Okay, so that's most of you. Who's searched using Zapien? Okay, and who's written code using Zapien? So one and two halves. So I suspect in fact most of people who think they haven't used Zapien have because it's installed on 95% of machines reporting to the Debian popcorn and it's used by a lot of the infrastructure and it's also used by other services. So for example gmain.org uses Zapien for its search. The project started about 16 years ago. It was originally a commercial closed source replacement for an engine in a company I was working for but very early on in development they decided to open it up and release it under the GPL. The plan was at that time to dual license it so people could use it under the GPL if they wanted and if they wanted to commercialize it so they could pay for one and we'd have to deal with getting copyright assignment for patches. However a company went out of business quite soon after that and I just kept working on it. So I'm the most long-term contributor because I've been there since the start. So we're using it on quite a lot of the Debian infrastructure now. The numbers on the right are the number of documents like emails or pages. So you can see from this we write a lot more email than anything else. So just to run through them lists is what it's obviously for mailing this search. Search is over the website. Packages there's a search of we're getting some ringing on the can you hear that? The packages search is a package search so you can search package descriptions with it. The wikis and wiki pages and deb tags is Enrico's tags for Debian packages. There's also a lot of package software that uses it. This is just the sort of the top ranked roughly by popularity contest. So there's quite a lot of package management and similar things that's actually in Synaptic Growth Package Managers. SiteCast is a thing in GNOME which tracks what activities you've been doing so it can suggest recently used things. And because the package management stuff tends to get installed by a lot of people that's towards the top of the list. There's a long tail down below here which I didn't include on the slide. So I was intending to talk a lot about this part of the thing but I had to sort of shorten the talk a lot to fit into the time available. I'm going to touch on this briefly. There's some usability issues with some of these searches. As an example the list search results layouts not very helpful. I'll just show you that. So this is the search on the list site. This has actually got a bit worse the last week because I fixed the list search feature which had got disabled some time ago. But if I search for some things say... You can see most of results page is just the form at the top. In fact if you have a small screen you don't even see any results and you think it's just not matched anything. So on my netbook which is only 600 high you don't get any of this part on the screen. So clearly we need to do a bit of work on rearranging a form to take up less space. So probably ideally needs input from someone with some web design now but just a simple tweaking of things would help here. So it's a slightly embarrassing example. If you go to the search.dev.org and search for DFSG the code of conduct page which is almost certainly the page you wanted is the 157th hit. So the problem with this is that it's actually part of the code of conduct page and it doesn't talk about DFSG a lot and the ranking is only based on keywords at the moment and it's only based on keywords on the page. So there's actually a bit of an issue with page in other search engines. So even the mainstream search engines which have more rich database they have link analysis and click-throughs and so forth even they only get it in the middle of the top ten. So there are several ways we could improve these sort of issues and these issues they're not very common but they're very annoying when you hit one because you just can't find the page you want. The DFSG ones are particularly embarrassing one because it's clearly something that it's a key part of our culture, our ethos. So one thing we could try to do is to tweak problematic pages which people have tried with this one in the past by adding some more keywords to the page which has helped a little bit but not really solved it. We could try to maintain some sort of list of golden answers to particular problematic queries. That doesn't scale all that well because you have to do it manually and people have to find the queries to do it for when someone has to do it and then you have to make sure it keeps working. We could look at the search logs and we'd be website logs and so forth to do log analysis. Generally things that people click on a lot are the results that they want and that's also true. Generally pages that people visit a lot on the website even though they don't come through the search, other pages they want. The possibilities and privacy issues within that are analysing the logs and there's also potentially abuse issues with people hammering away at websites to try and promote pages. It's probably not such an issue in the way that it is for something like on the big search engines. Searching all sorts of websites. Probably the simplest thing we could do is to also index pages by the link text. So almost certainly a lot of pages on the Debian website that link to the code of conduct page with DFSG in the link text or near the link text. So you can make use of that to decide that DFSG is important to that page. We don't currently do that. That wouldn't work so well for the list search though because it's not a linked structure. So the thing I'd like to talk about in particular is building a combined search system. So currently if you want to search for something in Debian you need to know where the answer is before you start. So you can have to go and search for lists or you can search for websites or you can search for Wiki and so forth. But the answer, in some cases you know where the answer is. If you're looking for a particular package, the package search is the place to look for it. But for many searches the answer could be usefully fulfilled by a result from any of the things. So why don't we try and provide a unified search UI. We search all use Zappian and Zappian provides a way to transparently search over multiple databases at the same time. So I had a look at doing that at the end of this previous week. So I'm presenting Sauron, searching all useful resources over the network. It's now time for a demo. So let's search for code of conduct. So because we have nearly 50 times as many emails as we have anything else, you'll notice these results are dominated by email results. So all of these are email apart from this one here, which is a page on the website about the vote for the code of conduct. This is a bit of a duct tape and blue string demo. There's lots of things in it which don't quite line up properly. The different searches use different formats to store their data in. So I've had to sort of bodge together ways to handle that. And in particular the Wiki results, the Wiki stores the information in a Python pickle. So I'm running a little external server which I've patched. I'm going to call out to depickle results and return to mean something I can handle. There aren't actually any Wiki results in this one, but what was the Wiki query? There we go. So the Wiki results are a bit brief because there's no snippet from a document stored there. You have to go and get it, which is how Moin, Moin works. But it does at least find them. So you can actually try this at home apart from a Wiki part. The Wiki machine is restricted access. I requested access to it specifically to set up this. But any developer can SSH to the list machine or machine which runs the search front end. You need to be running Jesse because that's the same version of Zappian that those machines are running and there was a minor change in remote protocol to fix a bug. So if you're using the version from unstable it doesn't work because the protocol versions aren't the same. So you create this small file which is in Zappian causes a stub database. Each line in this is a database to search. So all this is telling us is that we're searching a remote database. We search it by running the command SSH machine name and on remote machine we run Zappian ProgServe, which is a version of remote protocol server which just talks on standard in and standard out. And then the thing at the end is the name of the database on the remote server. So if you create that file and point omega is the CGI front end I'm using here, which is a part of Zappian. It just works. The other part you need is the bit to deal with the different formats of the two databases. This is a sort of an undocumented feature that you can do this. I was quite pleased to discover this. So there's two different ways in omega you can store the fields. You can either store them as name and then value each time. But because that's storing the name over and over again, there's another way which is you just store them one on the line by themselves and you put the names of fields in once. And the list does it one way and the web search does it the other way for just because the people who set them up to do different ways. But you can actually dynamically switch which happens on a per hit basis. When you search two databases it interleaves the numeric document IDs. So what this is saying is that the dollar mod, dollar ID, comma two, comma one, that part there is basically saying it's the document ID even or odd. And if it's one then we set the field names and if it isn't we set the field names to empty which means it looks for the field names actually in there. And with those two you basically get the search I had but without the wiki part. So there's some fairly obvious improvements we can make this. It really is just a prototype at this point. We can align the data formats better. There's not really a good reason to do it differently in both ways. And for various fields which are present in some but in others but are meaningful in both cases. Things like timestamps and languages aren't stored in entirely the same way. And that means that I can't do a sorted search of this because to sort by date I need the same date format on both of them and use different ones. It would be nice to have a UI that would allow you to select which data sources you want because in some searches you don't want to search for mailing lists. And in fact results for mailing lists would just distract for you. It would also be good to add some more data sources to this. I did start to look at the search for the packages that there have been installed but then I realised that I'd have to sort out the Python unpickling amount to cut most of yesterday afternoon. And to make this into a system which is more than really just a demo we really want to have these databases on a common server rather than searching remotely. If we're searching remotely we get into issues. If one servers down and the other server can't talk to it and how a fallback works. We're already replicating the databases for the web search. Oh sorry. We're already replicating the databases for web search. So it wouldn't be very hard to set that up to replicate the others. And it would be good to have some diversification of results. Because we basically write too much email and so all the results tend to get a little bit swamped by results from email. And it would be generally be good if a first page offers you some email results and some wiki results and so forth. If anyone's interested in helping out with this and helpful you're very welcome. There's various skills required that you don't need to know Zapin very well to help necessarily. And that's really a part of some questions. I'd like to thank Sledge for Sledge of the DSA team for sorting me out with access to a wiki machine without which part of the demo wouldn't have been possible. There's a Zapin bot this afternoon at 3.34 in Amsterdam which is slightly obscure. If you walk down the hallway there's some bending machines and there's a doorway just before them and there's a little science is Amsterdam and you've got the stairs. You can find out more on the Zapin website which is zapin.org. There's an IRC channel on a free note. There's mailing lists and I'll be here all week until a week on Sunday. So if anyone has any questions. Hi so Ollie knows who I am but for those of you who don't I'm David Bremner. I'm one of the not much developers so thanks for adding us to the long tail there. You probably did that just because I was here. That was actually where you were in the pop com position roughly. Right okay so I wondered about using one of the libraries built on top of zapin that provides threaded search results if that would be a useful addition. So in terms of presenting the results of mail search there's tools new and not much which are both in Debian which can give you a nice threaded view and then of course you have HTML issues about how to present those threads but still. Yeah I think that would be interesting it's it's also another way to deal with the the kind of avalanche of mail results you get is that a lot of them tend to be in the same threads and sorry I've put a bit of a cold at the moment. And really you don't want actually every message in the thread you want to collapse it down to show that this is a thread which is discussing this. Any other questions? The problem you've got with the DFSG searching would it help if you added a synonym for Debian free software guidelines to the list of synonyms when it's indexed? Yes it probably well there's a synonym feature which works as search time rather than index time but yes I probably would help actually make a note of that. Is the source code for this demo available as it can get? It isn't currently but it's a very small patch which will work available. And would adding some weight? You've got the source to the try it at home bit anyway okay which is. Where is that? Well it was on the slide. Would it adding weights help? Could you weight the results from the the web site heavier than the lists? So give the lists a 0.5 weighting and then you tend to get results more from Debian.org rather than Yes you could. The trick is to how to pick those weights. There's a there's quite a popular area in academic research which is called learning to rank where you basically put all your weighting factors into machine learning and feed it to training set them and it it rejigs all your factors so that would probably be for the way to adjust the weights would be to simply get a machine learning thing to decide them for you. The problem with picking one arbitrarily is that it'll look great for the case you're looking at and then all the other cases will just get worse. And are you thinking of adding search facets so that you know you could have a list of different facets like this is number of results we've got from lists and this is the number from the wiki. I hadn't thought about it but we could. There are features in zapp here which allow you to do that. So I maintain sort of the app zapp in index and the query tool for it and I noticed that writing the software is reasonably easy but tuning it is an entirely different thing and I sort of have a bit of a hard time staying on top of it so would there be so I'd kind of like to have some people that occasionally try out these query things and figure out oh it's actually useless because you can't find things but it could be easily fixed by tuning indexing and this or that way or searching in this or that way. We don't get very much feedback about bad searches and what I've seen. I'm not sure how we can encourage people to. And unfortunately the average user wouldn't know how to tune because there's a bit of knowledge involved. I would kind of like to have a group of people in Dabien that kind of read up or get a little bit of training on how to tell good results from bad results and how to have a look at the indexing and propose things. I don't know how that will help. I don't know how to put that together but I can see that like 90% of the work is writing the code and the remaining 90% is tuning it and the remaining 90% to make it useful tends not to be done and I feel like I write lots of nice things that are nowhere near as useful as they could actually be. So that's currently my concern and I don't know how to address it but I wanted to raise it. It's probably quite an involved topic. It's probably worth discussing that afterwards. Hi I'm BDL. Hi. So I'm sort of curious. I mean Dabien is totally cool technology and you know I couldn't live any more without not much thanks David et al. But I can't help wondering in the context of some of these searches to what extent does sort of relying fundamentally on a keyword oriented indexing mechanism sort of prevent us from being able to do other things that at least seem like they would be helpful. And maybe it's just that I don't understand sort of the layering of what happens at sort of index build time versus what extra intelligence is laid on at you know query and results integrating time. But it seems to me for example that in the DFSG case a document whose title is Debian Free Software Guidelines ought to carry more weight in search results than a document that makes reference to it. So the document in question doesn't have that title. It has the title code of conduct. Yes it's not code of conduct. It has no the Debian Free Software Guidelines was originally an appendix to the social contract so it did actually have a section title. Okay so it's not the page title. Right exactly. But so this is the question I'm asking though. Is it are we somehow is there something strangely limiting about using something that's essentially just where the underlying indexing technology is essentially just a text indexing thing or am I just being stupid in thinking that there's sort of a bunch of obvious optimizations here you'd want to use that might go against that model. So I think we should make more use of the information that we have. So for example we have web logs that tell us what pages people look at and we have links in the pages that give us information about what other pages are about. So I think we should make use of those. And then technically is that something that gets layered somehow into the search interface that we're using to access the underlying sapien index or does that involve putting some kind of extra information into the index itself. It's probably mostly an indexing side thing. The DFSG case is quite frustrating. We do give extra weight to document titles. We don't give extra weight to section headings in them so that could help. I've done some work in the past where it would actually look more at words near the beginning of the page and section headings and then as you got down the page you would care less and less and that works quite nicely especially if you then quite aggressively cut off which words you care about. For that particular case essentially what it is is it's two documents in one file. So from the search side the search would be happier about it if we were set for pages but I realize we shouldn't be changing the structure of our constitution. Yeah right. It does seem to me somehow though that some of those sorts of tricks particularly in pages that are things like web or wiki pages that have an intentional structure and some hopefully some thought was sort of put into how the information is organized ought to be weighted higher than the randomness of email. I too am frustrated sometimes at search results return sort of a sea of emails that might or might not be terribly relevant and then the thing you're really looking for is buried 100 hits down. So yeah I don't have great answers. This is not a space I spent a lot of time thinking about. I just can't help wondering. I think waiting emails down a bit just because in general they're probably not what you wanted is a good starting point in a combined search. But the threading would be quite useful too. Are we at time?