 Vishesh has been doing an amazing work of turning the more university into a very practical part that is Baloo. So, I'm curious to see what we get to learn about Baloo today. Thank you. Alright, so I'm not actually going to be talking that much about Baloo, but more about file search in general and how different people approach it. Previously, during the KD era, we had an epimook which was kind of like a unique snowflake. So, it did all these different things. So, you couldn't exactly compare it with other systems because it was unique. But nowadays, we're basically trying to do file search. There's no more RDF, there's no more fancy stuff. It's just trying to get the basics done and how the plasma world is also going, get it done. So, I'm mostly going to be looking at how Microsoft and OSX and maybe even some other applications in Linux do it and how we compare and what we can learn from them. While designing Baloo and I've been researching quite a bit on all these other options. So, yeah. Okay, so basics of searching. Just to get you started in case of any of you don't know, you have a file. It has some words. They will run funny. He jumps funny. He also tries to be funny. It has a URL. Simple enough. You take the words, you remove all the duplicates and you remove the small punctuation marks and what not. Actually, you do a lot more stuff, but I'm simplifying it. And then you put it into a kind of a hash map and be funny jumps. You have all the words. Each of the files goes to a URL. So, when you need to look up any word, funny, go check it up, gives you the URL. The first problem we come across is URLs are long and they change and you don't really want to store all the URLs. So, problem number one, you need a unique way to identify a file which kind of can be stored efficiently instead of a really, really long string and it's resilient to changes. And that is kind of a hard problem. It's a problem which we aren't the only ones who face because this isn't unique to the file indexing world. It's also there whenever you write any application which has a database which is accessing files. Key activities has this problem. Any kind of music application has this problem they need to do re-scans regularly. Oh yeah, that's the problem. So, the way it's typically done in Windows is NTFS. Windows only supports more or less one file format or this is the de facto standard. Every file has a unique identifier. They use that. It's a one-way mapping. So, you have the URL and you can get the identifier. The other way around also exists for Windows but it's slightly complex and is only used internally. So, applications don't get to use it which I don't understand why they won't expose it but that's them. But their search thing does use it in a way. But it's quite efficient. OSX has something similar and earlier on they did expose it but now they don't and they use a combination of actually only one way mapping and having a really kick-ass file system monitor which regularly informs them so that they can keep the two-way mapping consistent. The two-way mapping is only in spotlight but it works out for them. And then we have Linux. It's complicated. What we used to do during the Nepomoc word is we used to use unique identifiers. This was a really, really long randomly generated string which we used to identify every single file. This wasn't great because we had to use two-way mappings. You had to have one map of all the unique identifiers and the other side is filed URLs. And you need to do the mapping both ways always. You need two big tables and those tables need to be kept up to date which means monitoring the entire file system. And when you rename a folder, you need to rename all of its children. And if you implement the table in a slightly naive way, you get lots and lots of CPU operations when you're doing a move. So it's slightly problematic. During the balloon time when I was just starting out, we used integers which is basically the same thing. But now what we're doing is using the file system. And we're basically using the inode number and the device number. And that gives us a more or less unique enough identifier for every single file. And we only have to store a one-way mapping because we take the file system to actually resolve the URL and give us the inode. It kind of works out well. It doesn't work across multiple block devices because then the device number is different and then you have to do a lot of work. But more or less for the most common use case, it works out really well. The end result is you can kind of use Baloo as a fine replacement when you're searching for a file based on the inode number because we actually have it and it works out well. But yeah, it more or less works out. We're not quite there in the Windows world, but given our limitations, we're doing a decent job. And nowadays in Baloo, we don't really store a full table. We store a full file system tree kind of a structure. So doing updates is really, really fast. It's typically an atomic operation because only that one folder needs to be changed. Alright, second problem. Initial indexing. You need to go through all your files and index them. This sucks. Can't be avoided. Everyone bitches about it. Doesn't matter which system you're on. You can do a search for Windows searching, taking up all your CPU, OSX, Tracker, Recall, Baloo, Napomoke. It's there. Can't help it. You have to go through this. Different systems mitigate this in a better way. Windows actually has a lot of research into trying to analyze what processes are running, how much CPU is being used, how much hard disk is being used, and then accordingly throttling it. And they actually have great throttling mechanisms, so they actually do decrease the CPU given to that process and whatnot. We don't have any of those things yet, but we do okay. The second problem. Once you've actually done the whole initial indexing, when you start up again, you don't know if files have changed. You kind of need to do a scan again, except doing a full file system scan is slow and will invalidate your entire iNode cache because the entire file system is being scanned. This gets hard for people to do. NTFS. You don't use it, but it's actually a kind of an awesome file system specifically for Windows the way they've designed it. It has something called a USN Journal, which is a higher level of journal of every single file operation that is there for metadata. So any application can just record which is the last place I was in the journal, how far I've gone, and when it starts up, it can just say, hey, give me all the records since this point, and it'll get all the records of which files have moved, where which file has been renamed. And this is essentially what applications use and this is essentially what any backup application uses to track what's happened since the last backup and this is what their Windows search system uses to get updates since its last run and it works out really well for them. In Linux, we have finally a slightly modern, quite modern file system, ButterFS. So I try to emulate doing this with ButterFS. It's complicated. ButterFS has snapshots, so you can technically take a snapshot and then later on when you want to know the difference, you can take another snapshot and do a diff, but you get the combination of both the data and the metadata, which is not something you want, but people have been in talks with the ButterFS developers to kind of only give the metadata, so maybe we'll see this in the future, but it still seems like abusing a system instead of really integrating with it to get the default things we require. It is what it is. Indexing plugins. This is the simple part. Everyone seems to be doing stuff really similar. Both OSX and Windows have a single plugin per file, so you have a list of MIME types or some unique way of identifying a file type and you have one plugin per that. We have multiple plugins per file type. Maybe this was a bad decision, maybe it was good, we won't know until maybe a couple of years down the line when I start upgrading it, but it works out well. It's fairly simple. You open the file, you have a specialized reader for it and you get the metadata out and it gets saved. Windows does this really well. Windows has two separate indexing forms. It has something called protocol handlers which are responsible for getting the text out of the file, so only the text, and they have a separate property handler system altogether which is for writing the artist of a file or whatnot, but they're completely independent and the property handlers actually go both ways, so they're for reading and writing, so they can actually, you can just right click on a file theoretically and modify the PDFs title, the image tags, whatnot, and it's kind of really well done and you can expose this from an application point of view, except that they don't. So they have awesome technology, but something went wrong in the latter part, but it works out really well on every single extractor in a completely isolated environment where the only access is given is to the file information, so there's a whole model of security as well. We have a 2B framework called kFileMetadata, which it has a list of properties and extractors go in from the files and spew out those properties, so both the text extraction and the property extraction is combined and this is only one way. The text has the nearly similar thing given different languages and they have an extensible property system which people can define custom things, we just assume people will contribute upstream. It's fairly similar and it works out. Hopefully in the future, maybe I have plans to making kFileMetadata more of an output system as well, so maybe we'll get property handling and then finally we'll be able to modify properties better by dreams, but doable. Next problem we have, file monitoring. I'll be talking about this a little bit. Windows has a really nice file monitoring API since they've always had search for quite a while now and you can say, hey, I want to monitor this folder, give me all the events, and I want it done recursively. OSX has the exact same thing. Actually OSX is even better. OSX has their own version of a file alteration monitor, something we used to have like FAM, except everything goes through Spotlight and it goes to that other thing, so Spotlight has special privileged access to the events and then you have another special database which applications can use to get information about which files are moved, similar to what NTFS does with the UDS in general, but they do it not on the file system level, they do it on the user space level, but it works out for them. We have iNotify. I could really bet you about iNotify. Recently, someone as in recently as in six months ago or maybe a year ago, someone actually submitted a patch on the kernel mailing list saying things I wish I knew about iNotify and added that to the documentation. It's hilarious in a really sad way, but still we have iNotify. It works kind of. The problem is it doesn't do recursive file notifications, so you need to iterate through the entire file system tree and you need to install a watch per folder. Each watch is approximately about a kilobyte of kernel memory, so when you have about 50 to 100,000 folders, you get about a couple of hundred megabytes of kernel memory being used to maintain the watches, non-swappable. Your RAM is getting used up. It's not perfect, and then you have the mapping on the user side, so it gets to be messy, but this is kind of the best thing we have and everyone uses it. We clearly don't have a real focus from the kernel side on file system notifications, purely for the context of file indexing, but that's okay. That's how it works. If someone needs it, they have to add it. We have something cooler called FANotify, which was added a couple of years ago. It was added for the sole purpose of... It seemed like the sole purpose of virus scanners. It's good. We can do global notifications and we can say we don't want to monitor these, and it's recursive and it works, and it doesn't have file move notifications. It doesn't give us anything, but it's still usable for us because as long as you have a unique identifier for a file, if you do somewhat periodic checks, you can kind of see which file is moved where. We've experimented with that, and tried to work around a particular system and not really integrating it well. Also, it requires root access, but I've been trying to change that. Maybe it will go in the kernel, maybe not. All right, two interesting parts. We have how you actually store the index. Windows uses some kind of SQL store. They have it running on every single system and it's exposed in a table called system index, and that table is actually shared across your workgroup. So when you're connecting to a network, Windows lets you say how you want to share stuff and if it's a private network with some strange stuff, the entire system index is exported, so you can run SQL queries giving the machine name and the system index from another system. Useful for distributed searching, but they don't really expose that in a UI. So again, it's having cool technology, but not really promoting it in a perfect way. They store all the text there and they store all the properties separately, so you can actually build fairly complex applications dealing with properties. And they actually do quite a bit to promote distributed searching. They have the open search protocol, which is used by web browsers and whatnot, and they support that as well, so you can hypothetically, if Linux supported this protocol, you can search through Windows Explorer on another machine. They're trying, and it's impressive. OSX has a C-based full text indexing library called SearchKit, which is also used for indexing their emails and a few other places. It's purely for full text indexing, so that initial image which I showed you, where you build a table and you do the tokenizing, that's basically all it does. It's not a full relational store, so doing property comparisons and rain checkups is slower and harder, but that's not the common use case. The most common use case is doing word lookups, and it's specifically built for that, and this is blazingly fast. SearchKit, there wasn't that much information available about it, of its implementation, but trying to reverse engineer, it seems straightforward, and it's actually being used across a lot of places, and they promote using it in other applications as well, and it has special handling for files, so it's not only a full text indexing engine, it has a way to store file URLs in a special way as well, so it's doing quite a good job. We have Baloo. We used to use something called Chapion, which is one of the more popular full text indexing solutions in Linux. It's no leucine, which is the de facto standard as of today, but leucine is Java-based, and then you have clones of leucine which are in different languages, and then you have Chapion, which is a pure C++ implementation. It works out really well, and the first couple of versions of Baloo used it. We had problems with it, as I typically end up exploring with different things, and it wasn't as resilient to corruption as I would have liked, but typically software designed for the server, I find it's more of a... if it gets corrupted, someone will go in and try to fix it, and we can't really do that with users. Plus, we had quite a host of problems with it, so I decided a couple of months back to stop using Chapion, and we basically wrote our own full text indexing solution. It was actually much smaller than Chapion because when you're writing a full text indexing solution, you need all kind of priorities, and you need all kind of... you need a way to say that this term is important, this is not, and develop a scope and relevance ranking, and these are really complicated probabilistic algorithms which rank every single document. I mean, Google had the same thing, the whole page rank thing, but when you're doing search on desktop, it's not really the same case. You aren't really doing searching based on relevance that much. You're doing searching based on filtering, and then the relevance is typically when is the last time I modified this. So if you have your clear use case very well defined, it's easier to target the problem and just build a customized solution for that. It's more brittle and less specialized and less adaptable, but I don't really think these requirements are going to change that much. So it works out for us really well. This is a part of Bulu and it doesn't have a separate name, so it's just an internal engine which we're using. Hopefully maybe in the future we'll export it as a custom, cute, really lightweight cube based full text indexing engine and maybe other people can use it. There are plans for Econali to use it as well, because we're built on top of kind of a really cool key value based store called LMDB which memory maps the entire files, so it's always we're using the kernel caches instead of having any caches on the user level. So it's really, really blazingly fast to do searches on and to do lookups and this is cross process, so it doesn't matter if the same process has been crashed or whatever, the caches are still valid. So when you start doing searches in K-Runner it really works out well because the caches, once they're populated even without the caches being populated we typically get a latency for about 10 milliseconds per search and that's really fast. Then we have presentation of how you actually, instead of going into the internals how you actually present all the data and Windows does an okay job but there's scope for improvement and they do okay, but they're not these guys and that's understandable. OSX typically really, really tries to work on the presentation layer, so while Windows actually has more superior backend technology they've really put in a lot of effort to index stuff quietly and all of the APIs exposed and their documentation whatnot is superb. On the presentation front it's not as glamorous even though they have the technology to make it to be while OSX is using a smaller C-based API a smaller full text indexing engine their APIs and their whole framework can't that elaborate but they display a really, really nice UI and it kind of works out. In KDE I've done something similar so you can clearly see how there's a similarity it wasn't completely intentional but it kind of was so we look good kind of and the plus point is that spotlight no longer really looks like this so we're kind of doing well plus since the whole QML revolution we kind of really separated the outer layers so hopefully in future versions you'll have more different UIs of Krunner and maybe even something similar to the newer spotlight. We do have custom UIs in Plasma which aren't yet exposed so you get the old Krunner UI or you get the new one you know we're all about customizability so it's going to get better. We have a lot of work to do in terms of file management of searching in a file browser OSX does that really well the whole finder thing and allowing people to save custom searches Windows does a decent job as well. We have scope to improve but that's always the case with us but we'll get there and that's mostly it. Microsoft has great backend technology but they aren't really promoting it as a unified product OSX really market spotlight a lot so they have it as a separate product and they really show a lot of media attention to it so people have a catchy brand name which they use to really say oh I love spotlights it's the most amazing thing. I think this is going to change in the future. Microsoft's really promoting something called Cortana. If you don't know what Cortana is there was a Hello Game which was apparently really popular and there was an AI over there which looked kind of hard so she was put in the desktop and she's now something akin to Siri but it's also in the desktop which is kind of unique and they're really promoting that for Windows 10 so the whole desktop file search is part of the whole Cortana thing but they're trying to move away from it. OSX is also slowly doing this but they're still on the desktop they're still using the spotlight brand while on the mobile stuff they have Siri but then file searching isn't that relevant typically on a mobile and on Linux you guys can derive the conclusions of how we're doing you all use it so I don't really need to go there and that's it questions Thank you. So how is the difference on the API with the bottom bit of the software and the current documentation how much work would it be for us in that kind of switch to this new thingy? About a day about a day maybe of hacking less actually maybe we could do it today. But the thing is we'd already started replacing large parts of the chappian because they weren't doing what we wanted so that's why we created a chappian cute library which we implemented large parts and all of that went into the new engine so we're already half way there it's gonna be trivial they're aware of it and they've been discussions on the kernel mailing list but it's typically of this form of scratch your own itch provide patches we'll do it and I've been trying to get more involved in that but that's not really my forte and understandably open source word we typically can't expect other people to be doing the work so either we hire someone or we do it ourselves that's typically how it works. I do know they're opposed to having a fourth file system notification API there's denotify inotify and fa notify so when I propose new changes of adding a new one they said explicitly please just modify fa notify to what you need and let's not have a fourth one I have some patches on fa notify which are kind of working but kernel development is hard for me so maybe we'll see the situation improve but the moment it does improve it improves for everyone because tracker is essentially the same has the same problems and this is one of the longest time sanding bugs in network which is in there in the loop that we trash the disk on startup for installing watches people want where the features of the baby and the little ones cannot serve presenting around it this is there in that list okay that's one more question and yeah you mentioned a thing I know if you have to watch every single filter do you know the version of that? okay so during the network days we used to monitor every single thing become the virtual file system is slightly strange I don't understand all of it I can show you the code and it looks strange but I don't get all of this that's it I think we still have we actually have that time we have two minutes and 14 more seconds I will leave you with a lot of questions no so yeah when we started switching to 2006 all problems with I don't know if I am able to find one already there it's been for nine years the kernel is just hard to solve and I completely understand that it's hard to fix this in the kernel we need a kernel for that that's not our problem so yeah if somebody comes over and fixes that that will be awesome but this is another question the question is you are identifiers I was wondering you can also be using checks as their identifiers because many processes have a lot of different identifiers they are unique identifiers the checks and they give you a very solid identifier have you considered that? I actually hacked on this and shipped this naively during the NEPAMUK days so I ended up enabling a Shaban analyzer of every single file and trying to use this the UAUID generated for a unique identifier for a file the disk usage blows up because then you really need to read every single part of the file and right now even for full text indexing we don't need to read every part of the files especially for video files just go to the metadata and get it out and the moment the file changes the solve breaks and files do change plus I really think that this is not the job of the user land to be doing this butterfs does a great job in terms of identifying which blocks are common but it's not a indicator for butterfs because it's a copy and write file system so it actually only stores multiple copies of a file once I really think this exists in the file system layer and user land can work around it but we are trying to do file search and it's better to just focus on that and do it well and I don't really see this in the scope plus file change and not so it really didn't work out for us Thank you for your work on file extension question I'm still around here some people find it a dry topic I've clearly been working on it for a very long time so come talk to me