 Now with just a small delay. I welcome you Sorry, I can't pronounce your name Nukubi Nukubisan Who will tell us a bit about search engine full-text search engines in Debian, especially when using them? Chinese Japanese and Korean characters, which might be quite interesting for us Europeans So At first I I apologize you guys to my in my failure, I couldn't update to my presentations so it was almost safe as As a Debian mini conference in this year about three months ago, but I Some status were up to date. So I I talked about such notes and The presentation has some Chinese characters, but I Go out and machine and it didn't have any Chinese characters. So I will be switch My machine in later my presentation is focused as such engines in such engine implementations in Debian and there are many softwares They have some different algorithms, so I try to describe such algorithms and characteristics differences and then By the way, I am Japanese and Most important issue is to handle Japanese characters and such issue was very similar for Asian languages. So I Talked about such a language issue Then I'll describe I'll introduce some packages in Debian and then I have In fact, I I'll talk about another both With talking about Documentation in Debian so I I'll talk about such things in this This session So at first Arrogance There are three major Arrogance At first invert it index Second is n-gram So is suffix array Inverted index is most popular method the method is you formally Very easy to describe This example is three three text files Far away has a phrase such engine in Debian and blah, blah, blah Far away has There is a steam engine and Far away has Debian social contract such So in this case The word such is contains a file a and file c And the word engine is contained for a and for b So in bucket index is such a dictionary based index so index has such a word and Pay out the word and and contain file so People can easy to find Which file contains a word so it's very easy to implement but There are some It has some problems For usually But Latin language is That language Usually each was Splitted by space or something but some Asian languages is not so In such situation we need I need to split such words by some another message and It's how to master part of what if so for example a Person want to find All words it contains se a for such station The search engine need to search Full words it contains se a so It takes small cost, but not impossible And some problem is how to match the phrase Default it's not so hard, but it need to record Each position of words so for example The search a was far away in In the top for policy in somewhere like so then Someone want to search some phrase So the engine need to check such Such was what such was Plastic Then I'll describe about n ground And means Any Numbers, but usually is 2 gram To them is also called by the end. So I'll describe about by ground This is example There is a word So It will be next like Like left part so Secretly two characters will be Recorded as Index so be and e b b i and b i a and a So and also Such Part of what first of the part of what also recorded so It will be able to match the part of any word for example Some someone want to search b i a so in that case The engine checks b i and I a and it will be placed in Secretly so so correct and phrase that Probably a cover so It has no such missing But in but it will be index was based on what so some Sun summer was Causes such missing but n ground was Not however it has some problem At first it requires many Calculation because for example by by in by ground All all phrases were separated as To to two letters so the agent need to check All of us and it's place it take many cause then but it makes and also It causes too many much for example Someone want to search it Se a see the one see so when mutated index Sees Such it was recorded as a search not and see see is also Recorded as another one, but n ground was recorded in Se and ea such such word were recorded as Exactly same so the engine was fine fine see and search Sanction it to answer it causes a problem and such Algorithm suffix array it is The index is a presented a Reported as a Point of a full file so the example is a file in the word very young Point So suffix array when It consider such a date as semi-infinite infinite strings for So it is Let's say and beyond So talk to the e b i n and drop to e b i n i n n and n such a strings semi-infinite so and then The other is Sorted as alphabetical order so Then the index data was Like this order a b d e i n such such order The data is Pointers so usually it was record it as starting to beat so usually the index size is Times of the original file. It's really huge almost Advantages almost same as n ground and distance distance of a page is also same but Another characteristic is The algorithm needs to remain in the original source because usually it was searched by The binary search so but the index record is the only pointer so the engine need to check the original source To see but it costs very Very expensive about indexing this this example is very small, but usually the text the target takes you to be very nice, so I mean if you make a strings were very very long so It need to It needs very very high calculation cost for sorting and then I'll discuss about as a language I always said about Inverted index this description Usually it need to sediment each words so usually as a language Not separated by space, so you need to Segmentation process and Some languages has Some variant of what suffix like Japanese and also they have many Kind of encodings Well in this page it contains Japanese but it could be displayed so This in general space is in some Japanese characters and it was pretty About four parts and there are three Japanese words There are some method to segmentation The most easy way is using Kakashi. It is dictionary based software and when Sun was much to the dictionary it tried to spray it and It has most longest words were preferred another way is more political analysis It is the method is natural natural language based so Sun was Categorized as known or bad or something. It has called morpholone. So this this method is also dictionary based but it it calculates Japanese Syntax so usually usually both were not Not awkward secretly Some many rules There are many rules. So then it has An analyst body some software so There are many two major softwares in open source First is Chasen Second one is Mekabe Chasen only supports Japanese but Mekabe is designed as for many languages. Chasen has Such syntax rule but Mekabe is more simple as so so it could be Usable for other languages, but it has only Japanese dictionary. So if you can currently only support Japanese so Also There are also many Japanese words It could be this way. So usually Japanese verbs have very many very many variant It is It looks like Stemming in English usually based on its moves and it the Variants are moves or moves or moving There are also similar Similar things are in Japan Japanese but it's really complex. Okay, I can Prepare to display Japanese presentation Please wait a minute. Usually use windows with cooperative Yes, it's Japanese word. So it's a word suffix example It is works. It means same as move It is best style and This is variant Uwokanai Uwokimasu Uwokanai is same as not move and Uwokimasu is same as it can move Such a very There are very many variants and It's very hard to And another issue Captain Coins Coins is a mapping between a letter and Identify code for usually Many people using ASCII code ASCII is map alphabet large a as Hex for one the details are described as map page section 5 It is a very simple but Japanese Japanese characters very large. So I can display any visual example It is example of Asian Captain Coins in this Example Chinese and Korean Korean Well only this way One coding but I have there are many kind of encodings and Japanese has usually have used three three types of encodings the first since it's ISO to JPE It is best It is usually used in email Second one is you see JPE You see means extended to the next code And third one is shift g's the shift g's is Also, there are many kind of bias and So window style was Mac OS style was something so it has been it causes very many problems. So Usually such engines need to come combat into one encodings however some encodings are not Same Not covered same character sets. So For for example and character in encoding way, but not not there in encoding b there such problems and and Detection is No, it could be but not perfect perfectly because So in some case It could be classified by by seconds. So And you see variants are almost same so method for encoding so it could be it couldn't be classified by Languages so you see she ena and the jp and the kr is it seems almost same and then Unicorn has some issue for Asian people because some letters were Not into same same code point However, usually Such characters different Different figure so usually Asian people are considered as such differences Usually want want to map another code, but it has it has gathered in Unicorn and then I I'll describe some such engine implementation in Lillian HD is very popular for Ascii or single byte encodings It is based on inverted index It could be able to handle local fire only so remote file like HTTP or FTP or something couldn't be handled Now it's very famous It has developed this since 1995 Second one is monogos such it was also based In bucket index It can It also can only handle local fire And it supports Some kind of back-end data this Supportable SQL mask and SQL And it also supports Internationalization and it But it's Japanese support were required Chasen Chinese was segmentation software is included so Chinese character can only index by Only this software only and the third one is switch plus first. It is also based on inverted index And it supports very various formats html mp3 text and something Mp3 is usually Based on ID Twitter and it supports utf8 However it has index specs for The world was separated by space so it couldn't be Usable for Asian Sorry, it is a suffix. Sorry based on suffix and So the algorithm the algorithm's limitation It can be only Support single file and it supports two encodings utf8 and ucjpe but it Could be used for uc Sheen mkr There are some Language bindings Ruby and Perl and Ruby and Python and Namaz. It is also based on inverted index But it supports English and Japanese and it also supports some various formats like switch plus first the difference is Such Performer supporting is based on plug-in so it could be easy to add another format for support. This layer it It has two I would it supports two I would inverted index and unigram. Unigram means band. So In that I described about bygram, so unigram is a split it once for one character and and The upstream also a slayer is developed High-pist layer the success of the software. It supports It supports Bygram and it is also It was already installed in Debian. This section was considered as another session both, but I might have to talk about something There are some documentation systems in Debian B-double-double it is The best Debian documentation system It supports HTML and GUNION info and man pages And it works with switch plus first, so it could be it's full text search, but It's only for English The help is also almost same software, but it supports Only users Only files in user-share-share-share It also works switch plus so First of all, it is very useful, but not Not considered about internationalization. So It is not good for Asian or another languages people. So So consider about What's wrong? So usually there are many files in start starting Documentation directory, but There are many High-pist sounds software sound. It seems no policy if someone want to someone want to Make some search engine decks for the need to protect language my recordings It's it's very hard Info is also set same program and usually When you info was only supported ASCII characters only Some files are written in Japanese encodings, but it it is also some files are different started by Some different encodings is One page seems good situation because there are already supported by languages. So one page is is easier to make Making index with for some languages. So I think I need internationalization policy for documentation I I think there are some method I'm consider about some methods, but it should be good for languages for fixing filenames Usually a patch A patch supports such Such methods, so it will be useful and We have found the d-double double who are Only considered about Swiss traffic, so I think I think we need some indexing or searching wrapper software to handle other languages I consider about such a issue in five phases of So I want I want to Join some some people If if someone I'm interested So in the future, we need more integrated documentation system in deli like desktop system for some something I think Big deal seems good for such such situation And is also good For personal use d-double double offline software is has such such issues, but it was already The development was already stopped, so I need another software And so someone I'm trying to make A patch module with hyper-estrel So it could be useful for very proxy and very proxy searching I selects of such The known has already have big But it is based on D-double D-double is implemented by mono, so and And currently not supported asian language, but I have It has a capability to handle Another languages, so it could be useful for this purpose KD has not not such software, but They announced To implement in KD4 for partner use I'm causing a lot of such requirements International and much for file For my support and user friend Refronting, so without Offline was always supported With hd.go and namaz So it could be useful, but Developers already finished So namaz also supported script 5 hash format, so it could be also useful However Once there are 5 folks such like big browser has their own Their own systems, but It couldn't be It couldn't be supported This attention I talked about Many My software and Package in nabian and implementation and I talked about some issue for internationalization And I consider that we need documentation for internationalization So sorry, I'm not Familiar about english, so It's where it would be hard to hear Anyway, the slide was finished, so If someone Have a question Can you ask me Other issues with the languages in the Middle East Arab or other languages such as Hindi How they are served with such I'm not research of national language But I had Sorry The question is Another Arabian Another Almost similar They have almost similar problem Like What topics and Advanced segmentation So It To solve this problem It requires Such languages Knowledge So I couldn't But There are some Researchers In Japan, so he could be Solve this problem If You can to Try That is M17N library It is It means It is a Natural language such as Software, so it supports many Languages like such Such another Languages so It should be It could be Solve such problem, I think My question was You've gone through Most of the search engines With Beagle You're looking at including Documentation Or file format searching Inside of that, because it already handles Most of those like the emails And the messages Which one are you particularly focused on Improving for Debian At first I considered About At first Documentation file in packages And Then I I considered to spread about Such Personal Personal user state So Such I think However Personal data Is Considered about such Normal KD Such So The other question I had is We can already do Since we already have the package We can already create the index files For searching purposes We could have in part of the package The index file, you drop it in a location And rather than having something Like locate, run every night On a cron job, you just have Once you install the package You already have the index file built in Do any of the search engines here The ability to search for multiple files Or can make multiple files Into one large Search engine system So in the Debian package, for example You might have The how-to's might come with an index Already pre-built So there might be dh underscore You know, search engine index Is something that you considered using that That might be particularly It should be good I think But some Indexes File were Host byte order So it could be How to use Another architecture You also have a size issue Joe? And, you know, we've already Sort of had a history of Thinking about splitting documentation In general into separate packages So this could probably come down To either side of that But some of the changes That de-package is supposed to have In the future is the ability to say Anything installed on the user-shared Dock, throw it away, I'm not interested In it. So I kind of Think that that size issue Would be such a problem in the future It still addresses media size And download times Gee, you're from Australia, aren't you? Yeah. Aren't you the guys who wanted To do that? My point is these all just have to be balanced I agree, because To me, the Greatest use of this kind Of search technology is Personal search And I think we don't have a very well Coordinated solution For that right now. I think Some of the things that the GNOME And KDE folks are looking at are Good steps in that direction, but Until I have a good easy To find things in the 40-ish gigabytes Of my email archive And to be able to do it You know, it's something I can Carry around with me. I'm not happy Yeah. Got a ways to go. Maybe thought by Google. Yeah. Thank you very much.