 Thank you, York for the introduction and thanks to everyone for coming here for the first talk in the morning so So I'll be talking about new implementation for the spell utility that I have been working on for about one year so a Bit of an introduction about myself as York said I work About five years back for net BST in the Google summer of code program and So I worked on writing a new implementation of the apropos utility which had a full text search feature and And I got the net BST commit bit about last one year one year back and So this project started about at the same time. So I'll be talking about this So these are the five major topics. I would like to cover today. So I'll briefly talk about what are the shortcomings in the Current spell implementation that is there in the BSTs So from now onwards, I would like to call this the old spell implementation and what I have been working on I'll call that as a new implementation So based on the shortcomings I have come up with the set of features or requirements for that a modern spell implementation should have I'll talk about the Some implementation details that I have done in this project I'll compare the performance of some of the major in spell implementations and With the this implementation and I'll show one small demo of integration with other Shell utilities of how to use spells spell check so This idea for a new spell checker came to me when I was working on the Google summer for code project about five years back But at that time I realized that writing a spell checker is not trivial and I could not do it in the summer of code And I dropped the idea at that point of time and about two or three years back. David Holland Decided I created a bug that we should just remove the spell utility from net BST because It just doesn't work. So he gave cited several examples of some invalid spellings and spell will just Never complain that those are incorrect words So At that point of time I volunteered that and I said that I would like to work on this and I finally water on and I started working on this last year So I'll talk about some of the shortcomings that spell has so it's a very old implementation As per Wikipedia It was created around Unix version 7 from AT&T. So Douglas McClroy Created this implementation spell. He's also credited with creation of I think diff sort Tr Join and some of those utilities. He also created this spell implementation So the problem with this implementation is that it Uses to to decide whether a word is a correct spelling or not. It uses a set of rules called Infection rules. So essentially these rules are So when looking at it at a word if it it will first check whether this word exists as it is in the dictionary or not if it doesn't then it will It has a list of prefixes and suffixes So it will remove if it it will check that whether this word consists of any of these prefixes or not If it does it will remove all those prefixes Similarly, it will check for all the suffixes if it has any suffixes it will remove those and Finally, whatever is left it will check whether that word is in the dictionary or not So as it happens if you saw that So these are all popular suffixes. So fragment Fragmental frog wood if you add any suffix to a word Spell will think it's a valid word because if you remove the suffix, whatever he means is also a valid word So while It's a handy trick to do but It's quite inaccurate and another major drawback is that all these rules are strictly applied to the English language So you probably will not have all these prefixes and suffixes of being useful in other languages So if you want to use the same spell implementation with a different dictionary it will probably not work unless the language is English or English like and Another drawback that I consider missing feature is that it does not give spelling correction So it just checks whether this is correct or not, but it will not say what should be the correct spelling and Finally it lacks a library interface. So for example, when I was working on apropos and I wanted to integrate a spell checking functionality, but Even though there's a spell checker I could not use it because it doesn't have any library interface that I could just call some API's and do it so I would have to end up implementing the whole spell checking interface myself So based on these Shortcomings I have come up with these set of requirements so Just apart from doing a spell check it should also do spelling suggestions So it should come up with some possible correct spellings for the misspelled word It should also use algorithms which are not strictly tied to the English language So for example those inflection rules will not work with other languages. So if if we can come up with algorithms Which don't use such kind of rules that will be good So we can use the same implementation with a different language dictionary and it should work And it would also be nice to have a library interface So other applications for example the shell or apropos or similar Applications could also easily hook up and do spell checks So as part of this project These are the four major things that I've done so I've created a bigger dictionary because the existing dictionary that is there in net VST that is user shared dick words It's it doesn't have all the words. So I worked on in expanding it and then I Worked on writing a new spell implementation using some of these algorithms the edit distance double metaphone and that tries I compared the performance against some of the Other open source spell implementation such as a spell I spell and hunts spell and I Also tried to integrate with other for example with with the shell to see how easy it is to integrate so the existing dictionary that is there in the BST is that is the user shared dick words it is the Merriam Webster's dictionary, but the problem is that it does not have all the What forms so for example if there is a verb it will not have it's it's past and Other work forms or if there is a noun it will not have the plural forms. So That's why the oldest spell implementation use those rules because it could use those rules And even though dictionary doesn't have all those possible words it can remove the prefix and suffix it can still figure out If it is a correct spelling or not, but if you want to get rid of those rules We need to have the full dictionary with all the possible word forms so I expanded the dictionary and So this is the difference that I have currently so the old dictionary had around 235,000 words and the size was about 2.5 megabytes and The expanded dictionary has around 421,000 words and the size is around 4.5 megabytes now talking about the spell checking Issue so when we want to do spell checking there are two major kinds of Spell checking errors that can be made So one is a non-word error. So for example if you if use misspell Applied as Apple. So if you mean if you miss the I then it becomes Apple and Apple is not a dictionary word So that is not a real word. So it's a non-word error. We can easily figure this out and Then there are real word errors for example if you want to spell desert but if you add an extra s and becomes again desert but that's also a word if you want to write there But you exchange R and E it becomes 3 that is also a word if you want to write peace You use a different spelling that is also a word These are much harder to figure out. So how do we handle real word errors? It's a much harder problem we because all of these are actual real real word words So we can't just look up the dictionary and say this is not a correct spelling. So We can use Bigrams or trigrams to figure out based on the surrounding context like for example if the word is I'll be there as soon as possible But okay soon is misspelled But and now there are two possible corrections to that Sun and soon but Based on the pattern we know that as soon is much more likely to be there as compared to as Sun So Based on that based on these kind of heuristics We can figure out. Okay, this is not a correctly spelled word and we should replace it with this word but this is much expensive to do because if you want to do this you have to Scan the every word because you cannot just say you cannot depend on Just Doing a dictionary lookup you have to scan every consecutive pair of words in the file and check if This is a likely Bigram or not So this is a much harder problem and it has a performance impact. So I did not work on this during this project But I have worked on handling the non-word problem, which is a much simpler problem to start with So these are much simpler to detect You just scan through all the words in the file and you check the dictionary if it is not there in the dictionary it is most probably a spelling error and If you use an expanded dictionary, we don't need those inflection rules So how do we represent a dictionary in the memory? So the oldest spell implementation used a nice trick. So it used to M-map the whole file whole whole dictionary into the memory and it would do a binary search. So It would just whatever word is there it will do a binary search and figure out if it is there in the file or not Although it works and it's quite it doesn't require a lot of memory, but I believe I Did not use this. So my first instinct was that if I if I want to expose this as a library interface and If an application wants to do a lot of spell checks then Going through a file like that memory map file might not be Good point to do so I decided to read the whole dictionary into the memory Although I have not compared a performance of M-map version versus reading the whole dictionary into the memory but I have tried some of these things so We need a fast lookup data structure to represent this dictionary into the memory so We could use a hash table, but it does not guarantee a worst-case performance. So if the dictionary is huge is huge It might have performance impact. So even though in normally a hash table guarantees Overn performance, but in the worst case it might just become a linear But we could use a red-black tree also which guarantees order log-in performance, but So if you have a huge dictionary and you want to See whether a word is in the dictionary or not and you have it represented as RB tree you have to do string comparison with at every node in the tree You'll end up doing a lot of string comparisons Tries are a specialized with data structure Which avoid this problem of doing that many string comparisons. These are like binary search trees, but every node has Single character and you just traverse the tree based on matching the characters So you don't end up calling strCmp a lot of times So these I've used these ternary search tries So there's a much like a binary search tree, but instead of having two nodes every It has at every level it has three nodes and Every node stores one character of the word and In the next node if the next character is smaller than this character We go to the left and if the next character is greater than this character We go to the right and if it is the if it matches with the current character. We go to the middle node so That's how we travels so and This also has order log in performance and another Cherry on the top is that we can do prefix Matches so for example if I want to see how many words are there with a certain prefix I can easily do that with this kind of data structure in the login time So this is an example of a turn research try in memory. So for example if I want to Insert a word cat. So I'll start with C and then This is the first first word. We'll just go down the middle node If I want to insert another node in another word, let's say bug I'll go since be smaller than C will come to the left and since there is nothing here will go down the middle If I want to insert cup so C is already there you is greater than C So we'll come down to the right and then we'll come again to the middle and This way you can do again a binary search kind of look up and Insert and get the words in login time So for doing a spell correction, so that was about how do we represent the dictionary? So once we have the dictionary we can look up the dictionary and we can figure out if it is a correctly spelled word or not But now to do a spell correction. There are different set of algorithms So one of the popular techniques is the added distance techniques. So we figure out how many Edit distances this word is from another word and then there are sound based algorithms such as the metaphor and We use and gram models to Figure out which is the most correct Spelling of the all of all these possible corrections. So the added distance is basically If we have two words, so how many edits are needed in this word to convert it to into the other word So the possible edits are there The insertion so you insert you can insert any character in the word or you can delete any character in the word Or you can replace characters with another character and it has been Found that majority of spelling correction are just fun added distance away from the correct spelling Yeah, transposition is also there This is an example of for example if the word is I misspelled eyes So then these are the possible This is not a complete list. I just Shortened it but so for example if you delete one character at a time So if you remove the T it will become E H and if you remove the E then it becomes T H Remove the E then become H becomes then becomes T and so on So these are the possible Words from ta at one edit distance once we have this list again We can look up all these and see how many of these are there in the dictionary and then those are the possible corrections of a misspell word So these are all added distance one if you want to go to edit distance to we take all these words and from these words We go to the next how many words are there at edit distance one from here And those are edit distance to but as we go on the the list of words Increases very rapidly. So looking up dictionary becomes quite expensive Another popular technique is the metaphone algorithm. So this is a phonetic algorithm There was a very popular algorithm called sound x which was developed in the 50s of the 60s and So essentially based on how a word is pronounced or how a word sounds it generates a code so every word gets a code and Based on the misspell word we can generate its metaphone and we can see how many words are there Which have the same Metaphone as similar to this and then Those are the possible Corrections for this particular word. So the advantage of sound x or advantage of metaphone over sound x is that It not it covers not only English, but also Many other languages. So if you want to use it it for other languages, this works quite well and There have been three Metaphone algorithms for the first one was developed in 1990 and but it had some bucks and then the next version called double metaphone came in 2000 and The final version came in around I'm not sure but that is a Commercial version so we cannot use it, but the double metaphone algorithm is available as open so we can easily use it and Finally once we have these possible corrections for a misspelled word We have to figure out out of all these possible corrections, which are the most possible ones So as I described previously, we can use biograms or trigrams to see the surrounding words and see how many of these are I mean how many of these are the most possible corrections here So this is an example if I have a sentence like this I am not feeling very well and I misspelled very as W er y Now I'll have suggestion. I have I will have corrections such as Very W e a r y that is also valid word and I'll I'll have Correction as very V er y both are possible corrections, but if I compare biograms then V er y very well is much more likely than W er y very well So this is a technique to filter down this list of possible Corrections and once you have all these algorithms, we have to figure out how to combine all of these to Get the get the best possible accuracy There are various choices here So you can just take all the words at a distance one and just stop there So if a match is found at a distance one just take that but although That gets good accuracy around 70 to 80 percent, but it does not cover all possible spelling errors Another possibility is once you you go another distance. So you go to edit distance to but edit distance to gets Because a number of words at edit distance to is much higher The number of comparisons you have to see in the dictionary is much higher also. So it becomes lower So what I found is that? These three steps. So first Check all the words at a distance one if a match is found stop there if there is no match found We'll try to find all the words having the same metaphor or a metaphor at distance one or two And out of those we'll see if a match is found in the dictionary or not If no match is found and we'll go to edit distance to so this is an increasing odd Increasing order of runtime Complexity because this is a very fast. This is also fast, but this is very slow So with this you only want to do in the worst case then when we are not able to find anything and then in the research literature there have been some Tricks which have been found so it has been found that There is a much less chance that you someone will misspell the first character of a word So if when we are generating the possible correct spellings if The first word is being changed and we'll give it a lower weight because it is very much less likely that this is Person will make this kind of a mistake and Similarly when we have a set of corrections and Out of those the words which sound similar that is they have a similar metaphor code we can give those words a higher weight because Because they are sounding similar for most probably that could be the correct spelling so based on these heuristics I have done some comparison with the other major open source spell implementations So there is an openly available data set which a spell Uses to publish its benchmarks. So I have used the same data set and compare the performances So the a spell so there are two variations in a spell. There is a normal mode and there is a slow mode So in the slow mode and then there is So this is like if a match is Founded in the first place that is if a space generating just one suggestion that is the correct match and If a space generating five corrections and a match is found in any of them This is one to ten that is a match is found in any of the top ten corrections and this is the same as for the top 25 So for the first position a spell gets around 73 or 74 percent accuracy But if we go to top five a spell gets much better So it jumps to ninety six point one and in the top ten it goes to ninety seven and then ninety eight The slow version is slightly better. It has seventy four and then ninety six point six hunts spell Is better in the if we are doing the comparing the first match in the first position, so then it will have around eighty point five accuracy and Top 25 goes to ninety seven I spell is the oldest of all these all so It has Accuracy around seventy seven percent and in the top twenty five it goes to eighty five So I also have done two variations of comparison. So there is a slower mode and there is a fast mode so in the slow mode it gets around ninety one percent accuracy and then in the top five it goes to ninety five and then The improvement is very marginal. So it almost stops in ninety five in the fast version It is charged at eighty eighty eight point seven and goes to ninety three and ninety three point four so the difference between the fast and the slow mode is that this test data set contains about four thousand words and The slow mode takes around two minutes to compare to come up With spelling corrections for all those four thousand words the fast mode Takes around Around eight to ten seconds to do spell corrections for four thousand words So I've also created a small Demo to show how this works if you want to integrate in a shell So I wanted to integrate this into actual into an actual shell, but I didn't have the time So I just created a shell myself and Integrated it on the spell checking functionality there. So For example, if I type anything wrong It will figure out okay, I spelled the command wrong I mean this is an existing functionality that most of the shells already have but I just wanted to show how simple it is to do this then since I'm using a try data structure and It can do prefix searches. So it is very simple to do Auto completions as well. So for example, if I want to see This is also an existing functionality But since we already have all the words in the memory and we can do this kind of stuff also And another thing is if you want to do Context sensitive auto completion so for example if I'm doing I want to look up a man page and I type man and then I Type I do tab and then it was auto completed to The possible man page name Yeah, and similarly if Another example of context sensitive auto completion is is it visible? So another example of context sensitive look Auto completion is for example if I want to do install a package and I type package add and then I want to see all I want to install some go package So I just type go and it will show me all the packages that are there for go So I built individual dictionaries for man pages package names commands and I'm based on the Context that is if I have typed man. I'm doing use looking up the dictionary of man pages if I am Trying to install package then I'm looking up the dictionary of packages. I'm doing auto completion based on that so All of this is quite simple to do With a library like a interface So I would like to come conclude with saying that although it does not meet a spell or unspell in Their top 25 corrections, but it still is able to match in the in the first order top five comparisons and There's still much room for improving the accuracy and the performance So because I'm using an in-memory dictionary. There is an issue about using too much memory So for example if the dictionary is too big it will consume a lot of memory. So that is something to look into But it's still it would be nice to have a BST licensed Spell checker plus a library so that we can use in our own applications and all this code is currently available on GitHub so it is not part of night BST as of you know, and I'm not sure if it will be but I'll try to post it on the mailing list I I'm still working on this. So there might be a lot of bugs still any questions if you have So Are you looking at what keyboard type people are using so you cannot say what characters are possibly the replacement for a typo? Not yet. So that is a I guess that is a different set of algorithm probably I'll have to see because For example the edit distance one, so I'm using the normal ask I order But if I want to use the keyboard order, then that's a different kind of I need to figure out How to get that ordering of I mean how the keys are placed in the keyboard that is a possibility. Yes So in your demo in your in your demo you showed your special shell, which was expanding context wise That's one approach. The other approach would be to maybe modify package info manual to use your infrastructure Yeah, what would you propose because both of them are kind of intrusive? And somehow it has to be a model which you would use for your deployment of this spelling addition So for example, you're saying for example if I want to modify package add and that should do the auto completion or The question is would you build it into the shell or would you build it into each and every command? it is simpler to do in the shell because that is one place and We can modularize it so we can provide hooks into the shell that shell should call Should use this dictionary to look up when someone is typing this command Modifying each and every command will be much more time-taking. I can answer that because I've written this thing before for DC show So there are certain common Completions so depending position if you're a command or you're a file and or if you're expecting a directory or not a directory So all of these are common so you can put it in the language of the shell Putting it each command would require executing the command every time you have to ask it what the completions are that can be inefficient But one example where I would like to modify the command is for example in apropos So if someone types a query and that does not get any results But apropos can see oh you misspelled this thing as something else so I it can say did you mean this so someone Can know okay. I misspelled this thing. I need to add a query That is one place you I can modify the command itself for auto completion. It makes sense to modify the shell