 All right, hello everyone. My name is Richard Schneemann or at Schneem's. If you came to the Lightning Talk, you know the most important part about me. I'm incredibly, incredibly committed to Ruby. Sadly, she could not come to the conference. And you will also know that Ruby is in fact a Python programmer, it's okay. I've gotten over it. I work for a small company based out of San Francisco that you may have heard of. It's called Heroku, where they give me some free time and allow me to work on things such as CodeTriage.com which allows you to get an open source issue in your inbox, one every single day. Or I also recently announced docsdoctor.org which is basically the same thing but with methods. So if you're looking for documentation, you can get method documentation, a documented method or an undocumented method in your inbox. A little bit more about me. I am in the top 50 of the Rails contributors so that kinda makes me a big deal. I do not have any caps in this presentation. This is a photo of my dog, enjoy. You may have seen a couple people wearing Keep Ruby Weird shirts. This is a conference that we threw for the first time ever in Austin, Texas. Austin is Keep Austin Weird. And you know what, like Ruby is amazing and dynamic and powerful. And in the spirit of why we wanted to preserve that. We also, we got together and we were like, okay well you know what, we kept Ruby Weird in Austin but at RubyConf, we should also try to keep it a little bit weird. And we were like, what can we do to keep Ruby Weird? So we were just brainstorming and I needed a lot longer segue. So we were brainstorming and we thought, perhaps we could do the CanCan, you ready? Da, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na. So so that was the fun part stay here. You're not done. You're not thank you very much. That was wonderful. Can I get a round of applause? They knowingly volunteered to do this. That's amazingly. That's amazing What what you do not know is you have also volunteered to the do the conference can can So can I get you to all stand up? You have to stay weird Everyone so the conference can can is similar to the actual can can except you can't do this so it's more kind of like a Pretend you're just like a really like shy person doing the can can And we and we have we have some actual some actual music That is perhaps a little bit better than mine So you're ready and and it's kind of you do a sort of knee step kick Really like me. Yeah. Yeah, it's like me knee knee knee If if you if you're friendly grab the person next to you Thank you. Thank you very much, but that's not that's not required All right, you're ready for this This is amazing By the way, this song goes on for an entire minute Okay, thank you Thank you very much All right, I hope you I hope you are fully weirded out now Now that I have your attention now that you're awake. I'd like to talk about algorithms But before I talk about algorithms, I want to talk about my background So I I went to Georgia Tech and I studied mechanical engineering. I don't have a CS degree And I've been teaching myself programming for about the last eight years Which it's been like super fun and super amazing, but computer science is like Crazy boring. I don't know if you know this if you have a CS degree. It's it's all right I Find programming really really interesting because I like building things. I like actually making and putting my hands on On parts of systems and seeing how they move but The CS students might know something that the the the mechanical engineers of the world Haven't quite touched on which is Algorithms are absolutely beautiful like just unbelievably beautiful And it's not because oh they come from a book or you have to learn them to pass this test because algorithms solve problems They solve real-world problems that someone else has invested time into into learning so I Have a really big problem and that is Spelling I cannot spell at all Which is why I'm a programmer because if I misspell my variable name as long as I consistently misspell it. It's okay I'm only like half joking on that one so and spelling even becomes more difficult when you're tired or distracted and Like all the time I'll do something along the lines of like get commit and then and gets like yeah that's not a real command and Something else that get does which is really cool It's like hey what you actually probably meant was commit and that's that that was amazing in one day I was like hey, how does get no? So so the method through which get determines this is something called edge distance and Edge distance is going to be the cost to change one word into another The the the less the less that costs to change one word into another the more likely that those That is what you meant if there's zero cost then it is the exact same word I think a quick example from zat to bat This would be a cost of one or zzat to bat would be a cost of two so that that kind of makes sense But how would we go about coding it? When I when I first found out about this I was like oh, hey that's like probably really simple and I can I can do that and I sat down at my editor and And just kind of banged out this like I don't know five lines Something I was like hey distance between two strings And when I when I ran it with this really simple simple use case I got that the cost was one basically all we're doing is iterating over each Character in one of the strings and checking to see if it matches the exact same spot in the other string It's super simple like that that made sense to me, and I was like oh, this is great I'm done and the talk is over But You may have guessed the talk is not over it wasn't perfect And if I put in something a little bit more complicated, maybe things that didn't have The same length then we would get a really high cost, and this is not correct It does it does not take seven edits to change the word Saturday into Sunday so It turns out that in my naivety I accidentally recreated a thing that actually exists and is actually useful. It's called hamming distance and Hamming distance is not edge distance. It's known as signal distance and It will measure the errors introduced in a string and it's actually only valid for strings of the same length It is it is really useful, but only not for spelling. It's horrible for spelling. Don't use it for spelling If you happen to be in the telecommunications communications industry and you're dealing with let's say strings of Like, you know, well, it's essentially strings of ones and zeros then you can say okay Well, this is the likely match of this set of strings if we can detect that there was an error So it's really bad for misspelled words And the reason for that is it only includes substitution. It does not include inserting a character or deleting a character So I would like to introduce another algorithm called Levenstein distance In order to figure out Levenstein distance, we have to figure out how do we figure out those extra deletion and Insertion so if we look at two strings that are really similar We can see here that shneems and in zee shneems is almost a match except for this first character So basically if everything except for the first character on our second string is a match Then that's a really good candidate for deletion. That's saying yes We should delete that and this is what that would look like in actual Ruby code If if those match then we delete For insertion we can we can do a similar thought experiment where we have just one character missing so shneems And I don't think the other word is actually pronounceable shneems Okay. Yeah, that that was good And if we knock out the first character of the first word then oh, okay, we can solve this via an insertion And here's the code about how we do that with substitution we already looked at that and And that's basically saying every single character matches except for the the one that you're currently on and Yeah, so we're only looking at the at the first character in that string whenever we put all of these rules together We can calculate distance If you pretended that we already had a distance measurement between different strings we could calculate the the distance between strings of different lengths and It would if you have an empty string and a string the cost to change between those two is going to be the length of that string So that intrinsically makes sense Nothing to foo cost three characters food and nothing also cost three characters We can represent that in code by just checking the length or returning the length if the string is empty Once we have this piece we can calculate the different distance between every single substring and Do it using the same logic we did before where we're going to call that distance measurement And we're gonna we're gonna either knock off the first character for for one of them We're gonna knock off the second character for the other one We're gonna we're gonna not only compare the first characters and this represents Deletion insertion and then substitution We're gonna go ahead and calculate all of those and we're gonna take the minimum cost because we only want to make the minimum number of edits And then we're gonna add one So why did we add one? in this case The the cheapest operation just just represents which one we should be picking but adding one to that Represents the actual cost of the operation Okay, that that covers all of our cases except for when all of the characters match and So that's fairly straightforward again. You just look at that individual character and When we combine all of these things together we we form of course Not not vulture on We form recursive leaven stein distance which is amazing and really useful it It's it's incredibly simple and it looks a little something like this It's actually not that much longer than than the other one each of the individual rules If you look back at the at the previous at what I previously talked about Vaguely kind of makes sense if you just sort of pretend that the distance measurement Method already works. So here we go at when I when when I looked into this I found this I was like, okay Hey, we're done. I put in Saturday and Sunday and it turns out that hey the distance It only takes three edits to change Saturday into Sunday. It's much better than seven Then I was like hey, what what does this actually look like when it runs and so I kind of Hooked it up all together. I don't know if you can see this We're still going We're still going Okay, so that took 1647 comparisons and then you can see that the distance was three So we got the correct distance But like it took a really long time in order to run that by the way all of these scripts are on github.com Schneem slash going the distance I think I also have a I have a bit.ly URL that might be a little easier to remember I had that later But so if we look at this the the measurement before the hamming distance, which I didn't know it was called hamming I just called it dirty distance That took eight comparisons recursive took 1647 It's like which would you rather have fast and wrong or like really long and correct? That's not something you can You can you can you can really you can really reason about so it turns out that they're that There is a better way to do this that there is we can do we can do slightly better and One of the keys to this is if you watch the recursive algorithm closely enough You'll notice that a lot of those strings that it was comparing were the same so you know we can we could we can cash those in in some sort of a way and Basically if we have the distance of all of the substrings in a string then we can add those up And and get the distance of the entire string so so first of all I Would I would like to invite you to join the club? I don't know if you know this but there's a very prestigious club for members only and And and when you're in the members only club you can also learn how How the the Levenstein distance calculated with the matrix works, which is what we're going to go into right now All right We're gonna stick with the same example Going with an empty string over to Saturday We can build a matrix that that looks like this So we have our empty string on the up and down axis and we're gonna have Saturday on the top if we just Convert empty string into s it costs one s a is two s a t is three four five six Seven eight so that that hopefully makes a little bit of sense We can look at it at the other way and say what does it cost to turn Sunday into into an empty string? We form the other side of our axis and we have one two three four five so Six so it costs six edits to turn Sunday Into an empty string we have to delete six characters in order to get it into an empty string So that would be that would be the distance Once we have this we have the the starting point of our matrix We don't need to we don't need to to really calculate this with this is kind of like the zero with law of edit distance So we're gonna go back to those rules that we looked at previously and we're gonna break it down How much does it cost to change an s to an s well in in intrinsically we can say that Okay, it it is the same thing so that there is gonna be no no edit distance and it's gonna be a cost of zero with insertion We can say obviously if we want to change s to s a we're gonna be adding the a so that's gonna cost one That's gonna cost one extra character And so we would add one into our matrix and And we need to be able to do this though programmatically It's it's relatively well potentially relatively simple for people to do it But we're gonna go back to those rules So again, if we're gonna be we're gonna be knocking off the first character of one of our strings And the way we can do that in a matrix is actually by using the the row index and the column index as a As a pointer to to the string that we're looking at so we're looking at s And then we're looking at a and but we want to knock off that last character to see if to see to see if they match And the cost of doing that would be would be a one and Then and then of course we add plus one to it and that would represent the cost of doing an insertion right here So it would it would be zero plus one which happens to be one Which is exactly what we thought previously which it's always really good when you're programming programming an algorithm And and your algorithm like produces results that you expect generally Okay, so we're gonna we're gonna we're gonna keep going in this case there are no extra characters It's gonna just all be insertion so we can we we go over and over and over again And we end up just adding one two three four five six seven On to the end of our matrix Our next character is going to be changing s you into s Mm-hmm. Does anybody can anybody guess the action of this? Changing s you into s Deletion okay good again. This is this is something we can intuit But we have to be able to figure out how to tell our our machine previously We looked at some code where we we knock off the first character of the of the second string and again, we're gonna we're gonna look in our matrix and Our column index this time is gonna be s the row index is gonna be you We're gonna bump it up by one and that gives us our initial cost that we can then add one on to so it's gonna Be a cost of one This covers insertion this covers in deletion. We also have to cover substitution You probably see where this is going before you've already seen these equations And we get to our matrix we find our rows we find our columns and in this case We we go back on each of them and this is this is exactly the same Substituting so substituting a for you is exactly the same as saying Did our previous set of strings match because if they matched then we then we should then we should substitute them So that that hopefully makes a little bit of sense We add one and that covers all of our all of our processes of determining Insertion deletion and substitution. I did kind of gloss over match match is basically just Minus one minus one where where we look at at the previous one It's similar to insertion except we don't add one because there is no cost to change a string to itself So essentially changing an s to an s as the same is the same as changing nothing to nothing Which would be a cost of zero so once once we have all of these things in place We can put them together and we end up with something essentially like this. It's gonna we're gonna iterate over all of the characters in one of our strings and We are going to to store the values in a matrix and you can see here highlighted in yellow all of our all of our different Cases granted if you were going to code this for performance, you probably wouldn't want to allocate a hash, but that's a different that's a different story When we do this we then iterate we can run it and you get kind of something that sort of looks like this Just goes There you go Okay, so you can you can see the final cost of changing Saturday into Sunday was three One of the one of the coolest things about this method is that we can also get the cost of all of our substrings We've already calculated them if we if we were interested we could find the cost of Sun to sat we just look in our matrix We we pull the end because that's the last character in the T and we find out the cost would be two Was this better than recursive it? it took a net of 48 iterations which in my personal opinion is a little better than 1647 it is it is a little little memory intensive and both of these Examples that you can feel free to play around with and like actually iterate and step through and reason about as well with with all of my notes for For this presentation are on bit.ly slash going the distance if you're ever interested as well as I'll throw up a couple of a couple of reference links for the research that I did in Coming up with these Okay, so this is a tool previously I was talking about programming and I was saying algorithms are really cool because they do things and like by now you're like String minus one plus one ad cost like blah blah blah like what's going to be for lunch And you haven't seen it actually in action. I mentioned previously. I'm a horrible speller I Happen to be human and I get I get often tired and machines do not understand me very well when I get tired So one day while I was try typing. I was typing rails generate migration but it was really something that was like my Gratoon and It totally just blew up in my face and I was super upset And I was like stressed and it was the simplest task known to mankind like why could I? Not possibly do this like it just things were not working. I was upset. I Don't even know where this came from like who made this. It's amazing So I Want I was saying why can't more software? Why can't the software that I'm using be more like get We know what you're trying to accomplish We know you're trying to run a generator command like we know the generator commands available Why can't we be a little bit more helpful? So I submitted a pull request request where we use Levenstein distance we then Basically whenever we detect that you have an error we we compare the command that you gave us to all of the possible Commands using Levenstein distance. It's relatively it's quick enough and then we recommend the smallest possible distance So again, this is this is also similar to what Google does if you've ever typed something into Google And misspelled it it will give you a spelling suggestion So this is really neat Peter Norvig has a great great paper online where he talks about how Google implemented the first the first spelling suggestion and Basically Google in addition in addition to the Levenstein distance stuff that we talked about they slurped in a lot of real-world books and In these books they calculated or they they they counted words Every single word every single time a word came up they counted it So the word a probably came up like a ton and they're like a is probably a real word The is probably a real word smorgasbord is totally made up because we didn't find it in any of these books Like you know they're dragon flash or something So The higher the count of that word the higher the probability that it's gonna get in there Once we have this information we can also get the edit distance between the input that you gave it and the dictionary that we have The the the lower the edit distance the higher the probability and then put that all together boom we show a suggestion I also Google is really smart and they totally cheat about this Like running Originally running a Levenstein distance Across all of those would probably not return in whatever point something something seconds with network lag So they cash the correct the correct spelling suggestions what basically like show you a spelling suggestion And then you click on it. They're like ha gotcha That's what you actually wanted and if most people search for that thing and also click on that spelling suggestion Then they know that that's probably probably the right one a Another really cool thing that actually came out Relatively recently is the did you mean gem? Has anybody seen this? All right, so if you add the did you mean gem to your gem file it will give you a really cool error Like if you if you call yeah user logged in it's like you know what this method doesn't have a user logged in it basically catches no method errors and Finds all of the methods on that on that Object and and runs Levenstein distance across all of them and then suggests a method for you to use Probably don't use this in production So this is this is really cool like this this stuff it wasn't just like in a textbook and I was like People are bored and professors are like trying to make your life horrible like this is applicable actual things You can you can add it in in like this just came out somebody just thought of this Like what else added to like there's a ton of CLI commands like yes, it's very specific It's generally in in dealing with with misspellings, but as programmers we spell and misspell a lot of things We type a lot of things there are a ton of applications. I'm sure that we could add this to in in preparing for this talk and Doing doing research They're like it was originally I was like oh like I'm gonna do all the distance measurements And that's like saying I'm gonna do a talk on all the algorithms Because there's a lot of them. So we talked about Levenstein distance and we talked about hamming distance There's like longest common subsequence distance. There's Manhattan distance tree distance and There's also Jero Yaro Jero Jero Winkler distance, which is nowadays what I've been told most people who if you're doing large massive amounts of Spelling suggestions would would actually use instead of instead of Levenstein And you can you can get really creative If you're if you're gonna be doing this often and frequent enough like, you know Google had to bootstrap at one point in time those spelling suggestions. They can do things like store the distance edit Calculations or the the the the count in a like in a try so that you can easily just give it all of your characters And it will give you back a a list of possible spelling suggestions. So This is kind of just just just scraping the surface of of all of those all of those measurements but Like really at the end of the day out. Hopefully you're gonna you're gonna walk away and be like algorithms like not just for CS Undergrads anymore Like they're they're really cool. There's a lot that we can learn from them and If you're interested in in learning more just general about algorithms like Wikipedia has an absolute ton of stuff Rosetta code is great if you're not familiar with this project it will actually they will pick common algorithms and Implement that algorithm in a bunch of different languages though the the ruby version of Levenstein distance was like so Not ruby code. It was like obviously not written by a ruby programmer I felt free to like fix that a little bit Or honestly the best way to learn more about algorithms and my personal recommendation is to give an algorithm talk Like seriously, it's probably gonna happen as soon as we open up this for questions Somebody's gonna be like, hey, did you hear about like this algorithm? And I'm like no and that's amazing and we should talk about this And and and algorithms are essentially these ways of of sharing knowledge and they explore how How how we can do things and do them efficiently? So this is the anti penultimate slide Which I've been informed means the third from the last And this is basically a nonsensical slide So does anybody have any questions? Okay, cool. Well, I'm gonna be I'm gonna be around all this code is is is online Like I highly recommend going out and playing with it actually exploring all of my notes are on that ruby gem Or on that on that github repo and if you wait maybe like five minutes I'm gonna actually push up a bunch of links to the read me that that help explain where a lot of a lot of My source material came from so thank you very much for coming and yeah, have a great day