 Thank you for welcoming me here in Singapore. My name is Yuki. Today I'm going to talk about digital media experience and beyond. This is me. My name is Yuki Nishijiman. You can find me on GitHub and Twitter. And I think yesterday Matt called me Nishida-san, which is actually not the right one. My family name is Nishijiman. So I wanted to ask him, did you be Nishijiman? So yeah, this is me. This is me on GitHub. I'm back forward. I'm the creator of digital media jam. And I also maintain a kind of community jam, which is a wireless partner jam. I was born in Japan and live in New York right now. And I also used to live in the Philippines before. And then I actually came to Singapore for Reddose RubyConf in 2011 and 2013. And I'm really excited to be here as a speaker this year. I work for a company called Pivotal Labs. Pivotal Labs is an agile consultancy that helps build software with agile practice, like pair programming, TTT, and stuff like that. And I think some of you already know Pivotal Labs before, because there is a branch in Singapore. So this is standard pair programming and pinpoint. We're going to also open a branch in Tokyo this summer. So if you are interested in working with us, just let me know. So let's talk about today's main topic, digital media experience. Matt actually talked about it yesterday already. But let me just give you a quick introduction because some of you might not have used it before. So let's say you want to check out a string of just a specific letter. So maybe you can use a starts with method. But lastly, it doesn't work because it only responds to start with, not starts with. And it often happens because you are now using Rails and more specifically Active Support. And because Active Support has an alias to starts with. But if you use digital in-gen, as soon as you get an error, it will automatically look for what you really wanted to call. And then suggest it to you. So you don't have to waste your time. You don't have to Google. You don't have to go to RubyDoc. It's going to just look for what you really wanted to call. So since last year, I started getting many questions about this channel. And one of them is like, it's great. But sometimes it doesn't correct when it should. Like for example, you type a long method name and then you realize that it's not just, you just don't remember the correct one. And then did you name files to search for it? And then sometimes it displays too many corrections. And because it's not super smart, so sometimes like for example, Rails path, and then you type like API something in the path and then it's going to search a lot of things. And another example could be, I'm using a table name in the database. And then DigiMean just doesn't suggest it. This is actually interesting because DigiMean is designed to work with name error and number error. But if you type like a column name, table name, then DigiMean doesn't actually suggest anything. So let's learn about how to write a spec checker in general first so that you can understand easily. This is not actually about really specific topic, but it's about computer science. And the history of building a computer spec checker is actually quite long, even longer than my age. So let's learn from the history so that you can easily understand how it works. So first of all, so what's a spec checker? Basically it behaves like a function that takes a user impact like this. And obviously the impact is what paper actually typed. And the input usually has noise, like you make a typo, you misspell something and you don't remember method name. So basically it is just nice. And a spec checker gives us the output back, which is what is most likely to be intended. So it is actually pretty simple. It takes an input, it gives us back the correct one. But what's inside the bootbox in this case? So usually a spec checker consists of three things. The first one is dictionary, which is basically just a set of words. The second one is control mechanism, which decides what to return as a correction. And the last one is optimization. And technically a spec checker can work with only with dictionary and a control mechanism, but usually some optimization techniques are applied to improve a spec checker like performance accuracy. So what is dictionary? Again, it is basically just a set of words and it usually comes from an actual dictionary, which is why it is called dictionary. And a spec checker can have multiple set of dictionaries. For example, sometimes a spec checker can have both Oxford dictionary and sometimes dictionary, which is available on the web. And sometimes even more. And this is because every single dictionary out there has different characteristics and then one dictionary can't always cover everything. For example, British English is a bit different from American English. And then you wanna implement a spec checker that can correct both. Or sometimes it shouldn't because for example, optimization has sometimes Z, sometimes S, and if you were in America, then a spec checker shouldn't suggest it. And another example could be spec checker for Japanese. I think this is also true for other languages like Chinese, but the reason why it needs multiple dictionaries is that it is typical that when you're writing Japanese, you already use English words. And then you probably need two different dictionaries so that the spec checker can correct even English words while you are writing something. The next one is control mechanism. Mathematically, it is just a formula. It usually checks whether or not each one of the words in the dictionary is the right one. A spec checker can have multiple formulas because one formula doesn't always cover everything because of the same reason as why dictionaries can't. And since it is just a formula, some metrics must be calculated based on the user impact and the words in the dictionary. There are many types of metrics out there like using colloquial similarity between two strengths with Lubinstein and Yellow-Winkler and Harming and there's a lot more. And then naively scanning all the words in the dictionary will be painfully slow because let's say English, there are like four million words and then you really don't wanna scan everything because it takes time. So optimization, it basically improves performance or accuracy and sometimes both. There are many, many, many optimization techniques out there but they're actually just, they're usually context-specific. Like for example, this spec checker has an optimization technique but if you have another one, then this one can't use the other one. But it's really powerful because it really often, often this one is what makes spec checkers great. So I think we really should, so we don't, so if you are writing a spec checker then you really should optimize. Now let's take an example. Let's take a look at some examples. Obviously the second one is wrong. They were troubling and it is easy for humans to choose the right one. Yeah, it is good that word is the correct one but for computers, it is really hard to choose which one is the right one because computers are not smart. And in this case, you can just implement some kind of grammar analysis because word can come between trouble and trouble. But where can't stay in the middle of the day and the troubling. Another example would be something like this and then this one looks really weird but spec checkers are not only for humans. For example, if you talk to Siri, it can't recognize the difference between no as a verb and no as an answer. In this case, you can just create dictionary for the same sound of a word. And then, so that spec checker can pick up the right one. Now we know that a spec checker can have three things, dictionary, control mechanism and optimization. So what's the dictionary of the dictionary jam and what control mechanism does it use and are there any optimizations in it? The dictionary of the jam is simply just a list of symbols and it calculates livings in distance for each word in the dictionary and a user impact and it suggests the ones that are within that threshold. So what is livings in distance? It is actually quite simple. Let's say you have two strings. In the previous example, we talked about start with and it starts with. So here you can see start with and it starts with and obviously there's a one character difference between them, the S letter between T and under, which means if you remove one letter from the second one then they will be identical. That means the livings in distance will be one. Now let's take a look at this example of first name and full name. Obviously there's three letter difference as well as one extra letter in first name. So the livings in distance will be four. And Digi-Meen-Jan has just one optimization which is context-based dictionary. I'm not sure if I should call it an optimization because this is what I did since the beginning, but basically you can, if you wanna get all the list of the names, then you can just call symbol dot all symbols. And so how many symbols are defined in a Ruby process? If you, as you can see there, there are about 2,500 symbols if you just call it with RubyCon, that symbol of all symbols and size. So how many symbols are defined when you just do Rails new, and then Rails default, ReactDB migrate, and then Rails C? I'm gonna ask you guys. Raise your hand if you think it's 5,000, 10,000, 20,000, 50,000, 100,000. Okay, so the answer is about 20,000 symbols, which is quite a lot because every time you get an error, you really don't wanna scan this data in words because it takes time. And as you can see here, the number of the methods available are relatively small, like for string we have 26 methods for F1 array, we have 22, 246 methods and F1 hash, 236, and for user model, it has about 600 methods and a user object has about 400. So every time you get an error problem like hash, it doesn't have to scan all the symbols, which is about 20,000. In this case, you can just scan about 20, 50, 300 methods. And Digiming Jam uses a pattern called final pattern. That means, for example, you get a name error and it says unutilized constant, then it's gonna use constant finder, which only knows about the list of constant names. And if the error is not an error, then it's gonna use method finder, which only knows about method names that you can call. So this is how Digiming Jam works. And this is actually how the latest version of Digiming Jam works. But it is actually different from the one that is available on GitHub. Now we know how it works, but we don't know how accurate it is because sometimes it doesn't suggest it, sometimes it suggests too many methods, and we don't know how accurate it is. How are we going to do this? We can just test it with Ruby Symbols because it's hard to collect typing data while you're programming. So I'm gonna just use existing data that is available on the internet. And I'm gonna use dictionary simple English as a dictionary. And simple English is a dictionary that only contains essential English words because dictionary also has a whole English dictionary, but it has four million words. So I don't want to use it because it takes time. And then while you're programming, I don't think we use really, really hard to remember words because you want to make it simple so that which means you only use essential words. I'm gonna also use a list of correct and incorrect pairs from Brithbeck, Spell, Spelling, Evercovers. There's always a study, and then that data was used by that study. And everything is available on the web, so I'm gonna upload this slide later so you can check them out. And this is a result of the evaluation. As you can see here, the accuracy right now is about 54%, which is actually not high. So why is it low? What kind of names can the Spell taker not correct? Obviously, many cases where I remember a method of them incorrectly and the current Spell taker doesn't catch it. So let's just optimize it. And you may already realize it, but sometimes I say mistype, and then sometimes I say mistype. And then they are actually different. A study said that Spell correctors that can correct mistypes can't always correct mistype. And it is easy to correct mistypes because you can just call it at a distance with a little bit of distance. And then if you just make a typo, like for example, you try to hit A, and then accidentally hit S, then there will be just one character difference. And then that's where the taker can correct that mistype. But when it comes to spelling mistype, you don't remember the method of them correctly, and you will start. Now you don't know how to type, and you don't know how to type, and you try to guess, but it doesn't always catch the right one. And the other study said that you always remember the first part of the method, not method, just maps in general. Like yesterday, Matt called me Nishita-san, but he remembered the first part of my name, but didn't remember the last one. So I guess it's a good example. And now it's time to use yellow-winkler distance, sorry, yellow-winkler distance. What is yellow-winkler distance? It is basically yellow distance plus prefix bonus. There's another distance in my trade called yellow, and the prefix bonus has been added because you always remember the first part of the name. And if you add a prefix bonus, then you can pick up the right one because it has bonus. So what is yellow-winkler distance? There are two important metrics, M and T, and the first one is the number of matching letters, and the second one is half the number of transpositions. Take a look at this example. Here you can find first name, and the second one has the wrong character. To calculate M, it checks if the letter appears in the first and the other one. And here you can see just four arrows, and then the question is, does it actually have to scan everything? And the answer is actually no because it has matching window, which means let's say you have two long strings, and then the first one has a character A in the first plate, and then the second one has a character A in the last one, but it doesn't make sense if you combine these two things because it's too far. So we don't have to check these ones. And as you can see, every letter here appears in the other one, which means the matching number would be just 10. And there are two transpositions, meaning T is going to be just one because it's going to be half number of the transpositions here. Now we know M and T, and you can calculate the distance with this formula, and it's going to be 0.966666. The next thing we have to do is to calculate a prefix bonus. To calculate it, let's only calculate the first four letters in the strings. So let's avoid it about the rest, we only know about the first four. And check if each one matches the other one. And obviously the first one matches, but the second one doesn't. And even if the second one appears in the third place in the other string, it should stop counting if it doesn't match. So in this case, I should appear in the second place, but it doesn't, so it should just stop counting. Which means in this case, the prefix match is going to be just one. And we're going to use this formula to calculate bonus where W is a weight, usually it's 0.1, and an MP is number of prefix matches, which is just one. And then here J is yellow distance, which means the prefix bonus is going to be 0.0033333. And since yellow-winkled distance is just yellow-winkled plus prefix bonus, we can just combine these two. So we're going to get 0.9669999. So yeah, as you can see, they are pretty close, which is why we get value, which is really close to one. If everything is the same, the yellow-winkled distance is going to be just one, because they are the same. Now let's talk about the messed up correction in Digiming Jam. It uses Digiming Distance, and then picks up the closest one, only if no mistyped corrections are made. And then the limit-stand distance should be lower than the length of the shorter distance, because sometimes the yellow-winkled distance could be really high, even if the limit-stand distance is really, really high. So limit-stand distance should be low, otherwise it's going to suggest something that is not related. Now let's really do the evaluation. We can use the same script and how much it has improved. And now it's actually better. The accuracy increased by about 7%, and then that's about 80% accurate, which is great. But it is also true that 20% of the time, it is wrong. So what are the corrections that didn't go well? This is just one of them. Phase, fate, and phase. The reason why they are misspelled is that they sound quite similar. So, for example, if you say, phase, then some of the things are, oh, it is F, A, it's the E. But it's actually wrong, but both limit-stand yellow-winkled distance cannot catch it, because the distance is quite low, and the first character is actually different. Another example like this, female, email, female. And then night, unite, night. Same problem. And then the last one. This is interesting because it always happens to me. You really don't know it is S or C, and then I don't know how many S do I have to type, how many Cs do I have to type. It's really confusing. But as we can see, most of the errors are coming from the fact that they sound quite similar, but has different letters, like C or S, pH or F. In other words, if I have to improve the DigiMe Jam even more, I should probably apply a pronunciation-based optimization. Now let's talk about writing a finder. This is the last section of this talk, but it's really great. So the reason why you want to write a finder is that let's say you're writing a Rails app and then sometimes you want to find a finder like this. For example, you use active record and then mistype attribute name in the database, and you will get unknown attribute error. But it doesn't correct our mistake because it's not name error, it's not norms of error, so it doesn't know about how to correct this, how to correct the mistake here. I want something like this. So I'm going to mistake in the hash, and it should suggest a name so that I can easily realize that I'm doing something wrong. As I talked earlier, DigiMe Jam has a couple of finders by default, but you can also add a new one if you want, which is great. So let's just implement it. Here you can see a class called attribute name finder, which includes DigiMe base finder. I'm not actually sure if it's a good name, I should probably change it to something else if I come with a new one. And what you really have to implement is just two methods, initialization and search as method. The initializer takes an exception object, and you can grab things like a binding object and original message. And what's important here is that you really have to call original message because this finder is evaluated while it is trying to generate a message. So if you call this message, then it's going to be a stack overflow error because it tries to generate a message, it tries to call the finder, it tries to call the method, so it's not good. So it's really important to call original message. And then certain methods should reach a hash where the key should be a user input and the value should be a dictionary. And it has to respond to attribute name and column length that you can just implement like this. Attribute name word is coming from the original message, and then column length is coming from column length. So here eval-salt-thus is actually an actual record object. And if you say column length, you can get a list of names. And don't forget to add a new finder to the list of finders. So before we get something like this, but after you implement the finder, then you're going to get something like this, which is great. It is available on GitHub, so check it out. So yesterday, as Matt said, it's going to be bundled when Ruby 2.3 is coming out. But there's still a lot of things that I have to do, like removing support for other MRIs, Jeremy Rubinius. If it is going to be bundled as part of Ruby, then it shouldn't know about Jeremy. It shouldn't know about Rubinius. It shouldn't know about old rubies. It shouldn't know about Rails, Bundler, RubyGems. And then the next thing I have to do is stop monkey patching. Right now, DigiMean Jam has monkey patching. And I also see extension. And then I'm expecting the next version of RubyGem, Ruby version, to include the extension. And also, hopefully, I don't have to do monkey patching anymore. So yeah, there's still a lot of things that I have to do, but hopefully I can ship it with the next version of Ruby. And then one last thing I want to tell you today is that DigiMean Jam totally works with a module. That's pretty much it. Thank you so much. We have time for a couple of quick questions. If you have any, if you do, please come up to the mic. Hi. Thanks for the speech. I have a question. I'm trying to implement TFIDF. The problem I face is with the TF, with the IDF, whereby I'm trying to look for a corpus with the document frequencies. And so if I get it, because it's quite large, what data format would I best put it in to actually do a fast query? Can you say again? So your question is you want to implement the finder, but you want to change the format? I'm trying to implement TFIDF, whereby it tries to find the importance of a wording document. So I'm trying to find a good corpus. And so if I find it, how would I best store it so that I could do a quick query? I don't know, actually. What I can think of is to implement, like Vim plugin or Emux plugin, or removing my plugin to automatically capture what you typed and then send it to somewhere else where you can collect what you typed and then what you actually mistyped or mispowered. So yeah, it's a difficult question because I use some corpus that is available on the internet, but it is used like back in 1980, and it could be really old. So yeah, so the evaluation script is not actually good enough.