 So yeah, thank you very much after lunch and then you join my section. So welcome and So yeah as this talk will be beginners friendly So I won't dive deep down into the detail of like how the package work and all this If you are into something really technical, I'm sorry, maybe lagging but I'll tell you how I use the package and what is the Like a theory behind and how I use it just some tips for people who may be using it in the future and it's like a it's a you know, pie data style, you know, like So here who is kind of like identify themselves as data scientists analysts working with data Yeah, great. So, yeah, I hope you will find it useful. So, yeah, that's my you know Twitted there and you know github and you will find the code on github actually and also remember to tag your Python and now Please rate my talk afterwards. So thank you Okay, and then Yeah, I'm Chuck or you can call me cherry or my friend does and I'm working in hotel baths I'm organizing an AI club for gender minority wishes in in London. So Yeah, I base in London. So we are trying to promote You know gender equality in in the technology. So we're trying to you know encourage and and power Gender minorities like many women who yeah, who was like struggling, you know feel a bit, you know Not enough support in the community. So we try to help also I am a member of Python sprints So if you are, you know, if you're like me coming from London and then next week We will have a sprint for pandas documentation. So which is for gender minority So if you are minorities like me and you want to Start contributing to pandas or other we will have other events for either libraries as well So if you want to contribute and you you don't know how to get started you you're welcome to join or if you're super Senior you're like very experienced and you're welcome to help us. Well, so Okay, all of those things clear so Why why we are doing this? Like matching company names like string matching and so yeah, I was having a problem at work Like I want to match like there's a list of companies which is my my my company's client and I want to find the Some similar names in it, but of course, it's not limited to to my company, right? Like why why we why we want to match the name? So I have some kind of funny and counter on Facebook actually because Somewhere in China. I saw this pictures Looks familiar, right? Okay What about this one? Oh, it's it's it's giving credit to a company in USA. Obviously and this one Okay, I have to quickly skip because it's a very nasty type of it so If I want to see like okay, if I have a list of company and some of the company like those I want to see all are they the one that the Coffee shop that you're like we drink coffee and we buy cakes and I did the same so Can you tell me like? What would be resolved of those? Yeah all fours So I can't do it like that right everybody knows that So maybe I can be smart. Maybe I can use some string methods to do it so Who knows like what what string method I could I could use to to make the first one become false like to do some modification Yeah, I pulled over how about the second one There are different ways, but I found a way to do it. I wish is Yeah, I'll show you and then the last one. Let's let's let's talk about the last one Yeah, I will show you the thing that I could think about like yeah, I would place the space bar and I I just Trim the S. It's a bit stubborn right if what if it's not like Starbucks is like Starbucks exclamation mark and Then it doesn't work, right? So So we need something else to kind of find all these similar strings So I I kind of know this one is called fussy matching like it's it's just a funny name It's like fussy thinking about you know, I kill animals like they're like but But yeah, that's that's making videos definition. It's too long. I just don't like to read long Paragraphs but actually basically what it means is just matching two strings that is not exactly the same But we want to find a way to score them how similar they are So while now we have something very, you know, I'm smart coming up So this is kind of name after Russian scientists I hope I pronounced it right if you are you speak Russian then you can tell me is that but but near That bench time is the current never never shine 11 sign is distance. So I'm still doing it wrong But yeah, forgive me Yeah, so it's basically telling there there'll be a number You know an integer that says like oh how much Alternation I have to do from changing in string a to string B So if I delete one character There'll be one one change if I change that like for example from an a to an e That would be also one alteration and or I think something so Yeah, so if it's the if that number is bigger This are the leverage time distance is bigger than the two strings more different. So it sounds for intuitive, right? So how can I do it in programs? I've dynamic programming. It sounds smart, right? But actually, yeah, I just I just I got this picture It's not my picture. I should give credit is from from from github. Actually, there's a JavaScript library that is Improvented this but of course I'm using Python So I'm not talking about that packages But yeah, if you if you find my slides and you can click on it and you will go to there and like Have a look at their github repo. So this is the the graph, right? Or matrix. I don't know how you should call it But you can see like from the top left-hand side The is zero because there's no change and then if okay, we just go from one One character to one character. So first character. They're the same. So if you So basically You don't need to change anything. So they will be the same right both as but you can see like For example, if I go one step to the right, I'm adding one word to two to Saturday So if I spell Saturday, it would be adding one word each So it would be like you can see the first lines one two three four five six seven eight If I go down it's like adding words to to Sunday. So you can also see that as well So it's from zero to to to that to that carrot to that word so What if I have now I have two two characters? I have for example Saturday I have like as a if I want to change it to like Sunday. So it's like as should be as you, right? So I have to change the a to the you so if you look at the intersection of the a and the you There will be one so you have to change one character. So this is what this is And of course it's dynamic programming So it's like so it's if you you have first you have the small problem And then you can expand it and until the end and at the end you can find that the minimal change is free And then you can work your way back up and find the the path the minimum path that you have to do to change Saturday to Sunday so Yeah, you can see there's some transformation some deletion Which you can see from the graph So I'm not talking about that rep hope because that is in Java. So job JavaScript. I believe it. Yeah, JavaScript so yeah, and so I'm using a Python so which is I can use a fussy wussy which is also a funny name, which I love So why I use that is like Why what what is it like a better that I really like it that much is because it's not just using the Distance that you know, I can do compare two strings. It also provides some very Some features that's very useful. For example simple ratio, which is I can Which is the basic one, right? So that one is it's not magic. That's basic But we we can also have part part sleep partial ratio Which means that if I have for example two words That for example, my name is like I'm check But also if you're including my middle name by checking so check and checking are they the same person? Is it both me? So if I use partial ratio, that would be the same right because checking also contain Chuck There's also my name. So that would give you Score a hundred so they are the same But if you use simple ratio, it won't be the same because you have to add Tang at like four characters to become my name, right? So so that would be give you a lower lower score For this first, it was the matching score and also we have a token sort ratio Which means that I token each word and then they can change the order for example my name Like my name is checking whole which hole is my surname? But in in Chinese is like in Chinese we have we have surname first So my name will be actually whole checking in in China in Chinese. So Yeah, so is the same person so I have to use a token sort ratio So that would be actually still like the three three names are the same So it's just the order is changed and because you know Western culture usually put your first give a name first And then your surname but in Chinese is different. So yeah, it's still the same It will give you a hundred and then for token set ratio that would be like for example My name if you because like for example, you will skip the middle name, right? Usually if I put my first name and my last name, they'll be check whole so Yeah, then that would be you know, is it the same as me like and then it would do the token ratio And also if it's a subset there, it will also pass. It will also give you a hundred score So that's very useful. It depends on the use case actually would really Help you out. So you can check out the github repo It's done by Seatkey It's a very very popular library. I used it and I really like it So you can go there check it out and they have a blog post about, you know, what what the difference between this four as well So, yeah Okay, so my use case now okay now I Have the company data, but of course I won't use my my companies are clients data because that's confidential so I download an example data set which is from the Open license public database, which you know all the UK companies if they are They have limited liability. They have to have all these information in public So they you can download it. So if there's a lot of company actually in this country So I only use Cambridge because a lot of startups going on there is very exciting But still there's a lot of company in that list. It's like 15k so quite a lot Okay, so check one because If I just use first seem we'll see and I just check all these names. Are they the same? Actually a lot of them will be having a high score because there's some words that you know all the company use right is you can have You know, I can have my own, you know Track tank company or whatever and everybody can set up a company with the name company. It's a very common Come idea right and and limited. It's like very common as well. So Yeah, that would be less meaningful to match them. For example, if it's talking about my clients There are lots of them because my company is doing travel like hotel rooms and stuff So a lot of my clients they will have travel in the name. So which is less meaningful. So Yeah, what I do is like I use Small trick that you know, I do all this You know count dictionary and I see what's the 30 most common words. Actually, this is also useful in doing some NLP stuff as well So recurring things over. Yeah, very convenient. I just used your same idea to do it Okay, and then another thing is like I came across a problem because There's remember there's like 15 15 almost 16 case companies So there's a lot of number if we match each single one of them with the others that would be a lot a lot to compare Which I I don't want to have a big project where right? I don't want to have like a GPU and all this stuff to train if like to calculate for a day No, no, it's not gonna work. So And you know, Python is not super fast So That's why I am trying to use some trick because I'm thinking about I want to match companies the name That's highly similar So I would just assume that they won't make the mistake on the first character if like if for example, if somebody make an account I would type in a company name, but made a mistake have a typo and oh It's wrong. I will open another account with the right name So basically a person open to accounts with highly similar names because I made a mistake on the first one So um, so they I just assumed that they won't make the mistake in the first character because that's very obvious Usually you type something you just have a look at it you if the first character is wrong. You just catch it So, um, yeah, that's the trick that I do That's good enough for me Okay, so remember I found out the 30 most common words, right? So when I do for C matching I would Deduct 10 points if it's like a game, right? It's like oh because you have this word. I'll just deduct your points because you know they are Because I'm I will check the score at the end right I'll check if it's having a high score. So if they just Have a score this this pair having a score just based on they have a common word So it's not valid. So I have to deduct the score. So to make them Balance it out. Uh, is it a is it a very good trick? It's uh, It may not be but uh, but it works and it's very simple. It's very easy to implement. So That's why I do it. I think uh, in a lot of cases they will work um Yeah, it sounds very simple, but uh, it works so and also I'm I chose I choose to use this token source reso because um If somebody for example somebody type in the names if they swap it so they should they should be highly similar, right? So I'm considering it like uh in in a word word by word. So I choose to use that sorry um Okay, so at the end what do I find out using that data so that you can go to my github and check it out it's just a very um jupyter notebook. It's just very um simple and Uh as a as an example that you could you know just simply reference back so uh this The thing that I call it like with they if they score um score There is like more than 85 then they would be considered the same. So what are they? Usually they would be like smelling mistakes because they They would be like uh Two names that's different by you know, for example just an s like the starbucks and starbucks Or um, they would be Having one less l or one less i or things like that So, uh, it's it's uh, it's very easy to make those kind of mistakes because um, if it's an i and an l obviously If you look at it, that's it's very easy to miss. So it kind of makes sense Uh for human typo, it may not be but I just suspect that that is and also number three is step number two. So that one it will be um I would suspect somebody having multiple accounts, right? For example, if I sign up with an account with a username I already did it last week with my name and then This week I won't have another one. I'll just put a two at the end. So that's the logic Sorry So um, and also, um, there would be Like for example a abbreviation, you know, there there could be some changes. So um, that could be intentional So, um, I won't suspect that would be a uh Uh, absolutely a human error And um, also, uh, that's the the funny thing that I found because like some names that actually Uh, they look similar like the one that I have here But they are not just differed by one character It could be there just like two companies. That's coincidentally having a very similar name So, um, that's the interesting one that you know, I would love to investigate You see like, you know, um, for example, if if they are my clients I would maybe ask my colleague if if they can, you know, talk to their clients and like, oh, are they like the same Is it still like the two accounts belongs to you? Or is it um, you know, if somebody is like coincidentally Having like a similar name. So yeah that We need some checking uh, human checking at the end but um, But actually, uh, I can show you it's actually highly reduced The you know the the work that we have to do to find out if if it's A typo or is it a totally two um clients because um You see after applied the matching So it's like uh, the the one that got caught. There's like highly similar. It's only 1% of the total and so among like, um By 15 k that you have to maybe um, you know, look at it at least like you won't call all of them, right? You I won't like of course my colleague won't call all of the clients But like by just looking at them you can like you still have to do like go through the long list of you know, like that much like Like 15,889 But uh, now I just need to look at 57 of them. So um and also maybe investigate in 57 of them So it really uh, highly reduced the the work that you know, uh, you or your colleague have to do So I think it's a really good trick. So um, yeah So, um, yeah Happy matching it. So I think I have a lot of time left. Is it and yeah, I just go through it very quickly So You just took for you just ask questions and Thank you very much. Yeah, so we have time for questions Yeah, so The fuzzy would say use four metrics to To estimate if it's a match or not. What is the Let's say the coefficients or the priority of this four metrics Uh, you mean the like the token source Order. Yeah, so I can actually Show they are doing it. They're actually giving it a score like it will return a score at the end which Uh, if it's if it's perfect match or if it's um, that's actually examples like for example if Like here for example the token sort matching you can see like, um, let me make it bigger Yes, but how it's eventually generate this final score from that four metrics Yeah, so Basically, basically what it's doing if it's just for example, if it's just simple ratio, it will be A lot of time distance and then just you know, normalize it and give it a score if it's you know I I haven't really like Check in the code of this but it will give you a hundred if it's a total match you can see here For example, if it's a partial match, then it will give you a hundred if it's a total totally match So it's um, it's by proportion as well. So if you have a longer phrase Then if it's the the tolerance of the difference it will be higher because it's by proportion Okay, so it's a so you'll pick which one you want to use it's not like eventually they combine the fall to generate one score Yeah, you can choose which logic you want to use because uh, for example, if I uh, if I applied this In the simple ratio, it won't be it won't be perfect, right? So it's it's the it's the logic that you could choose. Yeah, thank you Okay, we have another question. Yeah Hello, so I was curious that the leavenstein, uh, only uses the letters So does it work equally well in all languages or are there some languages that it has problems in and do people use techniques? Like combining it with the dictionary from a language to get better results actually if Because it's it's doesn't um, it doesn't care what language it is. It's the the idea is like it check characters and if for example, if it's not in english if it's in french then It doesn't matter right because like if you have uh You know if you have a word that is you know one character difference It will give you a score of like, uh, uh, a longer science distance of one because there's one character difference So it it doesn't matter if it's a change of that character. It was deletion or if it's adding one character Yeah This is kind of more of a question of what if Suppose I've got two super big tables and there are like world of warcraft players from one server and another server And I kind of want to know if there might be a match and let's assume that these names are fuzzy Not the best example, but let's assume that like you would then maybe Do a full join and then check if they're fuzzy was he Matchable, but this won't scale very well unless you use some sort of heuristic In such a situation you already mentioned that you can take the first letter for example as a trick to maybe make things faster other other tricks uh For for two tables that's matching the names I would say that I because It was like, uh, I haven't think about something else yet at this level but I mean These these four like these already helps a lot because uh, for example, if it's two names that got You know change in order like the two words got changed in order that would already I'll sort it out for you But uh, yeah, if you want it to be super quick, uh other than you know having the first character that match maybe You can check the length as well But there is some limitation because if it's Different by length then if you kind of if it's not change of character If it's adding one character or minus one character then you can't you can't check that as well So, uh, there will be limitation. That's uh, for me. It's just uh, my my trick is more less like a patch like a quickly kind of Yeah, I get things done quickly. So, yeah Okay another question Hi, here we are comparing one to one What if you want to compare one to a set like let's say you have 10 000 names and you want to figure how this specific name is rare rarity to compare our uniqueness to compare to the whole set Will you also use fuzzy wuzzy or would you go with something else? So if I if I have one uh one one name, for example, I have to match with the other thousand to see if there's A couple that is very similar. Is that what you Um, no actually how how would you measure how this name is unique compared to whole data set? Okay. Yeah, actually Yeah, if it's if it's a mesh then uh, or like it depends on which one you choose it will um The logic will be difference. For example, if I want to be um, you know, it's uh If if it's for example, if it's uh, I can I can For example, this is the same I choose the token sort ratio So if the name is changing or that I would just consider them the same So I can I can apply because every single one of them have a score, right? So if there is for example, if there's a 10 out of a thousand Yeah 10 out of like 10,000 that you know got a score higher than 90 for example Then I would say that it's not unique. So at least like all of them need to score at a point that is Smaller than certain flush hole then you can say that it's different from everything else So So before you said you pick 85 if I remember correctly is a threshold right for saying yeah, that's the threshold Yeah, so how do you know that? Like matches with a score below 85 cannot So you don't consider anything below 85 Did you run any tests like to to choose this parameter 85 or how do you did you pick it and Yeah, yeah, I did run a couple of times actually that's uh, yeah, I didn't show it but um, yeah, because uh, I I as this one the 85 it gives me um, like 50 57 of them that is considered the same like matching, right? So I look at them But I can also you know change it for example change it to 75 and then it will give me for example Let's say it give me 200 and I can compare the two or if I have pilot knowledge knowing that you know All the mistake won't be you know won't be more than 1% let's say Then I could you know Have a look at the number that it's got mesh and then compare it to the total is it less than 1% or is it more than 1%? so I can kind of I I need to still fine-tuning it. So that's that that would be yeah, it's not absolute Okay, here we have another question Hi, um, did you ever try it on longer strings? So could you for example find a company name in a large text would that scale the library? uh company with a So if you have a large text and you want to know if a company name appears in this text Do you think you could use the library for that? Yeah, that would be uh, of course if it's a not so big paragraph Maybe like you can do you can do some windows the slide flew and then you can do some maybe like Partial ratio and all this to see if if the company name appeared in that window But uh, I haven't tried that elbow I do think the sliding window would work if you kind of have a paragraph and you for example Your company name is like in consists of three words Around normally a free word then you can have a window of you know five words and slide flew and then to see How many of them is the having a high partial ratio that you can kind of pick that chunk out and know that is Yeah, it's a match And so the last question Yes, um, have you tried a phonetic algorithms because I think that would work very well for short names Um Because you only use it for short names, right not for long corpus Yeah, this one is for short. Uh, so what's the algorithm that you mentioned phonetic algorithms? Of an anti algorithm is it? Oh, yeah, I I I haven't I haven't tried that one because uh Yeah, it's just uh, because I just tried this firstly matching, but I haven't I haven't tried that. Yeah Okay, and just to answer maybe finns plateau or fins methods could be a good match for your question Yeah, I'm afraid we don't have any more time for questions. So let's thank church once again. Thank you