 So, you may be wondering what correlate is, technically it's a library that I wrote. It's solving a problem that I had and let me go through what that problem is. The first thing, I don't like the word corporora or corpus, I use the word data set instead, I kind of demean the same thing. But my description of correlate is, correlate matches up values between two data sets which conceptually represent the same data. So you have, imagine that you have two data sets that have some values inside and there really should be matches between them. Like this is really the same piece of data as this, but it's in a different format. How do you find matches between those two? You can do it by hand. The idea with correlate is to automate that process. I realize this is still a little bit abstract. So I think it's best to go through a concrete example. I like to listen to old time radio shows from the 40s and 50s. This is a list of MP3 files for a detective show called Boston Blackie. And I downloaded this off of archive.org and I really hate these file names. These are terrible file names. You can see what's in it. It's Boston Blackie and then the date and then the episode number and then the name of the episode with like no spaces and uppercase characters and all this sort of thing. And they kind of made up their own titles too. I don't know what's going on here. Happily there's very clean data for this sort of thing. This is an episode log of Boston Blackie which has the same episodes with the broadcast dates and a much cleaner title. So what I want to do is I want to take this data and use it to rename the MP3 files and make them nice clean file names that I'm going to enjoy. And correlate is just the tool to do that. So I gave a lightning talk about this exact thing at PyCon this year and I knocked together a quick test case and I found some nice examples in here. So this is an example of matching up an MP3 file to an episode. I created my own episode object off of the episode log. And the only thing these two have in common is the date. But Coralite said, this looks like a good match and it's right. Those are the matching episodes. On the other hand, this one, the dates don't match up at all and the titles aren't really all that similar. But the words John and Davis both appeared in both places. And so Coralite said, I think these are the matching episodes. And again, it was correct. So Coralite really kind of does a little bit of magic, figuring out what these matches are. The important thing to keep in mind though is that Coralite is not an algorithm. I would describe Coralite as a heuristic. Coralite is not provably correct. It's not like merge sort or quick sort where you run it and you can prove that it's providing the correct output when you're done. Coralite is very much of a guess. And I've spent a lot of time just sort of thinking about the problem and trying to come up with ways to improve the quality of its guesses. At the end of the day, it is kind of a guess. It's based on the quality of its output is going to depend on the quality of the input that you put into it. But I can't prove that it's correct. It just seems to work okay. So I'm going to go over, we're going to start with the basic API. When I started out on this, I was like, okay, how do I even represent this so that I can write a library to solve this problem? So obviously you import the Coralite library and you create a Coraliter object. The Coraliter object has a bunch of stuff hanging off of it. Datasets A and B, those are the two data sets you're going to stuff your data into. I also put it in a list in case it's convenient for you to iterate over those. I never use it. Print data sets is just for debugging. It just dumps the contents of your data sets out to standard out. Sometimes I discover that I've been inputting my data wrong, like I've been breaking up the strings into individual characters. It's weird, Coralite still does a good job even in that case. But occasionally I will discover I was doing it wrong and print data sets shows you what your data looks like from Coralite's perspective. And finally, the Coralite method itself does the actual correlation. And we're going to look at that more in a sec. So the conceptual model for Coralite is that you establish your values. And these are the items in the two data sets. Coralite really doesn't examine the values at all. All it needs to know is whether two values are the same value or not. So it's doing a quality testing, but nothing else. Values don't even have to be hashable. But what you do then is you establish keys that map to those values. And this is metadata that you have culled out of the values. So for example, you parse the file name and you pull out all the individual strings. I think it's best to lowercase them. I'm also parsing out the date and I'm setting the date in the same format on both sides. And Coralite is going to examine all those keys and find keys that the two values have in common. And say, that looks like it might be a good match. So in this example, we have unlucky at cards and 1945, 10, 04 in common between the two things. So Coralite is going to consider those keys in common to figure out whether or not this is a good match. So I'm just going to show you an example of calling the API here. This isn't really very useful. But I just want to show you the basic API. I'm going to create two values. They're just strings. Again, they could be any Python object. But it doesn't really care what the object is. I'm just using strings just so it's nice to present when I print it out at the end. And then you call set on the data set you want to stuff the values into. Data set A, data set B. You set the key equal to the value. You can set a key multiple times to a value. You can set a key to multiple values. You can do whatever you like. And as a matter of fact, setting keys more than once is very interesting to Coralite. We'll look at that in a sec. In case you're just iterating over an interval, like I am here, there's a set keys which just takes the iterable and it just sets all of the keys from the iterator as set values, just to save you a little time. And then when you call Coralite, you get a result object. The result object, the most important thing, of course, is the list of matches. These are sorted by confidence level. And then unused A and unused B are values from the two data sets that Coralite couldn't find a good match for. So it's just showing you it's like, yeah, I couldn't find a good match for this guy. Statistics is a dictionary mapping strings to numbers, basically times. It's the idea is if you're wondering where Coralite is spending all of its time, this is a little bit of internal logging data that'll give you an idea of what's taking so long. And finally, the match object itself, that's the thing that we have a list of in the matches list up at top. That contains the value from data set A that we said matches this value from data set B. And then the score, which is a sort of numeric confidence level in how good a match this is. Now, just to round the bases, I'm going to iterate over the matches. There's only one match, and it's this Kingston Unlocked Cards. And it computed the score of 3.6 bar. Where did this score come from? How did Coralite compute it? We'll talk about that in a sec. But let's start at the basic algorithm of how Coralite does its work. You feed in all of your data, and you call Coralite. Coralite, more or less, does iterate over every value in data set A, and compare it to every value in data set B, and look at all of the keys they have in common, and compute a score. And then it takes that score, that match object, and it adds it to a list. Then it sorts that list by score. Then it starts at the top, and it does this sort of greedy algorithm, where it says, okay, looking at this match, have I used value A or value B yet? And, of course, the very first one, it hasn't used anything yet, so it says, oh, that's a good match, okay. So it stores that in a list, it's going to return to you. And then it writes down, oh, I've seen value A, and I've seen value B now. And then it goes on to the next one, it says, have I seen value A yet? Oh, I have, then this match is toast. Well, I don't use it, so it just goes on to the next one. So it's finding all of the matches that contain values that haven't been matched yet, and adding those to the list over there. And that's the list that it returns to you. There are flaws in all of this, so we're going to pack over a bunch of these flaws with some additional technology we're getting to later in the talk. But at the heart of correlate is the scoring algorithm. I had to figure out what a good score was for a particular match. And I'm very lucky in that my intuition was right. The first thing I tried worked really well, and then I tried some other things, they didn't work as well, and I went back to the first one. I turned out to be right all along. So this is, let me say one thing here, though. This is dealing specifically with what I call exact keys. So in correlate, you have all sorts of keys. You can use almost anything as a key. You can use strings, integers, floats, datetime objects, custom objects if you want to. Those are all valid keys. There's a special kind of key in correlate called a fuzzy key used for fuzzy comparisons. And just as a piece of terminology, any key that is not a fuzzy key, I call an exact key. So we're going to look at the scoring algorithm for exact keys now, and we're going to get to fuzzy keys later. So what I tried to do with my scoring algorithm, I wanted to score, if two values have a key in common, I wanted to score more highly if the key was rare than if the key was common. So I had to figure out how to represent that. Ultimately, what I did was I said, OK, let's count the number of times that this key has been used in data set A, and we're going to count the number of times that we use that key in data set B. We're going to multiply those numbers together, and that's become our divisor. So as an example, in my Boston Blackie data set, in the MP3 file names, we use the word the 75 times. We use the word the 136 times in data set B, the episode list. And so we multiply those together, and that's our divisor. And that turns into a number that is about one 10,000th of a point, which is almost no signal. The fact that two values have the word the in common doesn't tell us a lot, and the score now reflects that. On the other hand, the word jewel doesn't come up very often. It's used three times in the MP3 file names. It's only used once in the episode log. So if two values have the word jewel in common, that scores a third of a point, which is a lot more. That's a lot more signal. That's a lot more evidence that these two are a good match. And then if two keys have the word atkins in common, which is the last name, there's only one appearance in each data set, if those have the two values have that key in common, it's very likely that that's a good match, and so that gets a huge score boost. So to show you the example, the Atkins jewel thief, or Robert Atkins jewel thief, that's a good match because it has the word Atkins in common and the word jewel in common. And of course the date matches as well, the broadcast date. In this case, the word jewel didn't help because we spelled it differently. It's the Winthrop jewel robberies versus the Winthrop jewelry company thefts. So comparing exact keys kind of failed us here because we use jewelry in one place and jewel in the other. This is the sort of thing that fuzzy keys is good at. We're going to talk about that. So already with just that basic approach, that basic scoring algorithm and that basic greedy algorithms correlates working pretty well. But I thought about the problem a lot. This is actually like my early COVID coronavirus project. So I had a lot of time to myself, so I would just sit on my mountaintop contemplating this problem. And I thought of a problem. So let's consider this scenario. We have a value in data set A, and we map one key to it, break-in. For you younger folks, break-in was the name of a movie in the 80s about break-dancing. Value B1 in data set B, we also map the key break-in to it. But in data set A, we also map break-in to electric boogaloo, which is the sequel to break-in. Now what's the problem here? The problem is that these are both going to score the same because all we're doing is we're looking at the keys that match. We don't consider anything else when we're drawing the scoring algorithm so far. So the idea was, OK, how do we prefer the top one over the bottom one? Because clearly, if we're looking at it, we say break-in, this top one is a much better choice than the bottom one. How do I highlight that? And what I resolved to do was, I count the number of keys that matched, and I divide it by the total number of keys that are mapped to that value on both sides, and I multiply those together. And that becomes a bonus that I add to the score after I'm done adding up all the scores for the keys. So in this example, the top one is getting basically a score of a bonus of one point, and the bottom one gets a bonus of a quarter of a point, because it only has one out of four keys matched on value A2. So that inflates the score of the top match, and we prefer that one. And again, we get the correct answer, so crisis averted. Now the thing, again, with Coralate is that it is a guess. It is not an algorithm. And again, it's dealing with messy data, and you kind of have to work with it a little bit. So there are some things that you can do to increase your odds of getting good matches out of it. The first and probably most important one is setting a minimum score. Let's consider you have two data sets, and they are a perfect match for each other. Every single data set and every data set has a corresponding value on the other side. In that case, Coralate's already going to do a good job. If you have one is a subset of the other one, where, again, every single item in the smaller one has an exact match in the larger one, Coralate's going to do a great job. But what if you have some values in one or both data sets that just don't have a corresponding match on the other side? Well, Coralate blesses a little heart. It really wants to do well by you, and it's going to try and find matches even when those matches cannot exist. So it's going to find matches that are arguably wrong. They're just bad matches. What can you do about it? Well, here's my observation about that. This is a graph of the scores of all the matches in my Boston Blackie data set. And you'll notice there's a huge drop off at the end. In point of fact, I mean, it looks like there's a drop off after about a score of 4.0. But it's not until we get to the very bitter end a score of below one where the matches actually get terrible. And what tends to happen is there's an inflection point. You start at the beginning of the list, and they're all great matches, and they're good matches. And then at a certain point, you hit something, and suddenly none of the matches are good. They're all junk. And there's an inflection point in the score there. So in this case, it's happened somewhere between 1.0 and 0.25. So all we need to do is pick a point there, like 1.5, and tell Coralite, you know what? You see any match that has a score lower than this? Just ignore it. Just pretend it doesn't exist. Coralite won't keep those matches. And you won't have this junk at the end of your match list. Another way you can improve your matches is just by weighting your scores a little bit. This is just a little bit of extra data. You can say, if you match with this key, give it a little bit of extra score. I just multiply that in when I do the basic scoring. So if I set the weight to equal to 2 on this key and 2 on this key, then the total possible match is going to be 4 rather than 1. And you just pass that in. There's a weight parameter to set and set keys where you can just set what the weight is for that mapping. Now I want to talk about rounds for a second. Here's where redundant keys come into play. Originally, I thought that if you mapped the same key to a value multiple times, I didn't think that was interesting. And again, this is one of those things where I just sat on my mountaintop contemplating, and I realized actually it's very interesting. And I'm going to show you why. So here are two values. This is a different imaginary data set. This is just for examples. These are movie titles from the 70s and 80s. The day the clown cried, and the day of the dolphin. In both cases, we have mapped the word the twice to the value. What does correlate do with that? So internally, when correlate is doing its comparison work, it splits everything out into what it calls rounds. A round is a set of all the unique keys that are mapped to a value. And then subsequent rounds represent duplicates of those keys. So they tend to get smaller very quickly. If you mapped the word the to a value five times, then there would be five rounds, and the word the would appear in all of them. So here, round one for the value on the data set A. Round one is the day clown cried, and round two just contains the word the. And on the other side, we have the day of dolphin, and round two just contains the word the. So what's interesting is, conceptually, the word the for the second time is a different key than the word the for the first time. And the is very common in your data sets. Again, I'm making up numbers here, but imagine that the appeared 85 times for the first time in data set A, and 76 times in the data set B. Well, again, that puts us at about a one 10,000th of a point. But the for the second time is a lot rarer, and if that only appears a handful of times, suddenly we're getting a lot higher score from that. So my goal here in telling you this is, if you use correlate and you have redundant keys, absolutely pass them in. Another thing you can do to make your scores better is use what I'm calling a ranking information. And this is just the ordering of the data inside of the data sets. So if one or both of your data sets are unordered, there's really no sensible ordering to them, you can't use ranking. But if both of your data sets are really in order, where this value should come before this value over here, and this value should come before this value over there, then it's more likely that your matches are going to be sort of local like that, then they are going to reach all the way across. So the way that we work with ranking in correlate, there's a special method on a data set where you can specify extra metadata about a value. And here there's only one parameter, which is ranking. You just pass in a number. Correlate is automatically going to figure out the range of it, and it's going to figure out, it has two strategies on how to score ranking. It's going to use the one that's more successful. You just pass in ranking. I think you also have to enable it on correlate. You have to set either a ranking bonus or a ranking factor. I don't remember. But it's time to talk about fuzzy keys. So fuzzy comparisons, like fuzzy string comparison, is a way of examining two keys and saying, these look kind of similar. So it's not exact, these are the same string, but these are pretty similar strings. There's a very popular library for this called FuzzyWuzzy. I prefer one called RapidFuzz, which has the same API, but is MIT licensed. So here we have two titles. The top one is from, again, the MP3s. And the bottom one is from the episode log. But they're for the same episode, the Winthrop Jewel Robberies versus the Winthrop Jewelry Company Thefts. And so RapidFuzz is comparing those two, and it's expressing how the same they are based on a percentage. So this is the number 69. This is about a 70% likelihood that there are similar strings. So it's just doing it lexographically, though. It's just doing it by examining the letter. So this is saying, jewel and jewelry are very similar words, but it doesn't understand, for instance, that robberies and thefts are basically the same concept. Now, fuzzy keys are very expensive. And in terms of CPU time, it slows down, correlate a great deal. But sometimes you just have to use fuzzy keys. So fair enough. In order to use fuzzy keys, you create your own subclass of a base class that I established for you called fuzzy key. And you have to implement a compare function. And compare is going to return a number from 0 to 1. And that's it. I only compare objects of the exact same type to each other for fuzzy keys for speed reasons. The hardest thing was figuring out the scoring algorithm for fuzzy keys. This is something that took me months. I tried a bunch of different things. Nothing felt right. Eventually, I realized that I had this successful scoring algorithm for exact keys. I should make the fuzzy key scoring algorithm algorithm look the same, but more complicated because it had to handle fuzzy keys. So I'm going to walk you through how we get from the exact one to the fuzzy one. The first observation is that you can multiply or divide by 1 as much as you want. And it doesn't change anything. So I'm going to just add and multiply by 1 here. And then 1 is really the score of the exact key. Exact keys are binary. Either they have a score of 0, not a match, or 1, perfect match. So really, we can replace ones with score everywhere that it's used. But the final step is that the number of uses of the key in A versus the number of uses of the key in B isn't actually what we're measuring. We're measuring the score. So now I have to add up the score of how much this key cumulatively scored when it was comparing to stuff in data set B. And I have to add up how much this key was scored when talking to stuff in data set A. And I divide the score by that. So now the score is, I take the score and I multiply it by the ratio of that score versus all scores using that key in data set A, multiplied by the ratio of that score over all scores in data set B. That seems to work, finally. So I haven't touched it. And then every so often, I get like, oh, I don't know if this is right. Oh, I think it's right. Oh, I don't know if it's right. Oh, I think it's right. The final little bit of technology I'm going to talk to you about is the match boiler in the grouper. This is solving, again, a specific problem. Let's say that we have not such good data and we have a whole run of matches that have the exact same score. Which one of them should we pick? If we just pick the first one, then whichever one becomes the first one wins and that's not necessarily the best one. How do we decide which one is the best one? So what I wound up doing here was I said, OK, let's run an experiment where we pick each of the ones with a duplicate score and then recursively examine the rest of the match list and compute the cumulative score for all the rest of it. And then after we've tried all the experiments, we keep the one with the best score. So that works, but it's expensive. Because if we have eight values here, now we remove one and now we have seven values that are in a run and now we want to recursively do a match boiler on that. This becomes like an n to the nth power problem. So there's now a pre-processing step called the grouper, which is observing that sometimes these things group together naturally and sometimes they are off by themselves. So for example, again, this is all made up examples, but these first three values, they don't have anything in common with the, excuse me, these first three matches don't have anything in common with the other matches in the list. So if I select the first one, which uses value a and value g, well, that doesn't affect anything down the line. So it's safe to go ahead and pick that off. So all the ones that group together and are by themselves, I just commit those to the match list immediately. And now I only have to do the match boiler experiments with the bottom five values. Because every match in that bottom five has at least one value in common with at least one other entry in that list. So that's what groups them together. And we have to, there's nothing for it, but to run the experiments. The bad news there is, again, we're talking about n over n time. So when I wrote my example using Boston Blackie, I came up with it like hanging out at PyCon in order to do the lightning talk that night. I wrote a quick example using Boston Blackie, and I ran the script. And it correlated for 20 minutes. And I said, OK, something's going on here. This isn't working. And so I killed it. I started it a little. I said, oh, I should have the dates. I added the dates, and now it's correlating in a 10th of a second. So it just needed more data to run. But I went back and I said, OK, how long is that going to take if I don't have the dates? So I started that thing again before I left for EuroPython. I actually started it on July 6th at 1.30 AM. And it's still running a week later. It's been over 11,000 minutes of CPU time, and it shows no signs of stopping. So my point in telling you this is, if correlate seems to be taking a while, maybe you just need to go back and share your data and see if you can give it more to work with. The more you give correlate to work with, the better job it can do. That's really everything. I wrote a lot of documentation for correlate. That's out up on GitHub. And you can, of course, install it with PIP3. Thanks for your time. Thanks, Larry. Thanks very much for that interesting talk. Do we have some questions about correlate from the audience? Oh, somebody's brave enough. By all means. So you've had to do a lot of hard work of thinking about this problem. Isn't there some random forest machine learning algorithm, meta algorithm, where you just feed in some examples of your matches and how you'd like them to correlate, and you build some model, and now it can correlate many things? That sounds like the future. I would have to say, first of all, I don't know anything about machine learning. But second of all, fundamentally, you need to establish a way for the machine learning algorithm to understand the data. So you would have to establish things like the scoring algorithm. Once you did that, maybe you could teach it to say, OK, I prefer these higher scores. Really, I'm not sure what it could do that was smarter. The only thing I can think of here is just like, I'm thinking I may add some sort of maximum list of how many experiments we can run in the match boiler, just to cut down on the, like, OK, pick something random. But I don't know how AI would make it better. On the other hand, given all the keynotes recently, I'm worried about how adding AI to correlate may make it murder me in my sleep. Thank you. Thanks. Second question, yes? Sure, please. Thank you very much for the talk. That was really lovely. I was thinking about the information retrieval domain while listening to a talk. Have you had some inspiration there, or was it just like completely random things coming to my mind when I heard this? Because when you create a set of keywords basically for an object, this object is kind of a document in the terminology of information retrieval and algorithms and measures like ITIDF, for example, should work there. Oh, I don't know anything about, like, real, like, again, I'm off on my mountaintop just thinking about my problems for myself. I've used this for matching up podcasts with data scraped off of webpages. I've used this for, again, it's usually like MP3 files scraped off of webpages. I haven't used it for anything like that. I don't know anything about document retrieval technology or document databases or anything like that. I'm sorry. OK, thank you. Sure, sure. OK, so that concludes the questions. Thank you very much again for the talk. Let's have another round of applause for Larry. Thank you.