 Hi, everyone. My name is Vincent, and I've been working on this project called Human Learn. I want to give a bit of background about how it started and why I'm doing this. In my mind, I'm a data science practitioner, I do a lot of stuff with algorithms, but when I got started with this, it was actually not really common to do any machine learning. What was really common back in the good old days was, you had some data and then you had some domain experts that will come up with some sensible rules. This way, you could come up with a system to automate decisions because you would get these labels that would come out of these rules. This was like the common practice. Since then, we've taken this different route. It's a lot more fashionable and a lot more modern to use machine learning. In this case, it's different because you come up with some labels and some data, and then you have a machine learning model in the middle that'll figure these rules out. Now, all of this is well and good. I mean, there's benefits to this machine learning approach. But recently, I've really started wondering, there's a lot of human knowledge that went into these rules. You can wonder, did we maybe lose something in the transition from the rule-based system to the machine learning ones? There's things like fairness, also model explainability, but maybe it'll be better if we start thinking about systems differently where we combine the two approaches. Let's talk about a bit of an example. It talks a little bit about this. Suppose you're doing fraud. You're in a fraud scenario, you're interested in figuring out which profiles actually require a human to check out because there might be a risk for fraud. If I were designing such a system, even without seeing any data whatsoever, I think I could say something like, look, if there's a child that has a higher than median income, that's a bit weird. Like no eight-year-old should have more money than the average worker. So that's a risk. Like someone has to check that out. I would also argue if there's a person who has 10 bank accounts, that's also kind of weird. This should just be checked out. Now, the crucial thing here is that for both of these scenarios, I didn't need data to basically tell the system, like, hey, this is fishy, someone has to check this. What's interesting about this is that it's not just that I don't need any data to be able to declare this. It's also that a machine learning model will never be able to learn these patterns if there's no data for it. So there's already something here that suggests that we shouldn't just throw away the domain knowledge. We'd be building a better system if we maybe combine it in a sensible way with a machine learning model instead. We can build this system further and we can add some more rules that are sensible and proven and that they're validated. Then finally, we might be able to fall back to a machine learning model and that this can be a good system. But you might be wondering like, hey, do we have tools to make this easy? This is a system that's plausibly something that we like, but how do we make this thing easier? That's something that I've been exploring. In particular, one design choice that I've made is I want to make this thing a bit more possible, but I want to do it for the scikit-learn ecosystem. Then I figured, okay, we're doing something for scikit-learn. I think human-learn is a nice name for a package with what I'm trying to achieve here. What I would like to do now is just to give you a bit of a demo of this tool so I can show you what sort of features are in here. So let's do that real quick. This notebook runs marvelous. So very first, super quick demo. You might have heard of the Titanic data set. The goal of the data set is that you have a column with people that survived the disaster or didn't, and then you can base it on the class of the passenger. You were first, second, third class for your ticket. There's the gender that we know, there's your age. And we also know how much money you paid for the ticket. And let's say that I just have a curious hypothesis that I just want to turn into a machine learning model. Let's say that I maybe think, you know, if you paid more money for your ticket, odds are that you were probably on the upper deck. And if you're on the upper deck, you're probably closer to a life raft. Let's consider that as a rule. What's the easiest way in Python to define that rule? Well, I think the easiest way to do that is to just write a Python function. You just say, look, here's a fair-based decision that I'm making, a data frame is gonna go in, and I have some sort of a threshold that I'm going to be applying to this fair column that I'm referring to over here. And if you're, you know, over that threshold, then let's say you survive and otherwise you don't. And this will be like the machine learning model. And in Python, this is great, but the downside is it's not Psycho-Learn-Compatible. It's a function and a Psycho-Learn-Compatible model needs to be an object with certain properties that this model currently doesn't have because it's a Python function. So the first main tool that HumanLearn offers is this component called a function classifier. It allows you to just declare any Python function that, you know, accepts a data frame. You can do any logic that you like with it, but now you get a component object that is not only Psycho-Learn-Compatible, but it's also searchable. Any keyword argument that you put inside of your function here, that is something that we can now grid search over because this function classifier can do the translation to the grid search component for you. Now, I'm not gonna bore you with like all the code that actually runs this. Like I'm doing a grid search over here, and then it's code and I'll share the notebook. But the one thing that is interesting about like doing a grid search this way, like we're switching the threshold for the fare, what's really cool is that when you do your grid search, you're kind of also doing exploratory data analysis. You can kind of test your hypothesis a bit more here because, you know, this is like really interpretable. For this value of the threshold, we have a certain precision, a certain accuracy and a certain recall and, you know, that's something that's quite tangible and you can do human learning, which I thought was a cute thing. However, some function classifiers, of course, super useful, this is great, but we might be able to go a step further because it's one thing to be able to turn domain knowledge into a rule that Psycho-Learn-Compatible, but if once we get here, we can also start wondering, well, can I maybe make tools to make it easier for you to find meaningful rules? And for that, I figured, well, let's maybe do something with visualizations. So a feature that I have added since, I should say yesterday, is a interactive parallel coordinates chart. What you see here effectively is every column is a bar. This is a column in the data frame, that's a column, this is a column. And what I'm able to do is I'm able to highlight and move over it. The idea is that every line that you see drawn here is a point in the data frame or a row. So it was one person who had label one, it was in this asset, this age, et cetera. But this allows me to like quickly query my data set and see if there's interesting patterns. And because the label has like a color attached, you know, can sort of play around. So maybe people in the first class had like a bigger chance of surviving. And apparently that's true, you can see the color difference, but it mainly seems to be women who survive. Okay, and let's see if I do second class as well. Okay, also still fairly precise, I would say. Okay, if we do third class, then this rule no longer holds. Okay, so far so good. But I think the movie Titanic, it said women and children first. So maybe if we say, well, men, oh, okay, the men also survive, but only if they're quite young. That's something that I can quickly pick up here. Okay, so super quick demo, obviously, but this is an exploration tool that just gave me two rules. One, if you're a woman in first and second class, you probably survived. And if you're a man, first, second class, and you're young, you probably also survived. Okay, so those are easy rules. I can put that again in a function. I can, again, put that in the function classifier and I can, again, put that in the grid search to see how well this does. And let's run it, run it, run it. This takes a little bit longer. What was the demo thing? There we go. And when you run this, you mean the accuracy is like 80% already. The precision is 95%. Recall isn't perfect, but I would still say this is a pretty useful model. 80% accuracy is not bad, especially because it's like a very simple rule that I just found. Here's the cool thing. If I now start comparing that to a random forest, like a random forest classifier, let's see how this compares. I mean, the random forest, like it definitely has like a higher recall rate, but it has like a worse accuracy and certainly has a worse precision. But this is the cool thing. I get to compare my domain knowledge with what the machine learning model does. Not only will that allow me to have a more mature conversation with my data science colleagues and my domain friends, but this also means that I can start asking questions, well, why is there a difference? What has the machine learning model learned that I might be able to learn from as a rule? And it's this idea of being able to turn your exploratory data analysis into a model that allows you as the human, not just to teach the machine learning model, but also you as a human to learn, that's kind of a meta thing, but I like it. We're forcing you to understand the story behind your data, which automatically will, it's gotta mean that less things will go wrong in production as well. There's nothing that Kaggle that just fit predict the whole thing anymore. You're just playing with data. So this is a example of rules that might be easier to make, but we can do better or more. I can also make other widgets that help you make rules. So let's also explore that a little bit more visually. So there's just another example of the Penguin data set. Here, four columns, this is like column one, column two, and then below here I've got column three and column four. But also here, you can just look at the picture and go, do I really need a machine learning algorithm to separate the blue dots, the green dots and the red ones? Maybe the only thing I need is just the drawing. So let's just double click, click sort of make a little drawing here. Okay, that's red. So another little drawing here. Let's maybe say that this is blue, right? And this is like a drawing, which means that you don't have to be a PhD to do this. You can also have people with the my knowledge do this. But what's really nice is maybe the shape that I just drew, maybe that's also a rule. I can say is a point inside of it or is it on the outside of it? And if I do that for the other chart real quick as well, and I'll only do it for the blue part because there's a bit of a mix below there, just like this. But effectively, I now have an object that contains data that can also make a classification. So what this package also has is this thing called an interactive classifier. And the main thing that it can do is it can just listen to these interactive charts, learn from it and apply that as an algorithm. And I'll just quickly go through some cells to confirm that indeed I have something map called lib here that updates. So yes, we are doing the predictions. It is so I could learn compatible. But also here, I would argue the power kind of comes from the fact that you're doing exploratory data analysis and that can be your first model. Just that benchmark is something that's great. And the applications of this aren't necessarily just for models, by the way. We can also say things like if there's a point outside of the drawn area. Okay, yeah, then totally that's an outlier. That's also something that the library supports. You can also make these clusters and then you can turn them into like a featureization step. That's also something the library supports. And you can even combine it with pandas to make sub selections of your data. And for my employer, a while ago, I made this demo where we actually use it for labeling. You can combine this with IPython widgets inside of your Jupyter notebook. And basically what we did is we did a trick with like language models and embeddings and cluster points together. And if you want to do bulk labeling, it was actually not a bad tactic. It was something that Debraza were currently playing with. So also for this use case, just to make it easier for you to play around with data, human learn might also be able to help. But I hear some people say when I show this, like, okay, that's great Vincent, cool tool, nice demo. But is this really state of the art? And I just want to give, like, I think the coolest demo. What I did is I went to this blog that belongs to this deep learning tool called Keras. And for, you know, cool tool, like the API, it's good stuff about it. But there's a demo on the blog, the official blog, where it's handling this credit card fraud detection use case. And you know, there's a lot of stuff here and it's like properly explained as well. It's a good blog. I just want to point that out. But at the bottom here, they do the deep learning thing and they have like a validation and precision and a recall precision. I started wondering, like, is it possible for me to get like a better performance than this by doing rules? So what I did is I loaded the same data set. I made a training set, made a test set. And then I made another parallel coordinates chart. And you know, for all intents and purposes, I'm using a fancier tool here. This one is called Hyplot. And the reason why I'm just doing on this demo is you can do something fancy. Note, by the way, that I'm just eyeballing the algorithm here where most people just do the deep learning thing right away. I'm able to say, well, you know, the fraud cases are orange, the non-fraud cases are blue. It's kind of an easy selection here, by the way. But what I can do in Hyplot, which is just a nice trick, is I can say, okay, let's exclude those rows so I can more precisely just have a look at these other columns I can select and like say, okay, let's exclude those and I can zoom in and okay. Again, I can come up with rules. And again, I could put that in a function classifier. We also have this tool since two days that allows you to make like a case when rules. So like, if the first rule applies, that's the classification right away. I made a couple of rules just by eyeballing this on the train set. And I checked the performance on the test set. 47% precision, 40% recall, F1 score, all right. Again, super imbalanced data set, right? This is like the example of kind of a harder data set because it's unbalanced, et cetera. If I just quickly compare that to Keras, seem to be outperforming a deep learning model by doing exploratory data analysis that you could eyeball. But that's not even the main benefit. A part that's really, really cool about this is that I, after selecting a data column here and sort of noticing like, hey, it's this column that is causing the fraud, I can ask the question, why is that? I'm actively being involved in my data and I'm trying to understand what's happening there. And it's the act of doing just precisely this that I think more people should start doing. And it's kind of ironic. I really wanted this package to be like an easy way for you as a human to teach the machine learning thing. But literally if you follow this path of modeling, the human is learning too. And that I thought was like really, really funny coincidence. So this is a quick demo of human learn. There's more stuff in here and I'm working on extra tools on top of it. Note the parallel coordinates thing. I'm trying to work that into dataset as well. Could use some help there by the way. But if you wanna use it, that definitely feel free too because it's a pip install away. Do note, it's not perfect. I mean, there's a risk of data leakage. You need to keep thinking about what you're doing. It's certainly possible to add a rule that's bad, right? Keep that in mind. But the reason that I'm making this is because I kind of got a little bit frightened about this Kaggle attitude of let's just fit, predict and call it a day. In real life, machine learning literally needs to play by the rules. And I seriously think that this human learning idea will help out a lot in understanding and designing proper systems to make decisions. And if nothing else, I hope that it can make it easier to translate domain knowledge into benchmarks that machine learning models need to beat. A machine learning model can be great, but really it has to perform better than one or two sentences that the domain expert can tell you. So having said all of that, I haven't even told you about my favorite feature, which is a poem. The thing is I kind of do this with all my packages. It's something I learned from SimPy, but I try to hide a poem in every package that I make that has lessons that I've learned. So if you go to human learn and you import this, then you get the following poem. Why worry about the state of the art? Maybe it's time that we all admit that once size fits all suit, it's not bespoke and it usually does not fit. Computers can flow with tensor, but they never really learn. Natural intelligence is still a good idea and artificial stability of valid concern. There are many ways to solve a problem, but don't let fancy tools go to your head. Why would you use a predefined solution if you can create a custom one instead? Thanks for listening. If you're interested in this human learn project, and I do tons of these projects, by the way, definitely reach out to me on Slack. If you're interested in the full tutorial, I maintain this website called calm code IO and there's a full tutorial there if you're interested as well. And before we call it quits, I do wanna give a really big shout out to my employer, Raza. They have been super supportive of me maintaining some of these open source projects on the side, and I just wanna make sure that they deserve a shout out. We're also hiring by the way, but they've been really great. They're super supportive of everything that I'm doing here as well on the side, which is nice. We might not have enough time for a lot of questions, but I will be on Slack, ask me anything.