 So, hi everybody, I'm Alisa and today I'm going to present you my personal horror story how actually I started with short text classification. Just a short information I worked for a startup from Hamburg that aggregates jobs and presents to the end user. So we deal a lot with job descriptions and believe me, these are the shortest and the weirdest formatted text you can ever find. What I'm going to talk about is how I actually approach the problem, what information I found useful and if something was not there, something was missing, then what model I've chosen, for what reasons and how did I train it. Then really funny story how I actually deployed this because we are completely Java based system and afterwards conclusion, did I learn anything or can I make something better? Spoiler, yes I can. What can I do actually with a text? Text can be classified, you can generate text like using Markov models if you want to just have some fun or you can write chatbot based on some input that you gave previously. You can tag words as part of the speech, you can build syntax models, all that jazz. For our purposes we also didn't know at first what we want to do. We can automatically detect synonyms, for instance developer, soft ventwickler, in German it's kind of natural that English terms are also very often can be found in the Internet as the official title of the job. Something like new words or generate better description for a job offer that we already know. We decided to go with a classification. Profession class or industry are incredibly hard to classify because there are no canonical way to define those categories and they kind of differ from country to country so we had to do something else. Our marketing department suggested that we do binary classification for jobs that require education and those that do not. For instance, babysitter does not require any education normally. And developer, architect, they all do. It seems easy for the first sort but it's not. I started to think, okay, I have this text, what can I do with it? The first sort that happened to appear in my head was keywords. There are lots of problems with keywords though. First, as I mentioned, the quality of the text itself were awful. Sometimes we would get like null. That's a great description. I always wanted to work like a null or a foreign null, I don't know. And there are really big descriptions that are apparently made by a human but they contain lots of topics. Here you can see like a very generic example. Here you can see keywords for healthcare, secretary and the blue ones with work with papers and so on and so forth. These are three different topics. Industry the first, the profession class or seniority level can be the second one and the three others might be profession keywords. They overlap, they are not, it is just not possible to humanly define all the keywords for all the classes, for all the items we have in our database and for all the languages we have plus counsellor names in. My second thought was, hey, machine learning, it is a kind of hype now so I don't have labels. I don't want to label more than 1000 items I have to read through. So unsupervised, let's go unsupervised. I tried several models out and I have ended up with LDA. If you want to read about it I have links in the very end of my presentation. Basically you just teach this model on several texts and you give us an input amount of topics you want to generate from all those texts. And in the end you get something like regression with the keywords and their weights. So for each topic you get a different regression. In this case you kind of just throw text again the regression and you have the score and you just can compare scores and say, okay, the highest one is our topic. Wasn't good. This model, like LDA does not work well with short text plus again we have too many noise inside. Let's think again. Maybe I do have labeling. Actually I did. The months before I got this assignment we worked with so-called KLDB or ISCO. These are international, the first one is a German and the second one is an international standard for profession class defining. You can see five digits. This is originally defined by the German state. Comes from their sources and so the first three digits define the highest level of the profession class and then kind of going down into the depth of what it actually does. An example four, three, four will stand for scientific, four, three is informatic, four again is a software development. The last two digits actually show exactly where your field is. For instance IOS system administrator, IOS front-end developer, they don't go into front-end back-end differential but about that. And the last digit it is from 0 to 9 shows how complex is the task. So technically all one and two are all for the titles that didn't from the human standpoint require any education. Again some stuff that helps out in the hospital, some stuff that helps out for conference I'm very sorry but you have no education required guys and so on and so forth. We had like 600,000 items labeled with this KLDBs. It is about by the time it was 50% of our German database. The problem was we kind of tried, we had these official titles and we had our titles and we tried to create a regex that completely will match the titles from our base to the official ones. The problem was that was first of all synonyms. It was real German titles. Like you won't ever find something like Python master or administration guard, never. And actually there are plenty of job offers with that title out there. Synonyms were the problem then the German language structure where you can combine words to build a new word and you can combine words in different order. Regex is out. And also the quality of the titles themselves from our database was not as good as we would like it to be. Sometimes it was just helper. And in the description you can see that it's not even a helper but it's a secretary which has almost all the tasks from the hire manager or CEO or something like that. So it was kind of different. Anyhow, I have my labels. Yay! I dig into many tutorials and I really wanted to set Python and I had opportunity to look into Java. I actually took a look into Scala and Java libraries but it's such a pain to set up those and I like Python. And I really wanted to take Python into our infrastructure. So the most interesting things were an LTK, Scikit and Jensim. I mostly use for unsupervised learning but you shouldn't use it anymore. It's deprecated and Scikit actually uses some of the Jensim libraries. So I went and LTK mostly. How can I evaluate the model? Well, there are lots of tools. Depending on the model you can do different things but there is something called confusion matrix. When this is a confusion matrix, the left side, the vertical is the actual label of the text for instance and the vertical or horizontal is the predicted one that our model say it is. So the true positive is like you done it right, imagine we have our binary classification so it's like A, B and A, B again. Our model said no, you actually B but it was A. This is a false negative and you can imagine further. The accuracy is how many labels were got right. I actually took the accuracy as the main measure and the false negative. In my case I wanted to minimize the bias towards education required because it turned out we had really small amount of noneducational job offers. So I didn't want to decrease it even further. I've ended up checking four different models, Bernoulli classification, A, B, C and support vector machines and decision tree. These are the models you will start with. Bernoulli and decision tree went from the first round just out because the yield coin flip accuracy results. I've trained the models on like 10,000 item data set. It's not that big of a deal but I just wanted to know what are their learning time and accuracy out of the box. Before I invested more time I didn't have much time. So second round, support vector machine versus naive biasing. The problem with support vector machines was that it took way longer to be trained but it yielded really good results and the second thing was that the bias towards education required was way higher than by naive biasing. So that was enough for me by the time and I decided naive biasing it is. So I took a look at this trained model that was trained on 10,000 items and the accuracy was about 70,000 percent. Not good enough. What can I do about it? I can do micromanagement and micromanagement. I can tweak the data set itself to try to balance out the amount of labels presented in the training set and I can tweak each item independently. I just disclaimer balance set works for all models deep or not deep learning way better than unbalance set but I underestimated the impact. How big is the impact of unbalance set? So I turned, I ended up with a 50-50 label data set and it couldn't be bigger than 50,000 items because I had about 5% in general items that can have the non-required education label. So I went back to the second option, tweak each item in the data set separately. Since it is short text there are not too many things you can do. You can add information. In my case not all descriptions had the title inside and sometimes it was crucial. For instance in title was saying no education required for this job and no other sign in the text. You can remove information. As I said previously this text have like very many different topics inside and some of them like contact information or start date of the job or salary maybe but it doesn't matter that much. So I would take out for instance numbers and dates and emails. And one more thing that I could do is I can stem the words. Stemming is just following. For instance you have in German you have Koch and Kirchen. It's like kook and kook girl and they look completely different. The perfect world will yield results like Koch or with them loud. So what stem does is it shrinks the word up to its root maybe a bit more than this. So that I will catch running and run in English to get as one word without blotting the feature set. Spoiler alert. German stemmer are so bad. So let's go for that transformation. Very sorry. Really nervous. So I prepared already prepared job item description and with title. I'm very sorry it's a German because I really want you to see how bad stemmers are. Title is software developer in PHP whereas the description is like this big and just believe me it's about we are looking really cool PHP developers, we are really cool team, here is your responsibilities, here is what you are going to get from us. What I wanted to do first was to add information. So title and description concatenate. You can imagine I just get this step because you can imagine how it looked like. I put title in the front, I actually tried different strategies, putting it in the front, in the back, in the middle of the text, no difference. So I wanted to remove some information. We can remove stop words. These are unique words for the language. They normally do not yield any information unless you do something like language detection. The native bison works on the back of word and what comes in is actually for each text it's the tokenized list of words normalized tokenized with the label. You can see here some real crappy things like 25, this thing, this thing. So let's take out all the noise. All the punctuation and as I say data, time and digits. Looks better. At least I don't see too many of the noise words anymore. And stemming. For instance, there are like several five I counted in the beginning of the working on these slides. There are like five different versions of software development, software something. Even less words are now going in. You can actually see this innovative and personal and actually this is our software that allows us to build any further words. So this is what's going into the model. Now I'm not kidding. This is all the code I had to write to create my model in scikit it actually looks completely the same. And training it. I split the whole sample into training and validation set or testing set. I create like I format the data both train and testing sets. I build my model and I get the estimation. This is a custom function that just says your accuracy, your confusion metrics and this is how long you have taken. I decided to go with pickle because TensorFlow wasn't the buzzword back then. It is now, we actually do use TensorFlow in our days but again. This is the output of the pre-trained model. Now it is a model trained on English dataset containing 5,000 items for each label. There are three labels. These are part-time, full-time and mixed-time jobs offers. You can see that it's about that what I said in the beginning of accuracy. Those texts aren't really not that good at all. And the data training dataset was like really small. More you can get in, the better your model will be. Now I completely forgot that we were a Java system. And now I have my pickled model and like, okay, what, what now? They were crazy ideas like I can save it as a JSON and change the parameters, all the names of the model that I can only preserve the feature set and put it into 10.4.nlp. No, no, no, just that was a second of blackout. I could use Jyton. Not a good idea. It is only with Python 2 compatible and my whole model was saved and pickled away with C libraries, Python 3, not going to work. I've tried starting a Python script inside of a Java code. It was so long. Every time a procedure will run the script, it will start Python interpreter, do the thing and stop the interpreter. Imagine you have like a million. It's not that many of the items, but even this will completely destroy our performance in the back end. Should I write in Java? Nah. Message brokers, for instance, I could come by, like connect our Java resource with our Python service via Kafka or Rabbit and Queue. It's not always, it is possible, but it's not always a good idea because the versions of those tools are not always in sync. You have to actually keep an eye on that and, yeah, no. So I will take a part of microservices and I will build a REST application. This is it. Nothing more than this. We just use the Flask as our web server. We deployed it with Green Unicorn and just use the Jersey client on the Java side and they exchanged a simple JSON that says, hey, I have this item with this title and this description. Please say me something. And the result, as the result you get something, hey, I'm Python service and here's my answer and I've used this model. For instance, oh, I'm sorry. Now we are starting our Flask app and here you are. As I said, this model was trained for German markets. So what it says, hey, I have a text that is work for all helpers and people helping out. Language German. What we see, education, not required. Yay, we did something right. Let's continue this. Let's check something different. I'm going to just write education level, oh, we don't need all of this. Oh, yeah, I did get to mention where there is like no features in the text. The model will be discreet about a decision, but it will be just a coin flip. For instance, for some text, it is education required. I'm not actually even sure whether some text appears in a German feature set. But let's go forward with an expert job. And next part for everything. Oh, thank you. For some reason, it's no education. The final accuracy I could take out of this model was 95%, so I guess we just found the edge case. 95% with the outside restrictions. For instance, in Germany, all the healthcare jobs are education required, no matter what they do. And there are several other industries where this rule stands, at least for German market. So did I solve the problem? Or could something done better? It most definitely could be better. For instance, I could have spent a bit more time on the research, how I can work with the text, maybe I can transform it differently. If you're doing your first tech classification, just don't go into deep. Machine learning, like deep networks, like convolutional neural networks, or recursion, or so on and so forth, they are good. They do yield good results. But they have to be done, like you really need to be careful with those, and you have to invest way more time than with this. Actually, another advice of mine is try graphs first. You can map all your words to the nodes of the graph, and then search for cycles, or search for certain subgraphs, for certain topics, for certain synonyms, or for even a context. You can do that, and actually it will be a bit faster, I guess. Plus you can combine those two. Just check against each other. Don't be afraid to alter the features. You can actually, like, it is a bit easier to reverse engineer it, to know exactly what the features are that the model learned, and you can alter them for a better result. Like, those models are still not better than the humans. Another really good idea is monitoring over historical data. You should log all the decisions over all items you've did, or, like, every tense item. So that, first, you can actually check whether it's true or not by sampling. Or you have your label test, like, you have your data. You can alter it a bit, and then build an even better model. The estimation methods, as I said, there are tons of it. You can use something like proof statistics, or cross-validation. You've probably heard about that. It's a never-better idea. If you're using a model that constantly relearns, that is also a possibility, then you should have at least minimal quality test. If something goes wrong, you will be notified about that. It is also an interesting idea to have a golden standard test. It's like the very edge case. If the model can do this, you can sleep well. So this is pretty much it. Invest your time in machine learning. It's pretty interesting. Thank you. Any questions? Yeah, wonderful. And please, everyone, do give your feedback on the app for the talk. I think it was very good, so it's rated. Thank you very much for your talk. I have a somewhat unrelated question. So you showed this job opening for Antevicular PHP, and there was in the brackets M slash W. I saw it up here, you, because of the echo timer. Yeah, sorry, a little unrelated question. So there was the vacancy, and it was called Antevicular PHP, and in brackets it was M slash W. What does this M slash W mean? Because I'm seeing this a lot in German, with your openings. So that's the problem. Yeah, so that people don't assume gender from the grammar structure, right? So this MW means that they say, yes, this is for both genders. Okay, great, thank you. Thanks, it was a great presentation. And I have a question about the modeling you've chosen. Like, you've chosen the support vector machine Navi base, right? So there is a lot of information about a lot of random forest, or maybe another one is ensemble models, or kind of doing very good. Have you tried those models? Thanks for your talk. I wanted to ask about the pre-work that you did with data, especially with stemming. So how, because maybe I missed it, or it was not super clear for me, how did you actually do it? Is there like a pre-built thing that you can use as a beginner to stem words in a language? And then also it's like immediately second part, with the data you actually use your model on, you have to stem it as well if you want to use the stem data for processing, right? Thank you very much, Alyssa, for this wonderful talk. We have like a...