 So, hello and welcome to this hands-on workshop about sentiment analysis, where we can use three different approaches to, or where I want to introduce three different approaches, how you can find out the sentiment of a text. And the three approaches I want to introduce, one of which uses deep learning, the other one uses standard machine learning algorithms like logistic regression or decision tree, you can actually choose which supervised algorithm you want to use. And the third one is a really easy one, which is a lexicon-based one. The tool of choice for this workshop is the KNIME Analytics Platform, which is an open source tool, and that's the company I'm working for. So, my name is Catherine, and I'm a data scientist at KNIME, and in my job I go to conferences like this one, I do a lot of classes, so I teach people how to analyze data in general, and how they can use KNIME for this. So about the schedule of the workshop, we will start now with a short introduction to sentiment analysis, what is sentiment analysis, and we will have a look at the three approaches I mentioned. Then, as the tool of choice is KNIME, I will give you a quick introduction into the tool, so you are prepared for the hands-on session. And then I will go more in detail through the different steps which you have to perform during the hands-on session. So let's start with what is sentiment analysis. So when you plan to buy a new phone, what a lot of people do is they go on Amazon or another platform, and they look at some reviews so that they know what do other people say about this product. But this might not be only interesting for people who want to buy a new phone, but only also for the manufacturers, because they can actually find out what do say people positive about my product and what people don't like about my product. If you go on Amazon and you want to find a phone for you, what you do is you read all the reviews, but a company, there are not only 10 reviews and there are many different platforms, they don't want to let someone sit for hours there and read reviews. So what they do, they web-call all the information on the internet, what people can say about their phone, and then they want to find out automatically, was this a positive review or was this a negative review? And then later, once they know, when they did this classification, they can do other text processing stuff like, for example, topic detection. So they find out what are actually the topics my customers like about the product and which they don't like, but first they have to classify those are the positive reviews and those are the negative reviews. If you look here at the two examples, there it is really easy in the language. Positive one, we have words like beautiful, exceeded my expectations, so there it is easy to say, this is positive. So we could also say, we train a model that only says if there is a positive word in there, then we say it's positive, or a very bad experience. It has the word bad in it, so that is easy to classify. But what about sentences like this? I like to travel sometimes. Is this now positive or negative? It's already not so easy anymore. And then in the next sentence, I use the word like again, but it looks like it's going to rain, so like is a different meaning. So we have to teach an algorithm that words can have different meanings. And maybe we don't have to look only at each single word, but in the next sentence, I say, I'm not dislike, so this means I like it. Even so, I use the word dislike, which has a negative meaning in many sentences. Some other examples where we can use sentiment analysis is, for example, to classify movie reviews. Whether reviews about movies are positive or negative. Then the example I showed you in the beginning about products. What do people say about my product? What do they like about it? What don't they like about it? Another option is to make predictions. So if you want to make market trends predictions, then I think the component, how much people talk positive things about you on Twitter and Facebook, or negative things, is really important. So it's a really important feature to find out how many positive and negative reviews are there about me on the different platforms to then estimate or make market trends predictions. So if you summarize the task of sentiment analysis, it is that we want to determine the expressed opinion in a text. This can be a review, it can be any kind of text. And whether it's positive or negative. So we want to find out in which kind of opinion does the person has who did write this text. Or in which kind of emotion was this person when he or she was writing this text. So the three approaches I want to introduce is the lexicon based one, which is the more easy one, or also older one. Then the one with classical machine learning, and then the last one with steep learning. So let's talk about the first one, the lexicon based approach. The idea of the lexicon based approach relies on some dictionaries. So you have dictionaries for each languages, which are normally published by different universities. Where they say, in this dictionary, we have only positive words. And in this dictionary, we have only negative words. The idea is then that you go through your documents, and you count for each document, how many positive words do I have in this document, and how many negative words do I have in this document. And then I can calculate a score, whether they are more positive or more negative words in this document. What I do therefore is, I use a dictionary tagger. So I tag all the words in the document which are positive. They get a positive tag, and all the words with a negative tag. And then I count. This approach has one big advantage compared to the other approaches I will introduce, because you don't need a label data set. So for all the machine learning approaches and the deep learning approach, you apply supervised learning. Therefore, you need a label data set. So someone has to sit down and label, first of all, thousand of reviews before you can actually start. And in KNIME, we can build a workflow for this without writing code, because KNIME is based on a graphical interface. And this is one of the workflows that you can build later in the hands-on session, and you'll see KNIME has always these blocks. So a workflow always starts with some reading blocks, and each of these blocks performs one specific task. So who of you has used KNIME before? One? Okay, just quickly so I know, who of you has been working with text mining before? So, more or less half of the people. Okay, just so I know how deep I have to go into detail. Yeah, I will give you an introduction to the tool and how you can build a workflow like this in a minute. Then the second approach is the machine learning approach. As I said, here, the starting point is that we need some labeled data. So we start with some labeled data, and then we have to extract our feature space from the documents. Before we can apply machine learning. The feature space normally is a numerical representation of each document, which tells us this word is in the document, and this word is not in the document. So we have a zero one hot encoding telling us which words are in the document. If you take now all the words, and we don't do any preprocessing, our feature space will be huge. Because you will have words like in, the, he, she, and so on, all in your feature space. But what we normally do is we try to extract key words or only words which are meaningful for our analysis. Instead of using all of them. And that is, that's what we do in the preprocessing step and text mining, that we extract only the words which we need for our analysis, and this happens in the preprocessing step. And once we have this, we can create our numerical representation for each document and then apply any supervised algorithm that we want, because then we have again just numbers. So we represent our document as number, so we have all the freedom of all the different algorithms. The advantage of this machine learning based approach is that it normally has much better performance than the lexicon based approach. Because as you remember, the lexicon based approach only counts the words with positive and negative meaning based on the dictionaries. But can you remember the first example where you had not like or this can change the meaning even so like is maybe in this positive dictionary. But they are also for the lexicon based approach different. Yes, you can provide sub-progressive dictionaries. And this is the workflow which the second group, if you want to decide to work on the second approach, you will build later. And again, you will see that you don't need to write code, we just have these different blocks for the different preprocessing steps, like the stepward filter, the man here in the front was asking for. And then comes the third approach, which is the deep learning approach. And here we want to train a network to make predictions. The starting point is the same as for the machine learning based approach, we need again a labeled data set. But the preprocessing is a bit different. So what we do for the machine learning based approaches that we really only encode in this numerical representation, which word is in there. In the machine learning approach, what we do nowadays is that we say a text is a time series. And we are not only interested in which words are in there, but also in the series of words. But to train a network, we first have to bring all the documents into the same shape of the numerical representation, because a network wants to have always the same shape for each sample. And therefore, what we do is we first encode each word with a number, because a network don't understand words, it only understands numbers. So we do a hot encoding where we say he gets a number one, she gets a number two, and so on. And then we have to make sure that each document has the same amount of words in it. And what we do therefore is truncation and zero padding, it is called. Truncation means I define a maximum number of words per document I want to take into account. And if it's too long, I will cut the end. And if it is too short, I will just add zeros in the end, so that I have for each document the same shape for the numerical representation. What you have to do then is that you have to define a network structure, what people use nowadays there is on one hand embedding layers, which give a better representation of each word, then they want hot encoding. And on the other hand, LSTM layers, which are also used quite a lot for time series analysis in general. So what I normally do is I do first the hot encoding, because also for the embedding you need a ready end hot encoding. But I normally use an embedding layer. It just gives more meaning, more sense to each of the words you get. So the network understands also the relationship between the words, that's why I normally use it for sentiment analysis. Where I don't use it is, for example, but this is on character level. If you want to create new text with a network, that's a thing where I don't use it. There it doesn't make sense. And the advantage of deep learning, the basic machine learning approach, is that if you have really big data sets, then this one can lead to better results. But deep learning, I know it's a hype at the moment and everybody wants to do it, but deep learning really only works or gets better results if you have really a lot of data. So to train a network, you need data. But otherwise, you can take pre-trained networks and fine-tune it with your data. But to train a network from scratch, you really need a lot of data. It doesn't work with 100 samples. Do you have a question for this question? Is it easy to follow the book on data? No, I would always start with some easy algorithms, like a random forest or something like this, also for benchmarking, to see what do I get with an easy algorithm compared to deep learning? The random forest is not the easiest one, but what do I get with a random forest? Because this is a really robust algorithm already. And then I will try with a deep learning network if I have the time. Yeah, it depends on your network. If you don't fine-tune your network, it depends on the amount of data it's possible that you get also better results with the random forest. So to get a good neural network, you have to fine-tune your network. You have to try different layer network structures. And with the random forest, it is also a time issue. The random forest, there are also parameters to fine-tune, but not as many as in a deep learning network. So I would always try the random forest. And if the random forest doesn't work at all, I would try a bit with the neural network. But normally, the random forest already gives a good result. And here we see now a workflow where we use the Keras integration inside of NIME to build a network or to train a network. And what we see here is that we can define in NIME our network structure without writing Keras code. So we have for the different layers in NIME different layer nodes for the different layer nodes. And then you can just plug them together to define your network structure. And then here happens all the pre-processing, like integration and zero padding. And then we have the training of the model. OK, so as I said, the tool of choice for today is the NIME Analytics Platform. And I want to give you now a quick introduction into the tool so that you are prepared for the hands-on session. So what is the NIME Analytics Platform? The NIME Analytics Platform is an open source tool that you can use to analyze data. And it works from end to end. So from the reading part, over the machine learning part, the pre-processing part, until deployment. As we will see, it is based on this graphically interface. And it provides a lot of extensions. So today we want to use the text mining extensions. But there are also other extensions like the network mining extensions. Then there are a lot of extensions for HEMO Informatics and a lot of integrations like H2O. HEMO Informatics, it is the RD-Kit library, which is developed originally in Python, which we integrated. And there you can, for example, analyze fingerprints and those kind of things. So let's take a look at the platform. So this is how the platform looks like. And during my presentation, you've seen already some workflows. If you open your platform, it looks like this. And now we have to start building our first workflow. This means if you want to start, we need a fresh canvas. So the first thing you have to do is to create a new project. To create a new project, we go into the NIME Explorer, which is here on the upper left. In the NIME Explorer, you see all your projects. And you can start new ones. So to start a new one, you can right click on the folder, which we call Workflow Group, where you want to create your new workflow and say, new NIME workflow. Then you can give it a name, like, for example, demo ODSC. And you will get a new fresh canvas. Then what else you can do in this NIME Explorer is if you have some workflows from your colleagues and you want to import them, or today you will get some workflows from me. So you have here the USB sticks, and there are the prepared workflows. And later the first step you have to do is to actually import this into the NIME Analytics platform. This is also something that we can do in NIME. So to do this, you can just right click in local and say import NIME workflow. And then you can browse to the directory which you want to import. So I say browse, and I go to my desktop. I know I go to the USB stick. And here I have the sentiment analysis workshop.nar. And in here are all the prepared workflows for today. So later on, this will be your first task to import the workflows. Then you will get a new folder where you have here the three groups. And you can decide which group you want to work, and also the data, and there's solutions for later at home, not for later. But here's now our empty canvas. And as I said, in NIME, we have these nodes which perform tasks on your data set. So now we have to create nodes. How can we create a node in NIME to, for example, read in our data? Down here, we have the node repository. And in the node repository, we have all our nodes sorted by different categories. So here we have, for example, the IO session where we can read something. So I want to say in this example, I have in the data basics, I have a file which is called adult.csv. And I want to read this file now with the NIME analytics platform. Then I can use the file reader there for. To create a node, I have two options. I can either drag and drop the node from the node repository into the canvas, or I can double click. Now I have here the node, and I see the traffic light below. And the traffic light below always shows us the status of the node. At the moment it is read, this means this node needs some configuration. To configure the node, I have, again, two options. To open the configuration window, I can either double click on the node, or the other option is that you right click on the node and say, configure. And then what I have to do, of course, if I want to read something, I have to define which file I want to read. So I can click on browse, and browse to the directory where my file is. As my file is saved already in the NIME workspace, which I see here, I have an even faster way to do this. I can just drag and drop the file from the NIME Explorer to the canvas. Then NIME will guess which reader node do I need for this kind of file. And it will set the path for me already. So yeah. You can also read it from external sources, or from databases. Everything is S3 is also possible. Or it's your blob storage. Suppose my product already has some database. I want to use my NIME. And if you don't support that, how many can you do? So can you customize your thing or support the database? So we support, for some of the databases, we have dedicated connector nodes, for example, for MySQL. And then we have a generic database connector node. And with this one, you can connect to any database that has a JDBC driver. So if your database that you are using has a JDBC driver, you can download it, install it, register it in NIME, and then connect to this database. If you have a database that don't has a JDBC driver, then you can develop your own nodes with the NIME SDK. But then you have to write your own Java code. But all you write us on the forum, and this is a database which a lot of people use. Then maybe we build a connector node. But of course, we can't build a connector node for a database which you build on your own and only you are using. Or you can contribute to open source. Good point. Yeah. To what? To Hadoop, yes. We have connectors for Hadoop. We have a Hive connector and an Impala connector. You are using what? Yeah, which is on top of HTFS. So you have a HTFS. If you have a HTFS, so you have your own Hadoop environment where you use Hive or Impala, you can just connect your Hadoop environment with NIME with the big data connectors. And there we have then also a Spark integration. And there is also a training where we teach how to use the big data stuff inside of NIME. OK, so now we have set the path automatically. And we see here the preview of the table we are reading in. We have here different configuration options. And we can say now OK. Now I see the traffic light has changed into yellow. This means the node is configured. It is ready to go. I can execute it. To execute the node, I have again different options. I have the toolbar here on the top. Or I can right click on the node and say execute. What is really nice about NIME, you are always really close to your data. So at any time when your node is executed, you can take a look at the table which is created by this node. So you can always right click and then choose the last option. The name is here always different depending on the node. And you can open the data set. In this case, we have here some data about some customers. We know whether they are female or male. So we know the gender, the age, and whether they are more or less than 50k. What is the goal of my little demonstration now is that I want to filter out only females in this data set and only some of the columns just to show you how you can build this workflow in NIME. So let's say our first task is that we want to filter only females of our data set. This means we want to filter some rows. This means we need another node because for each task, we have a specific node. So what I can do is I can either go again manually through these different categories or I can use the search box on top. And I want to filter rows. So if I start typing row filter, I will see that there is a row filter node. The search box has actually two modes. It has the mode which I'm using right now, but it only looks for nodes which actually have exactly the text you are looking for in this node name. In addition, we have the fuzzy search, which shows you also related nodes. Therefore, you can press on this magnifier and it will change the search mode. So now I want to create the next node. So in this case, a row filter. So I can drag and drop it to the canvas. And then I have to connect actually my output part of the file reader node with my input part of the row filter node. Therefore, I can click on the output part of the file reader node and connect it to the input part of the row filter node. Another quicker option is that if I know I want to connect the row filter node to my file reader node, I can select the file reader node in the canvas and just double click on the row filter node. Then it will create this node and already connected to the one. So let's delete one of those. As I said, if the traffic light is red, this means the node needs some configuration. So we have to go back into the configuration window of this node. So we double click on the node. And you see now the configuration window looks different. So each node has a specific configuration window with his parameters. So now I said I want to filter out only females. This means I have to look for the column sex. This is the column on which I want to filter on, on the gender. And I have to define the pattern matching. And I say, I want to have females. Now here on the left, I can say that I want to include the matching or I want to exclude the matching values. And I include the females. Then I say, OK. And I can execute the node again. And then I have here, as I see it at the output table, only females now in my data set. And in this way, you can go on and go on and go on and build your whole workflow. What else do we have here in the nine platform? We have here the workflow coach. This one gives you a recommendation based on a nine usage statistic, which node you might want to use next if you select a node. Then we have the node description, which explains you what does this node do and what are the different parameters in the configuration window. We have the outline if your workflow is getting really big. And we have the console, which tells you what is going on behind the platform, so under the hood. What is also really helpful is if you make comments to your workflow. Because today, you know what you did. Maybe in two weeks, not anymore. That's why comments are helpful. So there are two ways how to make comments in line. You can either double click on the node ID and write something like readadult.csv. Or you can make bigger comments with the annotation boxes. And therefore, you can right click anywhere in the workflow editor. Click on new workflow annotation. And then write here, for example, my first workflow. And then move it around and change the shape, and so on. Are there any questions regarding using the tool? Yeah. Which one? This one? The first node that you see is the initial node. Mm-hmm. You were telling all the companies about that. Yeah. So you can have parallel branches. So I could take here another row filter and say, OK, I have a second branch, one branch, where I want to analyze the data for all the females. And now I have to know what happened to this one. I deleted it. You deleted it? I deleted it, yeah. OK. OK, if not, then we go back. Two tips if you use NIME with text mining inside of NIME. NIME is based on Java and in Eclipse. And if you start Eclipse, it starts a Java virtual machine. And you define while creating this machine a virtual machine how much heap you give to this virtual machine. The default settings of NIME, I think, is four gigabyte or two gigabyte. And this might be not enough if you analyze text. What you can do, then, is you can go into the NIME INE and change this there and give it more heap. For example, I normally use about 8 gigabyte of RAM. I use 8 gigabyte of heat for text mining. Other useful additional extensions are the Palladian extension in NIME. This is a community extension, which is good for web crawling or for text mining as well. And another extension, which I recommend for text processing, is the XML processing, which allows you to parse and process XML documents. So with the Palladian extension, you get back the HTML code. And then we have a document extractor node, which actually extracts only the document and so that you don't have the whole XML around anymore. That is part of the Palladian extension, which is part of the open source product. Welcome. OK, so for those of you who haven't worked with text mining yet, so how is the general process of text mining? So the first thing you normally do is, of course, you have to read and parse your data. And therefore, you have different options in NIME because you can have your data in really different kind of formats. It can be in a CSV file, like we will have it today, or it could be a PDF. Then comes the next step, which is the enrichment step in text mining. It depends on the shape. If it's really text, then it's this PDF easier. If your document has tables, that makes it already much harder. It really depends on the problem. And for example, if you do web crawling, nowadays, websites have really complex structures that's much harder. If it is like a blog post, which you are web calling, then it's really easy to XML. And it's easy to extract it. If you have a lot of small text boxes on your website which you want to web crawl, then it's getting harder. Images as well. But in text processing, so we have another extension for images, which is the image extension, but it's not a text mining extension. Then comes the next part, the enrichment part. In the enrichment part, we want to add additional information to our text. So we want to say, for each word, which kind of word is it? There we have different tag options. So in text mining, it's called tagging. You have, for example, the part of speech tag that you want to say, for each word, is this a verb, is this a noun, or is this an adjective? Or another example of tagging is that you do a named entity tagging, and you say, OK, this is a person, this is an organization, this is a location, and so on. So you have algorithms which can tag your documents and give additional information to each word. And this part happens in the enrichment part. Then we come to the preprocessing part. And as I said earlier today, the preprocessing part is where we try to create our feature space, and we decide which words do we want to have in our analysis, and which words we don't want to have in our analysis. That is preprocessing. How do you say, tell us a few of the rules, which have any and don't have value. I have my own logic of feeling that, say, I want to feel that to me, or me, and others. Some other way of feeling those values. Or I'll make those values as real. So then some customized things, which is specific to my business, which I can't do it through drag and drop out of the data which you are providing. Yes, you can. You can. There is a missing value node. And in the missing value node, you can define for each column how you want to fill out the missing values, whether you want to use the mean, whether you want to fix the value. And this is also possible. There is a node, and I'm, which is called a missing value node, where you can do these things. So you can define how you're. I think this was one of the examples of this model. So if you want to make a transformation to your data set, which we don't provide, then we have a lot of integrations, like the Python integration or the Java integration, which allows you to write your own snippet of code inside of the platform. So you can grab a node, which is called Python node, or Java snippet node, and then write your own code inside of the platform. That's possible. So once we are done with the pre-processing, the next step of a text mining project is that we want to get our numerical representation for each document. So this is then the transformation part, where we also often use the frequencies. So how often does each word occur in the document? And then the last step is that we can do classifications like we do today. But also clustering, if you want to find cluster of documents, so which documents are talking about the same topic like topic detection, or we can try to visualize our text, for example, in a word cloud or something. To do this, we have a nine additional data types. So additional data types I mean in addition to double and string and numbers and so on. So we have a document cell. And in a document cell, you have not only the whole text, but also meta information, like the author, the title, the category, the source, and so on. And in addition, if you do the enrichment part, we save in this document also the tags for each word. And then the second additional data type that we have for text mining is the term cell. And in the term cell, you have always the word and the tag. So with the term cell, we represent, for example, a bag of words for those of you who are familiar with this. So as I said, the first step is always the data access step. And we talked already about that our data can be in different data sources. And I'm provides different connector nodes or different reader nodes. So we have file reader nodes, an extra reader node, a CSV reader node, and so on. But you can also connect, as I said, to any JDBC-driven database. Or you can connect with the Hive connector to a big data environment. Or you can get data from the web with the get request node or the Stepaladian extension. But then when you read our data, for example, from a CSV file, we have it in a string format. And as I said, we have this additional document type in NIME. So we have to transform our document from the string format into this document format that we can use our text mining extension. And therefore, we have a node which is called strings to document. You will find this node under text processing, transformations. And you will find this node in the node repository. And if you double click on this node, it will open the configuration window. And then you have to define whether your document has a title, and if yes, in which column is the title at the moment, and in which column you have your whole text, and in which column you have your author, and so on. So you have to customize or make the settings for your document and your data set in the configuration window. Then you have a document cell. And in the document cell, you will see only the title anymore. You won't see the whole text. But we have a node. If you want to inspect the text in between in your analysis, which is called the document viewer node. And if you execute the document viewer node, you will see all the information again. And you can double click on one of your texts. Then you chose also the full text again. You can visualize your text. So after tagging the document, you can go through it and look which text, which word did actually get, and so on. Another option to get document cells in NIME is to use one of the parser nodes. So for example, the PDF parser node to read in PDF, which will then create directly a document cell. And there we included also the Tika parser node. And the Tika parser node has in the background the Apache Tika library, which allows you to read from really a lot of different data sources. So there are many options available. Today during the workshop in the hands-on session in a minute, what we want to do is we want to work on the IMDB data set. So we have movie reviews. And we want to find out the sentiment of these reviews. And what we do in the first step, no matter which project you want to work, we have to read this data set. Therefore, we will use the file viewer node. Then we have to transform it into a document. So we use the strings to document node. And then we can use the column filter to delete all the columns where we had the full text, the author, and so on. Then comes the second part. If you remember the stream or the overview of a text mining project, after reading in the data comes the enrichment part, where we want to tag our documents. So we want to assign some semantic information to each term. And there are different terms. As I said, it can be a part of speech, or it can be a tag that you want to define, or it could be a named entity tag. So there are different tags available. And therefore, we have a name also different tagger nodes. The two tagger nodes we want to use today is on one side the dictionary tagger node if we want to work on this lexicon-based approach. Because as I said, we want to count how many positive words and how many negative words do we have. Therefore, what we do is we use the dictionary tagger node twice, once with the lexicon of the positive words and once with the lexicon of negative words. And then go through the document and look which word in this document is positive. It's in the lexicon of positive words. And if it is in there, then we give it a tag positive. Therefore, we can go into the configuration window and say the tag type you want to give is sentiment. And then, depending on whether you have the positive or the negative lexicon, say you get positive or negative tags. Yeah. So if someone doesn't give feedback, then I can't classify it. So I can classify only the samples which I have. But, yeah. But I think nowadays there are also people who actually write positive things. It's not just that people write only negative things. I know that people like to complain that's actually right. But it's not only about reviews where you can use sentiment analysis. Also, other languages. But I think not the 22 different Indian languages. Because I learned during the conference that you have more than 20 different languages. We support, for example, there's a package for Chinese. There is a package for Italian and Spanish. But for Hindi or so, we don't have a package yet. Do you have any of the other languages you want to go to? No. So it's just the reading the text and then the text. But after that, it's all machine learning on the only thing that I have. So it's just a package. Yeah. But the text mining is basically. But for example, I think you can read it in. And then, if you do the tokenization, where you decide what is one word, you can do the white space tokenization that will learn what is a word in Indian. And then you can learn your own named entity, for example. For the post-taker, maybe there are libraries from the Stanford University. They have quite a lot of libraries. Maybe there is something. But for example, for the stop word filtering, you can provide also your own list of stop words. I'm pretty sure that there are some stop word lists also for the different Indian languages. Another question. I haven't worked on this yet. So maybe there are already approaches where they try to find out whether it is. But I think this is really hard. I think this is more like what they do at the moment in really in research at academics. But I haven't worked on the project like this yet. So we have a word. We have two deep learning integrations in I'm. One is the Keras integration. And one is the DL4J integration. And in the DL4J integration, we have a word2vec. But with the Keras integration, you can also read in the word embedding from Google, the Glavi, and then use also this one. OK, then back to the tagger. So we have the dictionary tagger, which we will use in the lexicon-based approach. And then we have the post-taker. What is a part of speech tagging? As I said, for the machine learning part, we have to make this preprocessing. And here, for this case, I want to take only nouns, adjectives, and verbs into account. And therefore, I have to tag my words and decide what is a noun, what is a verb, and so on. And this is something I can do with the part of speech tag, where I go through the document, and I look what is a noun, what is a verb, and so on. And this is also called grammatical tagging. And this tagging works based on the definition of the word and also on the context. So there are algorithms, and there are different algorithms available. And we have here the post-taker. And we have in addition also the Stanford tagger, where we provide also different algorithms that are used for this task, for the part of speech tagging. And as I said, depending on which workflow you want to work later, you can either use the dictionary tagger or the post-taker. So now we read our data. We did enrichment. Now comes the preprocessing part. And here the aim is to reduce the feature space so that it doesn't increase too much. And therefore we have two options. What we can do, on one hand, we can filter out unnecessary word. And on the other hand, we can do normalization of the terms so that we stem them to their stem. So you mean the algorithms that we are using in the post-taker? So there we use other libraries, mainly from the Stanford, from the open NLP from Stanford. OK, and for this preprocessing, we have a nine than different nodes. And this is actually where we spent a lot of time, if you work on the machine learning part. So we can say we are not interested in the punctuation. So we can use a punctuation in a regular node. Then we can say numbers are not important. So we can use a number filter node. Then we can say small words. So words with less than three characters. We are not interested in. So we delete those ones. Then we can say there are lists of stop words. So we can have either a built-in stop word list, or we can take another list. And we need all the stop words in our documents, and so on. And the last thing which we want to use in both examples today is the tag filter node. And with the tag filter node, we can say we want to have only words in our document that have a certain tag. And in the lexicon-based approach, this is those are the words which have either the tag positive or the tag negative. And in the machine learning approach, those are all the words that have a tag, I'm a noun, I'm an adjective, or I'm a verb. Then comes the transformation part. So now we did our enrichment, we did our pre-processing, and now we actually want to create our numerical representation. The first thing what we do there is we create a bag of words. And the bag of words is a representation of our document, where we just know which words occur in this document. And this bag of words we can use, for example, already for similarity search. So which documents talk about a similar topic, so they will have the same words in their bag of words. So here we have an example, what a bag of words is. So here we have the sentence, John likes to watch movies, Mary likes to movies too, John also likes to watch football games. So the bag of words of this document would be only John likes to watch movies. Mary is also in there, likes, so it's also in there, but each word appears only one. So we lose the word order when we create a bag of words. Then one option is that I only say a word is in my document, or it's not. And the other option is that I also say this word occurs three times in my data set, and this word occurs two times in my data set, and so on. And therefore we have different frequency nodes in mine, which append an additional column to your data set and says how often each word occurs in this document. And therefore you can use, for example, the TF node, which is the term frequency node. And then out of this bag of words, and with these frequencies, we can now finally create our document vector, which is now the numerical representation of our document. So this is then only either one hot encoding, which words are in there or not, or we use the frequencies in this document vector, which tells us this word is twice in this document or three times. And as you see here, you get ready for each word, which is still now in your documents, after the pre-processing, one column. So what we want to do there today is that we create a bag of words. We use the TF node to count the term frequency for each document, and then we create the document vector. Now we have a numerical representation, and we actually can do our classification. And here, so under here, the both approaches are quite similar. The deep learning approach, I will talk a little bit more in a second about. So here, if you do the lexicon-based approach, we have now the numerical representation, so we have to count how often each word occurs, how many positive and how many negative words occur, and this is something we can do with the pivoting node. And then we can calculate, do we have more positive or more negative words in there, and then use the rule engine node to make the classification. For the supervised algorithm, we have to extract our class information again. Then we create a training set and a test set with the partitioning node, and then we can use any algorithm in NIME for supervised learning. Here in this example, I use the decision tree, and all algorithms in NIME follow the same motive. We have always a learner node, and then a predictor node. So we have the decision tree learner node, which creates, which trains the model, and you see the output port is now different. It is a blue square, so we save in there the model, and then we can feed this model into the decision tree predictor node to make our predictions, and then evaluate our model with the score. Here in the partitioning node. In the partitioning node, you can say how much data you want to use for training, how much for testing, and then, yeah. So for the preprocessing for deep learning is a little bit different. So we have in the beginning a text, as I said, and then we want to have a numerical representation of the text, but we don't only want to know which word is in the text and which one is not, but we're only also interested in the order. So what we do there is we first create a dictionary. So we have our text here on the left, and we want to get to something that we see here on the right. So what we need is we need a dictionary, where we say for each word, this is the number I want to assign, and then I can use the dictionary-replacer node to replace in my dictionary all the words with the number, and what I will get is a document which actually has inside only numbers. And then the second problem with, or the second step we need in deep learning is that this has to have always the same input shape for the training, and as I said, what we do here is we decide a maximum number of words per document, and if it is too short, the odd has two less words, then we fill out with zeros, and if it is too long, then we truncate it, and we take only the first, for example, 20 words or 1,000 words. It depends. If you analyze Twitter data, then you know quite well how big the document is. If you have really different documents, you have to find a trade-off. For the session today, I prepared some meta-node templates for you, and because this is there, you need a little bit, a few more nodes in the pre-processing. So if you decide to work on the deep learning group, what you will do in the pre-processing step is, that you read in the documents, and you do the strings-to-document transformation like in the other groups. You read in addition this dictionary between each word and number, and then you can do the truncation, that you cut off two long documents, and the zero padding, and then you are done, and you can go on with defining the network and training the network. This is then the next step. So it is now similar to all the other algorithms that you can use in NIME, like the decision tree predictor and learner node. We have here also a learner node and an executor node, but the network learner node needs in addition a network, and the network that I suggest for today is that you will have an input layer and Keras embedding layer, then an LSTM layer, and then in the end, a dense layer, Mr. Siegmert function, because it's a classification problem. Yeah. Yeah. So at the moment we are losing tensorflow is the back end. Yeah. So if you see this tensorflow, right, we have tensorboard, which shows you the graph of inhibition, right? Can we have similar? Yeah. So when you, while training the network, you can right click on the node and say few learning monitor, and then it shows you how the accuracy changes, how the loss changes, and how the lock loss changes. So this is also available. Don't this run on it regularly? So this depends, so you have to, so you have to set up the Keras environment in Python, and then you have to define where your Python executables, and when you install Keras, you can decide whether you install Keras for CPU or GPU, and then it depends on your Keras installation, whether it runs on the CPU or GPU, both is possible. And once you do it? No, you just need to change the Keras installation, need to change the Keras installation. Okay, so as I said during the session today already, today's use case is that we want to analyze the sentiment of some movie reviews, and we want to assign the correct sentiment to the take, to the documents. I prepared three workflows for you. So when you import the .nav file from the USB stick to your machine, and you can choose in which group you want to work. So we have the group one, you can double, so in your name explorer, you will see this folder structure, and you can open a workflow by double clicking on it, and then you will find a description of what you have to do, which notes you need, and how you can configure them correctly to build your workflow now in lime. One last thing is, because you attended today's session and you're attending the ODSE, you have a book about text mining, which is called From Words to Wisdom, and with the code ODSE slash India slash 2018, you can download this book for free from the NIME website, from the NIME press. Oh, everybody's writing. So on the USB sticks on the table, you will find the .nav file, and in addition also the installation file. So if you haven't installed NIME yet, you can install NIME from the USB stick, otherwise you could download it also from our website, but because we don't have internet connection or not all of us have internet connection here, I provided it also on the USB sticks. In addition to the normal installation, we need also the text mining extension. For those of you who haven't installed it, what you have to do is, you go to your NIME analytics platform, and then you can go to file, install NIME extension, and then you can look for text processing. So I have it installed already, then you can check the checkbox here for text processing, and install in addition also the text processing extension. If you don't have internet connection here, and you haven't installed the text processing extension, what you have to do is you have to change, is there someone, otherwise I can help this person manually? Otherwise, someone here who hasn't installed the text processing, do you have internet connection or not? Okay, then you can just install it like I did show. Okay, and if there are any problems, then I will walk around. Yeah? I will come. Just import a workshop file, and then you can choose. I think the best thing is if you make groups now, so who of you want to work on the lexicon-based approach? Because then you can help each other, and I will walk around, and I don't have to explain everything on each table. So maybe we can say, so first of all, can you please raise your hand if you want to work in the lexicon-based approach? No one. No one. You would like. Okay, so you make your own group. Sorry. I know, someone else is telling you. Who wants to work on the machine-learning-based approach? Okay, maybe the machine-learning-based approach can meet here, and who wants to work on the deep-learning approach? Have you already installed TensorFlow and Keras? Okay, then we can try, yeah. So maybe we can meet here on the right table, and then, yeah, and then we can start. So again, the lexicon-based approach here on the front, the machine-learning-based approach here on the front, and then the deep-learning approach here on the right. You can just type here text mining, text processing is the name. This is a re-installed name, or if there is a version, we need to import it. Pardon? This is a de-installed name, you have the U.S.Tigers, you have the Keras, TensorFlow, so you need to install a strong name, or we should have the local machine-learning-based approach. Keras and TensorFlow, you mean? Yeah. So you need to have a Python environment, so I do it with Anaconda, and I have a special environment for NIME, and in this one, I have installed Keras and TensorFlow, and this is outside of NIME, And then in the NIME settings, I can define where is my executable of NIME, and then it will use this one. All right, they are already here. They are already here, it's only for the deep learning that you have to set up the Python environment. For the other things, they are already here. For the lexicon, which approach you have? Yeah, yeah, for your machine learning approach, there is no need of Python or something like this. That is all available in NIME. And these dictionaries I downloaded from a web page of the university, and this is in the data set folder. But there are no, so these lexicons are not. Yeah, yeah, yeah. Your own lexicon, yeah. So, yeah, yes, you can take it home. So, what is your problem? I do not. I don't use P-Stick. Maybe you can try another one. I know it's, no it's here. Mine? And do you have it installed already? No. Okay, then you can use this. Yeah. Mm-hmm. Yeah. No, you have to create the nodes. So, it's only the description, and it's on you now to build a workflow. So, the first thing is that we want to read a data set from the data folder. So, and it's now on you to build a workflow. Pardon? So, I can show it again. So, once you imported the workflow, so once you imported the workflow, you have here, we call it workflow group, this folder, which is called sentiment analysis workshop. Now you want to work on the machine learning based approach, correct? On the deep learning. Then you press on group three, deep learning, and you double click, and it will open this workflow. Did you, in my work, in my workshop description, I put a link what you have to do to set up the Python environment before you do this. You have Python, and you have Keras installed, and in which version of Keras and TensorFlow do you have? Okay, but which Keras version and TensorFlow, because there are many versions. So, what you have to do first in general, if you want to use, because this works with Python, you go to file, preferences, you have to define where is your Python executable, and you go to NIME, and you open this one, NIME, then Python, so for me, let's see whether we see it somewhere else. Yeah. If you install Keras and TensorFlow, the TensorFlow in the virtual environment, will that be linkable, or do we have to do it in the... So where did you install it? If you make a virtual environment in Python, and then install it, is it currently done, be linked to this, or it has to be in the main? So you mean if you have in Python environment with Anaconda? Yeah, so there is an option. We have a blog post which explains you what you can do, so you can have your own environment, and then, or you have an environment, and how you can link this environment to NIME, and then use it instead of NIME. Okay, okay. I will come in a second. Open source, so where do I divide? So if I open file preferences and NIME, I have here also Python. That's strange. Did you install it from the USB stick or...? Yeah, I just copied it from you. I have downloaded it from the... That's strange. My install is also complete on the back end card. Install... So where did you just copy it, that folder? Oh, sorry. Maybe we can work on machine learning. Yeah, I'm sorry about this. Maybe there was something wrong. Yeah, maybe you can try the machine learning approach, and then I can show you in the end the results, maybe. I'm sorry about this. It must be something wrong with the installation I put on the USB stick. Okay, I will quickly explain it maybe in more detail. So when you open these workflows, you have these boxes, these yellow boxes, and I will show it now for the machine learning group. So you have these yellow boxes. In the yellow boxes, you will find the description of what you are supposed to do to build a workflow. So for example, this is the same for both groups. It says, read the dataset imdb.samples. This dataset you have in this data folder, and then you can just drag and drop this dataset to your workflow. It will set the path for you. You can say okay, and you can execute the node. Then you can continue with the next step, which says use the strings to document node to create your documents. So the next step is I go to the node repository, and I look for the strings to document node. I bring it to my workflow editor. I connect it to my file reader node, and now I can configure this node. So I can go inside, and now it tells me already in this text box how I'm supposed to configure the node. So it says the title column should be the index. So I change this one to the index R. Therefore, I have to take this one as string. So I go into this one, and I say I want to use the index as the title. The full text is the text. And I want to activate that I have some category information where I want to use the column sentiment, and I don't have an R through. Then I can say okay, and I can execute both of it. Then if I look at the output table, I see that I have now here in addition this document column where I have not only the title, but also the full text, as I said. And then in the notice in this description of this workflow it says next use the column filter node. And then you can go step by step and build your own workflow to do this. And it's the same for the group one with the dictionary based approach. So this is, I'm sorry that I didn't explain this in the first place. Yeah, so I take all the information from one row, and create the document. And there I put into this document on one hand the document itself. I structured the system in this nine document cell. As I said in my presentation, we have additional cell type, which is the document cell. And yeah, so in this document cell, we save the whole text, and then also the meta information like the author, the title, or the category like here. Yes, and that's why you haven't installed the text processing extension. So if you go to file. Okay, then you can, do you have installed then you have installed a text mining extension. Then you can take a look and write here strings to document. So what you can do, you can take the USB stick. Ah, you have it already. Then go back, yeah, go back to nine. Back to nine, into nine. So open the platform, yeah. Then you go to file. So here in the file. And then to install extensions. Okay, that doesn't work. Then you say, okay. Then you go back to file. And then to preferences. And then install and update. Open this one and available software sites. And then you say add. And then you say local. And then you go to installation files and you choose this folder. Only choose to, yeah, and then you open. Only choose to, yeah, open. Give it, yeah, that's okay, you're good. Yeah, installation files. And then it's called ORG, this one. That's like this one, yeah. So once I have the data, I need to do something else over here. That's a hard one, just for you to, wait a minute. And then you probably don't have installed the text processing extension yet. So file, install nine extension. Okay, and just apply and close or do I have to. Then you have to restart. Okay, no, now you have to say, okay, apply and close. And then go to nine and install nine extension because now we're installing the extension from the USB state. Yes, I really have installed it. I see the problem. The node is called strings to document and you wrote string to document. There we go. I think since we've enabled that, so I will. Yeah, that's good. Oh, sorry. And then it should, yeah. And now you can look for the text processing extension. And then it should, yeah. Got the IMB sample CSD. We don't find one. It's called strings to document, not string, but strings. Ah, okay, but you haven't installed the nine text processing extension yet, did you? So then we go, okay, because you don't have internet connection correct as far as I see. Yeah, no. Okay, because then we have to install it from the USB stick now. So if you go to your nine, yeah. If you go to your nine installation, okay, then you go to nine preferences. Because normally you would install it from our update sites, but now we want to install it from the USB stick. Then we go to install and update. Open this one. Available software sites. Add. Local. Then into the installation files and choose ORG because there are all the updates. Choose this one, the whole folder. No ORG, this one. And then open. Yeah, you can give it a name, say local update site or something like this. Then you uncheck those two check boxes. Otherwise it won't work and say apply and close. Now we can install from the USB stick. So you go to nine, a profile, install nine extension, and then text processing. Yeah, we can type your text processing and then you will have the node. Yeah, yeah. And then next again. Yeah. So how is it going? Do you have internet connection? I don't know. Okay, if you don't have internet. Yeah, yeah, yeah. Okay, yeah, it's one. Okay, then you go to file, install extension. And then you can look for text processing. You need to unfortunately actually download it from the USB stick. So, is it included in that? And so there is, then we have to change the update site. Then we click next. To charge the laptop. I have a Mac charger that won't work. Maybe you can ask one of the others. No, charging is there. Ah, here's a, I come in a second. Yeah. Because we have to create for the Mac. Okay. And now what you want to do is you want to use the bad workflows. So you click on, right click on here, import nine workflow, browse. And then you go to the USB stick. And here we have our workshop material. And import it into nine. And now you have here the prepared workflows. And you want to work on machine learning. So, and here you have now the description. So the first thing you want to do is we want to read this imdb data set. And we have the data here in this data folder. So you can just track and drop it. Yeah. Here, welcome. Yeah, that's correct. Yeah. But when I double click, here it says part of speech tags. Yeah. See where I should select that. Yeah, you don't need any, you can either say whether you want to replace this document with the tags. And then you can just say, okay. So I don't have to select that part of speech tags. No. And then you can just execute this node. So, here is the question. So you've got documents. So we got the file leader. And then we did string to document. And then we don't know what is this chunked lab. Okay, we don't, you want to work on a machine learning based approach? Deep learning. Okay, but for the deep learning you have to set up the Python thing as well. I have. You have, and if you go to MIME, because we had problems there with setting up Python preferences. NIME, okay. Because normally, here's the option Python. And this option is not available at the moment. I don't know. There's something wrong with the installation from my USB stick. Maybe you have to install it later again from. I'll install it from the normal. Okay, but then it's strange that you don't have. And a few, yeah, it's not here. Because there you normally have to set where you actually, where's your Python executable? Only then the deep learning. Pardon? Only then the deep learning will work. So only the deep learning, we need the Python integration. So you can build a workflow, but you won't be able, unfortunately, to execute it at the moment. But you can still build a workflow. So for the truncate and so on, I did build these templates. They're in this template folder. You go into templates. And then you need the truncate from here. Yeah. Okay. Okay. Okay, yeah. Maybe I can also show it on the front. Yeah. So we want to install the extension from the USB drive. Okay. We need to go to add local, and put a USB stick, installation files or a gene open. And I leave it empty for now. So I say apply and close, line, install line extensions. It's working. Okay. And then we look for text processing. And then we say, next, finish. And now it is installing it from the USB stick, and then it will restart. The POS tagger. The post tagger can do only post tagging. So there is actually no need where you have to configure this. No, actually we have to get it. Yeah, yeah. I think there is around 15 minutes left, so we will let her. Yeah. Is this okay for everyone? Because we have only 15 minutes left. I will show first the machine learning approach, how to build it, and then I think I will go through the lexicon approach and show the result, and also for the deep learning approach. And if you rebuild this workflow in the next, I don't know whether you go to another presentation afterwards or not. So if you have questions afterwards, I will be at the booth and you can come with your laptop and then you can set it up so you are prepared to do this homework at home. I'm sorry that we are running a bit out of time. Yeah. So I'm now in this machine learning voice approach. So as I started already, so I did read our documents in. I did the strings to document. Now I want to delete the columns which I don't need anymore. So I look for a column filter note in the note repository. I double click to connect it already to the strings to document. Go into the configuration window and I say I now only want to have my document column. I say okay and I can execute. If I look at the output, I see now I have here only the documents left and only the document cells. And in this document cells, we have now all the information, the full text, and in addition, also the category and so on. Next, we want to do a part of speech tagging and therefore I look for the post tagger note. So I look for the post tagger note in my note repository, double click and I have it connected already and to the column filter note. If I go into the configuration window, I only have to say on which column do I want to make this post tagging. So this is our document and then which tokenization I want to use. And then whether I want to replace this document column or whether I want to create a new column where we have documents again, but in addition with the token, with the tag information. So I can execute this one. Then as I said, we can visualize these document cells with the document viewer note and also the tags. So I have the document viewer note. I will show you this one in addition. So we have here again the document cell and I can say okay and execute to actually take a look at the documents. Here I see it has the only the title and the category, but if I double click on one, I see the whole text. And if I click on search and I say that I want to see the part of speech tags, show the tags. Then I see here now also the tags for each word. So get a part of speech tagging and now each word has a tag. What part of speech is it? Then we can close this one. I can also delete this one again. Okay, and now we come to the pre-processing step. Here they ask us to use the punctuation eraser. So I can look for the punctuation eraser, punctuation eraser, bring it to my workflow editor, connect it to the post tagger and also this note, the only thing I have to do is I have to say whether I want to replace my original document or I don't want to replace it. I say I replace it and then I can execute it. The next step is that I want to filter numbers. So I take the number filter note to delete the numbers. So I look for a number filter, connect it, and again in the configuration window, here I can say whether I want to filter only terms representing a number or whether I also want to filter terms that only contain the numbers. So for example, if I select the second option, a word like S3 would be deleted as well because it contains a number. Otherwise, if I select the first one, then S3 would be still in there. And I have again the option to say that I want to replace. And then I can execute. And here are now all these different preprocessing steps. I will copy this now from the solution just to speed up a bit. So we are running out of time. So I'm there in the description, you will find that you will need the n-sharp filter where we say we want to filter documents that have less than three characters. Then we have the stop word filter where we can have a built-in list, for example, for English, French, German, and so on. But we can have also our own list of step words which we can use. Then we have the case converter because we say it doesn't matter whether our words are uppercase or lowercase, so we can convert the case. This is part of the normalization. And I can use the snowball stammer to bring my words back to only the stem. So I don't have, so also to reduce the dimensionality of my feature space. And then I can execute all of them. The next step is the tag filter node. We have done this part of speech tagging. And we say we want to analyze or we want to take only verbs, nouns, and adjectives into our analysis. And therefore we can use the tag filter node because we have the part of speech tag. We can filter now based on these tags. And in the tag filter node, I can say that I want to do it on my document. I take the tag part of speech. And there you can look now up actually what are these different apprehensions are. So JJ, all these starting with Js are actually tags for adjectives. So I can say I want to have all of them. Then all starting with N are the nouns. And then we have the verbs, they are all starting. They are all starting with V. So those are the verbs. And now I lost that. That happens if you want to do it quick. So we have the verbs. We have the nouns. And we have the adjectives. Then I can say okay. And I can execute. So we are now done with the pre-processing. And the next step is the transformation. And here we want to create first a bag of words. So I look for the bag of words node. And connect it to the tag filter node. And here in the configuration window, I have to say out of which document I want to have a bag of words. And here I see now I have the original document and the pre-processed document. And I want to do it out of the pre-processed document. And only the pre-processed document should be later in my table. And now I can execute it. And if I look at the output table now, I see here that I have for each document the words which are occurring in this document. Then the next step is that we want to use the TF node. So we want to see the term frequency. Or we want to calculate the term frequency for the bag of words. And here I have the option to say either I want to have the relative term frequency or the absolute term frequency. I want to have the absolute term frequency. That's why I'm unchecking this checkbox. And then in the next step, we have to create our document vector. So we look for the document vector node. And in the document vector node, we have to say out of which column do we want to create our document? This is our pre-processed document. And whether we want to create a bit vector or not. In our case, we don't want to have a bit vector, but we want to use our term frequencies. And we don't want to have a collection cell because we want to apply a decision tree in a minute. Then we can execute. And because we want to use a supervised learning algorithm, we need our category information again. And in our document, we have our category information. And now we can extract this one again. And this is something that we can do with the category to class node. So I can look for the category to class node. So it's, here we go. And here again, we have to say from where we want to create it. And we can execute it. Now we want to create a training set and a test set. This happens with the partitioning node. So I look for the partitioning node, category to class. So in our document cell, we have also the category information, whether it's positive or negative. And now we extract this information again from the document cell. With the partitioning node, we can say that we want to use, for example, 75% for training. And we want to do a stratified sampling on the document class. Then we can execute this one. And as I said, we have always these learner predictor motive in NIME. So here we want to use now a decision tree. So I look for the decision tree learner node. And then for the decision tree predictor node. Into the decision tree learner node, I feed in my training set. And if I go into the configuration window, I have to define the document class, which quality measure I want to use. And then I have different parameters for the decision tree that I can set here. And then I can execute the node to actually train the decision tree. Then I can feed it into the decision tree predictor node. And I feed in an addition also the test set. And I can execute both of those things. The last step is then that we want to evaluate our model. And therefore I have the scorer node. So I double click on this one. This takes now a while to train the tree. And with the scorer node, I could evaluate how good our model is. So I can compare how often was my prediction correct and how often was it wrong. And then I would see the confusion metrics and accuracy and those kind of things. I think we have to stop now. Okay. So then I configured the scorer node. So here I have to say where I do. I have my true values. In our case, that's the document class and where I have my predictions. Then I can say, okay, and I can execute an open view. And I will see the confusion metrics. And I see that I have an accuracy of 87% on this data set by using the decision tree. Okay. So if you have any questions regarding the other approaches, I will be the whole day at the booth. You are more than welcome to stop by and I will explain also the other approaches and show you how you can build those. Thank you for attending your workshop. And maybe we'll see you later at the booth.