 Text pre-processing is an essential step in any text-binding pipeline as it determines what kind of data or machine learning and data explanation algorithms receive. Generally it comprises several steps so let's look at some of the most important ones. We'll use BBC3 from the data sets widget as an example data set. It contains several articles from the BBC News. Now again we'll use the corpus widget to indicate that our data comprises both a title and some content. So let's inspect our data in a word cloud to get a general feel for the corpus and its content. It could be more informative. Unsurprisingly the most frequent elements of the English language include the to in and of as well as some punctuation. So we need to clean this a bit and we can do so in the text process widget. I'll quickly insert it between the corpus widget and our word cloud. Now first we lowercase the text then we tokenize it. That means we break it into the core units of analysis typically words. Then finally we can remove some of the stop words. Stop words are words without semantic meaning like the and in and so on. In the preview on the left we can see a sample of the widgets output and we can look at the actual results in the word cloud widget. Okay great we see that words such as said in year and us are the most frequent words in our corpus. But let's look at the widget again. In the least on the left we have both year and years. Now the meaning is the same that distinction is only the grammatical number but it might make more sense to transform such words into a standard form. And the transformation of words into their base form is called normalization. For example working should become work. Orange is becomes orange and bigger becomes big. We'll use the UDPype limitizer a fast normalization option for English. So we just drag and drop the option between tokenization and filtering. Now here the order really is important. Okay great years and year are now a single year. Said has become say and so on. We can also select a word in the word cloud and inspect all the documents containing this word. For example win. Now observe the documents in the corpus viewer widget. Here are the 532 news articles that mentioned winning. Let's find each of its appearances. I can enter the regular expression backslash B win backslash B into the filter box which will locate the exact occurrences of the word win as backslash B denotes a non-word character. And it looks like winning doesn't appear only in the sports category. This concludes our pre-processing. Now if our data is big we might want to perform some additional filtering and if we wanted to distinguish between nouns and verbs we could apply some POS tagging which adds part of speech labels to the data. Now determining the best steps for your analysis can take a lot of time and it will probably heavily depend on your exact field. But the more you practice the easier it will be to decide which steps yield the best results.