 Hopefully you have had a chance gotten some emails from me about the resources that we're going to use today. And these are the packages we're going to use today. Having said that, I always like to start with the Duke University Land Acknowledgement. So I would take just a moment if you will give me that privilege to honor the land in Durham, North Carolina. Duke University sits on the ancestral land of the Shikori, the Eno, and the Kataba people. This institution of higher education is built on the land stolen from those people. These tribes were here before the colonizers arrived. Additionally, this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to break out beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope that we can glimpse an understanding of these histories by recognizing the origins of collective journeys. So I want to thank you for giving me the opportunity to read that. I think it's very important. Of course, we're not going to talk about those issues today, so I only ask that if you have an opportunity to write some injustices that you see, that perhaps you can do so. The goal of today, this is part of a sort of a sub-series of the R workshops that I do. It's called Case Studies, where typically I'm teaching people how to use tidyverse techniques in their R training. And I teach workshops on tidyverse. So the goal of today is to sort of emphasize and apply some of that learning into a specific space. So I want to note I am not a text analysis scholar. I don't publish papers on these things. I'm really not an expert in those things. What I am is an enthusiast just with data and data manipulation. So I'm going to try and reveal some aspects of data rankling and text analysis that will be useful to you if you're new to this. If you're not new to this, I hope that I will expose some things that you haven't seen before. But I want to point out that the text analysis field is dynamic field. Well, it's been around for a while. It's changing rapidly. And I'm going to recommend at least one book, but I'm going to also recommend that to really do this well, because you have to get a lot more background that we'll be able to present in the next two hours. And I'll try to expose some of those resource options where you can learn more. And I'm certainly always willing to consult with you and offer what assistance I can. But I want to make it clear that since I am not an expert in the field, I may not be able to give you a really nuanced answer to intricate questions that revolve around your particular text. In any case, we're going to talk about data cleaning because it's always super important, especially in text analysis. I'll just talk about that briefly. We're going to talk about tokenization. We're going to talk about visualizing. And we're going to use Word Clouds as the main way. But that's just because it works well in this workshop. We'll talk about and maybe do one or two other visualizations today. And I'll show you some methods of visualizing your analysis. We'll try to do a little sentiment analysis. And as time allows, we'll do a word frequency analysis with a TDIDF approach. And we'll define that. Now, the main text that we're going to use, and you can download this for free. Actually, it's not a download. It's up on the website. You can buy it if you want a paper copy. But it's this text called Text Mining with R. It's by Julia Silge and David Robertson. And also, I don't know if David Robertson had anything to do with the TIDU text package. I think he did because they wrote this book together. But you can get the package from this website. And there's some documentation there. And Julia Silge also collected the works of Jane Austen. And so it's always nice to start out with a nice clean corpus to do your analysis with. And use those Jane Austen books. If you've never read any Jane Austen books, certainly as a librarian, I would recommend them to you. She's long since passed us. But wrote several novels, I think six that were published back in the 19th century. Sorry, I don't know the exact time, but they're in public domain. So it makes it easy to work with. She deals a lot with English high society and the implications of marriage and relationships. And you may have seen or many people know the novel Pride and Prejudice, which has been made in several movies. So a lot of people are familiar with this text. And this is a really fun way to start learning text mining. Okay, so this is the website Rfun, which is a sub branded site for my center. And it consists of a whole series of workshops on different modules about tidyverse. So in essence, once you're learning tidyverse, that's why we're doing these case studies to sort of see how it can be applied. But if I'm covering things that you haven't heard before, just know that there are modules here that can be to your benefit. Each one of these modules has code and slides and practice data where you can do these work yourself. It's not the only source for learning about R, but I want to bring to your attention at least this first module. Quick start with R because as I scroll down here in the last year, I took some of my introductory courses and shortened all the videos to make it useful for a flipped workshop model teaching in Zoom. So if I cover things about Deplier that you're not sure of, you can look at this like a 15 minute video or I recommend a quick start. If we're talking about joins and left merges or left joins and pivoting data, things like that are marked down or are projects. There are little videos on all that stuff that I just want you to be aware of. In any case, the backbone of this text mining that Julia Silgi talks about is, so that Julia Silgi, is this idea of working with tidy data, which is the backbone of all the tidy verse stuff. And that's this notion that every variable is in a column and every observation is in some row. So variables or columns are vertical, across the rows are horizontal. And then the observational unit is a table or sometimes known as a data frame or in the tidy verse context, we call that a table, but essentially a rectangular grid of data. Right now, R has lots of different data containers or data types. You can have matrices and you can have lists and you can have vectors. Data frames are essentially a collection of vectors and the important aspect is that each vector or column has to be all the same data type. So it can either be all alphanumeric data or it can be all numeric data, whereas lists can have multiple different kinds and matrices are usually numeric. We're going to be dealing with these tables. That's the advantage. And so what Silgi and Robinson did is they sort of adapted tidy verse techniques to their package called tidy text. And it's this idea that a token is a meaningful unit of text. That token can be a number of different things, right? It can be just a single word. It can be a sentence. It can be a paragraph. It could be a whole document. It depends. It could be a tweet. It depends on the context. But a token is the meaningful unit. And then tokenization is this idea of splitting up your text into these tokens so that what you end up with is a table or a table that has one token per row. Right? And we're going to work our way through that example. But that's why we're using this approach because once you've learned something about tidy verse, you can manipulate these texts more simply. There's a couple other data structures that are important, right? We're working with string data or character data. So that can all be in a vector. It's important to talk about the term corpus. Corpus is a collection of documents, but it's usually not just the string data, right? It can be metadata about the document, things like that. So you have to manipulate it properly so that you can do your analysis. And then there's this, as we move into the advanced aspects, there's this concept called a document term matrix, which is a sparse matrix with one row for each document and one column for each term. And that's where we get into this concept of TDIDF. So just to sort of cut to the end, the concept of TDIDF is this idea of where you're weighing your tokens in the context of a document which is in the context of a collection of documents or a corpus. So not all words are of equal value. And then how do you use that to your understanding in your analysis? Because you want to really focus on some words that have meaning. And you want to ignore other words, stop words, such as leading articles and conjunctions, which may have no meaning. And of course, all of this can get even more challenging if you're working with tweets, for example, where hashtags and abbreviations and other kinds of terms may have more meaning than the algorithms would typically acknowledge. So that's some of the art of all this, is you have to sort of know what your text is about as well. Moving beyond that, and we're not going to cover these things today, but I want you to be aware, there are at least two other packages that a lot of text miners or text analysis people work with. One is the TM package. I find it harder to work with than the tidy text package, but I have a big bias towards tidy verse. I find it easy to work with. If you're more of a base R kind of person, you won't probably won't find TM as hard to work with. But I'm going to couch that with the notion at least that, you know, R comes out of the statistical world. So it's no surprise that it's really easy to work with numbers in R. Many people are surprised to learn that you can manipulate text in R. It's really not that much difficult, more difficult than working with the numbers. There are just some certain techniques and packages that make working with text a more manageable process. In any case, the TM package is something else that you would want to look into, as well as the Quantita package. I'm not really going to go into those, but I want you to be aware of them. Based on stuff that you're learning either today or you may already know, it's also helpful to know that there is the Gutenberg package, which is a collection of Project Gutenberg texts, which are all texts in the public domain. And you might want to work with those just to practice techniques or learn different things about different texts. Project Gutenberg is nice because, like I said, it's in the public domain. And I'm going to say you may be faced with doing minimal data cleaning when you're working with Project Gutenberg, as opposed to gathering just raw data. But it doesn't mean that you won't have to do any data cleaning. So, a little more about that document, the book I told you about, it's free and it's online, text mining with R. There's a couple chapters that I think are particularly interesting. One is Chapter 2, Sentiment Analysis. That's mostly what we're going to talk about today. And as time allows, we'll talk a little bit about TDIF for analyzing word and document frequencies. A lot of people talk about topic modeling for unsupervised classification. We probably won't talk about that, but it's a nice initial introduction. And indeed, Julia Silke has gone on to write another book that focuses more on this kind of thing. And I can recommend that to you. But it's also worth noting that there are three separate case studies. And I think Chapter 7's case study on Twitter analysis, Twitter archives, is pretty approachable. But they're all actually approachable. This is a very concise book. It's not long. The explanations are clear. But this is only a starting point. A further study that I might point out to you that you can also get for free, the Summer Institute for Computational Social Science, which is abbreviated as 6. You can go to 6.io. And in particular, there's a link behind this one. There's a URL behind this link right here that will take you to their curriculum. Their curriculum is linked online. And you can learn a lot of things about text analysis. Chris Bale just published a book on some analysis of Twitter archives. And he and his co-founder, Matthew Salganick, found this institute. And if you're at all involved in social sciences, you might want to keep an eye out because they do this every summer actually across the world in several locations. And if you're a budding social scientist, you may want to get involved in those kinds of things. But as Chris Bale would point out to you very clearly, this is a fast-moving field. And we're not going to be able to even barely scratch the surface of an introduction today. But we will try and focus on a kind of a practical approach just to get started. This should be, what you should see now is the GitHub repository that has the text that we're sharing today. If you have not worked with GitHub repositories, the easiest thing to do is to click on this green button, scroll down to download zip. And then if you want to look at the slides, you should be sure to unzip or expand the zip folder locally. I do have a workshop on Git inversion control, so we won't cover this too much. But that's the easiest thing to do, just download and then expand. But if you want to follow along without downloading, I did make a PDF version which will display on GitHub. In our studio, in this project, if you are expanding it and you double-click on the file workshop underscore textmining.rproj for rproject, then it will open up all of this code into an rproject, which is going to be ready to go for version control. But the other advantage to rprojects is that you don't have to set the working directory and it's kind of a discrete r unit for this, in this case for this topic about textmining. So I'm going to start out here in tokenization. So I'm going to open up this document called 00tokenization.rmd. So this is just an R markdown document that will output an HTML notebook. And at this point we're just starting with these two libraries. We're just going to start with this poem by Emily Dickinson. So it's going to be four lines and you can see in this first step what we're doing is we're making a text vector that consists of four elements and each element ends with a dash. Not exactly certain why I've lifted this right out of Sylvie's book, Sylvie and Robinson's book. But in order to make this a tidy text object, we first want to put it into a table. So you can use the tibble function and you can give every line its own number by making that numeric vector which consists of four elements and adding it as one variable and the other variable would be the text vector. So if we put those together in a tibble, we then have the line number that's associated with the four lines in this poem. So this is the thing that this UNNES token function which comes from tidy text that is really useful because this allows us to have one token per line. So just running that data frame through that function word we're going to identify that the token is going to be word. That's the thing that's variable, that could be paragraph, it could be document, it could be sentence, it could be tweet and you can double check the documentation on UNNES token on that but that's what word means there is that our token is going to be a single word and then the text that we're tokenizing or the corpus that we're tokenizing comes from this column right here. So then you can see what we've got is we have each word broken out into its own line so that's the token and we know what line it came from because we added those line numbers and that's a really valuable step you want to be able to track where that comes from. So 01textmining.rmd but let's run through what's happening here. You can ignore the YAML header at the top, you can see that we've got four libraries running here. Jane Austen R is going to be our source of text. Then we're just taking a look at the Jane Austen books and you can display those but we already talked, I gave you the setup already. It's just each of the six novels starting with Sense and Sensibility which was published in 1811 and you can see that at least the beginning lines are pretty straightforward, right? It's the title and the author, some of that metadata that I was talking about and as you scroll through you get the actual text of the data in the text column. So there's the text and I think we might get to this particular line. They lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance. I want to bring that to your attention because when we get to sentiment analysis we want to assign some kind of sentiment to words but we don't assign sentiment to every word and it's always a little shocking to me how few words would actually get sentiment assigned to it from a sentence like that. So think in your head, well, I wonder how many words in line 16 would get a sentiment assigned to them. It's kind of an interesting little quiz. Austen books, if you just want to know which the six books are that are in that collection, they're listed right there and then just a note on data cleaning. I just like to point out that there was a New York Times article, a very, very detailed Times article as Times articles tend to be, published about five or six years ago where they put forth the conjecture that they're interviewing lots of different data scientists and they're talking about big data which was a huge term five or six years ago. But they put forth the conjecture that data cleaning is 80% of any data project. Whether or not that number is right or wrong, it seemed to fit with the tenor of the article, it seemed to be agreed to. I don't know that there was empirical evidence behind it, but it's an important point nonetheless, particularly important when it comes to text mining because text is notoriously messy. And particularly when you're adding in other languages, you have diacritics, diacritical marks that you don't have the Latin American English or the Latin British English data set. All of this text cleaning is going to affect you and everybody wants to do the analysis and make the visualizations. Of course, that's where it gets really fun to share, but you should approach this knowing that cleaning text, whether you get it from Project Gutenberg or maybe a corpus of documentation from the UN, cleaning text is going to be a significant chunk of your time and you could spend easily days or weeks trying to solve little problems of how to translate certain UTF codes or what to do with things that don't import cleanly into R. I can try and help you with some of that, but please be prepared to know that this is a very sanitized presentation where we don't have to clean any text. Okay, so what we're going to do here is we're going to start with Austin Books. We're going to use the tidyverse function called GroupBy and we're going to do that same thing that we were doing before with the poem, which is we're going to add a line number to every book or to every line for each book, but we want those line numbers to be unique to the book, right? So we don't want to start a line at line one and then consecutively and contiguously add all the way up through the last or the sixth novel. So that's why we're using the GroupBy function is to restart the count for the beginning of each book and then we're going to ungroup it because having a grouped data frame is not useful going forward, but what you see there on the right-hand side are the line numbers for sense and sensibility. If I execute that code chunk, you'll see that I have 73,000 rows of data, so that's 73,000 rows of text across all six novels. And I'm pretty certain that if we go to the 100th screen, we're just at the 1,000th row and we're still at sense and sensibility. So you could do a little bit of data munging if you want to see a particular other novel, but that's all we get in this preview. That's okay. The next step is to tokenize. So this is a repeat of what we did before and after this, I'm going to invite you to do a little exercise. The important part here is that when you're tokenizing, you have one token per row. And when we tokenize, we have now three-quarters of a million rows because each line that we had previously is much longer and the first line, Line 1, consists of only three words. That's the title. So we scroll forward to Line 16 because I remember we pulled that one out for your attention. You can see that there are about five words there. If we go to the next screen, another seven or eight. And then the next thing we want to do is we want to remove stop words. Now stop words are this idea that there are words that are just not all that interesting to analyze. So leading articles, conjunctions, all kinds of things. And there are stop word dictionaries. So we're going to use this anti-join function to basically say, remove any words from the tidy books data frame that has 75,000 rows at this point because we've tokenized it and remove any words that also are in the stop words dictionary. And when we run that, we're now down to 325,000. So a third of the words that we had originally exposed for analysis are still left for analysis while as the other half a million words we just summarily threw out. Now I'll just bring to your attention that we use the anti-join function right there which is part of Deplier. And here's a little link. If you haven't seen this before, the anti-join looks like this, right? We're taking everything out of the original word set that does not match the stop word set. And sometimes in these text manipulations sometimes we're doing an anti-join, sometimes we're doing an inner-join, sometimes a semi-join. But all of those joins, it's just helpful to have that visualization. Again, if you're not comfortable with joins, you can watch that video on the RFUN site that I showed you earlier. Also want to point out that you can customize your dictionary. You as the analyst may know certain words that are not worth keeping around for whatever reason. Completely, I don't have a specific example, but I made up a totally bogus example. I don't know, I'm getting old enough now that I forget where some of these things showed up in my mind, but Volkswagen used to do a commercial where they referred to dragging their cars as farfignugin and they made up this whole definition. I don't even think it's a real German word, but they had a whole ad campaign over farfignugin. There's no reason in the world why farfignugin would show up in Jane Austen's novels, because those were written 100 years before, but I made a custom dictionary just in case. And I could use that stop words custom dictionary to throw out one additional stop word, farfignugin, from my Jane Austen corpus. Of course that's totally made up, but you understand how that can be applied going forward. The other thing that I want to do here then, I mean that's basically the whole goal, is I've tokenized it, now I can start to analyze and visualize my text. So I'm going to take those matched words where I've thrown out the stop words and I'm just going to count them up and return the count in a sorted order. And you can see that the most popular word in this case is Mr, followed by Mrs, followed by Must, right, going all the way down. There are 14,000 unique words, apparently, in that set of what was it, a quarter of a million that we started off with. And so then I can take that list of word frequencies and I can visualize it. And I'm going to do this right now, I'm going to visualize it with this package called word cloud 2. And in this case it's actually a interactive visualization. So if you hover over a word you can get the count, but you don't have to, I found that word cloud packages are a little bit fussy. And so I've shown you this one because I like the output, but there's another one in the exercise that's a little easier to work with. But at this point let me just note to you that word clouds are kind of maligned and maybe for good reason. But let's just talk about that. Like, I work in the Center for Data and Visualization Sciences and we run across this like, for example, pie charts are much maligned. There are lots of people who hate pie charts and there are some good reasons to hate pie charts. And there are lots of people who hate word clouds and there are some good reasons for hating word clouds. They're not very scientific, they're hard to perceive, but it's also important to sort of recognize what the rules are, what's the standard within your discipline and maybe know why you're breaking those rules. And in this case I'm just showing this to you because it makes a nice visual eye-popping appeal. I can tell part of the story, but I want to encourage you to never stop with a word cloud because there's more information to impart. Oh, here's the other word cloud library. This one's a little more, it's not as pretty, but it's a little easier to work with and sometimes it prints better in documents. And now we're getting to the your term part. So my question to you is, because this is a case study, it's intended to be a demonstration, I want to just lecture to you, so I'd like to invite you to, let me zoom back out to, to open up the file called x underscore o1.rnd and work on that for about the next five minutes and then we'll come back together and that should reinforce some of the things that I've told you today. If you don't want to do that, that's fine, but if you do, those are the questions, and either one of the two documents below it, exercise01answers or exercise01answers.nv.html and that should give you a chance to reinforce that. I'm going to set my timer for five minutes. I see somebody asked what does the width function do and I want to double check that I know where that's coming from. Hmm. That's interesting, I must have borrowed that text from somewhere, you know what they say, the best code is stolen code and that is not a function that I typically use and pretty certain it is not tidyverse function, it's baseR function, but it must in some way be important for setting something to WordCloud, but if I go over to my previous example, am I using width here? Yeah, I'm using width there. It's the beautiful thing about R, there's so many functions to learn. Again, the answers specifically are there. So let's move on and get to sentiment analysis. Our studio, I'm at line 209 and here we're going to use this function part of tidy text called get sentiment and there are different sentiment dictionaries that we'll cover in just a minute, but we're going to start with, we're going to use the bing dictionary and what we're going to do specifically is we're going to focus on only the words that bing classifies as positive, right? So if I just run those first couple of lines if I did it without filtering which I can do right here you should see that there are negative words like two-face and abnormal and abolished but we're going to dump those for the moment and just do the words that bing classifies as positive which is right there and then we're going to use the semi-join function to gather the intersection between the tidy books and the positive words and we'll return that as a sorted set and we have 670 words that are ranked in the bing dictionary as positive for example the word well is used 401 times so what we can do with that is we can visualize a sentiment score but then let's go on and make ourselves a little algorithm here so our algorithm is going to be sentiment equals positive minus negative because what we're going to do is we're going to calculate into a certain segment how many words are positive how many words are negative we're going to subtract that and we're going to get sort of an overall value for a segment of text so in this case while we did we need to have the words tokenized into single word units to get the sentiment we're then going to go back and use 80 line increments as our chunk to analyze to apply this algorithm to and I don't have a good explanation for that that's the number they use in the Silgy book I think it is context specific but the way Silgy explains it is if you have too many lines it kind of muddies the analysis and if you have too few lines you don't really get a very rich number so she somehow settled on and I don't believe she really explains so it's much like my experience when I attend machine learning workshops there's often certain numbers that people just settle on as convention and somehow she settled on 80 and she applies that with this particular base R integer division calculation right here so let's just go ahead and run all that and break that down the Jane Austen sentiment data frame first thing she does is she takes the join with the Bing dictionary and then she counts the sentiment words by those 80 line segments and she still has and that's where you get this index number so that's segment 0 through 79 or 1 through 80 and the next 80 and the next 80 and then you can see that she pivots the data wider right so again moving back to tidyverse because what you get if you just do this part is you get a really long data frame where you have negative below positive for the 80 line segment and she wants to pivot that so that you can just do the math in a tidyverse context that's what's going on here and you can see now you've got three columns so now you can just use the mutate command to get your final answer and what we end up with then is a score for each 80 line increment now presumably it would make sense that if there's a score that is negative that's a particularly dark passage as opposed to a score that's positive which is a sunnier package and is my recollection of Jane Austen books that I am familiar with is that they are generally positive books but let's visualize that just using the power of ggplot I took each one of those segments and I almost like a time series made a bar graph over each segment and then I just colored it so as to fill in the bar graph with the value that is in this less column and I just filled in the value if it was a negative number I made it red and if it was a positive number I made it green excuse me and then you can use that really nice feature of ggplot facet wrap to separate each one of those and it's a nice way of visualizing the sort of flow of the novel of each one of the novels over time right so you can see in Pride and Prejudice there's this whole horrible chunk right here where things are just downhill and sad and for those of you who know the novels I'm guessing I like to look when I'm guessing this is where Mr. Darcy awkwardly tries to propose and Elizabeth kind of reacts in a way that on hindsight she wished she hadn't and it's all very dramatic and eventually as you can see here ends rather nicely I don't want to give it away if you haven't read it or seen one of the movies anyway let's see most common positive and negative words so what am I doing here I'm joining the tidy books with Bing and I'm counting those words and because that's in preparation for this visualization so taking all the tidy books and joining with the Bing words and getting a sentiment count for each word I can then use ggplot to visualize this is a little big for the screen let me change this maybe to 5 and 7 it'll display a little better and you can see over time that again the word well comes up the most frequently positive associated word and that brings up a whole other point that's worth contemplating if you're familiar with the novels Jane Austen spends a lot of time sort of essentially talking about an individual's association with marriage so if you're a miss you're unmarried but Jane Austen doesn't necessarily talk about that status of a person as being negative it's simply the association with the modern sentiment analysis dictionary that associates that word as being negative so I think that this is a particularly problematic outcome to visualize it doesn't really represent the text as well as we might want but it's a brute force method which would then cause us to go back and think about how we might alter our dictionaries or our analysis in any case it makes for a nice visualization and one that I drew straight from the book so I would suggest that to you moving on to dictionaries I wanted to point out that there are different kinds of dictionaries there's the big dictionary the low-rand dictionary the NRC and the AFIN and it's worth at least kind of looking a big dictionary identifies words as positive and negative the low-rand dictionary does the same thing the NRC dictionary identifies words as being into certain categories I think there are seven, maybe there are 12 I don't remember the exact number but trust, fear, negative, sadness, anger and then there is the AFIN dictionary which actually puts words onto a numeric scale from negative to positive so you can do interesting things with the AFIN dictionary as well which is what we'll do next but oh this is where I was trying to figure out how many different words were in the NRC characterization and there are 10 the words are characterized into these 10 different categories and so I guess it's sort of interesting it is interesting to me let me do this just to give you an idea of I think this is maybe a limitation or at least something to keep in mind summarize sum of N so for the NRC dictionary there are 13,000 words that are characterized in the NRC dictionary as one of these categories if you think about that in comparison to the number of words that we use and we write with I don't know what the implications are again I think if you're a text and an analyst you will have a better foundation upon upon which you think through that implication but I just think it's fascinating that we really are talking about a small subset of words and we of course are placing our faith on solid scholarship so it probably is not a problem but it might be worth knowing more about the details so it's not so much of a black box that you're applying to your analysis in any case let's take the book EMA and let's apply that to the AFIN sentiment dictionary which is the dictionary that assigns words to numeric values and when we do that we can see that the word clever is ranked and rich and so forth and we can analyze that as well so there we're just getting a straight word count we could do another word cloud or we could do a simple ordered bar bar chart and we're going to take here I'm taking a variation on Silky's algorithm and assigning to 80 word increments that book using the EMA dictionary and then visualizing it first with word clouds but secondarily just visualizing it with slightly more colors to try and bring out a nuance for negative words and positive words because clearly there are way more positive words that's all that happens there and then there's one more visualization so as much as I think I can sort of tell you about how you might go about applying sentiment to your corpus oh somebody asked I see Ninel asked the big dictionary only functions with English text does it also work with other languages that's a great question there are other sentiment dictionaries I don't know of any for other languages but I think maybe googling around will help expose that it's a really good question not only for sentiment dictionaries but also for stop words because there are all kinds of articles and conjunctions in other languages that you'll need to take into account