 Great, thank you all so much. I am really happy to be here. And what I'm going to be talking about today is text mining and natural language processing and how we can get from text mining and raw text that we need to analyze to interesting visualizations. But I'm going to be talking about how to do these text mining tasks that we need to do from a particular perspective. And that's from the perspective of using tidy data principles and applying and using tidy tools that we can use here. And the point that I want to make here today with you is the argument that I want to try to make is that applying tidy data principles can make text mining easier, can make the analysis section easier, can make our move from analysis to visualization easier. And by easier, I mean easier to reason about the things that we want to do and easier to integrate the kind of tasks that you want to do into workflows that are already in wide use by data analysts and data scientists. So we have a lot of resources about how to go about doing them and more effective. So the work that I'm going to be talking about today is based on an R package called Tidy Text. So I'm one of the authors of this package, along with my collaborator, Dave Robinson. And so this work is centered within the R package ecosystem of tidy-verse tools that includes tools, includes packages like Deplier, Tidy R, Broom, and GG Plot 2. And the point of the Tidy Text package is to provide functions and supporting data sets so that analysts and data scientists and data journalists can approach analyzing text with the same infrastructure, the same powerful infrastructure that has proven to be very effective for people to be able to do data manipulation, data visualization, to be able to apply the same infrastructure to extend it to text analyses. So I know that in this room there's probably a variety of people's familiarity with what I mean when I say the phrase Tidy Data from the main developer of these packages who is sitting somewhere here to people who probably are unfamiliar with what I mean when I say Tidy Data. So let's talk through just, I'm not going to go into a super esoteric definition of what we mean here, but I am going to talk a little bit about what I mean when I say Tidy Data specifically in the context of text. So how is text stored? We all produce a lot of text. We all use words a lot. But how is text stored when we bring it into a computer and we want to analyze it? Often text is stored. If you first read it into R or any programming language, you're going to store it as a string. In R we often call it a character vector. If you have a whole bunch of them, we can put them together and we can annotate them with some information like, say, the title or the author or something. And we can put them together in what we call a corpus. Or we might store text information in something called a document term matrix. And so this is a sparse matrix that has a row for every document, a column for every term, which is usually a word. And so there's a lot of zeros in this matrix, but there'll be a value that tells you where every document has a word. These are all examples of data structures for storing information about text within the context of using code to analyze text. But none of these are a tidy data structure in the context in which I'm going to talk about it today. So let's talk about a tidy data structure. So the big takeaway that I want you all to think about when you think about tidy data structure in the context of text is that we're going to have one observation per row. And an observation in the context of text is the observational unit of text that you are interested in. Oftentimes, that means a single word. That's a word. It doesn't have to be a word. Depending on the analytic question that you're asking, that could be something like an engram, or a sentence, or something like that. But let's think about one observation per row as being the thing that you're interested in studying like, say, a word. So let's step through a quick small example using part of a poem by Emily Dickinson, who wrote some lovely texts in her time. So let's look at a little bit of an example here. So these are four lines from one of Emily Dickinson's poems. And here, we're assigning them to a character vector in R called text. So if I look at what is stored in text here, it is just a character vector. I can load the deep plier package and then put this into a data frame. And I can add another column that's going to tell me what line. So what line of the poem are we on here? This text is now in a data frame, but it's not yet in a tidy format data frame. So let's load the tidy text package and let's use the function unnest tokens to transform this data from an untidy data structure into a tidy data structure. So what this function is going to do is instead of having each line of a poem on one row, we're going to have single words on a row together. So the observational unit that we are interested in is in single words. So we want to have one observation per row, which in this case means one word per row. So notice that a couple of things about this data frame that we have after we did this process, we still have the other columns that we set up before. So in this case, that means we still know which line of the poem each one of these words came from. We also removed punctuation and we converted the words to lowercase. These are default options in the functions in tidy text, which often are the best choices for analysis moving forward. So that was a little bit of a tiny bit of text. So now let's look at a bigger chunk of text. Let's take the six completed published novels of Jane Austen and let's work on converting these to a tidy text format. So in R, you can access the completed novels, the complete text of Jane Austen's novels in the Jane Austen R package. So we can load the, yes, that's one of my packages. So let's load the Jane Austen R package as well as dplyr and string R. And then let's set up a data frame using code like this. So the output of that code is a data frame like this. So we've got a couple of columns. The text column contains the actual text content of the books. And then book, line number, and chapter are annotating each row of text. So that tells us, which book did this line come from? Which line of the book are we on? Which chapter of the book are we in? So line number and chapter will restart when we get to a new book. So we have a data frame, but we don't yet have a tidy data frame. So let's use unnest tokens to do that. So notice that this data, now we don't have a text column anymore now. We have a word column. Also notice that the data frame is much longer. We have a lot more rows, which is what we would expect here. And notice that we still have all the other information that we had before. We still know which book, which chapter, in which line each one of these words came from. So congratulations. We did it. We converted our text to a tidy format. I'm very proud. We did it. This is very exciting. But why might we be interested in this approach? Why am I standing up here talking to you about applying tidy data principles to text analysis? The reason why this is an approach that is powerful is because there are text mining tasks that we often need to do that become natural extensions of common tidy-verse operations. For example, often if you're doing a text analysis, you need to remove stop words. Stop words are words that are very common in a language and are not interesting for an analysis. For example, in English, these are words like the and of to. If your data is in a tidy data format, you can remove stop words using deep pliers anti-join. So you can easily remove them. And after we've done that, we can ask the question, are the most common words that Jane Austen used in her novels? And then we can just simply take one more step further and make a visualization for that and see what were the most common words that Jane Austen used in her novels? We see words like miss, time, dear, lady, and sir. And we also see some proper names here of characters. We see Fanny, Emma, and Elizabeth. So we're able to get to a visualization that communicates the results of our analysis with just a handful of lines of deep plier and ggplot2 code because we use tidy data principles with our text. This is an example of an analysis that is based on just frequencies, just counting, basically. Another analysis that I did within the past year that is based on a similar approach, just a counting-based analysis is one of using a data set of pop lyrics. So there's a data set out there of pop lyrics, of songs that were on the billboards, Hot 100, the year-end Hot 100, from about the 1960s to the present. And I asked the question, what states are mentioned more often in pop lyrics? So I took a similar approach where I took the raw text, I transformed it to a tidy data format, and then I could make a visualization to answer the question, which states do we see that are mentioned more often? So this map was made with ggplot2. And we can see that states like California are mentioned most often. The New York ones are, of course, mentions of New York City, not New York State here. So we can see that a state like California is mentioned most often. But of course, California is a very populous state. So maybe we would prefer to divide by the population of each one of these states. And instead, look and see what states are mentioned in pop lyrics most often relative to how many people are there. So here, we see that states like Hawaii and Montana are mentioned in pop lyrics very often relative to how many people live there. So these are states that have a big impact in pop music and culture relative to how many people actually are there. So this is an example of a kind of analysis that is possible using these kind of tidy data tools and moving through a pipeline where we get from raw text to visualization using a pipeline of dplyr and ggplot2 that is well understood and well supported and gives us visualization that I'm proud of and can be that I can share. So this was picked up by the Washington Post Wonk blog a couple months ago. And I was excited to see this being shared around. So that was what is tidy data and how can we do analysis based on counting and frequency. The next thing I want to talk about is sentiment analysis. So sentiment analysis is a kind of analysis that asks the question of text. What is the emotional content of text? Or what is the opinion that is being expressed in text? And often sentiment analysis is done with sentiment lexicons or dictionaries. And these are lists of words that are assigned some scores. So they might be scored from negative five to positive five, saying like how negative of a word or how positive of a word is this. Or they might be scored in a binary fashion. This is a negative word. This is a positive word. This is a joy word. This is a sadness word. And the way you assess the sentiment of a section of text is by adding up the sentiment of all the words that make up the text there. If you use text data that you're storing in a tidy data format, you can implement sentiment analysis with an inner join operation. So we can take, let's go back to Jane Austen. So if we have our Jane Austen that's in a tidy data format and we use an inner join with one of the sentiment lexicons, then afterward what we have is the words that were in both data sets and we can add them up. We can use just a couple more lines of dplyr and tidy r code here actually. And then the data frame that we have at the end of that, what it actually shows us is how does the sentiment change during the narrative arcs of these six novels. So we can see in during what parts of the books do we see more positive words being used or more negative words being used. We're able to get to this visualization that I made again with Gigi plot two because the data is in a tidy data format and then I can use inner join to implement a sentiment analysis. So the sections here of, for example, extended negative sentiment or positive sentiment correspond to plot events that we as human readers understand to be good and bad things to do. So we're seeing real effects and real insight by doing something like this. Like for example, if you look at Pride and Prejudice, that section near the middle of the book where you see the first extended section of really negative sentiment there, that's where Mr. Darcy proposes for the first time so badly and Elizabeth is very upset and angry with him. And that section about three quarters of the way through where you again see like a really extended section of negative sentiment. That's where Lydia elopes with Mr. Wickham and it's so such a terrible scandal. And so we're able to see these real, we're able to identify plot events that are reflective of our understanding, our human understanding of what this text is saying by embracing tidy data principles and making visualizations in this way. It's also important when we're doing sentiment analysis to be able to understand which words are contributing to each sentiment. And when we are keeping our data in a tidy data format, it's easy to get at this. We can, for example, in here we're doing we're doing the inner join again and counting, but instead of counting up only adding up sentiment, we're gonna count by word and sentiment. And then we can see how much is each word contributing to sentiment, which we can see which were the negative words, which are the positive words. And we can again make visualizations to get a better understanding of what's going on here. Here you can see on the positive side, these words all look good, right? Like we see that these are the Jane Austen words like happy, love, good, great, pleasure, happiness. And some of those words on the negative side are fine. Those are like poor, doubt, impossible, afraid, scarcely. But that top word miss in Jane Austen, that is used as a title for a young unmarried woman. So this is an example where the sentiment lexicon has misidentified a word in our text that is actually a neutral word. So it's important that we explore which words are driving the sentiment in which direction and using tidy data tools and exploring the space of what is going on with sentiment analysis is important. So in this case, we could add miss to a custom list of stock words, use anti-joint to remove it, and then go on and do our sentiment analysis and see how that changes our results. That this is one of some of the power of applying tidy data principles is that it is easy to iterate, it's easy to keep going in this process because these pipelines are effective and easy to reason about here. So far we've talked about what is tidy data when it comes to text. We've talked about sentiment analysis. And the last thing, the last big thing I want to talk about is how can we programmatically understand what a document is about? So we might say, okay, we're going to find out what a document is about. Let's look at the words that are used the most often. So let's look at something like term frequency. So term frequency is when we take, we divide the total number of words used in a document. We take the number of times a word is used and we divide it by the total number of words in a document. We're like, okay, a document uses a word a lot of times and that is an important word in that document. But that, again, we run into the problem of some words being very common but not important. A more sophisticated way to approach that problem is something called inverse document frequency. So inverse document frequency is a weight. So it is, it works like this. It's the natural log of a ratio. And so let's say we have a collection of documents. And let's say all of the documents contain a word. So then this ratio will be one and then the natural log of one is zero. And so that will get weighted down. That will be, that will be, that will become zero. But let's say we have our collection of documents and only one or a few of the documents contains a word. Then that ratio will be bigger than one and then the natural log will be a bigger number. And so our IDF, our inverse document frequency here will be a bigger number. So if you take term frequency and you multiply it by inverse document frequency, you end up with something that's called TF IDF. And it's a statistic. It's something you measure about a word in a collection of documents. And it's a heuristic quantity, which means that people don't really know why it works or have any great theoretical backing for why it works. But what it is meant to do is that it identifies words that are common but not too common. So it's gonna find you words that are common to one document within a collection of documents. So let's look at how this works. So let's go back to Jane Austen naturally. So first let's look at term frequency. So we can actually just easily calculate term frequency from scratch using dplyr verbs. And we get a data frame that looks like this. So for every book and word, how many times is the word used and how many words does the book have in total in it? We can make a visualization of this and see what is the distribution of n divided by total. So that's just exactly what term frequency is, n divided by total. So you notice that these are very similar, these distributions are very similar to each other. And we see that there are a lot of words that are used just a few times and then we have these very long tails of there are just a few words that are used many, many times. So those are the words like the and of and to out there in those long tails. So what TFIDF is going to try to do is just look at this distribution and find the words that are important for each document compared to the other documents. So the tidy text package has an implementation of TFIDF and the function works like this, it's bind TFIDF and the output will be a data frame like so. So notice we've got a TF now and that would just be the same value that we got before dividing in by total. Notice here that IDF and thus TFIDF are zero for these examples and that's because all six of Jane Austen's novels contain the words the, to, and, and of. But instead if we take this data frame and we arrange it, we sort it so that we look at the things that have the highest TFIDF then we see these words. And if you're familiar with Jane Austen's novels you know who these people are and even if you're not familiar with Jane Austen's novels you can tell that these are proper, these are proper nouns. These are the names of people and places in Jane Austen's novels. So what does that mean that we've learned? We've gotten to a visualization like this using tidy data principles and we were able to get there and learn that in the corpus of Jane Austen's novels she used similar words. She used words, her word choice was similar from one novel to the other and the thing that distinguishes the most one novel from the other are the proper nouns. The proper nouns are what make the novels most different from each other in a linguistic sense. So we're able to see that because of the way we've applied these tools. Another application of TFIDF I want to talk about is related to the NASA data knots program which is something that I've been a part of for the past year which has been really interesting and fun. It's been a project where I've been involved with data science learners and data science mentors and one of the projects that I've been involved in is one to understand NASA data sets. So they have over 32,000 data sets at NASA and they have metadata on these data sets and this includes things like like what subsection of NASA did they come from and what is the title and what's the description and are there keywords that are tagged and so what NASA wants to do is to understand what are the connections between these data sets and in some better ways using machine learning techniques. I've been involved in topic modeling and some other various techniques but one of the things that I did was to apply TFIDF to the description fields in just the same way that I talked about in the last example and the results look like this. So this is a subset of the NASA data sets. So if you look at NASA data sets that are tagged with the keyword seismology, the high TFIDF words are words like risk, earthquake, acceleration and hazard. But if you look at NASA data sets that are tagged with for example, the keyword budget, you see very different high TFIDF words. You see words like the office of management and budget or fiscal and financial. So using a measuring TFIDF for words and description fields got us to meaningful insight with two very different kinds of texts. We had narrative fiction from several hundred years ago and we have technical description fields on data sets and in both cases we got real insight because this is a flexible and powerful technique. So one thing I'm just gonna mention briefly is that so far in everything I talked about today, the observation level was the single word but that is certainly not the only observation level that we might be interested when it comes to text. We also can look at things at the n-gram level like for example, if you take, if you look at bigrams like pairs of words with texts that you can then ask questions about networks of words, you can look at questions about how is negation affecting your sentiment analysis? Like you can ask very interesting and complex questions by moving from the single word level to larger sections of text. And this is all possible using the, by still within approaching your data using the tidy data principles. I'm just gonna throw up something else from the NASA data analysis. This is a network of tags from the NASA data sets. So this is a network of how often or how correlated are different tags together. So there are some tags that are actually correlated together with a correlation of one, like they always come together and then the correlation goes down from there where some tags are used less often together but we're able to make a network and see how, like what is the universe of tags at NASA on the data sets that are there and just try to understand this. So this is possible when this kind of analysis is done within this whole universe of tools. It is not necessary or required that text data is kept in a tidy data structure through all stages of an analysis. This lets you have a, for example, a workflow where you might pre-process, filter, do initial exploratory data analysis using tidy data tools and then cast to a say a document term matrix and then you can do some machine learning algorithm on the document term matrix because it's something that involves like the linear algebra of the matrix and then you can tidy the output of your statistical modeling procedure and use say Broom and GG plot two to visualize the output of your modeling procedure. So tidy data principles with text is powerful and fun and a great way to use tools that are in wide use but it's not mutually exclusive with other data structures or other approaches when they also can be used. So this, my collaborator, Dave and I are writing a book on this. So what this book sets out to do is to provide background and examples for people who want to apply this approach to the text analysis that they want to do. So this book is available in its entirety online at tidytextmining.com. It's also available for pre-order on Amazon. So it has chapters that have introductory material and ways to get started and then the last couple of chapters are case studies. They're beginning to end case studies with real world actual data and how do we start, how do we read in the data or how do we clean it and then how do we do things like sentiment analysis or topic modeling and how do we gain insight from it and so they're like a whole beginning to end case studies there at the end there. So we're really excited to have this resource available for people who want to use this approach to text mining. And with that, I will say thank you very much.