 Hello everyone and welcome. My name is Eric Fransen. We would like to thank you for joining us today for this webinar, a production of Dataversity with our speaker, Nick Pendar of SkyTree. Today Nick will be discussing machine learning techniques for analyzing unstructured business data. Just a few quick points to get us started. Due to the very large number of people that attend these sessions, you attendees will be muted during the webinar. We will be collecting questions in the Q&A box in the bottom right-hand corner of your screen, however, and we encourage you to do that. You can also use the chat window to send me messages if you need. As always, we will send a follow-up email within two business days containing links to the slides, the recording of this session, and any additional information that may come up during the webinar. All of that will be posted at Dataversity.net within two business days. This webinar is part of a series and we look forward to seeing you at future webinars in the Smart Data Webinar Series, but also the archived collection is available at Dataversity.net as well. And now a few words about our speaker. As a natural language processing expert, Nick Pendar applies machine learning and data mining techniques to textual data in order to classify, extract, and organize information from a variety of sources. Nick received his PhD from the University of Toronto in 2005 and in the same year started an academic position at Iowa State University where he conducted and directed research on NLP and text categorization for various educational and legal purposes. Prior to joining SkyTree as a senior data scientist, Nick also held engineering and R&D positions at Groupon, Uptake, and H5. He has published papers and given numerous talks on the topic of NLP to a variety of audiences for over 15 years. He has also filed multiple patents and is an active member of several related professional organizations and conferences. Please join me in welcoming Nick Pendar. Nick, welcome. Thank you, Eric, for that great introduction. Yeah, so today, instead of basically listing, providing a laundry list of different techniques that we use in text analytics, I decided to introduce some of the techniques that are available and then go a little bit deeper in one of the techniques that we recently used at SkyTree for a specific problem. That way, we can get a highlight of some of the available techniques but also kind of get into the details of essentially what amounts to data preparation for machine learning using text data in the context that I'm going to provide. So without further ado, let's get to it. So I'm going to go through the introduction and talk about some of the uses of text data, where text data comes from, and then get deeper into the case study that I'm going to talk about, talk about the challenges of training data and provide the experiments and the conclusions. So, yeah, over the last few years that I've been working with text data and doing text analytics, I noticed that there's two general sources of text data, internal and external to any organization. Internally, you have email and you have chat communication, all sorts of internal communication across team members in an organization. Also, people publish white papers, their patents, other types of IP papers. There's also business documents that get produced like notes, purchase orders. And also, machines reduce lots of data. For instance, if you have a large system, usually those systems, computer systems, they produce a lot of logs. And those machines need servicing, so people service them and then they produce notes of what happened to these machines and a lot about other technical data. Obviously, you can think of other sources of text data within an organization. And people throw out numbers, things like almost 80% of, or maybe up to 80% of an organization's data comes in unstructured textual form. On top of that, you can add the external data sources, things like the web, everybody's familiar with that, social media news, public records, electronic health records if you're into healthcare, professional publications, if you are an R&D shop, libraries, if you are a knowledge worker. So there's lots and lots of text data that we have available to us. And the speed that this data is accumulated and the sheer mass of that makes it impossible for anyone to process it and to ingest it in any meaningful way. So this is when people resort to machine learning in order to make sense of the text data that they have. And you can kind of categorize the types of uses of text data into three. You can say a recommendation or search. For instance, you're interested in, you know, a document that is relevant to you. So basically you're asking the system to show me a document that is relevant to me. And this could be in response to a query such as, you know, a Google search. So you have an information need, you type in your query, and you get some documents relevant to you. Sometimes you have a number of documents, and you are interested in documents that are similar to this. This comes up in a variety of contexts, for instance, in any exploratory analysis. Or you are interested in documents that people like you are also interested in. This could be for personal enrichment or for professional purposes. Another form of use case for text data is categorization and e-discovery. So essentially you want to show me, you want to see documents that are specific to a topic. And in all of these cases, the machine learning techniques that we use for these involve some form of classification. For instance, if you are under litigation and you're required to provide documents that relate to marketing practices here and there for a particular product, basically you're asking the system to classify documents, whether they belong to the topic you're interested in or not. Or they are close to the documents that are interesting of your interest or not. Sometimes an organization is interested in deleting a number of documents. They don't want to keep everything around. They don't want to keep duplicates. They don't want to keep old documents around. But they're also legally obligated to retention of some of the documents. So they're interested in finding the documents that are basically either duplicates or they do not belong to any IP or anything where they might be required to provide those documents in the case of litigation. So that's another form of classification. Basically this document can be deleted or not. And then there is this vast basically sea of analytics. It's a big form. People talk about text analytics in a variety of different forms and contexts. And essentially you can say, you know, what is X thinking about Y? What is my customer thinking about my product, for instance? Or why did my customer make a certain decision? For instance, you know, why did my customer leave me? Or why are people closing their accounts with us or coming to us or whatever? In these contexts, we're still using classification. We're still classifying documents in terms of, you know, is this relevant to this topic? Is this relevant to a customer? Is this document related to a customer leaving or not? And on top of that, we are using classical NLP techniques and information extraction techniques to extract information from a document after we've classified that it is of interest to us. So when you look at all of these use cases, in the crux of those, we have a classifier. So basically the document has been classified in terms of whether it's related to your topic or not, related to your use case or not, and so on and so forth. And that makes sense because people are generally classifiers at heart. So whatever we do every day, we are basically classifying things constantly over and over again. You know, I'm walking down the street and somebody is approaching me and I classify this person as friendly or not. You know, I'm hiking and I'm stepping on rocks and I'm classifying these rocks as whether they're good foothold for me or not. So thinking of problems as classification problems is just natural for people. And also it's basically a good way to get rid of unnecessary data so you can focus on the data that's relevant to you. There's another word for classification in machine learning. It's called supervised learning. And the reason for that is usually we collect some data for which we know labels. For instance, documents that are related to sports and documents that are not related to sports if our category of interest is sports. We collect these documents with labels we already know. Somehow we train a classifier. The classifier learns a pattern of, in this case, you know, the lexical patterns of these documents. And at this point, the classifier is able to predict whether the document is related to sports or not given new documents considering it's past history. That's why we call it supervised because we are providing some labels for the classifier. And the cleaner, the better data you have to provide your classifier, obviously you get a better classifier. However, the challenge lies right there. Finding clean data. Data that doesn't include a lot of errors and is relevant. And also data that's enough. You know, you don't want to give your classifier only a handful of examples and expect it to perform well on new data because you might not see enough patterns in that data set. So oftentimes, even all of the time I would say, the job of a data scientist most of the time is spent in finding the data, cleaning the data, and preparing it for supervised machine learning algorithm. So how do we do that? In this case study that I'm going to discuss right now, we had a client that came to us. They had access to the entire Twitter firehose. And they were interested in classifying each tweet into a number of categories. Basically they wanted to tag all tweets that came to them. And they wanted to be able to store those tweets and then provide the dashboards to their clients and show their clients relevant tweets. And when you look at tweets, we get 500 million tweets a day, about 150 million of those tweets are in English. And the challenges of processing that type of data set, you know, as anyone who has heard the term big data knows, you know, it's the data size and the velocity of the data. Another challenge is that tweets are very short. And that means people use acronyms, alternate spellings. Sometimes they are so short that they contain no signal in terms of the classifier. And given that Twitter is basically live and people talk about things that are happening to them right at this moment, the data is also very dynamic in nature. Topics pop up. Over time they change and sometimes they completely disappear. So training a classifier for tweets is extremely difficult. And another challenge is that we don't have training data for these topics. We don't have tweets having around for only so long. We don't have training data for regular topics like sports and finance and for new topics that emerge. We definitely don't have any training data. So how do we create training data? This is the problem. We are interested in training a classifier. But we don't have any data to give to our classifier to learn from. There are several approaches you can take to find training data. One is you can say, okay, I'm going to hire some people, give them some tweets, and train them for a class. And say, you know, if you see these types of tweets, they all belong to sports and others don't. Usually this sort of thing is accurate. You get pretty good training data for it, but it's extremely expensive. There's not a scale at all. It only works for smaller data sets for cases where you really need human judgments. For instance, the cases where the topic is extremely difficult. And in cases where the topics don't change as much. So that approach would be out the window for us. And one approach that we looked at was why don't we use hashtags? People use hashtags to label their tweets for topics or other things. And we might be able to use hashtags, maybe cluster hashtags and find topic markers there. Maybe some kind of pattern would emerge. There's actually very interesting research for clustering and finding patterns in hashtags. We looked at it, but it turned out that hashtags are extremely noisy because people have used hashtags in a variety of different ways. I don't know if you can call it that. They just use it for their own purposes, not for tagging tweets for a class. So you have hashtags that are hashtags like TBT, throwback Thursday. Or hashtags that apps put in to link back users to their app. A lot of different hashtags that actually don't mean anything in terms of the category. And those provide an enormous amount of noise into our data that makes it almost impossible to get any useful meaning from them. So after a short period of looking at hashtags and trying to cluster them on finding topics, we decided to abandon that because it was too noisy. It's a good research topic, but it's not useful. It wasn't useful at least for us in that context where we were working with a client that wanted a production system. Another approach is we can say, okay, let's find a set of keywords that are not ambiguous. We are sure that those keywords signal a topic. For instance, soccer. Pretty much all the time means that we're talking about the sport. And especially in the context of tweets, the occurrence of one or two of these words would be enough to signal that this tweet is about sports, for instance. If I see soccer and basketball, the tweet is definitely about sports. Nothing else. So we can say I'm going to collect a number of unambiguous keywords, use those keywords to search through my data set. Anything that contains those keywords would be positive examples for my class. That is also accurate relatively because you're using unambiguous keywords, but it's hard to curate. How do you find an ambiguous keyword? And how many is enough? So you get low recall. It doesn't matter how many keywords I have accumulated. I'm not guaranteed to have a high coverage of all of the relevant keywords. And the last approach is let's find a comprehensive keyword set. So we are going to make a very large keyword set. It will have noise in it, definitely, because if I have a large keyword set, some of those keywords might be ambiguous. But at this point, we are hoping that machine learning techniques that we're using would be able to pick up the signal from the noise and train a good classifier. So we decided to go to take that last approach, create a comprehensive keyword set. And now the question is how do we come up with these comprehensive keyword sets? Again, we cannot go to people and ask them to give us comprehensive keyword sets because that will be the problem number one. It's very expensive. So we decided at this point to find keyword sets using some external knowledge source, things like Wikipedia. People have already put a lot of effort in Wikipedia or Freebase, things like that, where there is a lot of knowledge available already curated by people. Let's make use of that knowledge and create a keyword set from that. Another approach is to find unambiguous hashtags or Twitter handles. Unambiguous hashtags and Twitter handles are good. They don't give you enough coverage, though, but using a knowledge graph such as Wikipedia and extracting keywords from a knowledge graph would give you a lot of keywords and relatively good coverage. So for example, in this case what we did was we trained classifiers for sports and NBA. We created keyword sets, I take that back. We created keyword sets for sports and NBA. By starting a crawl of Wikipedia articles at a node like sports or at NBA and then collecting all Wikipedia title articles and treating those articles as keywords. So for NBA we got some like 10,000 keywords and that gave us about a 1% yield of the data that we had at our disposal. That gave us a good coverage. But like I said, the challenge here is again that these keywords are not necessarily completely unambiguous. There is noise in there and we're hoping that the classifier will pick up the noise. So for example, some of the keywords that we collected for NBA, these are names of, on the left-hand side you can see the names of players, names of NBA teams for sports. We got a list of names of sports, things like Aikido and aerobatics and things like that. So I have a number of keywords. What do I do with them? We're going to define that any tweet that contains these keywords would be positive to my class. So for my NBA, any tweet that contains those names of NBA players or teams would be my positive tweet. Or for sport, any tweet that contains, you can even score that. In this case we just went binary and we said any tweet that contained a name of a sport would be related to sport. So far so good. What about negative examples? So we assumed that any tweet that does not contain any of those keywords would be negative. So that's one way. This is not true in general because there's no guarantee that our keyword set is complete. So there could be tweets that may be related to our topic but we just didn't have the keyword for it. Another way was to basically train a general density estimation model on the tweets that we deemed as positive and then estimate the probability distribution of any other tweet and say whether this tweet belongs to that distribution or not. Tweets that did not belong, that had very low probability of belonging to the same distribution as our positive set, we deemed them as negative. So we did two experiments. One experiment was our negative set would be a uniform sample of anything that was not positive based on our definition. And another experiment was our negative set was selected from the tail of the distribution based on the current density estimation. It turned out that actually the general density estimation approach resulted in the worst model. So we went with the first approach. Basically any tweet, we're going to sample air from the any tweet that did not contain our keyword set. So let me step back a little bit and review this approach because there was a lot of detail there visually. So we had an initial concept, for example, MBA. We identified a keyword set using our external knowledge source from Wikipedia. We came up with an initial keyword set, things like names of MBA players and MBA teams. And then we searched our unlabeled documents or unlabeled tweets for the occurrence of these keywords. And we defined anything that contained those keywords as positive, anything else that didn't contain them as negative. So this is our training data right now. But we didn't stop there. We said, okay, we're going to now treat this as a supervised text classification problem and identify features from this set. Identifying features from this set basically means that you create a dictionary of all the words you see in this data set and compute a score for how significant is this word in terms of signaling my category. So now we are essentially extending our initial keyword set by a factor of basically several orders of magnitude. The keyword set goes from 10,000 to maybe a million. And each word now is not binary, basically. Does it occur or doesn't it occur? Each word now has a weight in terms of how many times I've seen this word in positive or negative tweets. At which point now we're going to represent these tweets in an machine learning friendly way. And now I have a data set that is ready for machine learning. So I haven't done any machine learning so far. But as I said, the most of the time of data scientists is preparing data, especially when you're dealing with unstructured data sets. This is where the bulk of the problem lies. Now I have a data set that's ready for machine learning. But at this point we trained a machine learning model. And the goal of this was we want to train a machine learning model that is good enough in the beginning to put in production so that whatever it says belongs to the category, actually it belongs to the category. I'm not too worried about missing things. But over time as we get more and more data with user input we can enhance our training set and with a better training set we can retrain periodically and we will improve the model over time. So we trained a model based on the occurrence of single words and single word and two word sequences. And we used this feature that Skytree provides called AutoModel. Basically say, you know, find the best parameters for the best model and just train a model for me. And for this experiment we optimized the model for precision as top 25%. Basically I asked the software to train a classifier such that the top 25% of whatever it says belongs to the category, actually belongs to the category. So once that happened we actually got pretty good precision and recall on cross validation data, on the data that we created. In this table, the first column, the threshold is basically the probability threshold that we use to say whether something belongs to the category or not. This is on sport data. Precision is the percentage of the tweets that it said belong to the category, actually did belong to the category. And recall is the percentage of the tweets that we wanted to capture. And as you see we get fairly good precision and recall on this data set. But we shouldn't celebrate yet. I know we have confetti and balloons up there but we shouldn't celebrate yet. Because this data set we created artificially. We said anything that contained these keywords was positive and anything that didn't is negative. So given good results on this type of training data is not a surprise. So the question right now is does this model actually work on new data? And does it go beyond the keywords there? Basically the question I'm asking right now is I trained a classifier but am I doing any better than just using those keywords? So at this point we had this model. We gave it some completely new, completely unseen data. And then we looked at how the classifier performed. And it turned out that the classifier actually because it had built this extra dictionary that was much larger than the initial keyword set, it had captured a number of interesting patterns in the data. For instance, in the context of NDA it had picked that methodology that is a basketball player. Me not being a force finder, I didn't know any of that. So I actually had to go and search and figure out what it is that it had picked up. It had learned that Al Jafferson is a basketball player. And in this fourth example I couldn't find any single lexical item that would signal NDA but it's a collection of those terms that told the classifier that this is a likely NDA term. We trained another model for the topic machine learning. And the system learned that latent semantic analysis or latent semantic indexing, LSI, is a term related to machine learning. It learned that hidden Markov models in data science are related to machine learning. And these are keywords and phrases that did not occur in the initial keyword set. So it looked like our system was learning something beyond the initial keyword set. Which was the goal. We didn't want to just have a static keyword set and stop there because especially in tweets, topics change, as I said before. So we ran another experiment just to make sure. We said how about we get this keyword set, we remove one of the keywords, we train the classifier and we see if it picks it up. So we trained the classifier for sports, but we removed the word baseball from our initial keyword set. And did the same. We trained the model, we ran the model on new data, and we saw if it had picked up the word baseball, which it did. There were a number of tweets where the only signal in the tweet was the word baseball. Same thing for MBA. We trained a model for the topic MBA without using the word MBA. And it picked up that word is significant. And all of that comes from the fact that we are doing future selection after creating the training data and the features contain a lot more keywords than where we started with. So to conclude, basically creating a training set, especially for large data sets and for difficult topics is a daunting task. In this particular case, we were able to leverage Wikipedia. And it's usually good advice to go look at what other people have done and try to kind of make use of that. In this case, Wikipedia, crowd source type of knowledge that we were able to use. We trained a model for high precision that we could use in production. At the same time, this model was performing well enough that was going beyond the initial keyword set and was able to be retrained and to get improved over time. And at this point, I stop and I will open the floor for questions. Thank you. Thank you, Nick. So a couple of questions have already come in. I'll remind people to please go ahead and drop those questions into the Q&A module in the lower right-hand part of your screen. Nick, to start off, right around slide 16, a question came in asking how we test the results. There's a common technique called cross validation where you divide your data set up into, say, five-fold. You train a model on four of them. You test on the fifth. And you repeat until all of your data you've used as training and as both test sets. And these numbers are based on that, which actually would be in this context, in this study, this would be misleading because we have selected data officially and any data that contains a set of keywords would be positive. Anything that doesn't would be negative. So it's not a surprise that we get good results on there. That's why we went further ahead and did those tests on unseen data. Okay. Our next questioner says thanks, Nick, for this clear explanation so far, but don't we need to create a lot of variables on the data under research? Isn't that even more difficult than preparing a good training set? Let's see. Let's see if I can understand the context of this question. We need to... So the variables that we created here were on this slide, the right-hand side box identifying machine learning features. That's where we are creating the dependent variables for our training data set. And that's a variable set that's bigger than our initial keyword set. Is this the context of the question or was it further back in the initial slides? No, it came in later. A real quick question in the meantime, in case that questioner would like to post a clarification. This question is, please repeat what recall means. Recall means the percentage of the items that we wanted to capture and we captured. So for instance, if there are 5,000 documents that we were expecting to capture and we got half of those, we get 50% recall. Okay, thank you. Can you talk more about how you get from topics like the NBA to initial keywords using Wikipedia? Absolutely. So for NBA, we start... So this is the semi-manual piece of it. We start at the topic NBA at Wikipedia. You can kind of search for it and find the best match for NBA. Now you have the starting point and then at that point you crawl Wikipedia and you collect the Wikipedia articles linked down to some depth, like five or six depths. That way you collect a large number of these keywords. And then you can do some computation on the frequency of things and repetitions that I didn't go into so that you get a cleaner keyword set as opposed to anything that you see. Okay. And we actually got a couple questions asking about tooling. What technologies and tools were used in this project? Okay, so for some of the aggregation, we were using Python for the aggregation of the data for the aggregation of the keywords that we were using dumps of Wikipedia and Python code. In order to get the training set, once we've got the training set, the features selection, the right-hand side box on this slide again, that we use strategy software for and the machine learning aspect of it, the auto model and the smart search on this slide, these are strategy software. Okay. That is all the questions we have at the moment. I'll give people a couple minutes to add any more in. I was curious, were there... what kind of human resources would be required within an organization to run this kind of a program? Definitely a data scientist, I would say. Yeah. And so the data scientist would help frame the problem, would help scope it and basically create the training set for it or at least advise on the collection of the data. And at that point, if you get into human annotation of the data, then you might need some subject matter experts depending on the domain. So for instance, if you're in the legal domain and you're doing e-discovery, you probably need to have access to paralegals or people who are well-worsened in the legal domain to say whether a document is related to litigation or not. In healthcare, you probably need a healthcare practitioner, such as a nurse, to tell you whether, you know, for instance, a record contains a certain type of medication or certain type of disease or not. But if it's just regular business data, you probably need people who can tell you whether, for instance, this place is a restaurant or a hotel or a spa, things like that. So that's when you need human annotation. A lot of times data scientists use their thoughtings of human annotators. Basically, they say, hey, how about you label some data for me and at that point, they try to get the labels from multiple people so that they mitigate some mistakes and collect high-quality data. And beyond that, there is, if you want to put things in production, obviously you need software engineers and systems engineers to build the infrastructure that's necessary, build or buy the infrastructure and the software that's necessary for it. Okay. The next question is, can embeddings from RNNs, and I'm assuming the questioner means the recurrent neural networks, possibly even on a character level, be used for semantic reasoning? And if yes, from which layer and how would you repeatedly construct dense-vector representations and sentences? The short answer is yes. There's research on that. People are using the RNNs for semantic analysis. The problem usually is the challenge is that you need a lot of training data for this sort of thing and the computational cost of this approach is high. But in general, the RNNs work well and if you have enough computational resources, you get good results. Next question. Can ontologies be used to substitute or in conjunction with keywords for classification? Supposedly the ontologies capture certain amount of semantics which might help in the analysis. Yes. Ontologies are very useful and in fact you can think of Wikipedia as kind of an ontology. It's a graph more than an ontology but in general, yes. However, the notes on the ontology, each one should be associated with a set of keywords in this type of context where you want to search for the occurrence of the keywords because I imagine, you know, and noting the ontology is itself just a phrase and searching for that one phrase would not be useful. You might take a similar approach to us and start from noting the ontology and every other note that's underneath that but down to a certain depth, that might be an approach worth taking a look at. Okay. Request here. Can you talk more about how you get from keywords to machine learning features? So for every keyword, we know how many times we see this keyword in a positive document and how many times we see it in a negative document. There are a number of statistics you can calculate based on this information. Actually, even without that, you can calculate term frequency times inverse document frequency or TFIDF. TFIDFs give you a good weighting in terms of the importance of the words in the corpus or you can calculate more informed weights for each feature and that would be based on the occurrence of the words in the positive and the negative set. Binaural separation is a good example. Chi-squares is another good one. There are a number of statistics you can calculate on these. So at this point, once you calculate a score for every word, you have a dictionary and for every word in the dictionary, you have a score. At that point, you can represent a document as a vector and the high-dimensional vector. This approach is called the vector space representation of a document. In this vector, each dimension of the vector would correspond to one of the words in your dictionary and the value of that would correspond to perhaps a frequency of occurrence of the word in the document multiplied by its weight that you calculated in the previous step. At this point, each document will be represented as a high-dimensional vector and now you have a matrix that you can feed your machine learning. Okay. A question here about the positives and negatives you were talking about. Do you reclassify some negatives to positives or positives to negatives based on the new keywords found using ML? Yes. So as time goes by and you get new data, essentially, you can add... Well, this data is time-sensitive. As time goes by, you probably want to refresh the training set altogether. But as time goes by and based on the user interaction, whether you can ask users whether this actually is related to the topic or not, or based on the fact that they click on it or not, you can make a decision that this new document or this new tweet that I saw is actually positive or negative. At that point, you can collect more data and add it to your training set. Or you can use some kind of active learning approach where you're making your classifier asked for clarification. The classifier trains the model, goes through a new set of data, and then looks at the pics of data that is closest to its decision boundary, data that it's most uncertain about, and presents those to a user and say, hey, how about this? Is this related to my topic or not, and at that point you can take the user input and add it to your training set, which in effect means some of the things that the classifier says would be negative might end up being positive or vice versa. So in a manner of speaking, the short answer to that would be yes, but you probably want to do this on new data that's coming in so that you're also getting fresh data. Thank you. Is a classifier equivalent to a feature in the context of a keyword? No, the classifier looks at the features, looks at the keywords, and finds patterns in those. So no, those would be two different things. Basically, a feature would be one of the inputs to a classifier. How do you handle negations, such as those often found in disclaimers? Good one. In this context, you really don't need to handle negation because we are a classifier for a topic. If somebody says, I like soccer or I don't like soccer, they're talking about soccer nonetheless, especially when you're talking about tweets, which are very short. You could think of a case where somebody says, my document is not about sports, I'm going to talk about something else. In those contexts, again, because we are only interested in the topic, the lexical patterns would basically signal the topic, regardless of negation within the document. However, there are cases where negation might matter, for instance, in the context of sentiment analysis where somebody says, I love this hotel or I don't love this hotel and things like that. At that point, using a, there are several approaches you can take. One of them is you can use a nonlinear classifier and you make sure that the negations are in your feature set and a nonlinear classifier to pick up patterns, the occurrence patterns of these negation words and the other words. And basically will mitigate the problem. Another approach is you can featureize your document not as a complete bag of words but a summation of bag of sentences, for instance, to maintain the locality of the negation and the words that are surrounding it. Another approach would be to use n-grams, sequences of words that would include those negations at that point but that would require a lot of data. If you don't have enough data, n-grams would probably not be useful. There are several ways. You might, in some other problems try to parse the sentences and try to capture the fact that a negation was present close to a verb in the sentence, for instance. Those are more involved and usually they take a lot of time with diminishing ROI. So I would start with just a simple bag of words and then incrementally make the system more complex as I need to. All right. Thank you so much, Nick Pendar of Skytree for this great presentation and Q&A. I am afraid that is all the time we have for today. Just to remind everyone, we will be posting the recorded webinar and slides to dataverseday.net in the On Demand Webinars section within two business days and I will also send out a follow-up to the students to let you know how to access that material. The next smart data webinar will be on the second Thursday of November. That's Thursday, November 12th at the same time and our topic will be related to a semantic solution for financial regulatory compliance with Michael Bennett of the EDMC, the Enterprise Data Management Council. Thank you all again. Thank you, Nick. Thank you for today and I hope you all have a fantastic day. My pleasure. Thank you for your time.