 And that's my colleague Ganesh, we're both at a firm called Grandeur and we're into data visualization. This is a talk on visualizing text. It's slightly unusual and certainly a new field. Text is not something that we tend to analyze too well and it's certainly not still not audible. Text is not something we analyze too well, we're a lot better with structured data numeric and quantitative data and it's certainly not something that we know how to visualize well. Actually there aren't too many things we know how to visualize well. So this is a fairly emerging field and I'll talk about some of the advances that we made, some of the things we've learned, some examples that we've created, some examples that we've seen that are really good and hope that you'll get something out of it. My interest in text and text visualization goes back to about a decade and a half when I was first introduced to Calvin and Hobbes. Now Calvin is an amazing character. Calvin's dad is actually an even more amazing character for giving advice like this. In fact there's this subreddit which is about explain like I'm Calvin's dad. Just talking about how you give explanations exactly like Calvin's dad. Now I love giving examples from Calvin and Hobbes and when I was at BCG we were trying to create slides and every now and then I'd want to say let me pick that one slide somewhere where Calvin was talking about not understanding stuff or TV being the best medium or whatever and the trouble is you just can't find it. Do a Google search, nope. UComics which distributes Calvin and Hobbes does not have a searchable interface. So in 2001 when I got my first laptop and was commuting between Bandra and Nariman Point every day on the train, I spent all that time typing out every single strip of Calvin and Hobbes between 2001 to 2006 and I completed that task in London. That's what I used. Now to an Excel sheet where as I scroll up and down the rows I could see the different strips. In fact I can show you the Excel sheet. That's what it looks like as I go up and down. It shows me the strip and I would just now go there and say okay what's this taste it you love it blah blah blah and that was my workflow for typing this in. Took a long time but we finally got there and eventually now that the content is there it's a question of how does one search and you've got it in Excel you can copy and paste in anywhere else and you can search it but I figured a web interface would be slightly easier and created a small Calvin and Hobbes site where you can search. So for example if I want to search for, oops that's not what I want to do. If I want to search for the tracer bullet strips, okay so that's the first tracer bullet coming. That's the second one. That's the third one. Or if I want to search all the ones where snowballs were mentioned or Spacemen Spiff or Miss Wormwood you know good stuff. It went on a reddit and a few other places a number of times and one of those happy occasions it was noticed by you comics legal team and this site is now shut down. I wrote back saying look happy to offer this text feel free to use it on your site I'll build it for you if you want. No response. But the good part is this text exists and I have this corpus and I started playing around with it which marks the start of my interest in text analysis. I said what does Bill Waterson talk about? What are the kinds of words that he uses more often? And that brings us to the classic example of text visualization. If there's one sort of perfect piece of text visualization that one looks at today. It's a word cloud. You take words you plot them lay them out in a way that such that the size of the words is proportional to the frequency with which it's mentioned and you lay it out so that they just about fit so clearly Calvin Hobbes is a lot about Calvin to a good extent about Hobbes no good see mom like so on and you get a sense of what this is about if you want to get a very quick sense of what any piece of text is about this is a great way of doing it and I'll show you some examples of how you would be able to create a word cloud within a few seconds now word clouds come in many shape shapes and forms there's this guy called Jeff Clark who goes by the alias of neophomics on whose site you'll find some incredible examples so for example here's a visualization of a word cloud of words that appear on Apple's website shaped in Apple's logo word cloud doesn't have to be a homogeneous mass it doesn't have to be arbitrarily shaped you can shape it to pretty much any shape you want and the shape in itself can convey a meaning so for example this is a picture of Obama but it's actually made up of a word cloud and the words in this word cloud are the words he uses in a state of the Union speech so a lot about change and hope and so on many of these words are repeated in fact you'll also see on the site a picture of Albert Einstein in which the word Einstein is what is used in varying sizes to create this image this is a combination of data and art the shape of the picture is mostly the art part of it but to the extent that the size of the words behind it are proportional to the frequency with which that they use that's where the data part of it comes in now it's possible to create a word cloud extremely easily there's a site called wordle.net which in my opinion is one of the best word cloud sites and all it takes is let's do one thing I'm going to take the prime ministers independence Day speech in 2011 copy it go to wordle.net slash create paste this text click go run this time and there you go that's all it takes let's look at how you can play around with this I don't like this font I like cool Vettica or I like the Gothic is pretty decent I don't like this color scheme I can make it a plain black and white scheme or Indian earthy or whatever I can go for straighter edges if I like straight edges but rounder edges tend to look good you can lay it out in any a random way or half and half half horizontal half vertical or entirely vertical but usually horizontal layouts are the easiest to read and you can you can play around with the word schemes and so on it's fast and it's very tough to get this kind of a smooth beautiful layout but also you can see what he's talking about he's talking about the country a whole lot he's talking about brothers and sisters in fact he's talking about brothers and sisters a little too much probably certainly talking about the government he's talking about prizes he's talking about political he's talking about the parliament he's talking about development he's talking about people economic you don't have to read what was it 20 paragraphs to get a sense of what this is about you know what he's talking about in fact one of the partners that we had at BCG said look I don't need to read any of Abdul Kalam speeches I know what he's going to talk about he's going to talk about children he's going to talk about science he's going to talk about education you don't always need to read something to get a gist of what is being spoken about in that context and word clouds help a lot now the trouble with word clouds though is so when you take a look at this one you don't often get the context in which this is being mentioned so when you say country what is he saying about the country when he's saying brothers or sisters what is he telling his brothers and sisters to do when he says also also what so the context is something that's a little difficult and there is a technique called concordance which is popularly used to bring this about let me give you an example of a concord a piece of concordance in Calvin and Hobbes so if I want to search for let's say what does Calvin think about girls okay so Calvin thinks girls are slimy slimy slimy and the bulk of what he thinks about he certainly doesn't like them girls and associated words would be slugs and sissy and Susie which sort of rhymes and he's in fact gone as far as creating this get rid of g-r-o-s-s this club get rid of slimy girls club for well for the purpose that is fairly obvious then you can just explore this so what do you think about secret click on secret okay so the secret note secret note talks a lot about secret notes what do you think is important actually he thinks lots of things are important I'm expecting important calls doing big things you get a sense of the context of this now what if we could combine these two so if we could combine both a concordance a concordance with word cloud so now no such tool existed no such tool exists to my knowledge so we ended up having to build one this is available at s.i.p.s annan.net now I'll explain what this s.i.p. bit does but first let me demonstrate this to you so these are all the words in Calvin and Hobbes the larger the word the more often it occurs so Calvin is pretty common Calvin ball is pretty common the smaller words like biology bidding and so on are not so common you can play around this with this a little bit you can ignore the common words so those are the words that are you can ignore the infrequent words so these are the words that are mentioned fairly often you can play around the scale a little bit the scale increases scale decreases the contrast now that bears a bit of explanation now you may have heard of Amazon statistically improbable phrases they brought this they popularize this term when they were describing books they said if you wanted to find out what a book was about if you want to get the key elements of the plot you don't need to read the book we'll just give you the improbable phrases now the improbable phrases are simply those phrases that tend to occur in the book reasonably often but not as often in the English language so you just say this book has the word has the phrase let's say Spaceman's fifth 5% of the time the English language has that 0.05% of the time so that's extremely improbable phrase whereas something like good luck occurs 10% of the time here and 5% of the time in the English copper so it's not such a big deal so that's what's going on behind the scenes here the darker the word the more improbable it is in the English language the lighter the word the more probabilities so if I wanted to see what are the words that are very improbable I can just increase the contrast on this one Hobbes, Speth, Suzie these are among the more often used words that are quite improbable in the English language. Stuff like this you can play around with this and see what where that gets you the site is sip.s-anand.net and this brings together two techniques of text visualization a word cloud and a concordance where does one use this sort of a thing are there any business examples where this tends to get used actually yes and then a number of places so to give you a couple we are doing a piece of work with a bank where we took all of the text that was mentioned in their the transactions this was actually from checks. So let's take all of the text that's written in these check transactions the two or the description or whatever and put it all together and create a word cloud. So a lot of it is about cash so you pay as cash but you can also see some details around where it's going to so for instance Fab it is going to HDFC there's a bit going to the lines there's a bit going to escrow accounts you can also see the kinds of names that these checks are being written to a lot of shahs, patales this is incidentally Bombay data. So there is like but I suspect that this may be the case even otherwise a lot of payments being made to companies limited, public limited companies, LICs and so on and then you can start saying okay these are the ones that I'm going to focus on. So for instance the organization we were working with they were looking to see what are the organizations they need to partner with to minimize their payment costs so we made a simple list and said okay those are your top 10 payments you need to worry about LIC payments, MTNL payments and BSES payments between these three they cover the vast majority of where people write checks to and this is despite them not being able to track the various accounts to which this check is being sent to. So it helps summarize what is available in an in an unstructured form in a very different example. Let's take has geek the has geek job board if you were to for those of you are not aware of the has geek job board has a job board where people can post jobs and I pulled out the description of all of the jobs that were posted since last August. Those are the words that people are looking for in the cities lots of experience you know if you're a fresher you may as well not apply Bangalore that's where everything happens in fact you'd be hard-pressed to find many other cities Mumbai to a certain extent but I mean just look at the contrast between these you better know JavaScript well okay actually next to experience you should have lots of knowledge and skills and be good at teamwork hardcore web developer or web designer one or the other now let's sorry feel free to ask questions as we go along by the way is it possible to use a phrase or an Ngram to do this yes and I'll show you an example where we do that the trouble with a phrase or an Ngram is differentiating between multiple ends. So how do I know whether a word is so let's say I have call center now I have that has two words call center and the phrase call center. So how do I know when I use call independently and when I use call center the phrase together tough problem I don't have the answer some people do have techniques that I'm not aware of but techniques exist but a simple way of doing it is to say look I'm going to take not just every single word but I'm going to take the words and by grams and trigrams and so on and I'll show you an example where we did unigrams by grams trigrams and walk through that yes the site a question is what are the site visual dot L y do does it do something similar not quite they're more an aggregator of visualizations. So they you send them a visualization they'll show you what it's about. Is it possible to figure out if word like call center is more learned by using the question is can't we use other corpus like the Google search queries or some of the corporate to find out if a phrase occurs more often than not absolutely you don't even need to go to a third party corpus you can just look at it within your own data set and try and see if for instance a word is used like call and it is invariably associated with call center immediately afterwards you may as well knock off that word which is what I've used I mean it's just a simple algorithm if a word is used with another word more than 50% of the time then just knock off the individual word. It's a very very crude heuristic but it works. Well since we're talking about this one I'm going to show you the combined version of this so let's increase let's ignore the common English words so what are people looking for? Primarily they're looking for people at Bangalore doing analytics what do they want in analytics? They want data and analytics, web analytics, data analytics, web analytics so that's pretty much it data analytics and web analytics. What do they want with Facebook? They want to launch Facebook and Twitter presence Facebook connect Facebook connect Facebook campaigns what do they want to do with optimization? It's SEM optimization performance optimization search engine optimization performance optimization just two kinds of optimizations that people want search engine and performance optimization. You get a sense of what people are looking for just by playing around with this kind of data set. Another way of using this is we took survey data where services organization had asked their customers to fill out a series of questions and asked them to be rated on a scale of one to five one is bad five is good. I said okay let's take all of the good ratings together and see what words are common there and that's the green stuff on top and what are all the bad ratings put them together at the bottom and see what the problem is. So clearly this organization is great from a flexibility perspective for good teamwork decent from a customer support perspective not bad at communication but the trouble is technical quality. I don't know what this is communication communication occurs in both places so there are some people that think communication is good communications some people think communication is bad and without having to go through the hundreds of pages of comments you get a sense of where we're doing well where we're not doing well those are some examples where it can be applied you know in business in the business world this is an example where we took in grams in my previous life we were looking at intranet search results and said let's see what are the words that are searched most often so the word that was searched more often are you able to see this the word searched most often was management followed by case followed by value blah blah blah and the question is what is the context in which these words are being searched for so let's take management now management is primarily being searched from the context of change management 8% of the time program management 6% of the time performance management blah blah blah so I get an idea of what this is about let's look at what we so when somebody search for portfolio click on that what does that mean portfolio is mainly about portfolio management so this is one of those cases where if somebody is looking at the at two words portfolio and portfolio management you may as well knock off portfolio as a standalone because almost half the time it's being associated with portfolio management but it's also being associated to certain extent with portfolio rationalization portfolio analysis and so on these are ways of exploring this data set which is search data set in a somewhat semi-structured fashion but one of the things that is missing here is now while I can get the context at any point I can't do deep accelerations so for example I can say okay let's start with application see how application is used within application it's being used a lot with mobile so click on mobile see how it's being used etc but it would be nice to have some kind of a network diagram where sort of like Google's search interface that used to exist the search we where you click on a word it tells you the related concepts you click on that you search for the related concepts and so on obviously I couldn't resist doing this with Calvin and Hobbes so this is what it looks like let me zoom in so if you start with Calvin the closest associated words the words that are immediately next to it are Calvin bed Calvin Calvin Calvin stop Calvin doing Calvin Hobbes Calvin mom Hobbes is a lot closer just come are a lot closer so let me click on some of these words so let's take see doing doing what Calvin's mostly doing homework or not rather Calvin bed Calvin has to go to bed tonight Calvin here let here I don't know what that is Calvin in time there's a lot around time so bed time clearly time machine Calvin plays around a lot with the time machine what about dad what does dad do dad said something about Calvin not as interesting Calvin's work with none of these have any interesting words mom is fairly sure about number of things mom just said something or the other and this is another way of playing around with text data so once you know the relationships between paths of words and you know what's closer what's further away you can start exploring concepts you can start exploring search terms you can start exploring pretty much anything that follows the structure of a network and see where that takes you let me show you some examples from let me show you some examples that have been used elsewhere so this for instance is a Twitter stream graph Twitter stream graph is you search for a phrase in this particular case you search for the phrase data visualization and it shows you a timeline where different words are depicted as an area graph so this period of time there were lots of mentions of words associated with data visualization the guardian was talking a lot about data visualization team arcade retro these were associated a lot this particular time lot talking about lot of mention about Twitter I presume that's Twitter Inc lot of mention about social networks and that's another way of seeing what is the air by the flow of some of the words over time in the context of a specific word this is available on Jeff Clark's site in your Formics and you can try out various words it tends to fail every now and then but you might get lucky with this another is a trends map where you can take the same words and plot it instead of on a timeline plot it geographically so what are the words that are trending in specific locations so for example in India when I took the screenshot the central regions of India we're talking a lot about Cargill and Revenue and Modi Amitabh was fairly popular in Mumbai isn't he always and furniture Muslims all of this riots those were being mentioned of fabric in south and this is a real time feed that you can check out on trends map that doesn't look too good let's go on to another example yeah that's a visualization of the sentiment of the Bible by the open Bible dot info website now sentiment analysis is for those of you familiar with text analysis a very hot topic and the crux of it is to take a phrase and determine whether you can whether it's a positive sentiment whether say for example somebody tweeting about you likes you or doesn't like you somebody tweeting about a topic likes a topic or doesn't like a topic and there are various varying degrees of accuracy in sentiment analysis I'm not going to go into the content or the sort of the technique of the analysis let's just assume that there are ways of identifying the sentiment of a given phrase somewhat accurately and what this does is starts off with the the Old Testament moves on to the New Testament so the Bible starts off on a slightly positive note becomes increasingly negative for a while slight respite there fairly negative and then things look good so after the time of Moses things are good which we sort of know and around the time of Joshua things are good but around the time between Joshua and David there is a bit of strife and during the time of monarchy things are mostly good but then it starts dipping and goes all the way up to here at which point starting with Jesus there's the New Testament and that's around the time when there was a lot of resistance to what he was talking about formation of the early church and so on now this sort of a visualization actually I should take a brief detour I want to talk about this whole concept of circular visualizations I mean they look really neat and classy not quite related to text but I do want to mention that we were playing around with audio we took a bunch of songs and said let's see if we can plot songs in a similar way so the the way this is plotted is let's take Eric Clapton's wonderful tonight song starts here these are the low beats these are the high beats so that's the spectrum of the song and the the darker the color on the boundary the the deeper the beats the brighter the color on the inside the higher the treble and all the the frequencies in between are shown here so now wonderful tonight is a relatively uniform song there are too many changes whereas if you look at Queen we will rock you you can see where the pattern is completely different the duration of the song is depicted by the length of the arc in this particular case just want to show you that circular with visualizations can actually compress a fabric of information into a relatively compact space another way of looking at data is on a time series but this is shown in a somewhat different way this is a calendar that shows the sentiment of the Calvin and Hobbes published on that day so it started on the 17th of November goes on to December January and so on this is a long friend but he starts off on a happy note not so happy the next couple of days where is the fabric but you'll notice that he's extremely neutral on Sundays all of the Sunday strips which are the color strips absolutely no strong sentiments one way or the other and that was a relatively happy time that more or less but it starts turning darker one might want to test if Bill Watterson was a bad Monday person whether his Monday comics tend to be a lot more negative than usual that's I'll show you a slightly longer version of this so that's that's the point actually so maybe he was being very neutral on Saturdays rather than Sundays never know no but he did have the habit of writing fairly early so it may not be reflection of his mood versus the day he wrote it on but it certainly is a reflection of the format because the format on Sundays is a color largest strip than it is on any other day and you can get a sense of what the pattern looks like over time see he starts off being reasonably polarized there's heavy reds and greens to start with but over time it tends to tone down and in the middle so in this region he's not that opinionated there was a stretch where he was fairly okay and then towards the end it starts picking up again okay so there are many apis that sorry the question was how what are the how do you figure out if the emotion is positive or negative or how do you figure out the sentiment there are enough apis to do that in fact at the end of the stock or sometimes we have time I'll show you the code that was used to generate this I generated this this morning and it doesn't take too long to create something like this another way of looking at data is this one so this was done for the US presidential elections on any given day you show couple of bars so that sort of represents the day and the reds and the blues indicate whether they are Democrat or whether okay sorry next step back this is about what are the news sentiments about either party or either candidate now there are six combinations so there are you know firstly the reds are the mentions about the Republicans the blues are the mentions about the Democrats if it's got blocks in between then it means that it's a mention about the party if it's a straight block like this one it's a mention about the candidate McCain in this particular case and if it's darker then it's talking about a larger number of fees that are talking about the same topic so you take news topics aggregate them and say I have 20 news items talking about the same topic I have five news items talking about the same topic the 20 news items one becomes darker the five news items becomes lighter and you plot this and you get a pattern like this for instance this was on October 10th where Phelan was accused of abusing power and I don't quite know the context of this but there was apparently a large number of media mentions and you can see that the reds are generally below the line the further below the line the more negative it is so here is a case where it's a position that indicates the sentiment not the color because the color is being used for the party and consistently you'll find the reds below especially the dark reds barring one exception where McCain said that no that intervention was not unlawful and that garnered them a bit of good press but otherwise consistently the parties been doing negative this is in fact an example of an actual analysis that was used by the party created by the University of Constance in Germany another way of looking at content is if you take the state of the union speech by Obama in 2010 versus 2011 the words are sized the bubbles are sized based on how often the the words on the left where words he used more often in 2010 the words on the right he used more often in 2011 words in between where use in both instances in roughly the proportion of their location so you can see that 2010 is largely about financial bill problems million pay all of the financial crisis issues whereas 2011 is about the future best success high idea education research technology clearly an election year I'll walk you through some or rather I'll ask my colleague Ganesh to walk you through some examples that we've done in somewhat unusual spaces thanks madam we'll talk about some interesting case studies from the Indian context I'll stay away from Calvin and Hobbes not my cup of tea we wanted to take up some Indian epic and visualize large volumes of text and Mahabharata was an ideal candidate for this because it has over 1.8 million words spread across 18 chapters and over 500 sub chapters so we read we actually scanned the complete text of Mahabharata in English and create some visualizations on top of it we show a sample of it here the first one is a simple occurrence visualization this shows when each of the characters of Mahabharata had mentions in each of the chapters so we have 18 chapters and each of the boxes each of the boxes that you see here the length of the boxes proportional to the volume of text in that sub chapter if I click on any one particular character it shows in which chapter the character was mentioned so Deviani is again not a central character in the plot sub plot or a sub story there are hundreds of characters like this and hundreds of stories which happen in the Mahabharata apart from the central one this this tool actually helps us to interactively explore where each characters appear and what is the kind of interaction between each of them for instance we could see where Karna and Kunti were separated and where they got united again and stuff like that so this the position is relative to the mention whether it's mentioned starting the text or at the end of chapter so it's again relative to the specific mention so this is one analysis and then we looked at network diagram to see the closeness of different characters because the first visualization just talks about the occurrence whereas we wanted to see how closely the characters are to each other so we took up every pair of character in Mahabharata and we looked at how many words separate each of the characters so it could be five words 50 words of 500 words so the average degree of words of separation between two characters was taken and computed for the complete text and we created a network diagram based on that so each of the characters are shown and then the position in the network and the connection is based on how close each of the characters are so we can see that Yudhishthira is again the central character in the plot and there are other people around him and Bhima and Karna are also fairly well networked if I may use and there are a lot of other characters like Nakla Sahadeva and so many other characters which are interacting with more with certain specific characters but on the periphery of the plot and for instance Gandhari apart from Dhritrestra she is quite close to Vidhara and also to Kunti so this brings out a lot of interesting stories and analysis possible so this again is an interactive version which is available on our website can play around with it and see what is the strength of the network and how it moves around so sorry I didn't get the question you're asking yeah it's actually the mentions the number of words it is the two characters are separated it could be interaction, direct interaction between the characters or it could be they just spaced they're talking to different characters but spaced apart so the closest we could do was look at the number of words which separate the characters which is an approximation for the closings of the characters yeah this can't hear you yeah, excuse me sorry the question was how do you handle people having different names so for instance Arjuna has a hundred names in the text to the extent that we knew the names we just merge them all into the same name Arjuna Krishna has even more names and to the extent that we could you know we managed you know Krishna, Janata, Nagavinda the whole works we left out some of the edge cases some you know domain knowledge does help I'll move on to the next one actually we're running out of time we did an analysis of all the tweets in India so we took up one week worth of geocorded tweets from India there were about 80,000 tweets so this again includes only the geocorded tweets and so we did a document comparison analysis so similar to the other one that you saw the US case study so here we have shown a bubble chart wherein the size of the bubble is proportional to the number of mentions of that particular word and separated into two sections the words by people with low followers whereas the words by people with more number of followers so certain things which clearly come out are people with high followers tend to use a lot of hashtags and they tend to be relatively if I may use the word more polite with good mornings, good day, thanks and so on and people with low followers tend to use words no traffic and high like extremely more the correlation between the low followers and the words they use these three words is very high so there are other contrast analysis you've done again in the interest of time I'm skipping those they are available in our blog so I'll move on to the next one so this looks at which names score better marks so this is interesting enough favorite data set so this data is actually from Tamil Nadu board exams the 10th and 12th standard board exams has records of over 10 lakh students and for more than 5 years of past 5 years of data so what we've done here is this particular analysis we have taken up interactive pre-map to show all the or rather the top 5000 names the top 5000 names and then the boxes show the number of occurrence of that particular name obviously the bigger boxes mean the names are more popular the smaller boxes the reverse and the color indicates the score dark blue means higher marks and closer to white is blue certain interesting things come out if I look at high marks there are certain specific kinds of names which come up Shweta, Shriya and more urban city kind of names come up here and apart from that we also saw lot of Agarwal, Guptas, Jains and lot of North Indian names coming here this is from Tamil Nadu and low marks again you see some big boxes and many of the names are again more traditional names like Murugesan, Ayyappan and so on another interesting observation from this is that some of the more common names are have bigger boxes and the marks are averaged out whereas the less common names either have very high scores or very low scores obviously because the averaging effect impacts it so this is one of the analysis we could do on this it doesn't that's right yeah we haven't factored that in this another visualization we've got that does it but the rough rule of thumb is in terms of the by and large the polarized marks have a lesser standard deviation the ones in the middle have higher standard deviation I'll just quickly wrap up the one point that we wanted to convey is text which is generally considered unstructured has a far bit of metadata that you can extract out of it if you just try hard enough there's one thing that we want you to remember it is look hard within text you'll find weird structures like similarity, associated metadata and so on from which you can pull out the data and then visualize it last one second plug about us our organization is called Gramna you can find out more about some of our work including some of these examples at gramna.com we are into data visualization and you can reach us at this location I guess we are out of time so we probably won't take questions since we just have we have lunch right after this we have time for a couple of questions if you have any otherwise yeah sorry go on the question is which platform is it made on python like in the example of Mahabharata the noun that you are looking for were you using any parser or it was just your own knowledge in a sentence if you are looking like if you are using a Stanford parser you are looking for the noun then you can have all the combinations of nouns so did you use those parsers or something let me just rephrase that question to how do we figure out the names no we didn't really we didn't need a parser it actually turned out to be simple enough to get it down to 90% accuracy by saying you know by just looking at the capitalized words we said take all the words that begin with the capital letter and ignore those words that are already in the dictionary that gets you to 90% accuracy and then we just did a manual filtering sorry that wasn't the question yes absolutely it turned out that in this particular case it was mostly names sorry the comment was that might have been places that might have been names so how do you know did we do anything smarter than that it turns out it just wasn't required in this particular instance I am not saying that a parser is not a good thing in fact a parser is a phenomenal thing it's just that there are occasions when sometimes a simple heuristic works reasonably well too and we just got lucky here my question is related to my experience of text processing which two typical problems comes up one is that time like sometimes like if you see the text of Mahabharata many things are discussed in context of past or their background flashback so time goes like if you interpret with that time it goes wrong second bigger problem I face actually we face try to come up was like U, I these words were actually not the words but they were actually representing something else and how we deal with these kind of problems so the question is one Mahabharata talks about different periods of time how do we handle that secondly words like U, V, I they have different meanings in different contexts how do we handle that in this particular example we just did and again given that this is a talk on text visualization I don't want to go into any of the techniques around text analysis if you will but let me just say that it's a pretty tough problem you mentioned about sentiment analysis do you have any open source APIs or something can you if you tell that the question was do we have any open source API on sentiment analysis no but a whole bunch of other people do in fact let me see if I can find the code that we used to do this analysis the one we used was viral heat so viral heat provides an API where you can just pass in a phrase and you will get the sentiment against it this was the first one that early this morning I googled and found you will find hundreds of others Python's NLTK has some parsers which will help you extract the meaningful text out of it to send to these or any other APIs and there are a number of others as well one last question so the question was how scalable is NLTK it is a bit slow but that's the nature of the algorithms themselves to the extent that you can cache it do some of the processing offline and reuse those results you are fine I haven't seen too many analysis of text data at a very large scale so we haven't really encountered this problem the quick answer to that question is NLTK scalable is probably not but don't know of a better alternative in Python great presentation so maybe one thing you may want to kind of share with the group is what are all the different tools that you use besides Python, NLTK and the viral heap that you talked about just now so your question was what are the tools that we might want various tools that you recommend people use the hidden tool behind half of these analysis is actually Excel take the data find a set of keywords do a find equal to find a particular keyword see if it exists or not that's what led to the visualizations a bunch of visualizations the whole Calvin and Hobbes was typed in Excel so I would add Excel to the portfolio and the reason I'm saying that is because I'm assuming that there would be a reasonable number of non-programmers in this room as well for example if you wanted to create a word cloud the way I would do it is type a bunch of and a lot of people want to mock up word clouds you want to create it for marketing purposes open up Excel type a bunch of words next to it type I'm not going to show it to you, we don't have time type of formula that says equal to R-E-P-T the word comma 10 that will repeat the word 10 times repeat the next word 5 times and so you can have a bunch of words in the frequency that you want copy and paste that text into Wordle and you're done so for the non-programmer I would say between Excel and Wordle you've covered the bulk of text visualisation that you can probably hope to do in the near future in a reliable fashion beyond that other than I wouldn't necessarily say Python, NLTK that's not the only toolset that one can use there are a number of APIs that are emerging the thing is the state of this is not so stable today that I would be able to suggest something concrete but apart from Python and NLTK regular expressions are your best friend thank you Ajay sorry, thank you Anand and Girish thanks everyone we'll break for lunch now and be back at two