 I'd like to introduce Anand S. He's one of India's top data scientists and he's here to talk about natural language generation. One other note, I'd like to remind everybody at noon after this talk, there's a group photo in the main lecture hall, which if we just follow the crowd, we'll find it. This is a bit soft, but I think it's on. Are you able to hear me? No? A little soft? Okay. I can be a little louder, maybe about that loud. Great. Thanks. So we're going to be talking about natural language generation, but first we're going to be talking about why we need it in the first place. We love creating some fairly complex visuals at Grammar and one of the things that we did for a client who asked us, tell me why my workflow is delayed, was put together this little visual which said, look, your process goes from a draft stage to a submitted stage and then to waiting for approval and then plan and so on. So these are the stages that your document goes through. And the size of each of these circles is how many documents are waiting at that stage and the color represents how long the documents have been waiting there. So the red circle, which is completed, is all of those documents that have gone all the way to completed, but haven't been closed and are just sitting there for a very long time. And the flows here represent the transitions from one state to another. So from planning, there's a whole lot of stuff that seems to be going into in progress and from in progress, they largely go to complete or to closed when going forward, but there's a big chunk that's going backwards all the way to submitted, meaning that they go right to in progress and then get resubmitted, which is clearly a problem and this is also relatively slow step. And with this, what they were able to figure out was that all of the forward flows, the ones that were on top, that is documents going from left to right, those seem to be going fast and the documents that were going from right to left, that is where they're going backwards, they seem to be going slow. So this was an organization that had a problem where the forward flows were fine. They were doing things fast, but the trouble was they were doing things, they were doing the wrong things and therefore stuff had to get resubmitted. Now, what I've done is compressed approximately half an hour's worth of explanation into about a minute's worth of explanation. And there's just no hope in hell that anyone has of being able to figure out this visual without that explanation. It's cool, but it's not simple. And this is among the more complex explanations. I've had people stare at simple tables, simple bar charts, unidimensional bar charts and ask the question, so what am I supposed to do with this? In fact, there is an entire industry of people whose job it is to take the analysis that somebody's done and interpret it for the other person. And the boss just says, look, tell me what I should do. And then if I disagree, I will ask you why, and then give you the supporting evidence. And if you need to go deeper, we'll do that. The question is, can we automate that? Can we take the stuff that the person is doing in the middle and communicate that to the end audience? Well, what does the person in the middle do? Well, half the time, most of the time, they're really just talking. So it's words more than anything else, which is where the natural language part comes in, but it's not only words, it's more to it than that. But what I'm gonna be focusing on is just the words part of it. Let's start with some simple stuff. We were working with the state of Uttar Pradesh. And they said, we want a dashboard that shows what we should be doing from a health perspective. It showed a series of metrics like what's the ratio of Pentavalent 3 to BCG percentage of PW screen for HIV. I have no idea what these measures are, I'm not a medical person. But it effectively showed what the performance of each of the districts was. And what the district health administrator said was, apart from this, just give me a summary that tells me in English what's happening. So this summary says, based on the latest data refresh in Jant in 19, so that tells me how old or new this is. Uttar Pradesh has a composite score of 0.41. Your best indicator is the ratio of this to this, which hopefully they'll understand that has a score of 1.03 as compared to last month. Now ideally this should have been worded as 3% higher than last month. That's how humans talk. But this isn't a particularly good report and that's why I'm starting with this, it's a very base one. The next best performing indicator is this, which has a score of 35.55 as compared to last month. Here are your top performing districts. Here are your worst performing districts. Now the reason I'm showing you this is because these five lines where the five things that the district health administrators said they really wanted to know. When we asked them, why are you looking at all of this data? What action will you finally take? They said, see, we want to know where we're doing well so we can tell our bosses. We want to know where we're doing bad so that we can drive the people. We want to know how we're doing compared to our peers. That's it. These are the questions we want answered. And if you can just give us those answers, the rest of it is just proof, which we'll get to when we want to talk to somebody. That's a real simple, templatized summary. Because we said we're going to put this in, generate this using a program. And the numbers are reasonably straightforward. The text is fairly straightforward to generate, but for grammatical errors and numerical errors. This can be done in a way that you can see how it can be done from data. Let me take another report that we created. There is a doing business report that the World Bank generates. This is a report that has information around how easy is it to do business, how easy is it to start a business, how easy is it to deal with construction permits, and so on across various countries. So I search for Singapore. For instance, I thought there was a doing business in Singapore. Wasn't there in the last one? Doesn't matter. Let me take the answer to the question, how easy is it to start a business? Now, each of these little bubbles represents one country. And the country that's easiest to start a business in is New Zealand, which is in East Asia, and the Pacific, followed by Canada, followed by Hong Kong, and so on. And that's the quantitative information. But apart from that, there's a little summary here, and I'm going to focus on the one on the bottom right. It says that the time taken to complete a procedure with regards to starting a business is half a day in New Zealand, whereas in Venezuela, that would take you 230 days. An interesting sighted bit that augments what we are seeing in this, or for instance, India's among the bottom 10 with regard to dealing with construction permits, or when it comes to getting electricity, setting up an electricity connection in South Korea takes less time and effort, but the process is very painful in Bangladesh and Venezuela, which ranked the lowest. Electricity is gold to Venezuela, and it costs you about 704 cents per kilowatt hour, whereas the next highest is the islands, where it costs just 67 cents, and that's a factor of 10 reduction already there. These are not entirely automatically constructed. There is a certain amount of manual intervention in two ways. The first is we know exactly what these categories are. So this is related to electricity. So we have the option of talking about electricity related stuff and coming up with anecdotes that are specifically electricity related. Secondly, the data doesn't change oftener than once a year. So what we can do is have a base layer and say throw out some automated stuff, but then on top of that allow people to override with comments and corrections that they want in specific area. So you have a base case which at least generates some meaningful stuff if not interesting or useful, and we have specific areas where people can intervene and put in interesting observations on top of that. We're still at the realm of semi-automation. We kind of get some basic stuff, but we are able to go a little further than that and improve on it manually. Is there a way in which we can make this dynamic that is what we have is for static data, the option of creating these kind of templates. Could this be done entirely dynamically? Let me tell you a fairly interesting scenario that we were working on. We had data for about 200,000 students for whom there was a large survey that was done. The survey effectively had information on their performance, which is the marks, as well as on their behavior, how often they watch TV, how often they play games, what are their parents' educational characteristics, what are their reading habits, and so on. So these are some of the factors that were there. And we started exploring these. So for example, what is the impact of watching television on, let's say, a big parameter, science? Turns out that children who never watch television tend to score the lowest, whereas children who watch TV once a month do a bit better, once a week do a bit better, and watch every day tend to do slightly worse. So which means that TV watching in moderation seems to be the answer roughly once a week, which is also the answer for social science. Watch TV once a week seems to be about the best. But that is not true for reading habits because the more you watch TV, the better your reading ability becomes. And these are statistically significant incidentally. Mathematics is perhaps the more interesting one. It turns out that TV just kills your mathematical ability. Between watching TV once a week or once a month or never, watching TV every day is even worse. This is particularly surprising because the children who don't watch TV at all are usually the ones that don't have television and they are economically disadvantaged in any case. Watching TV every day puts your mathematical ability worse than even those kids, that's saying something. But now this is narrated in a simple templatized way below saying watching TV can improve your math marks or the right kind of watching TV behavior by as much as 1.5%, that's the gap between here and here. The once a week watching TV is best. Students score about 33.4 whereas every day is the worst, students score 31.9%. You can see how this is constructed. Just say what's the best, what's the worst, what's the gap and tell people what's happening. And this kind of annotation is present for every single one of these. And you also make a commentary about whether this is common. So it turns out that for watching TV, for reading, watching TV every day is the most common behavior. 67% of the students exhibit that behavior. And you add a comment saying luckily every day is the most common. Whereas for mathematics every day is the most common and that is the worst behavior. So you say sadly this is the wrong behavior that you exhibit. So we bring in a certain amount of emotion into this as well and I will be talking more about emotion generation as we go on to the next experimental topic. Now this can be abstracted out into a master summary which says that if I look at how much each of these factors influence. So watching TV turns out that it influenced between two to 5% of your marks. Playing games influence about two to 5% of your marks but the factors that influence your marks overall, father's education, occupation, mother's education and occupation, make the largest difference. And that's another kind of summary that says here are the factors that influence your marks. As an administrator I would just look at this column on the left and say here are the factors that influence me most, here are the factors that influence least and these are the subjects that are influenced by the various factors which can be filtered dynamically based on students that are below poverty, above poverty. So if I wanna look at just the boys, is there a difference between boys and girls in terms of what are the factors that affect them? It turns out that there is but we won't go into the details of that. And the template and the narration automatically get adjusted for these statements. So all of this stuff that's coming in is since it's automated you can do it at any level of data for any filtering of data. That there is no chart that you create that cannot be annotated at least by a simple summary that explains what it is. Let's try this to the next logical step. Before that, is it okay? How we solved our original problem which is this paragraph, right? That one had a bigger issue which we were also looking to explain what are the factors that were driving the cause of delay? So it turned out that for example, the service requests that came from certain divisions, certain customers, certain entities were huge problems and it didn't matter which organization they came from, which site they came from, which country they came from. So we put in a summary of what is really causing the delay in text which made it a whole lot easier for people to read, understand and figure out and at the end of the day all they had to do was read the summary and just say, okay, here's some pretty picture, I don't really understand it but if I get the text, that's fine. That's in fact how it ended up getting used. But we can take this forward to a dynamic space as well. Actually, I'm gonna skip this one. I'm gonna skip this, yeah, let's go through this. On the right hand side is a visualization of the impact of the budget, of the budget that was announced in India every year on the day of the budget. So the budget is announced during the day, the stock prices move during the day. The difference in stock prices between the morning and the evening ought to be precise, the market capitalization is what the chart on the right shows. Now the size of the box here represents the market capitalization of the industry. So this is banking and financial services within which there are a set of companies. These are the oil and gas companies and that's the sector. So the size represents the market capitalization and the color represents how much it's changed, whether it's increased or decreased. So it turns out that in this graph, on the 28th of February, 2005, which is the budget day, the market grew by about 1.6%. 18 of these sectors grew, that's in green and four of them shrank, that's the ones in red or orange. BFSI grew the most by about 3%. Oil and gas grew by about 1.7%, that's the second largest. Services was the one that was most impacted by retail and real estate, which shrunk by about 1.4%. The one that had the highest growth, the greenest one is Food and Beverage followed and the one that had the big drop was services, which is the reddest. What I'm doing is literally reading off the left, which is exactly the way in which we want people to be talking through a visual or a set of numbers as well. So what this does is more than the technology behind it, it's really something that makes it independent of the interpreter. The person who sits in the middle understands the analysis and says something, needs to follow a script. So that brings in a certain amount of standardization. People say the same thing and you don't need much by way of training to have to read this stuff. What this means is that it also makes it possible to put this kind of stuff to a much larger audience, the public for instance. Media, we're working with news organizations in the media to see if we can create visuals on television, digital as well as print that are entirely self-explanatory to the point where no one needs to interpret them, but at the same time don't compromise on the sophistication of the message, the way in which we communicate the message. It could be complex analysis, it could be complex numbers, it could be complex charts, but provide the annotation along with it in preferably a manual way because, automated way because that saves the effort. This can be taken to an extreme. One retailer said, look, what we do is create PowerPoint presentations out of these charts. There is a chart system, we create a presentation. Can you automate that? We said, well, there's a couple of ways of doing that. One way is to automate the generation of PowerPoint itself, which is interesting. So I'm gonna give you an aside on that. It is entirely possible to come up with an interactive PowerPoint presentation that can tell a story. This one doesn't have narratives, but I am still gonna talk about it. This is a dynamically generated presentation where the sales revenue of a retailer is broken up by geography, within which by channel, within which by product, against which you have how well it's doing in that region. Red is not so good, yellow is yeah, kind of okay, green is you're doing well. And you can see that for example, in UK, in the stores channel, product nine is not doing so well. So we can ask the question, why is that happening? Is it because product nine is generally not doing well? Click on that. That does a drill in PowerPoint into product nine shows a breakup. And we find that product nine is not doing well almost anywhere. The only place where it's kind of doing well is in the US stores channel. Is that because the US is generally doing well? Click, that takes you to the US and shows your breakup of the US. And you can click on pretty much any spot and get in there. Or another way of looking at this is, let's take that, of looking at interactive visuals in PowerPoint is, it's enabling editing, this is a slightly complex one. Okay, what we have here is the breakup of downloads from an app store by geography. Most of the downloads came from the US followed by China, followed by Germany. By model, Android four versus iOS four. And you can see that's a relatively old visual by category where games mostly downloaded or was it educational apps that were downloaded and so on. And we can start exploring where these downloads came from. So Android four downloads mostly came through games. The games downloads mostly came from the newbie segment and the double m come no kids segment. And these two are segments that are not growing. Now this allows us to perform pretty much full-fledged sophisticated explorations in PowerPoint itself. So the challenge of generating PowerPoint that does sophisticated interactive visualizations is not the problem. The problem, however, is that this still requires an explanation and can we provide that explanation which is what PowerPoint does manually. So we put together a dashboard like this which mirrored what our clients normally use for their sales analysis. It says, look, retailer number six, that's the anonymous name of the retailer. Here are your insights for week number one in 2015. Your sales improved this week, but it's dipped compared to the same period last year. If you look at the US retail division, then your sales improved, but again has dipped compared to the same period last year. If you look at where your growth came from by product, it mostly came from this brand zero to four ink last form 50p, whatever that product is. And if you look at one other subdivision, then it's mostly your ring view binder that increased your sales. Now we can start drilling in into this, which could answer the question, how exactly did my sales improve but dip compared to the previous week? That has a set of narratives that go into the next level of detail. My sales grew by almost 150% this week, but the same week last year, it was 11.2% higher and we're less. And you can see the numbers against the charts just here and there, which allows me to drill down one level further. So what exactly was the growth driven by? Which were the products that grew? Which were the products that didn't grow? Go down one level further into that, into a particular.