 All right. So this talk is not as much about deep learning as you would expect. So it turns out that there is a lot you can get done with basic Python and basic analytics. And the more sophisticated stuff just sort of helps you push the envelope a little further. So a few years ago at the AI conference at O'Reilly, Tim O'Reilly said this, that the economy is built on stories. Now, this ought to be obvious, but you see that it's very difficult in general to separate facts from opinions and therefore protect consumers against propaganda. Now, we in the tech space are somewhat robust to this. We hear a bunch of different narratives, sometimes even countering narratives so much that we have sort of grown sensitive to this. But unfortunately, what this means is that we have sort of built a safe space around us. We usually don't narrate things. We usually have only analysis, only code, only logic. And the insight that is to be drawn from this analysis or the action points or what the user is supposed to do with these sort of things, that is often left as an exercise to the reader. So it's this sort of breaking this safe space and getting out there and being able to actually draw prescriptive insight from a piece of data is what this talk is broadly about. So yes, the economy is built on stories, but I for one am a terrible storyteller. I can talk endlessly about analytics. I can talk about engineering. I can talk about data science. But when I have to present some analysis to a business stakeholder, I am often clueless as to what I'm supposed to say. So the easiest defense for me is to simply shove a bunch of analytics in somebody's face and say that this is a dashboard. Figure it out. And that doesn't always work because you probably don't want your users to reproduce what you've done all the time. So natural language as a modality is highly underrated. I say that because visuals in dashboards or code in general can be ambiguous. That sounds a little counterintuitive, but a piece of analysis can be interpreted in many ways. But language is something that can be made arbitrary, less ambiguous. I can make a very finite, very specific point. Secondly, it also helps with accessibility, readability. People are more likely to read things when they are text. And even to the trained eye, looking at a dashboard is fairly daunting. So this is an example. Maybe take a moment to look at it and try to figure out what it means or what it represents. If you have some ideas, just park it to the side. And you might want to mentally come back to this point later to see how correct you are. This is a bunch of service requests that were filed with a logistics company. The bubbles you see are different stages of how the request is resolved. The size of the bubble is how many requests were there at each given stage, going from all the way from drafting a request to closing it. And the color represents how long they have stayed in that stage. So the redder it is, that means that at that stage more requests have been pending at that stage for a longer time. And the greener it is means that they have been resolved fairly quickly or rather they have passed through that stage fairly quickly. Now you see that what is interesting about this is that almost a third of all requests are consistently in the in-progress stage. And then they go, much of them do go over to the completed stage, but they get stuck there. If a request is completed, then why is it not closed? That's one big question that you can draw from this. And the parts that you see are again from what is the volume of requests that goes from one stage to another and so on. And the colors again mean the same thing, how long they've stayed there. So this is basically a fairly complicated application that we built. This was meant to be interactive. It is interactive actually. And when we had this out, when we delivered this, we figured out that people were quite clueless about this. People weren't exactly. It's not intuitive at all. If I hadn't given this explanation, there's no way that people could have figured this out on their own. So then we started adding text to it, which is something that people can just read. And yeah, if they want to drill down further, then they can always do it. But here's the problem with these sort of dashboards. Let's take a moment to think about what we do when we build dashboards. We have some data. We have some analytics on the data. We think about it. We do a bunch of fairly routine and then maybe case specific operations on the data. And then something clicks. I get an insight. I get a nugget or something like that. And I put a widget in my dashboard, which reflects that, or at least allows the people to reflect that. Now, the problem there is that I'm going from this data set, which is fairly ambiguous. I don't know yet what it is. I'm crunching it down into an insight. And then I'm, again, increasing its complexity back into the dashboard. Because ultimately, you want people to have some exploration capability. You can't just put bullet points as a dashboard. That would be just, well, text. And that can't be a standalone piece of information. Because we want clients or any, in general, stakeholders who are using the dashboard to have some exploratory capability. The problem is that we tend to get carried away with this a lot. Excel is not a dashboard. Excel is where you do the analysis. So you don't want your dashboard to replicate tons and tons of Excel features because you want the analytics to be restricted in a certain way. So a dashboard has to be simpler than the data it explains. It can't just throw data in your face where you are just iterating and you're just filtering things endlessly with arbitrarily high complexity. We can't allow that. And the point is that we are giving them some flexibility only so that they can replicate the insights, not the analysis itself. The analysis has to be done beforehand. And it has to lean more towards prescriptive rather than descriptive analytics. Descriptive analytics can be more or less static. But what is it that you need to do with the data? That is a bigger question that most people want answered. So the easiest way to do that is, well, that's where we're coming from, natural language generation. Maybe directly have a piece of text ready, which is the prescriptive text, which is exactly what I need to say when I'm standing in front of a client and presenting some material or showing some insights through a dashboard. This is an example. It comes from a health survey that was done nationwide, and this is the Uttar Pradesh part of it. Now, all these things, let's say that, sorry, yeah. So most of the widget that you see here, let's park them to the site for now, and they are self-explanatory as it is. But there is something that needs to be said. And that is where, again, the energy part comes in. And if you see the executive summary, that's what people are most interested in. I'm not sure if it's visible at the end, but I'll read out anyway. It says that, based on the latest data refresh of Jan 2019, UP had a composite score of 0.41. Now, let's say that 0.41 is some target metric that you're trying to optimize. And the best-performing indicator is the ratio of pentavalent 3 to BCG. Pentavalent 3 is a vaccine. BCG is also a vaccine. And for some domain-specific reason, the ratio in which they are administered is important. And that apparently has the biggest impact on your composite health score. So the next best-performing indicator is percent of PW screen for HIV against estimated pregnancy. I am not a medical person. I don't know what that means. But the idea is that we ought to be able to generate this out of the box with an intent which is provided to you so that whoever understands this can read this and make decisions based on it. And then there are some names about which are the best-performing districts for the composite score and some which are the least-performing districts. Now, the simplest way to automate generation of something like this is to simply use a template. Now, if I wanted to write a simple template to reflect this, then I'd come up with something like this. This is a Jinja, Django, Tornado, whatever you might call it. This is a standard template. And since we are trying to estimate this column called composite score, you can, say, assume that it's a regression. And you can get coefficients of all the factors back from it. And that coefficient is literally just the impact that it has. So let's say you fit an estimator. And then you have a date which you can format. And then you have a data frame which is your raw data. And then you can do a bunch of different things on it. And that is how this works. But you see there are, when I actually render this template, there are still tons of problems with it. For example, in the third line from the top, you see the mean that I'm calculating. That can have an arbitrarily high precision. I won't get 0.41 as expected. I'll probably get a 0.4001 or something like that. So it needs to be trimmed. That's one thing. Secondly, in the estimator.features, I'm expecting it to have at least a couple of elements, which may not be true. I could get a dump of data where my estimator is actually just fitting on one factor. So that might raise an index error. So there are a bunch of reasons why the template rendering itself might fail. And I need that to be robust, because this is an application where more and more data is going to keep on coming. And I need to keep on rendering meaningful text without errors. So obviously, the next thing that I might want to do is make this smart somehow. But the problem is that not everybody can write templates, let alone pandas or Python. And there are templating engines, especially like Torna Donald, which allow you to write arbitrarily complex code within the template itself. So you can write practically any sort of Python code inside a template, which is not what most analysts at least can do. So if I want some analysts to be able to generate this and embed it in a dashboard, I would have to expect them to know not just a little Python. I would expect them to know a fair amount of Python, especially a lot of pandas at least, pandas and scikit-learn and things like that. So what we tried to do was we tried to automate the template process creation entirely. We just said you just give us the text. You take your data. You try to transform it. You do whatever it is that you work with. You could do it in Excel. You could do it in Tableau, whatever. Just give it a Tableau structure and write some text based on it. We'll generate the template from that straight away. We started guessing the template. Now, this process itself has certain assumptions. Every insight that you say about the data, everything that you say about the data comes from some operation from the data or on the data. You have to eliminate all the things that you know from your domain. That is something that comes in later into the picture. But the idea is that if energy is to work, then all the operations that you would, all the insights that you might want to communicate about the data, they have to come from some operation on the data itself. It can't come from a third party source. Intent can be inferred from the operation itself. So I know that when I'm sorting columns, I'm trying to find out the extremes. When I'm doing a group by, I'm restricting myself to trying to figure out which, along which dimensions my data has interesting patterns. Or if I'm filtering, then again, I'm only focusing myself on a subset of the rows. So intent to some extent can be inferred from the operation that is being carried out. And of course, like since this is not exactly a machine learning problem, I can just ask the user to give me their intent as well. But that, again, complicates the process a little bit. Because if the user has to keep on doing a lot of manual intervention, then my generation capabilities don't really hold any water. The third assumption is that all operations for the sake of this particular functionality are DIDO, which is data frame and data frame out. If your output happens to be just like a scalar figure, then let it be a data frame with a single cell in it. Or if it's just a column or a row, let it be a Panda series. But it has to be something compatible. It can't be just an integer or a float. That also can be managed. But for the sake of compatibility, it's easier to simply enforce this restriction that all operations are data framing, data frame out. Just like scikit, scikit-learn says array in, array out. I think it does support data frames also. But it still only expects only arrays to be present. And operations need to be present in a parsable logical form. This is the most difficult condition to satisfy. Since you are inferring your intent from the operation, and if the operation itself is not straightforward, if it is something that is just Panda's code, or maybe an Excel formula, or HTML, or whatever it is, it is a nightmare to pass. So it has to be something that is, there have to be very opinionated ways of doing a particular operation, and it has to be concise, right? So in theory, at least, I can say that if I have data, if I have operations on that data, and if I have some representative text which narrates some insights about the data, then I can reverse-engineer this process to automatically create a template like this one. In theory, this is possible, and to some extent we've actually done it. So let's see how that happens. For the last two assumptions that we had, we had a component called form handler. So you can think of form handler as data frames for the web. Form handler comes from this package called GramX. GramX is an open-source platform for building data-driven web applications. And form handler is its primary data model. So it is a way of exposing data through the web and exposing Panda's operations through the browser or through, you know, HTTP requests. So you can, let me just try and see if we can open that up. Oh, okay, I'm offline. But anyway, so this is something that allows you to do minimal Panda's operations, typically not expensive operation, not numerically intensive operations, but typical data manipulation, data munging, slicing, dicing, those sort of operations through the web. So here's an example of it. That this data set represents a bunch of sales done in a fictional supermarket. Every row is one sale, right? And let's say that I'm interested in finding out which region, region here is just west, east, central, and south, which region had the most sales. So I'm going to group by region and sort the sales column descendingly, right? And then I say, when I get the sorted output, when the group and sorted output, I say that the west region recorded the most sales amounting to nearly half a million dollars followed by the east and the central regions. That's the insight I draw from it. By looking at it, this currently at this stage, it's not generating anything. This is just the table that you see at the bottom, that is what the analyst looks at and simply types this down. Now, what we can do with this information is, as you see, west is a named entity, which is present in my data as some value in a column. The region is an inflected form of the name of a column, right? And similarly, when I say most, that most is an adjective, that comparative adjective that corresponds to an operation that I've done. This time it doesn't depend on data. It depends on the operation. So again, another example of how operations can be used to infer intent. I'm trying to find out extremes, more sales, lease, sales, things like that. And I say that, okay, in my operations pipe, I have something called sort. So that most probably corresponds to that. And if it was something else, then I wouldn't write most. I'd write something else. Similarly, half a million is a humanized representation of this figure, 533,000 something, right? I can't print that file there because this is not how humans talk. I'll probably just, you know, people might want to say half a million rather than five, three, three, eight, nine, five, right? And again, I need some other NLP functionalities to make that happen and map it back from my template to this. Good thing is, space is great at this. So I can just parse my text straight away. And if you just look at the entities it's found, it's found the locations. And it knows that most of its locations are present in one column of my table. And you see, it also identifies that nearly half a million dollars is a named entity which has a label of quantity, which means that I can just go back to my data frame, look at all the columns which are of a quantitative type and look for this. However, I still need to solve that problem of how do I convert the string half a million dollars into five, three, three, eight, nine, five or something like that, right? So there is, of course, some places where human intervention is required. It's not fully automated, but there are multiple heuristics you can apply to get this to work. And then again, there is just a simple function called templatize, where you just provide the doc object, you provide the data, and what you see there is basically a formal representation of what you've done with the data. This is Panda's operation, which is represented in a very minimal DSL. It's a DSL that requires you not to write Panda's code. But what that means is that that underscore by is a keyword in this DSL, means that you're supposed to group by the region column. As you see, it's a list, I could add multiple columns, group by one, then the other, and so on, or I can group by multiple at the same time. And that underscore C means pick up this column. So sales pipe sum is what I get back from the grouped object. And then the templatize function is just going to go through the entities, figure out which are the columns, column headers, index names, and cell values that corresponds to each of these entities, and try to put that back into filling the blanks in the template itself. And that gives me this template. So you see that it's picked up the fact that west, east, and central are all DF.index values. They're all in my index, because after I group by, the region becomes my index column, and that the name of it is region capitalized. So although in my text, I am not using region with a capital R, so it's lower casing that. These are inflections that can be automatically sort of detected through SPACY by comparing two different token objects. I have a token object that I get, yeah. I have a token object or a span that I get by rendering the template, and I have a span that I get when I have the raw document. I can just compare them, detect the inflection, and actually apply that inflection so that my text becomes more automated. And you see at the end, there is a function called nlg.humilize, nlg.luralize. So that you see at the end, the last word that we had was regions. We found out where the word region lies in the data frame, but I need to know that this is, again, another form of detecting the inflection automatically and actually applying that inflection in my template, right? There it is. So that, again, itself doesn't suffice because you could still have tons of things that could go wrong with this template alone. If a new piece of data comes in, any of these variables could fail for whatever reason. So it's quite possible that my index may not have a name. Somebody might already send me a grouped and a table that is actually grouped, or it might be coming through web API where data frames may not have an index. So my name may not exist. It could just be an empty string. So that might go away. So I maybe need to replace the word region from the name of the index to maybe something else if it is found, or maybe just hard-coded, depending on what the user specifies, right? The column sales, which has a location west, may not exist. So this is, again, another place where it can fail. There are a bunch of places. Any of these can fail for any reason. So we need to inject conditionals into this template so that it renders them only if they are present. Otherwise, you don't talk about them. And you see there are four regions. I may get a data frame which has only one region. In that case, the entire point is moved. So I wouldn't want to say this particular line at all. I don't want to narrate this at all. So there might be some conditions based on which I decide whether I want to say something or not. So the whole thing belongs in an if block. That is also the sort of functionality that we might want to make this actually properly automated so that it doesn't require any intervention. And there are, like I said, there are a bunch of post-processing you can do to handle issues like this. One is that you can make the poops. I think we will, it's a power. Oh, great. I think that was a power outage, was it? So you can make the template smart by handling various conditions like what if the data frame contains only two rows or what if the sorting outer changes? Can I change the most to least, depending on whether I'm sorting ascendingly or descendingly? And how do I map objectives to operations anyway? Do I have a bank of objectives ready, which I can simply change the order or something like that? Maybe. Grammar checks. Bunch of times pluralization, singularization doesn't work because something that it's trying to do, let's say that there is a noun and there is a verb associated with that noun. If the noun changes its number, then the verb also has to change its number. These are things that can fairly easily be detected, but there is no upper limit to what sort of grammatical problems you could see as new data comes in and as grammatically terrible text keeps getting generated. So one great way to detect and actually apply inflection is a spacey extension called pyinflect. Amazing piece of work and really great that spacey 2.0 has extensions now. Earlier I guess it was difficult to hack around because there was so much C involved, but life has changed since extensions have come out. And there is also a package which does humanization. I'll be sharing the slides so you can see all these links for yourself. And there are a bunch of other heuristics that you could apply. In general, what we would like to make this truly energy is all of these things. What I've done so far is based mostly on pandas and a small DSL that reflects pandas. But people, like I said, may not always use pandas or even the form handler that we have. People might do their analysis in Excel or a SQL and just might want to convert those operations into text. So for the automatic template generation engine, we might want to have multiple backends just like we have converted pandas code to templates. We might want to convert Excel formulas into our SQL queries into text as well. Or at least, you know, templatize the variables to some extent. Operations on the data convey the intent. And that is the same intent should have very similar operations. If I'm just trying to pick out a particular value or a particular filter or a particular grouping, even in pandas, there are infinitely many, well, maybe not infinitely many, but there are multiple possible ways of doing that. They should be somewhat similar. Like, I can't come, I can't end up with multiple different narratives just because I'm trying to do the same thing. And therefore it is possible to infer intent from code and code from intent, in theory at least. We could maybe have sequence to sequence models later, which can do this. The other thing is that we might want to have an IDE for this. This is currently all happening in just Jupyter Notebooks, but you know, and you know, some very limited featured web apps where we have data exposed through a table and I think we're almost out of time, but I'll just quickly wrap up. We have, you know, data exposed through tables. You write your text, you click on a button and it templatizes. But then I can't interact with it further, so I have to take that down in Python. There's a automated script that I have to download. I have to send it to a developer who can fix the script and then send it back to me. This is terrible. What I want is, you know, ideally, an IDE just allows me to select a piece of text to templatize it, refer to the data frame, whatever your backend may be. It could be Excel, it could be SQL, it could be pandas. That is something that we would like to have. And if we do make this IDE, one great advantage is that we'll be able to collect training data in the form of natural language text to code. So I can map code to natural language, which is what would be awesome. Finally, another thing that is easy to narrate is ML. Machine learning models tend to be very structured. There are, unless they're deep learning models, but if it's something like a cyclone estimator, I can easily take it apart and see inside it and I can narrate, say, what are the coefficients and what they mean for my data. I can do narration with ML itself, like, you know, like I said, we could have sequence to sequence models which map code to data or, you know, variables to how they're rendered in the final output and so on. You can do automatic statistical analysis, like you could do just a bunch of things upfront without having to wait for human intervention at all. And, you know, just narrate the results from that. Most important thing that I would like to see is domain-specific language models. If I know that I'm gonna talk to a banker or if I know that I'm talking to somebody from the healthcare or the commerce industries, I might want to have domain-specific language models so that, you know, I can pick up the vocabulary from that particular domain and talk about it further. So thank you and I'll be happy to take any questions. This is the link, please try it out and bugs, patches, pull requests, everything is welcome. Thank you. Yeah, we have time for questions. Any questions? Hi, so thanks for an awesome talk. Thank you. It's been really great. So I have two suggestions that I don't know whether you've considered. So first of all, for the IDE, there's this tool that's recently come out that Innis has been playing with called Streamlit. Stream? Streamlit. Streamlit, okay. So it's an open-source tool and it gives you an easy way to build like sort of dashboard widgets and things with code. So I think that this might be a good way to make the GUI in an easy way so that you can basically solve that problem. So great. Yeah, absolutely. First thing I'm gonna try out. Yeah. And then another one is the explaining the deep learning models. You know, I definitely agree that simple models are easier to explain. So there's a trick which Jeff Hinton did in one paper where he has a deep learning model and then you train a decision tree to replicate its decisions. So you get the sort of ability of the deep learning model to learn something. Oh, yes, yes, absolutely, yeah. And then you can train a decision tree from that and then a decision tree is obviously very great for explanation. Absolutely, yes. So that's another thing that you might try and it might help on this. Yes, yes. It often happens that you train a deep learning model end-to-end and you realize that what it's actually learned is a fairly simple model. So you might still prove it and get maybe just a linear fit out of it, who knows? But the fact that it needs to learn that thing itself may not be obvious, which is why you had to go through the deep learning model. But yeah, that definitely is true. Yes, yeah, hi. Hello, yeah. Great presentation. Thank you. Just a quick question about Spacey. So Spacey, you know, Vida Spacey or Stanford, what it will be, it has its own pitfalls. You know, it's not... It has its own pitfalls. It's not always accurate in predicting entities and, you know, part of speech. Yeah. So when you are using such a system, you know, without your respect to Spacey, all you are doing is error multiplication, you know? Always multiplication. So if our entity is wrong, obviously our prediction is going to be wrong, right? Yeah. Accalculation and numbers all will be wrong. Yeah. So have you written rules on top of... Yes, yes, yeah. So I think there is... It depends on that. Spacey uses, by the way, go to the Episodes stall. They have something called the Zen of Spacey. It's really brilliant to read. And one of the things that they mentioned there is that if you combine entity recognition with rule-based or phrase matches, you'll get a lot of false positives. But that's still okay. Because they are still, you know, false positive, at least there are no false negatives, right? And like I said, if we are able to make this ID successfully, and if I can map how people are selecting a piece of text and assigning them as variables, I can use that information to train my own entity recognizer, which builds on top of the large model over here. So that's how I would deal with it. But first that ID has to be in place. So I can collect the training data. And of course, if you do have the false positives, the ID, I'm sure, can give people straightforward ways to simply knock them off and that itself, that training itself, like which are the IDs that my, which are the entities that my rules and the default entities are picking up together and which of them are not relevant. If I can just do that, that itself is a lot of training data for me. So by your rules, you mean to say context-free grammar rules or regular expression rules or some business logic? Mostly these are rules that are inferred from spacey attributes. Attributes, okay. So post tagging, regular expressions, all of them can be compiled into a spacey rule. There is something called a phrase matcher or a rule-based matcher, which picks up things according to these rules. So I can say that a noun chunk is something that is at least, say zero or more adjectives followed by at least one noun is a noun phrase. Okay, yeah. So I can say zero or one with the star asterisk operator and nouns with a plus operator. So you can apply post tags and reg access together to make a rule. Right, I got the point. Thank you. Okay, we have time for one last question. Yeah, there. It should be a quick question. No follow-ups. People will break up. I'll be around till the rest of the day, so. For the grammaticality part of the thing, I would suggest you to use language models to score the sentence to see whether the score is above threshold. So that the sentences you form, your templates are grammatical. Brilliant idea, yes. Not something that had occurred to me, great. But what kind of a score is this? Scoring what exactly? Just the perplexity or the loss itself would be the measure. Whether the sentence is likely. Is this a likely sequence of words? Yeah. Okay, yeah, absolutely, yes. That could be an easy fix against grammar errors, maybe, I don't know.